GSE72500 Processing Pipeline

RIP-Seq code_examples 3 steps

Publication

The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.

Science (New York, N.Y.) (2015) — PMID 26382853

Dataset

Ro60 iCLIP

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina software used for basecalling.

bcl2fastq (Inferred with models/gemini-2.5-flash) v2.20 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bcl2fastq (example using conda)
# conda install -c bioconda bcl2fastq2

# Define input and output directories
RUN_FOLDER_DIR="/path/to/illumina/run/folder"
OUTPUT_DIR="/path/to/output/fastq"
SAMPLE_SHEET="/path/to/sample_sheet.csv" # Optional, but highly recommended for demultiplexing

# Execute bcl2fastq for basecalling and demultiplexing
bcl2fastq --runfolder-dir "${RUN_FOLDER_DIR}" \
                  --output-dir "${OUTPUT_DIR}" \
                  --sample-sheet "${SAMPLE_SHEET}" \
                  --no-lane-splitting # Example common parameter, adjust as needed

View on GitHub

Reads were mapped to human genome build hg19 using STAR (https://code.google.com/p/rna-star/) with the "outFilterMultimapNmax 20" option, then PCR duplicates were removed using unique nmers in the barcode sequence.

STAR v2.4.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
GENOME_DIR="/path/to/STAR_index/hg19" # Path to the STAR genome index for hg19
READS="input.fastq.gz" # Input FASTQ file (assuming single-end for this example)
OUTPUT_PREFIX="mapped_reads" # Prefix for output files

# 1. Map reads to human genome build hg19 using STAR
#    Parameters: "outFilterMultimapNmax 20"
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}." \
     --outFilterMultimapNmax 20 \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 # Adjust number of threads as needed

# 2. Remove PCR duplicates using unique nmers in the barcode sequence
#    This step typically involves UMI (Unique Molecular Identifier) based deduplication.
#    The exact command depends on how the barcode sequence is incorporated into the reads
#    (e.g., in the read header, or at the start of the read sequence).
#    A common tool for this is `umi_tools dedup`.

# Install umi_tools (if not already installed)
# conda install -c bioconda umi_tools

# Example using umi_tools dedup (assuming UMI is in the read name after a previous extraction step):
# umi_tools dedup \
#     --input "${OUTPUT_PREFIX}.Aligned.sortedByCoord.out.bam" \
#     --output "${OUTPUT_PREFIX}.deduplicated.bam" \
#     --extract-method=read_id \
#     --umi-separator=":" \
#     --log "${OUTPUT_PREFIX}.deduplication.log"
# If reads are paired-end, add --paired. If UMI needs to be extracted from the read sequence first, use `umi_tools extract` prior to STAR.

View on GitHub

Peak calling was performed using pyicoclip (http://regulatorygenomics.upf.edu/Software/Pyicoteo/pyicoclip.html) using RefSeq genes as the region file.

RefSeq vv0.1.1

$ Bash example

# Install pyicoteo (which includes pyicoclip)
# pip install pyicoteo

# Placeholder for input BAM file (aligned reads)
# Replace with your actual input BAM file
INPUT_BAM="input.bam"

# Placeholder for RefSeq genes region file (e.g., BED format)
# This file defines the regions where peaks will be called.
# Example: Download RefSeq genes for your specific genome assembly (e.g., hg38) 
# from resources like UCSC Table Browser, Ensembl, or NCBI.
REFSEQ_GENES_BED="refseq_genes.bed"

# Placeholder for genome FASTA file
# Replace with your actual genome FASTA file (e.g., hg38.fa)
GENOME_FASTA="genome.fa"

# Output prefix for peak files
OUTPUT_PREFIX="pyicoclip_peaks"

# Execute pyicoclip (part of the pyicoteo package)
pyicoteo clip -i "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -r "${REFSEQ_GENES_BED}" -g "${GENOME_FASTA}"

Tools Used

STAR

Raw Source Text

Illumina software used for basecalling.
Reads were mapped to human genome build hg19 using STAR (https://code.google.com/p/rna-star/) with the "outFilterMultimapNmax 20" option, then PCR duplicates were removed using unique nmers in the barcode sequence. Peak calling was performed using pyicoclip (http://regulatorygenomics.upf.edu/Software/Pyicoteo/pyicoclip.html) using RefSeq genes as the region file.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: Bed files include peaks.

← Back to Analysis