GSE72500 Processing Pipeline
RIP-Seq
code_examples
3 steps
Publication
The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.Science (New York, N.Y.) (2015) — PMID 26382853
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina software used for basecalling.
bcl2fastq (Inferred with models/gemini-2.5-flash) v2.20 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bcl2fastq (example using conda) # conda install -c bioconda bcl2fastq2 # Define input and output directories RUN_FOLDER_DIR="/path/to/illumina/run/folder" OUTPUT_DIR="/path/to/output/fastq" SAMPLE_SHEET="/path/to/sample_sheet.csv" # Optional, but highly recommended for demultiplexing # Execute bcl2fastq for basecalling and demultiplexing bcl2fastq --runfolder-dir "${RUN_FOLDER_DIR}" \ --output-dir "${OUTPUT_DIR}" \ --sample-sheet "${SAMPLE_SHEET}" \ --no-lane-splitting # Example common parameter, adjust as needed -
2
Reads were mapped to human genome build hg19 using STAR (https://code.google.com/p/rna-star/) with the "outFilterMultimapNmax 20" option, then PCR duplicates were removed using unique nmers in the barcode sequence.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables GENOME_DIR="/path/to/STAR_index/hg19" # Path to the STAR genome index for hg19 READS="input.fastq.gz" # Input FASTQ file (assuming single-end for this example) OUTPUT_PREFIX="mapped_reads" # Prefix for output files # 1. Map reads to human genome build hg19 using STAR # Parameters: "outFilterMultimapNmax 20" STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outFilterMultimapNmax 20 \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 # Adjust number of threads as needed # 2. Remove PCR duplicates using unique nmers in the barcode sequence # This step typically involves UMI (Unique Molecular Identifier) based deduplication. # The exact command depends on how the barcode sequence is incorporated into the reads # (e.g., in the read header, or at the start of the read sequence). # A common tool for this is `umi_tools dedup`. # Install umi_tools (if not already installed) # conda install -c bioconda umi_tools # Example using umi_tools dedup (assuming UMI is in the read name after a previous extraction step): # umi_tools dedup \ # --input "${OUTPUT_PREFIX}.Aligned.sortedByCoord.out.bam" \ # --output "${OUTPUT_PREFIX}.deduplicated.bam" \ # --extract-method=read_id \ # --umi-separator=":" \ # --log "${OUTPUT_PREFIX}.deduplication.log" # If reads are paired-end, add --paired. If UMI needs to be extracted from the read sequence first, use `umi_tools extract` prior to STAR. -
3
Peak calling was performed using pyicoclip (http://regulatorygenomics.upf.edu/Software/Pyicoteo/pyicoclip.html) using RefSeq genes as the region file.
RefSeq vv0.1.1$ Bash example
# Install pyicoteo (which includes pyicoclip) # pip install pyicoteo # Placeholder for input BAM file (aligned reads) # Replace with your actual input BAM file INPUT_BAM="input.bam" # Placeholder for RefSeq genes region file (e.g., BED format) # This file defines the regions where peaks will be called. # Example: Download RefSeq genes for your specific genome assembly (e.g., hg38) # from resources like UCSC Table Browser, Ensembl, or NCBI. REFSEQ_GENES_BED="refseq_genes.bed" # Placeholder for genome FASTA file # Replace with your actual genome FASTA file (e.g., hg38.fa) GENOME_FASTA="genome.fa" # Output prefix for peak files OUTPUT_PREFIX="pyicoclip_peaks" # Execute pyicoclip (part of the pyicoteo package) pyicoteo clip -i "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -r "${REFSEQ_GENES_BED}" -g "${GENOME_FASTA}"
Tools Used
Raw Source Text
Illumina software used for basecalling. Reads were mapped to human genome build hg19 using STAR (https://code.google.com/p/rna-star/) with the "outFilterMultimapNmax 20" option, then PCR duplicates were removed using unique nmers in the barcode sequence. Peak calling was performed using pyicoclip (http://regulatorygenomics.upf.edu/Software/Pyicoteo/pyicoclip.html) using RefSeq genes as the region file. Genome_build: GRCh37 (hg19) Supplementary_files_format_and_content: Bed files include peaks.