GSE77633 Processing Pipeline
Publication
Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).Nature methods (2016) — PMID 27018577
Dataset
GSE77633Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [iCLIP]
Processing Steps
Generate Jupyter Notebook-
1
Sequencing reads from CLIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define input and output filenames (placeholders) # Replace 'input.fastq.gz' with your actual input FASTQ file # Replace 'output.fastq.gz' with your desired output FASTQ file INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="output.fastq.gz" cutadapt \ --match-read-wildcards \ --times 2 \ -e 0 \ -O 5 \ --quality-cutoff 6 \ -m 18 \ -b TCGTATGCCGTCTTCTGCTTG \ -b ATCTCGTATGCCGTCTTCTGCTTG \ -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \ -b TGGAATTCTCGGGTGCCAAGG \ -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \ -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
2
Reads were then mapped against a database of repetitive elements derived from RepBase18.05.
bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2 # Define input reads (adjust for single-end or paired-end as needed) # For paired-end reads: READS_R1="reads_R1.fastq.gz" READS_R2="reads_R2.fastq.gz" # For single-end reads (uncomment and adjust if applicable): # READS_SINGLE="reads.fastq.gz" OUTPUT_SAM="mapped_to_repeats.sam" UNMAPPED_PREFIX="unmapped_to_repeats" # Prefix for unmapped reads files (e.g., unmapped_to_repeats_1.fastq.gz, unmapped_to_repeats_2.fastq.gz) # Define repetitive elements database (RepBase18.05) # RepBase is a commercial database. Users typically obtain a license or use derived public datasets. # For demonstration, assume 'RepBase18.05.fasta' is available in the working directory. REPEATS_FASTA="RepBase18.05.fasta" REPEATS_INDEX_PREFIX="RepBase18.05_index" # Build Bowtie2 index for repetitive elements # This step only needs to be run once per reference database # bowtie2-build "${REPEATS_FASTA}" "${REPEATS_INDEX_PREFIX}" # Map reads against the repetitive elements database (Paired-end example) # Using --very-sensitive-local for robust mapping to potentially fragmented repeats # --un-conc-gz to output unmapped reads (paired-end) for subsequent mapping to the main genome bowtie2 \ --very-sensitive-local \ -x "${REPEATS_INDEX_PREFIX}" \ -1 "${READS_R1}" \ -2 "${READS_R2}" \ --un-conc-gz "${UNMAPPED_PREFIX}" \ -S "${OUTPUT_SAM}" # If using single-end reads, the command would be: # bowtie2 \ # --very-sensitive-local \ # -x "${REPEATS_INDEX_PREFIX}" \ # -U "${READS_SINGLE}" \ # --un-gz "${UNMAPPED_PREFIX}.fastq.gz" \ # -S "${OUTPUT_SAM}" # Optional: Convert SAM to BAM and sort # samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "${OUTPUT_SAM%.sam}.bam" # samtools index "${OUTPUT_SAM%.sam}.bam" -
3
Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
$ Bash example
# Install Bowtie (version 1.0.0 might require specific steps or older environments) # For example, if using conda: # conda create -n bowtie1_env bowtie=1.0.0 # conda activate bowtie1_env # Align reads using Bowtie 1.0.0 # Assuming 'repbase_index' is the prefix for the Bowtie index files (e.g., repbase_index.1.ebwt, repbase_index.2.ebwt, etc.) # and 'reads.fastq' contains the input reads. # The output will be in SAM format due to the -S flag. bowtie -S -q -p 16 -e 100 -l 20 repbase_index reads.fastq > aligned_reads.sam
-
4
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
STAR v2.3.0e$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.3.0e # Define variables (replace with actual paths) GENOME_DIR="/path/to/hg19_STAR_index" INPUT_FASTQ="unmapped_reads.fastq" OUTPUT_PREFIX="aligned_to_hg19_" # Run STAR alignment STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 -
5
Reads that were PCR replicates were removed from each CLIP-seq library using a custom script.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Define input and output file names INPUT_BAM="aligned_reads.bam" OUTPUT_DEDUP_BAM="deduplicated_reads.bam" TEMP_FIXMATE_BAM="temp_fixmate.bam" TEMP_NAMESORT_BAM="temp_namesort.bam" FINAL_COORD_SORTED_BAM="deduplicated_reads_coord_sorted.bam" # 1. Add mate score tags and fixmate information # This step is crucial for samtools markdup to correctly identify paired-end duplicates. # The output is unsorted. samtools fixmate -m "${INPUT_BAM}" "${TEMP_FIXMATE_BAM}" # 2. Sort by read name # samtools markdup requires name-sorted input for optimal performance. samtools sort -n "${TEMP_FIXMATE_BAM}" -o "${TEMP_NAMESORT_BAM}" # 3. Remove PCR duplicates # -r: Remove duplicate reads (instead of just marking them). # -s: Output statistics to stderr (optional, but good for logging). samtools markdup -r "${TEMP_NAMESORT_BAM}" "${OUTPUT_DEDUP_BAM}" 2> "${OUTPUT_DEDUP_BAM}.stats" # 4. Re-sort the deduplicated BAM by coordinate # This is often required for downstream tools like peak callers or visualization. samtools sort -o "${FINAL_COORD_SORTED_BAM}" "${OUTPUT_DEDUP_BAM}" # 5. Index the final BAM file # Indexing allows for fast retrieval of reads by genomic location. samtools index "${FINAL_COORD_SORTED_BAM}" # Clean up temporary files rm "${TEMP_FIXMATE_BAM}" "${TEMP_NAMESORT_BAM}" "${OUTPUT_DEDUP_BAM}" -
6
Briefly one read with a unique barcode was kept at each nucleotide position when more than one with the same barcode was mapped to the same location
umi_tools dedup (Inferred with models/gemini-2.5-flash) v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Placeholder for input and output files INPUT_BAM="aligned_reads_with_umis.bam" OUTPUT_BAM="deduplicated_reads.bam" # Deduplicate reads based on Unique Molecular Identifier (UMI) and mapping position. # This command assumes that the UMI is already stored in a BAM tag named 'UB' (UMI Barcode). # If the UMI is part of the read name, you would typically use '--extract-method' # and '--umi-separator' or '--umi-pattern' instead. # For example, if UMI is at the start of the read name separated by '_': # umi_tools dedup -I "${INPUT_BAM}" -S "${OUTPUT_BAM}" --extract-method=string --umi-separator='_' # The default behavior of umi_tools dedup is to group reads by mapping position # and UMI, then keep one representative read. umi_tools dedup \ -I "${INPUT_BAM}" \ -S "${OUTPUT_BAM}" \ --umi-tag=UB \ --log "${OUTPUT_BAM%.bam}.log" -
7
Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013).
$ Bash example
# Install CLIPper (assuming Python environment) # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Or ensure clipper.py is in your PATH or run directly # Example usage of CLIPper for peak calling # Replace 'path/to/genome.sizes' with the actual path to your genome size file (e.g., hg38.chrom.sizes) # Replace 'treatment.bam' and 'control.bam' with your actual aligned BAM files # The '--threshold-' parameter in the description was incomplete. Assuming it should be '--threshold <value>'. python clipper.py \ -s path/to/genome.sizes \ -o clipper_output_prefix \ treatment.bam \ control.bam \ --bonferroni \ --superlocal \ --threshold 0.05 # Placeholder value for threshold, as it was incomplete in description
Raw Source Text
Sequencing reads from CLIP-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT. Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009). Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1. Reads that were PCR replicates were removed from each CLIP-seq library using a custom script. Briefly one read with a unique barcode was kept at each nucleotide position when more than one with the same barcode was mapped to the same location Clusters were then assigned using the CLIPper software with parameters --bonferroni --superlocal --threshold- software (Lovci et al., 2013). Genome_build: hg19 Supplementary_files_format_and_content: bigWig, bigBed format, contains clusters of predicted RBFOX2 binding