GSE112782 Processing Pipeline
RIP-Seq
code_examples
5 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE112782The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [CLIP-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR.
$ Bash example
# Install STAR (e.g., via conda) # conda install -c bioconda star # --- Reference Data Preparation (Run once) --- # Download RepBase 18.04 human-specific repetitive elements FASTA file. # This file would contain sequences of repetitive elements from RepBase version 18.04. # Example placeholder: # repbase_fasta="RepBase18.04_human_repeats.fasta" # genome_dir="star_repbase_index" # Build STAR genome index for RepBase sequences. # This step requires sufficient RAM (e.g., typically tens of GBs for a full genome, adjust for repeats). # STAR --runMode genomeGenerate \ # --genomeDir ${genome_dir} \ # --genomeFastaFiles ${repbase_fasta} \ # --runThreadN 8 # Adjust threads as needed for index generation # --- Alignment Step --- # Define input and output paths input_read1="reads_R1.fastq.gz" # Path to adapter-trimmed R1 FASTQ file input_read2="reads_R2.fastq.gz" # Path to adapter-trimmed R2 FASTQ file (if paired-end) output_prefix="aligned_to_repbase" # Prefix for output files genome_dir="star_repbase_index" # Path to the STAR index built from RepBase 18.04 # Execute STAR alignment STAR --genomeDir ${genome_dir} \ --readFilesIn ${input_read1} ${input_read2} \ --readFilesCommand zcat \ --outFileNamePrefix ${output_prefix}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --runThreadN 8 # Adjust threads as needed for alignment -
2
Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables GENOME_DIR="/path/to/STAR_index/hg19" # Placeholder for hg19 STAR genome index INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for the input FASTQ file after repeat-mapping reads were removed OUTPUT_PREFIX="aligned_reads" # Prefix for output files # Run STAR alignment # "Repeat-mapping reads were removed" is interpreted as only outputting uniquely mapping reads (--outFilterMultimapNmax 1) STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --runThreadN 8 # Example: use 8 threads -
3
PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Remove PCR duplicate reads using UMIs. This command assumes UMIs have already been extracted # from the 5' adapter and appended to read names (e.g., by a preceding 'umi_tools extract' step # or during adapter trimming/alignment). The '--paired' flag is used for paired-end sequencing data. # Replace 'input_aligned_reads.bam' with the actual input BAM file containing aligned reads with UMIs in their names. # Replace 'deduplicated_usable_reads.bam' with the desired output file name for usable reads. # Replace 'deduplication_stats.log' with the desired output file name for deduplication statistics. umi_tools dedup \ --stdin input_aligned_reads.bam \ --stdout deduplicated_usable_reads.bam \ --paired \ --output-stats deduplication_stats.log -
4
Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)
$ Bash example
# Install CLIPper (if not already installed) # pip install clipper # Or clone the repo and run directly # Define input and output files INPUT_BAM="usable_reads.bam" # Placeholder for usable reads (e.g., aligned reads in BAM format) OUTPUT_PEAKS="clipper_peaks.bed" GENCODE_V19_BED="gencode.v19.annotation.bed" # Placeholder for Gencode v19 gene regions in BED format # Ensure Gencode v19 annotation file is available # Example for downloading and converting Gencode v19 GTF to BED (adjust paths and tools as needed): # wget -O gencode.v19.annotation.gtf.gz "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz" # gunzip gencode.v19.annotation.gtf.gz # awk '$3 == "gene" {print $1"\t"$4-1"\t"$5"\t"$10"\t0\t"$7}' gencode.v19.annotation.gtf | sed 's/\"//g' | sed 's/;//g' > gencode.v19.annotation.bed # 1. Call peaks using CLIPper # Assuming 'clipper.py' is in PATH or specified with full path clipper.py -s hg19 -o "${OUTPUT_PEAKS}" "${INPUT_BAM}" # 2. Assign peaks to gene regions using bedtools intersect # Install bedtools (if not already installed) # conda install -c bioconda bedtools bedtools intersect -a "${OUTPUT_PEAKS}" -b "${GENCODE_V19_BED}" -wa -wb > "${OUTPUT_PEAKS%.bed}_annotated_genes.bed" -
5
Each peak was normalized to the size-matched input (SMInput)
$ Bash example
# Installation of bedtools (if not already installed) # conda install -c bioconda bedtools # Installation of samtools (if not already installed) # conda install -c bioconda samtools # Define input files (replace with actual paths) # IP_PEAKS: BED file containing the identified peaks for the IP sample. # IP_BAM: Aligned reads (BAM format) for the Immunoprecipitation (IP) sample. # SM_INPUT_BAM: Aligned reads (BAM format) for the Size-Matched Input (SMInput) sample. # IP_LIB_SIZE_FILE: File containing the total number of mapped reads for the IP sample. # SM_INPUT_LIB_SIZE_FILE: File containing the total number of mapped reads for the SMInput sample. IP_PEAKS="ip_peaks.bed" IP_BAM="ip_aligned.bam" SM_INPUT_BAM="sm_input_aligned.bam" IP_LIB_SIZE_FILE="ip_library_size.txt" SM_INPUT_LIB_SIZE_FILE="sm_input_library_size.txt" # Example of how library size files might be generated (typically done after alignment): # samtools flagstat ${IP_BAM} | grep 'mapped (' | awk '{print $1}' > ${IP_LIB_SIZE_FILE} # samtools flagstat ${SM_INPUT_BAM} | grep 'mapped (' | awk '{print $1}' > ${SM_INPUT_LIB_SIZE_FILE} # Read library sizes (total mapped reads) from the respective files IP_LIB_SIZE=$(cat ${IP_LIB_SIZE_FILE}) SM_INPUT_LIB_SIZE=$(cat ${SM_INPUT_LIB_SIZE_FILE}) # 1. Count reads in peaks for both IP and SMInput samples using bedtools multicov. # The output file 'peak_raw_counts.tsv' will have columns: chr, start, end, name, IP_counts, SMInput_counts. bedtools multicov -bams ${IP_BAM} ${SM_INPUT_BAM} -bed ${IP_PEAKS} > peak_raw_counts.tsv # 2. Perform normalization: calculate Reads Per Million (RPM) for both IP and SMInput, # then compute the ratio (IP_RPM / SMInput_RPM) for each peak. # This awk script assumes 'peak_raw_counts.tsv' has columns: $1=chr, $2=start, $3=end, $4=name, $5=IP_counts, $6=SMInput_counts. # It adds new columns for IP_RPM, SMInput_RPM, and the Normalized_Ratio. # Pseudocounts are not explicitly added here, but division by zero for RPM is handled. awk -v ip_lib_size="${IP_LIB_SIZE}" -v sm_input_lib_size="${SM_INPUT_LIB_SIZE}" 'BEGIN { OFS="\t"; print "chrom", "start", "end", "name", "IP_counts", "SMInput_counts", "IP_RPM", "SMInput_RPM", "Normalized_Ratio" } NR > 0 { ip_counts = $5; sm_input_counts = $6; # Calculate RPM for IP sample ip_rpm = (ip_lib_size > 0) ? (ip_counts / ip_lib_size) * 1000000 : 0; # Calculate RPM for SMInput sample sm_input_rpm = (sm_input_lib_size > 0) ? (sm_input_counts / sm_input_lib_size) * 1000000 : 0; # Calculate the normalized ratio (IP_RPM / SMInput_RPM) if (sm_input_rpm > 0) { normalized_ratio = ip_rpm / sm_input_rpm; } else { normalized_ratio = "NA"; # Handle cases where SMInput RPM is zero } print $1, $2, $3, $4, ip_counts, sm_input_counts, ip_rpm, sm_input_rpm, normalized_ratio; }' peak_raw_counts.tsv > normalized_peaks.tsv # Reference genome: hg38 (used for alignment and peak calling, not directly in this normalization step) # Source: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40/
Tools Used
Raw Source Text
Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR. Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ. Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19) Each peak was normalized to the size-matched input (SMInput) Genome_build: Homo sapiens UCSC hg19 Supplementary_files_format_and_content: Each bed file was generated by read normalization between IP over SMInput. The columns in the bed files represent (chr, start, stop,-log10(pvalue),log2(fold change), strand). The CSV files contain the log2 fold change of reads upon IP over SMInput in annotated regions of each transcripts. Each bigwig file contains read distribution of each RBP, and bigbed contains clusters of predicted RBP binding.