GSE180955 Processing Pipeline
RIP-Seq
code_examples
5 steps
Publication
The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.Nature communications (2022) — PMID 35781533
Dataset
GSE180955RBM17 Mediates Evasion of Pro-Leukemic Factors from Splicing-coupled NMD to Enforce Leukemic Stem Cell Maintenance
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt (e.g., via conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_FASTQ="raw_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads.fastq.gz" REPORT_FILE="cutadapt_report.txt" # Define common eCLIP adapter sequences and barcode (placeholders) # These are examples; actual sequences depend on library preparation and specific eCLIP protocol. # 3' adapter sequence (e.g., from Illumina TruSeq Small RNA or similar) ADAPTER_3PRIME="TGGAATTCTCGGGTGCCAAGGAACTCCAG" # 5' adapter or barcode sequence. For eCLIP, often a random barcode is at the 5' end. # Example: remove a 4bp random barcode from the 5' end. BARCODE_5PRIME_LENGTH=4 # Example: 4bp random barcode ADAPTER_5PRIME_BARCODE="N{${BARCODE_5PRIME_LENGTH}}" # Remove NNNN from 5' end # Other common parameters for eCLIP trimming: MIN_LENGTH=18 # Minimum read length after trimming (e.g., 18-20bp for eCLIP) QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming (e.g., Phred score 20) THREADS=8 # Number of CPU threads to use # Execute cutadapt command for single-end reads cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME_BARCODE}" \ -m "${MIN_LENGTH}" \ -q "${QUALITY_CUTOFF}" \ --cores="${THREADS}" \ --report=full \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" \ > "${REPORT_FILE}" 2>&1 # Note: For paired-end reads, the command would be more complex, # typically involving -A for the reverse read's 3' adapter and -G for its 5' adapter/barcode. # Example for paired-end: # cutadapt \ # -a "${ADAPTER_3PRIME_R1}" \ # -g "${ADAPTER_5PRIME_BARCODE_R1}" \ # -A "${ADAPTER_3PRIME_R2}" \ # -G "${ADAPTER_5PRIME_BARCODE_R2}" \ # -m "${MIN_LENGTH}" \ # -q "${QUALITY_CUTOFF}" \ # --cores="${THREADS}" \ # --report=full \ # -o "${OUTPUT_FASTQ_R1}" \ # -p "${OUTPUT_FASTQ_R2}" \ # "${INPUT_FASTQ_R1}" \ # "${INPUT_FASTQ_R2}" \ # > "${REPORT_FILE}" 2>&1 -
2
Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.
$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2 # Define variables # INPUT_FASTQ: Path to the gzipped FASTQ file containing trimmed reads. # REPBASE_INDEX_PREFIX: Path and prefix for the RepBase bowtie2 index files. # This index should be built from a FASTA file containing repetitive sequences (e.g., hg38_repbase.fa). # For eCLIP, this often refers to a pre-built index provided with the pipeline. # OUTPUT_UNMAPPED_FASTQ: Path for the gzipped FASTQ file to store reads that did NOT map to RepBase. # THREADS: Number of threads to use for bowtie2. INPUT_FASTQ="trimmed_reads.fastq.gz" REPBASE_INDEX_PREFIX="path/to/repbase_index/hg38_repbase" # Example: built from hg38_repbase.fa OUTPUT_UNMAPPED_FASTQ="unmapped_from_repbase.fastq.gz" THREADS=8 # Build RepBase index (if not already built) # This step is typically performed once for a given reference database. # You would first need to obtain the RepBase FASTA file (e.g., hg38_repbase.fa). # For example, if you have 'hg38_repbase.fa': # bowtie2-build hg38_repbase.fa ${REPBASE_INDEX_PREFIX} # Map trimmed reads against RepBase to identify and remove repetitive sequences. # Reads that do not map to RepBase are written to OUTPUT_UNMAPPED_FASTQ. bowtie2 \ -p "${THREADS}" \ -x "${REPBASE_INDEX_PREFIX}" \ -U "${INPUT_FASTQ}" \ --un-gz "${OUTPUT_UNMAPPED_FASTQ}" \ --very-sensitive-local \ --no-unal \ --no-hd \ --no-sq \ --no-dovetail \ --no-contain \ --no-overlap \ -S /dev/null -
3
Remaining reads were mapped to the appropriate genome build using STAR aligner
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables GENOME_DIR="/path/to/genome_dir/hg38_star_index" # Placeholder for STAR genome index for hg38 INPUT_READS="remaining_reads.fastq" # Input FASTQ file containing remaining reads (assuming single-end) OUTPUT_PREFIX="mapped_reads" # Prefix for output files THREADS=8 # Number of threads to use # Run STAR alignment STAR \ --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_READS} \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outFilterType BySJout \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --outFilterMultimapNmax 20 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 -
4
For eCLIP samples, read densities were calculated to identify eCLIP peaks.
eCLIP vfrom source$ Bash example
# Install dependencies (if not already installed) # pip install numpy scipy pysam pybedtools pyBigWig # # Clone the clipper repository (if not already cloned) # git clone https://github.com/yeolab/clipper.git # cd clipper # Define input and output paths # Replace 'eclip_sample.bam' with the actual aligned eCLIP BAM file ECLIP_BAM="eclip_sample.bam" # Define genome size file. Using hg38 as a placeholder for the latest human assembly. # Replace 'hg38.chrom.sizes' with the actual path to your genome size file. # Example: Download from UCSC # wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes GENOME_SIZE_FILE="hg38.chrom.sizes" # Define output prefix for the identified peaks OUTPUT_PREFIX="eclip_peaks" # Run clipper to calculate read densities and identify eCLIP peaks # -b: Input BAM file # -s: Genome size file # -o: Output prefix # -p: P-value threshold (default 0.01) # -f: FDR threshold (default 0.05) python clipper.py \ -b "${ECLIP_BAM}" \ -s "${GENOME_SIZE_FILE}" \ -o "${OUTPUT_PREFIX}" \ -p 0.01 \ -f 0.05 -
5
eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip
$ Bash example
# Install cwltool (if not already installed) # pip install cwltool # Clone the YeoLab eCLIP CWL pipeline repository git clone https://github.com/YeoLab/eclip.git cd eclip # --- Placeholder for input data and reference genome --- # Replace with actual paths to your eCLIP FASTQ files, genome FASTA, and STAR index. # For human (hg38) as a placeholder: # Reference genome FASTA (e.g., hg38) can be downloaded from UCSC or Ensembl. # Example download: wget -P /path/to/references http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip /path/to/references/hg38.fa.gz # mv /path/to/references/hg38.fa /path/to/references/hg38/hg38.fa # Example STAR index generation (adjust parameters as needed): # mkdir -p /path/to/references/hg38_star_index # STAR --runMode genomeGenerate \ # --genomeDir /path/to/references/hg38_star_index \ # --genomeFastaFiles /path/to/references/hg38/hg38.fa \ # --runThreadN 8 # Adjust thread count # Define paths for the workflow inputs # Example eCLIP FASTQ file (replace with your actual data) ECLIP_FASTQ="/path/to/your/eclip_sample.fastq.gz" # Example reference genome FASTA path GENOME_FASTA="/path/to/references/hg38/hg38.fa" # Example STAR index directory path STAR_INDEX_DIR="/path/to/references/hg38_star_index" # Output prefix for generated files OUTPUT_PREFIX="eclip_sample_processed" # Directory for all output files OUTPUT_DIR="/path/to/eclip_output" mkdir -p "${OUTPUT_DIR}" # Create an input YAML file for the main eCLIP workflow cat <<EOF > eclip_workflow_inputs.yaml fastq_file: class: File path: ${ECLIP_FASTQ} genome_fasta: class: File path: ${GENOME_FASTA} genome_star_index: class: Directory path: ${STAR_INDEX_DIR} output_prefix: "${OUTPUT_PREFIX}" EOF # Execute the main eCLIP CWL workflow (alignment, BAM processing, and initial filtering) cwltool --outdir "${OUTPUT_DIR}" workflows/eclip_workflow.cwl eclip_workflow_inputs.yaml # --- Subsequent steps for a complete eCLIP analysis --- # After running the main workflow, you would typically proceed with: # 1. Peak Calling: Using 'workflows/eclip_peak_calling_workflow.cwl' with aligned BAMs from eCLIP and input control samples. # Example: cwltool --outdir "${OUTPUT_DIR}" workflows/eclip_peak_calling_workflow.cwl peak_calling_inputs.yaml # 2. IDR (Irreproducible Discovery Rate): Using 'workflows/eclip_idr_workflow.cwl' with peak files from biological replicates. # Example: cwltool --outdir "${OUTPUT_DIR}" workflows/eclip_idr_workflow.cwl idr_inputs.yaml # For more details on these steps and their inputs, refer to the YeoLab/eclip GitHub repository.
Raw Source Text
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner For eCLIP samples, read densities were calculated to identify eCLIP peaks. eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip Supplementary_files_format_and_content: bigwig, bed