GSE180955 Processing Pipeline

RIP-Seq code_examples 5 steps

Publication

The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.

Nature communications (2022) — PMID 35781533

Dataset

GSE180955

RBM17 Mediates Evasion of Pro-Leukemic Factors from Splicing-coupled NMD to Enforce Leukemic Stem Cell Maintenance

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (e.g., via conda)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    INPUT_FASTQ="raw_reads.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads.fastq.gz"
    REPORT_FILE="cutadapt_report.txt"
    
    # Define common eCLIP adapter sequences and barcode (placeholders)
    # These are examples; actual sequences depend on library preparation and specific eCLIP protocol.
    # 3' adapter sequence (e.g., from Illumina TruSeq Small RNA or similar)
    ADAPTER_3PRIME="TGGAATTCTCGGGTGCCAAGGAACTCCAG"
    # 5' adapter or barcode sequence. For eCLIP, often a random barcode is at the 5' end.
    # Example: remove a 4bp random barcode from the 5' end.
    BARCODE_5PRIME_LENGTH=4 # Example: 4bp random barcode
    ADAPTER_5PRIME_BARCODE="N{${BARCODE_5PRIME_LENGTH}}" # Remove NNNN from 5' end
    
    # Other common parameters for eCLIP trimming:
    MIN_LENGTH=18 # Minimum read length after trimming (e.g., 18-20bp for eCLIP)
    QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming (e.g., Phred score 20)
    THREADS=8 # Number of CPU threads to use
    
    # Execute cutadapt command for single-end reads
    cutadapt \
      -a "${ADAPTER_3PRIME}" \
      -g "${ADAPTER_5PRIME_BARCODE}" \
      -m "${MIN_LENGTH}" \
      -q "${QUALITY_CUTOFF}" \
      --cores="${THREADS}" \
      --report=full \
      -o "${OUTPUT_FASTQ}" \
      "${INPUT_FASTQ}" \
      > "${REPORT_FILE}" 2>&1
    
    # Note: For paired-end reads, the command would be more complex, 
    # typically involving -A for the reverse read's 3' adapter and -G for its 5' adapter/barcode.
    # Example for paired-end:
    # cutadapt \
    #   -a "${ADAPTER_3PRIME_R1}" \
    #   -g "${ADAPTER_5PRIME_BARCODE_R1}" \
    #   -A "${ADAPTER_3PRIME_R2}" \
    #   -G "${ADAPTER_5PRIME_BARCODE_R2}" \
    #   -m "${MIN_LENGTH}" \
    #   -q "${QUALITY_CUTOFF}" \
    #   --cores="${THREADS}" \
    #   --report=full \
    #   -o "${OUTPUT_FASTQ_R1}" \
    #   -p "${OUTPUT_FASTQ_R2}" \
    #   "${INPUT_FASTQ_R1}" \
    #   "${INPUT_FASTQ_R2}" \
    #   > "${REPORT_FILE}" 2>&1
  2. 2

    Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.

    bowtie2 (Inferred with models/gemini-2.5-flash) v2.3.4.3 GitHub
    $ Bash example
    # Install bowtie2 if not already installed
    # conda install -c bioconda bowtie2
    
    # Define variables
    # INPUT_FASTQ: Path to the gzipped FASTQ file containing trimmed reads.
    # REPBASE_INDEX_PREFIX: Path and prefix for the RepBase bowtie2 index files.
    #   This index should be built from a FASTA file containing repetitive sequences (e.g., hg38_repbase.fa).
    #   For eCLIP, this often refers to a pre-built index provided with the pipeline.
    # OUTPUT_UNMAPPED_FASTQ: Path for the gzipped FASTQ file to store reads that did NOT map to RepBase.
    # THREADS: Number of threads to use for bowtie2.
    INPUT_FASTQ="trimmed_reads.fastq.gz"
    REPBASE_INDEX_PREFIX="path/to/repbase_index/hg38_repbase" # Example: built from hg38_repbase.fa
    OUTPUT_UNMAPPED_FASTQ="unmapped_from_repbase.fastq.gz"
    THREADS=8
    
    # Build RepBase index (if not already built)
    # This step is typically performed once for a given reference database.
    # You would first need to obtain the RepBase FASTA file (e.g., hg38_repbase.fa).
    # For example, if you have 'hg38_repbase.fa':
    # bowtie2-build hg38_repbase.fa ${REPBASE_INDEX_PREFIX}
    
    # Map trimmed reads against RepBase to identify and remove repetitive sequences.
    # Reads that do not map to RepBase are written to OUTPUT_UNMAPPED_FASTQ.
    bowtie2 \
        -p "${THREADS}" \
        -x "${REPBASE_INDEX_PREFIX}" \
        -U "${INPUT_FASTQ}" \
        --un-gz "${OUTPUT_UNMAPPED_FASTQ}" \
        --very-sensitive-local \
        --no-unal \
        --no-hd \
        --no-sq \
        --no-dovetail \
        --no-contain \
        --no-overlap \
        -S /dev/null
  3. 3

    Remaining reads were mapped to the appropriate genome build using STAR aligner

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    GENOME_DIR="/path/to/genome_dir/hg38_star_index" # Placeholder for STAR genome index for hg38
    INPUT_READS="remaining_reads.fastq" # Input FASTQ file containing remaining reads (assuming single-end)
    OUTPUT_PREFIX="mapped_reads" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Run STAR alignment
    STAR \
      --runThreadN ${THREADS} \
      --genomeDir ${GENOME_DIR} \
      --readFilesIn ${INPUT_READS} \
      --outFileNamePrefix ${OUTPUT_PREFIX}. \
      --outSAMtype BAM SortedByCoordinate \
      --outFilterType BySJout \
      --outFilterMismatchNmax 999 \
      --outFilterMismatchNoverLmax 0.04 \
      --outFilterMultimapNmax 20 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --sjdbScore 1
  4. 4

    For eCLIP samples, read densities were calculated to identify eCLIP peaks.

    eCLIP vfrom source
    $ Bash example
    # Install dependencies (if not already installed)
    # pip install numpy scipy pysam pybedtools pyBigWig
    #
    # Clone the clipper repository (if not already cloned)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    
    # Define input and output paths
    # Replace 'eclip_sample.bam' with the actual aligned eCLIP BAM file
    ECLIP_BAM="eclip_sample.bam"
    
    # Define genome size file. Using hg38 as a placeholder for the latest human assembly.
    # Replace 'hg38.chrom.sizes' with the actual path to your genome size file.
    # Example: Download from UCSC
    # wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    GENOME_SIZE_FILE="hg38.chrom.sizes"
    
    # Define output prefix for the identified peaks
    OUTPUT_PREFIX="eclip_peaks"
    
    # Run clipper to calculate read densities and identify eCLIP peaks
    # -b: Input BAM file
    # -s: Genome size file
    # -o: Output prefix
    # -p: P-value threshold (default 0.01)
    # -f: FDR threshold (default 0.05)
    python clipper.py \
        -b "${ECLIP_BAM}" \
        -s "${GENOME_SIZE_FILE}" \
        -o "${OUTPUT_PREFIX}" \
        -p 0.01 \
        -f 0.05
  5. 5

    eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip

    eCLIP vYeoLab CWL pipeline from https://github.com/YeoLab/eclip (pre-2021) GitHub
    $ Bash example
    # Install cwltool (if not already installed)
    # pip install cwltool
    
    # Clone the YeoLab eCLIP CWL pipeline repository
    git clone https://github.com/YeoLab/eclip.git
    cd eclip
    
    # --- Placeholder for input data and reference genome ---
    # Replace with actual paths to your eCLIP FASTQ files, genome FASTA, and STAR index.
    # For human (hg38) as a placeholder:
    # Reference genome FASTA (e.g., hg38) can be downloaded from UCSC or Ensembl.
    # Example download: wget -P /path/to/references http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # gunzip /path/to/references/hg38.fa.gz
    # mv /path/to/references/hg38.fa /path/to/references/hg38/hg38.fa
    
    # Example STAR index generation (adjust parameters as needed):
    # mkdir -p /path/to/references/hg38_star_index
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/references/hg38_star_index \
    #      --genomeFastaFiles /path/to/references/hg38/hg38.fa \
    #      --runThreadN 8 # Adjust thread count
    
    # Define paths for the workflow inputs
    # Example eCLIP FASTQ file (replace with your actual data)
    ECLIP_FASTQ="/path/to/your/eclip_sample.fastq.gz"
    # Example reference genome FASTA path
    GENOME_FASTA="/path/to/references/hg38/hg38.fa"
    # Example STAR index directory path
    STAR_INDEX_DIR="/path/to/references/hg38_star_index"
    # Output prefix for generated files
    OUTPUT_PREFIX="eclip_sample_processed"
    # Directory for all output files
    OUTPUT_DIR="/path/to/eclip_output"
    mkdir -p "${OUTPUT_DIR}"
    
    # Create an input YAML file for the main eCLIP workflow
    cat <<EOF > eclip_workflow_inputs.yaml
    fastq_file:
      class: File
      path: ${ECLIP_FASTQ}
    genome_fasta:
      class: File
      path: ${GENOME_FASTA}
    genome_star_index:
      class: Directory
      path: ${STAR_INDEX_DIR}
    output_prefix: "${OUTPUT_PREFIX}"
    EOF
    
    # Execute the main eCLIP CWL workflow (alignment, BAM processing, and initial filtering)
    cwltool --outdir "${OUTPUT_DIR}" workflows/eclip_workflow.cwl eclip_workflow_inputs.yaml
    
    # --- Subsequent steps for a complete eCLIP analysis ---
    # After running the main workflow, you would typically proceed with:
    # 1. Peak Calling: Using 'workflows/eclip_peak_calling_workflow.cwl' with aligned BAMs from eCLIP and input control samples.
    #    Example: cwltool --outdir "${OUTPUT_DIR}" workflows/eclip_peak_calling_workflow.cwl peak_calling_inputs.yaml
    # 2. IDR (Irreproducible Discovery Rate): Using 'workflows/eclip_idr_workflow.cwl' with peak files from biological replicates.
    #    Example: cwltool --outdir "${OUTPUT_DIR}" workflows/eclip_idr_workflow.cwl idr_inputs.yaml
    # For more details on these steps and their inputs, refer to the YeoLab/eclip GitHub repository.

Tools Used

Raw Source Text
Raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner
For eCLIP samples, read densities were calculated to identify eCLIP peaks.
eclip data processing pipeline can be requested from the following link : https://github.com/YeoLab/eclip
Supplementary_files_format_and_content: bigwig, bed
← Back to Analysis