GSE117290 Processing Pipeline

OTHER code_examples 7 steps

Publication

Large-scale tethered function assays identify factors that regulate mRNA stability and translation.

Nature structural & molecular biology (2020) — PMID 32807991

Dataset

GSE117290

Large-scale tethered function assays identify factors that regulate mRNA stability and translation (eCLIP)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    library strategy: CLIP-seq

    $ Bash example
    # Install STAR (splice-aware aligner)
    # conda install -c bioconda star
    
    # Install CLIPper (peak caller)
    # pip install clipper
    
    # --- Placeholder for reference genome and STAR index ---
    # Replace with actual paths to your genome FASTA and STAR index.
    # For hg38, you would typically download the genome FASTA and build a STAR index.
    # Example for hg38:
    # wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # gunzip hg38.fa.gz
    # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_hg38_index --genomeFastaFiles hg38.fa --runThreadN 8
    
    STAR_GENOME_DIR="/path/to/STAR_hg38_index" # Path to your pre-built STAR genome index for hg38
    GENOME_FASTA="/path/to/hg38.fa" # Path to the hg38 genome FASTA file
    
    # --- Placeholder for input FASTQ files ---
    # Replace with your actual CLIP-seq FASTQ file(s)
    CLIP_SEQ_FASTQ="clip_seq_sample.fastq.gz"
    
    # 1. Align CLIP-seq reads using STAR
    # This step aligns the raw sequencing reads to the reference genome (hg38).
    STAR --genomeDir "${STAR_GENOME_DIR}" \
         --readFilesIn "${CLIP_SEQ_FASTQ}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "clip_seq_aligned_" \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 8 # Adjust number of threads as needed
    
    ALIGNED_BAM="clip_seq_aligned_Aligned.sortedByCoord.out.bam"
    
    # 2. Perform CLIP-seq Peak Calling using CLIPper
    # This identifies significant binding sites (peaks) from the aligned reads.
    # The -s parameter specifies the species (e.g., hg38).
    # The --species-genome parameter provides the path to the genome FASTA, which CLIPper uses for background modeling.
    clipper -b "${ALIGNED_BAM}" \
            -s hg38 \
            -o "clip_seq_peaks.bed" \
            --species-genome "${GENOME_FASTA}"
  2. 2

    Reads were processed essentially as described (Van Nostrand et al., 2016b).

    STAR (Inferred with models/gemini-2.5-flash) v2.5.2a GitHub
    $ Bash example
    # Install cutadapt
    # conda install -c bioconda cutadapt=1.9.1
    
    # Install STAR
    # conda install -c bioconda star=2.5.2a
    
    # Define input/output files and reference genome paths
    # Replace with actual input FASTQ file paths
    READ1="input_sample_R1.fastq.gz"
    READ2="input_sample_R2.fastq.gz" # If paired-end, otherwise set to empty string or remove
    OUTPUT_DIR="./processed_reads"
    
    # Placeholder for human genome hg38 STAR index. Generate with STAR --runMode genomeGenerate.
    # Example: GENOME_DIR="/path/to/STAR_genome_index/hg38"
    # Download reference genome (e.g., from UCSC or Ensembl) and GTF annotation file.
    # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_genome_index/hg38 \
    #      --genomeFastaFiles /path/to/hg38.fa \
    #      --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf \
    #      --runThreadN 16
    GENOME_DIR="/path/to/STAR_genome_index/hg38"
    
    # 3' adapter sequence for eCLIP, as specified in Van Nostrand et al., 2016b and eCLIP CWL workflow
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    MIN_READ_LENGTH=15 # Minimum read length after trimming, as specified in Van Nostrand et al., 2016b
    
    mkdir -p "$OUTPUT_DIR"
    
    # Step 1: Adapter trimming with Cutadapt
    # Using parameters consistent with Van Nostrand et al., 2016b and eCLIP CWL workflow
    # -a: 3' adapter sequence for forward reads
    # -A: 3' adapter sequence for reverse reads (if paired-end)
    # -m: minimum read length after trimming
    # -o: output file for R1
    # -p: output file for R2 (if paired-end)
    # --cores: number of CPU cores to use
    
    TRIMMED_READ1="${OUTPUT_DIR}/trimmed_R1.fastq.gz"
    TRIMMED_READ2="${OUTPUT_DIR}/trimmed_R2.fastq.gz"
    
    echo "Starting adapter trimming with Cutadapt..."
    if [ -n "$READ2" ]; then
      cutadapt -a "$ADAPTER_3PRIME" -A "$ADAPTER_3PRIME" -m "$MIN_READ_LENGTH" \
               -o "$TRIMMED_READ1" -p "$TRIMMED_READ2" \
               --cores=8 "$READ1" "$READ2" > "${OUTPUT_DIR}/cutadapt_report.txt"
    else
      cutadapt -a "$ADAPTER_3PRIME" -m "$MIN_READ_LENGTH" \
               -o "$TRIMMED_READ1" \
               --cores=8 "$READ1" > "${OUTPUT_DIR}/cutadapt_report.txt"
    fi
    echo "Cutadapt trimming complete. Report saved to ${OUTPUT_DIR}/cutadapt_report.txt"
    
    # Step 2: Alignment to reference genome with STAR
    # Using parameters consistent with Van Nostrand et al., 2016b and eCLIP CWL workflow
    # --genomeDir: path to the STAR genome index
    # --readFilesIn: input FASTQ files (trimmed reads)
    # --runThreadN: number of CPU threads to use
    # --outFileNamePrefix: prefix for output files
    # --outSAMtype: output SAM/BAM format (BAM SortedByCoordinate is common for downstream tools)
    # --outFilterMultimapNmax: maximum number of loci an alignment can map to
    # --outFilterMismatchNmax: maximum number of mismatches per read (allows many, filtered by score)
    # --outFilterScoreMinOverLread: minimum alignment score (as a fraction of read length)
    # --outFilterMatchNminOverLread: minimum number of matched bases (as a fraction of read length)
    # --alignIntronMax: maximum intron length (for RNA-seq, eCLIP is RNA-based)
    # --alignIntronMin: minimum intron length
    # --limitBAMsortRAM: limit RAM for BAM sorting (e.g., 30GB)
    
    STAR_OUTPUT_PREFIX="${OUTPUT_DIR}/STAR_aligned_"
    
    echo "Starting alignment with STAR..."
    if [ -n "$READ2" ]; then
      STAR --genomeDir "$GENOME_DIR" \
           --readFilesIn "$TRIMMED_READ1" "$TRIMMED_READ2" \
           --runThreadN 8 \
           --outFileNamePrefix "$STAR_OUTPUT_PREFIX" \
           --outSAMtype BAM SortedByCoordinate \
           --outFilterMultimapNmax 20 \
           --outFilterMismatchNmax 999 \
           --outFilterScoreMinOverLread 0.33 \
           --outFilterMatchNminOverLread 0.33 \
           --alignIntronMax 1000000 \
           --alignIntronMin 20 \
           --limitBAMsortRAM 30000000000 # 30GB RAM
    else
      STAR --genomeDir "$GENOME_DIR" \
           --readFilesIn "$TRIMMED_READ1" \
           --runThreadN 8 \
           --outFileNamePrefix "$STAR_OUTPUT_PREFIX" \
           --outSAMtype BAM SortedByCoordinate \
           --outFilterMultimapNmax 20 \
           --outFilterMismatchNmax 999 \
           --outFilterScoreMinOverLread 0.33 \
           --outFilterMatchNminOverLread 0.33 \
           --alignIntronMax 1000000 \
           --alignIntronMin 20 \
           --limitBAMsortRAM 30000000000 # 30GB RAM
    fi
    echo "STAR alignment complete."
    
    # Rename output BAM file for clarity
    mv "${STAR_OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${OUTPUT_DIR}/aligned_reads.bam"
    
    # Optional: Index the BAM file for faster access by downstream tools
    # Install samtools
    # conda install -c bioconda samtools
    # samtools index "${OUTPUT_DIR}/aligned_reads.bam"
    
    # Optional: Remove intermediate trimmed FASTQ files to save space
    # rm "$TRIMMED_READ1" "$TRIMMED_READ2"
  3. 3

    Brifely, reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR.

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # --- Reference Data Preparation (RepBase 18.04 Human Repetitive Elements) ---
    # Note: RepBase is a proprietary database. Access requires a license.
    # This example assumes you have obtained the human-specific repetitive elements
    # from RepBase version 18.04 in FASTA format.
    
    # Placeholder for RepBase FASTA file (replace with actual path)
    REPBASE_FASTA="repbase_human_18.04.fasta"
    REPBASE_STAR_INDEX_DIR="repbase_star_index"
    
    # Create STAR genome index for RepBase elements
    # Adjust --genomeSAindexNbases if the total length of repetitive elements is significantly smaller than a full genome.
    # A value of 10-12 might be more appropriate for a collection of shorter sequences like repeats.
    STAR --runMode genomeGenerate \
         --genomeDir ${REPBASE_STAR_INDEX_DIR} \
         --genomeFastaFiles ${REPBASE_FASTA} \
         --genomeSAindexNbases 10 \
         --runThreadN 8
    
    # --- Read Alignment ---
    # Input adapter-trimmed reads (replace with actual path to your FASTQ file)
    TRIMMED_READS="trimmed_reads.fastq.gz"
    OUTPUT_PREFIX="star_repbase_alignment/"
    
    # Create output directory if it doesn't exist
    mkdir -p $(dirname ${OUTPUT_PREFIX})
    
    # Align adapter-trimmed reads to the RepBase STAR index
    # Parameters for repetitive element mapping:
    # --outFilterMultimapNmax 100: Allow reads to map to up to 100 loci (common for repeats).
    # --outSAMmultNmax 10: Report up to 10 alignments per read (adjust as needed, -1 for all).
    # --alignIntronMax 1: Effectively disable splicing for repetitive elements (set to 0 or 1).
    # --readFilesCommand zcat: For gzipped input FASTQ files.
    STAR --runMode alignReads \
         --genomeDir ${REPBASE_STAR_INDEX_DIR} \
         --readFilesIn ${TRIMMED_READS} \
         --readFilesCommand zcat \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outBAMcompression 10 \
         --outFilterMultimapNmax 100 \
         --outSAMmultNmax 10 \
         --alignIntronMax 1 \
         --runThreadN 8
    
  4. 4

    Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR.

    $ Bash example
    # Install STAR (e.g., using conda):
    # conda install -c bioconda star=2.7.10a
    
    # Define variables
    # Replace with actual paths to your STAR genome index and input FASTQ file(s)
    GENOME_DIR="/path/to/STAR_index/hg19"
    INPUT_FASTQ="input_reads.fastq.gz" # For paired-end, use "read1.fastq.gz read2.fastq.gz"
    OUTPUT_PREFIX="aligned_reads"
    NUM_THREADS=8 # Adjust based on available CPU cores
    
    # Create STAR genome index if not already present (example command, run once per genome/annotation)
    # This step requires the genome FASTA and optionally a GTF/GFF annotation file.
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/hg19.fa \
    #      --sjdbGTFfile /path/to/hg19.gtf \
    #      --sjdbOverhang 100 \
    #      --runThreadN ${NUM_THREADS}
    
    # Align reads with STAR
    # Parameters are adapted from the Yeo lab's eCLIP Snakemake workflow (Skipper)
    STAR \
      --genomeDir ${GENOME_DIR} \
      --readFilesIn ${INPUT_FASTQ} \
      --readFilesCommand zcat \
      --outFileNamePrefix ${OUTPUT_PREFIX}. \
      --runThreadN ${NUM_THREADS} \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes NH HI AS NM MD \
      --outSAMstrandField intronMotif \
      --outFilterMultimapNmax 20 \
      --alignSJDBoverhangMin 8 \
      --alignSJoverhangMin 8 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --outFilterMismatchNmax 999 \
      --outFilterMismatchNoverLmax 0.05 \
      --outFilterType BySJout \
      --outFilterScoreMinOverLread 0.33 \
      --outFilterMatchNminOverLread 0.33 \
      --limitSjdbInsertNsj 2000000
  5. 5

    PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5’ adapter and remaining reads retained as ‘usable reads’.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install umi_tools
    # conda install -c bioconda umi_tools
    
    # PCR duplicate reads were removed using UMI sequences. This typically involves a prior step
    # (umi_tools extract) to move UMIs from the read sequence/adapter to the read ID, followed by alignment.
    # This command performs the deduplication on the aligned reads.
    # The --extract-method=read_id parameter indicates that UMIs are expected in the read IDs,
    # which is a common output format from umi_tools extract.
    # The --umi-separator='_' parameter specifies the separator used to delimit the UMI from the original read ID.
    # If reads are paired-end, the --paired flag would be added.
    
    umi_tools dedup \
        --input aligned_reads.bam \
        --output deduplicated_reads.bam \
        --extract-method=read_id \
        --umi-separator='_'
  6. 6

    Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)

    CLIPper v0.0.1 GitHub
    $ Bash example
    # --- Installation (commented out) ---
    # Install CLIPper via Docker
    # docker pull yeolab/clipper:0.0.1
    # Install yeolab/eclip for peak assignment script
    # docker pull yeolab/eclip:0.0.1
    
    # --- Reference Data Setup (commented out) ---
    # Download Gencode v19 annotation (GRCh37)
    # wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip gencode.v19.annotation.gtf.gz
    
    # Create a genome size file for hg19 (GRCh37)
    # This file lists chromosome names and their lengths. It can be generated from a genome FASTA index.
    # Example for hg19 (GRCh37) - replace with actual path to your genome fasta and use samtools:
    # samtools faidx <path_to_hg19.fa>
    # cut -f1,2 <path_to_hg19.fa>.fai > hg19.chrom.sizes
    
    # --- Placeholder Input Data ---
    # Assume 'usable_reads.bam' is the input BAM file containing aligned reads.
    # Assume 'hg19.chrom.sizes' is the genome size file for the reference genome.
    # Assume 'gencode.v19.annotation.gtf' is the Gencode annotation file for v19.
    
    # --- Peak Calling with CLIPper ---
    # Execute CLIPper using its Docker image. The output will be a BED file (e.g., clipper_peaks.bed)
    # and a log file (clipper_peaks.log).
    docker run --rm -v "$(pwd)":/data yeolab/clipper:0.0.1 \
        python /usr/local/bin/clipper.py \
        -s /data/hg19.chrom.sizes \
        -o /data/clipper_peaks \
        /data/usable_reads.bam
    
    # --- Assign Peaks to Gene Regions ---
    # This step uses the 'assign_peaks_to_genes.py' script, which is part of the yeolab/eclip pipeline.
    # It takes the called peaks and the Gencode GTF to assign peaks to annotated gene regions.
    docker run --rm -v "$(pwd)":/data yeolab/eclip:0.0.1 \
        python /usr/local/bin/assign_peaks_to_genes.py \
        --peaks /data/clipper_peaks.bed \
        --gtf /data/gencode.v19.annotation.gtf \
        --output /data/clipper_peaks_assigned.bed
  7. 7

    Each peak was normalized to the size-matched input (SMInput)

    calculate_fold_change_and_pvalue.py (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This script is a custom Python script used within the Yeo Lab eCLIP pipelines (e.g., skipper).
    # It calculates fold change and p-value for peaks relative to a size-matched input.
    
    # Installation of dependencies (assuming Python 3 is installed):
    # conda create -n eclip_norm_env python=3.8
    # conda activate eclip_norm_env
    # pip install pysam pybedtools numpy scipy
    # conda install -c bioconda bedtools
    
    # Download the script (if not part of a larger pipeline checkout):
    # wget https://raw.githubusercontent.com/yeolab/skipper/master/workflow/scripts/calculate_fold_change_and_pvalue.py
    
    # Define input/output files and reference genome sizes
    # Replace with actual paths to your files
    IP_BAM="path/to/your/ip_replicate1.bam" # Aligned reads for the immunoprecipitation sample
    SM_INPUT_BAM="path/to/your/sm_input_replicate1.bam" # Aligned reads for the size-matched input sample
    PEAKS_BED="path/to/your/called_peaks.bed" # BED file containing the called peaks (e.g., from clipper)
    OUTPUT_BED="path/to/your/normalized_peaks.bed" # Output BED file with added fold change and p-value scores
    GENOME_SIZES="path/to/your/hg38.chrom.sizes" # Chromosome sizes file for the reference genome (e.g., hg38)
    
    # Example for obtaining hg38.chrom.sizes:
    # wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes
    
    python calculate_fold_change_and_pvalue.py \
        --ip_bam "${IP_BAM}" \
        --input_bam "${SM_INPUT_BAM}" \
        --peaks_bed "${PEAKS_BED}" \
        --output_bed "${OUTPUT_BED}" \
        --genome_sizes "${GENOME_SIZES}"

Tools Used

Raw Source Text
library strategy: CLIP-seq
Reads were processed essentially as described (Van Nostrand et al., 2016b).
Brifely, reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR. Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5’ adapter and remaining reads retained as ‘usable reads’.
Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)
Each peak was normalized to the size-matched input (SMInput)
Genome_build: Homo sapiens UCSC hg19
← Back to Analysis