GSE205536 Processing Pipeline

OTHER code_examples 13 steps

Publication

Multiplexed transcriptome discovery of RNA-binding protein binding sites by antibody-barcode eCLIP.

Nature methods (2023) — PMID 36550273

Dataset

GSE205536

Multiplexed transcriptome discovery of RNA binding protein binding sites by antibody-barcode eCLIP

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Data is processed in a similar way to standard eCLIP pipeline, except for a few adjustments to ABC’s multiplex design and library structure.

    eCLIP vCWL workflow (state of yeolab/eclip repository before 2021) GitHub
    $ Bash example
    # Install cwltool if not already installed
    # pip install cwltool
    
    # Clone the eCLIP CWL workflow repository
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Placeholder for input files and reference genome
    # Replace with actual paths to your FASTQ files and reference genome/annotations
    # The description mentions "adjustments to ABC’s multiplex design and library structure".
    # These adjustments would typically involve custom demultiplexing, adapter trimming, or UMI extraction
    # steps that might precede or be integrated into the initial stages of the eCLIP workflow.
    # For this standard eCLIP pipeline representation, we assume these initial steps are handled
    # or the input FASTQ files are ready for standard processing.
    
    INPUT_FASTQ_R1="path/to/your/sample_R1.fastq.gz"
    INPUT_FASTQ_R2="path/to/your/sample_R2.fastq.gz"
    
    # Reference genome (using hg38 as a common placeholder)
    GENOME_FASTA="path/to/your/hg38.fa"
    GENOME_GTF="path/to/your/hg38.gtf"
    STAR_INDEX="path/to/your/star_index_hg38" # Pre-built STAR index for hg38
    
    # Create a CWL input YAML file (example based on yeolab/eclip structure)
    # This is a simplified representation; actual inputs might be more complex
    cat << EOF > inputs.yaml
    fastq_r1:
      class: File
      path: ${INPUT_FASTQ_R1}
    fastq_r2:
      class: File
      path: ${INPUT_FASTQ_R2}
    genome_fasta:
      class: File
      path: ${GENOME_FASTA}
    genome_gtf:
      class: File
      path: ${GENOME_GTF}
    star_index:
      class: Directory
      path: ${STAR_INDEX}
    # Add other necessary parameters as per eclip.cwl, e.g., adapters, output_prefix, etc.
    # For example, if custom adapters are used due to ABC's library structure:
    # adapters:
    #   class: File
    #   path: path/to/your/custom_adapters.fa
    EOF
    
    # Execute the eCLIP CWL workflow
    # This command runs the main eclip.cwl workflow, which orchestrates alignment, deduplication, peak calling, etc.
    cwltool --outdir ./eclip_output eclip.cwl inputs.yaml
  2. 2

    UMI are extracted using umitools 1.0.0 and adaptors are removed using cutadapt 2.8, then fastqs are demultiplexed based on the 5nt barcode sequence using fastx toolkit.

    cutadapt v2.8 GitHub
    $ Bash example
    # Install cutadapt (version 2.8)
    # conda create -n cutadapt_env cutadapt=2.8
    # conda activate cutadapt_env
    
    # Define input and output files
    # Input files are assumed to be the output from the UMI extraction step.
    INPUT_R1="umi_extracted_R1.fastq.gz"
    INPUT_R2="umi_extracted_R2.fastq.gz"
    OUTPUT_R1="cutadapt_trimmed_R1.fastq.gz"
    OUTPUT_R2="cutadapt_trimmed_R2.fastq.gz"
    
    # Define common Illumina adaptors (these might need to be adjusted based on specific library preparation kit)
    ADAPTOR_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina Universal Adaptor
    ADAPTOR_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina Read 2 Adaptor (or same as R1 depending on library)
    
    # Execute cutadapt to remove adaptors
    cutadapt -a "${ADAPTOR_R1}" -A "${ADAPTOR_R2}" \
             -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \
             "${INPUT_R1}" "${INPUT_R2}"
  3. 3

    Since ABC is sequenced on the reverse strand, we reverse complement sequences.

    seqtk (Inferred with models/gemini-2.5-flash) v1.3 GitHub
    $ Bash example
    # Install seqtk if not already available
    # conda install -c bioconda seqtk
    
    # Reverse complement sequences from 'input.fastq' and save to 'output.fastq'
    # The '-r' option performs reverse complement.
    seqtk seq -r input.fastq > output.fastq
  4. 4

    Alignment to the repetitive regions removes multi-mapped reads, then aligned to the genomic sequences using STAR 2.7.6.

    $ Bash example
    # Install STAR if not already available
    # conda install -c bioconda star
    
    # Define variables
    READ1="input_R1.fastq.gz"
    READ2="input_R2.fastq.gz" # Assuming paired-end reads, common for eCLIP
    OUTPUT_PREFIX="sample_aligned"
    THREADS=$(nproc) # Use all available CPU cores
    
    # --- Reference Data Paths (PLACEHOLDERS - Update with actual paths) ---
    # Blacklist/Repetitive regions STAR index (e.g., rRNA, tRNA, simple repeats, ENCODE blacklist regions)
    # This index should be built from sequences that are to be filtered out.
    # Example: /path/to/star_index/blacklist_hg38
    BLACKLIST_STAR_INDEX="/path/to/star_index/blacklist_hg38"
    
    # Genomic STAR index (e.g., human hg38)
    # Example: /path/to/star_index/hg38
    GENOME_STAR_INDEX="/path/to/star_index/hg38"
    
    # Temporary files for unmapped reads from the blacklist alignment
    UNMAPPED_R1_TEMP="${OUTPUT_PREFIX}_blacklist_Unmapped.out.mate1"
    UNMAPPED_R2_TEMP="${OUTPUT_PREFIX}_blacklist_Unmapped.out.mate2"
    
    echo "Step 1: Aligning reads to repetitive regions (blacklist) to identify and filter out non-specific reads."
    # Align to blacklist genome. Reads that map here are considered repetitive/non-specific.
    # --outSAMtype None: Do not output BAM/SAM for this step, only interested in unmapped reads.
    # --outReadsUnmapped Fastx: Output unmapped reads in FASTQ format.
    # --outFilterMultimapNmax 100: Allow multi-mapping to blacklist, as we are primarily interested in reads that *don't* map anywhere here.
    STAR --runThreadN ${THREADS} \
         --genomeDir ${BLACKLIST_STAR_INDEX} \
         --readFilesIn ${READ1} ${READ2} \
         --readFilesCommand zcat \
         --outFileNamePrefix ${OUTPUT_PREFIX}_blacklist_ \
         --outSAMtype None \
         --outReadsUnmapped Fastx \
         --outFilterMultimapNmax 100
    
    echo "Step 2: Aligning unmapped reads from blacklist alignment to the genomic sequences."
    # Align the reads that did not map to the blacklist to the main genome.
    # --outFilterMultimapNmax 1: Ensures only uniquely mapped reads are reported for the genomic alignment,
    #                             effectively "removing multi-mapped reads" from the final genomic BAM.
    # Standard STAR parameters for RNA-seq alignment are included.
    STAR --runThreadN ${THREADS} \
         --genomeDir ${GENOME_STAR_INDEX} \
         --readFilesIn ${UNMAPPED_R1_TEMP} ${UNMAPPED_R2_TEMP} \
         --readFilesCommand cat \
         --outFileNamePrefix ${OUTPUT_PREFIX}_genome_ \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterType BySJout \
         --outFilterScoreMinOverLread 0.3 \
         --outFilterMatchNminOverLread 0.3 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 60000000000 # Adjust based on available RAM (e.g., 60GB)
    
    # Clean up temporary unmapped files
    rm ${UNMAPPED_R1_TEMP} ${UNMAPPED_R2_TEMP}
    
    echo "Alignment complete. Final BAM: ${OUTPUT_PREFIX}_genome_Aligned.sortedByCoordinate.out.bam"
    
  5. 5

    The pipeline is available at https://github.com/algaebrown/oligoCLIP.git

    oligoCLIP vlatest GitHub
    $ Bash example
    # Clone the oligoCLIP pipeline repository
    git clone https://github.com/algaebrown/oligoCLIP.git
    cd oligoCLIP
    
    # The description does not provide specific commands or parameters to run the pipeline.
    # Please refer to the repository's documentation (e.g., README.md) for usage instructions.
    # Example placeholder:
    # python run_oligoCLIP.py --input <input_file> --output <output_directory>
  6. 6

    Given the genome alignment, we subsequently follow the steps to do peak call using CLIPper, then rank the peaks comparing to “backgrounds”, either SMInput (eCLIP), RNA-seq (ABC), or to other multiplex libraries (internal normalization).

    CLIPper vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # pip install clipper
    
    # Download a genome size file (e.g., for hg38)
    # wget -O hg38.chrom.sizes http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    
    # Execute CLIPper for peak calling
    # Replace 'aligned_reads.bam' with your IP sample's aligned BAM file.
    # Replace 'background_reads.bam' with your control (SMInput, RNA-seq, or other multiplex library) aligned BAM file.
    # Replace 'hg38.chrom.sizes' with the appropriate genome size file for your reference genome.
    # Replace 'clipper_output' with your desired output prefix.
    python clipper.py -b aligned_reads.bam -c background_reads.bam -s hg38.chrom.sizes -o clipper_output
  7. 7

    CLIPper uses nearby regions (+/- 500 b.p.) to estimate the background, followed by a Poisson distribution to assess the significance of a peak, whereas “normalization” takes into account of another background library.

    CLIPper v1.0.0 GitHub
    $ Bash example
    # Install CLIPper (example using pip, or check repo for specific instructions)
    # A common way to install is via pip, or by cloning the repository and running setup.py.
    # pip install clipper
    # Alternatively, for the yeolab/clipper version, you might clone the repo and run directly:
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # python setup.py install
    
    # Example command for CLIPper peak calling
    # 'eclip_ip.bam' represents the treatment (IP) library.
    # 'eclip_input.bam' represents the control (background) library, used for normalization.
    # The '-w 500' parameter specifies the window size (+/- 500 b.p.) for background estimation.
    # 'hg38' is used as a placeholder for the genome assembly, which CLIPper uses for chromosome information.
    clipper.py -t eclip_ip.bam -c eclip_input.bam -o clipper_peaks -s hg38 -w 500
  8. 8

    In the eCLIP protocol, the SMInput library is prepared the same way as the IP library, which estimates the background crosslinking rate of other RBPs with similar size ranges as well as the RNA expression level and all other bias in library preparation.

    eCLIP vN/A (Protocol description) GitHub
    $ Bash example
    # This step describes the wet-lab preparation of the SMInput library, which is part of the eCLIP protocol.
    # The SMInput library is prepared to estimate background crosslinking rates and RNA expression levels for normalization.
    # There is no direct computational command for the physical preparation of the library.
    # The following command represents a typical first computational step: alignment of sequencing reads generated from the prepared SMInput library.
    
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for input and output
    # Replace 'sm_input.fastq.gz' with the actual path to your SMInput sequencing reads.
    # Replace '/path/to/STAR_genome_index/hg38' with the actual path to your STAR genome index.
    SM_INPUT_FASTQ="sm_input.fastq.gz"
    GENOME_DIR="/path/to/STAR_genome_index/hg38" # Example for human hg38
    OUTPUT_PREFIX="sm_input_aligned"
    
    # Example command to generate STAR genome index (run once per genome)
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/hg38.fa \
    #      --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf \
    #      --runThreadN 8 # Adjust thread count as needed
    
    # Align SMInput reads using STAR
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${SM_INPUT_FASTQ} \
         --runThreadN 8 \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --alignSJoverhangMin 8 \
         --alignSJDBoverhangMin 1 \
         --sjdbScore 1 \
         --readFilesCommand zcat
  9. 9

    Since ABC does not have a SMInput, we realize that a total RNA-seq library can capture part of the bias.

    $ Bash example
    # Install Salmon (example using conda)
    # conda install -c bioconda salmon=1.10.2
    
    # Define variables
    # Using human GRCh38 as a common reference assembly placeholder.
    # Transcriptome FASTA and GTF files are required for building the Salmon index.
    GENOME_VERSION="GRCh38"
    TRANSCRIPTOME_FASTA="Homo_sapiens.${GENOME_VERSION}.cdna.all.fa.gz" # Placeholder for transcriptome FASTA
    GTF_FILE="Homo_sapiens.${GENOME_VERSION}.109.gtf.gz" # Placeholder for GTF file (optional for index, but good for context)
    SALMON_INDEX_DIR="salmon_index_${GENOME_VERSION}"
    READ1="sample_R1.fastq.gz" # Placeholder for input R1 FASTQ
    READ2="sample_R2.fastq.gz" # Placeholder for input R2 FASTQ
    OUTPUT_DIR="salmon_quant_output"
    NUM_THREADS=8 # Example number of threads
    
    # Download reference files (if not already present)
    # Example for Ensembl release 109:
    # wget -P . "ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/${TRANSCRIPTOME_FASTA}"
    # wget -P . "ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/${GTF_FILE}"
    
    # Build Salmon index (if not already built)
    # This step requires the transcriptome FASTA file.
    # salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}" --type quasi -k 31
    
    # Run Salmon quantification
    # -i: index directory
    # -l A: automatically detect library type
    # -1: first read file
    # -2: second read file
    # -o: output directory
    # --gcBias: correct for GC bias (relevant to "capture part of the bias" in description)
    # --seqBias: correct for sequence-specific bias
    # -p: number of threads
    salmon quant -i "${SALMON_INDEX_DIR}" \
                 -l A \
                 -1 "${READ1}" \
                 -2 "${READ2}" \
                 -o "${OUTPUT_DIR}" \
                 --gcBias \
                 --seqBias \
                 -p "${NUM_THREADS}"
  10. 10

    Furthermore, internal normalization to other multiplexed library can estimate not only the expression level, but also other biases introduced within the protocol.

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub
    $ Bash example
    # Install R and DESeq2 if not already installed
    # conda create -n deseq2_env r-base bioconductor-deseq2
    # conda activate deseq2_env
    
    # --- Placeholder for input files ---
    # counts.tsv: A tab-separated file with gene/transcript IDs as rows and sample names as columns, containing raw counts.
    #             Example:
    #             gene_id    sample1    sample2    sample3
    #             geneA      100        150        120
    #             geneB      50         70         60
    #             ...
    #
    # sample_info.tsv: A tab-separated file with sample names as rows and experimental design variables as columns.
    #                  The row names MUST match the column names of counts.tsv.
    #                  Example:
    #                  sample_id    condition    batch
    #                  sample1      treated      batch1
    #                  sample2      control      batch1
    #                  sample3      treated      batch2
    #                  ...
    
    # Example R script for DESeq2 normalization
    cat << 'EOF' > normalize_deseq2.R
    library(DESeq2)
    
    # Load count data
    # Ensure the first column is used as row names (gene IDs)
    count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t", check.names=FALSE)
    
    # Load sample information
    # Ensure the first column is used as row names (sample IDs)
    sample_info <- read.table("sample_info.tsv", header=TRUE, row.names=1, sep="\t", check.names=FALSE)
    
    # Ensure sample order and names match between count data and sample info
    # This step is crucial for DESeq2
    if (!all(colnames(count_data) %in% rownames(sample_info))) {
      stop("Sample names in count_data columns do not match sample names in sample_info rows.")
    }
    sample_info <- sample_info[colnames(count_data), , drop=FALSE]
    
    # Create DESeqDataSet object
    # The 'design' formula specifies the experimental design.
    # Replace 'condition' with the actual column name from your sample_info.tsv that represents your primary variable of interest.
    # For simple normalization without differential expression, a design like ~1 or ~condition is often used.
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = ~ condition) # !!! IMPORTANT: Adjust 'condition' to your actual column name from sample_info.tsv
    
    # Estimate size factors for normalization (median-of-ratios method)
    dds <- estimateSizeFactors(dds)
    
    # Get normalized counts
    # These are the raw counts divided by the size factors.
    normalized_counts <- counts(dds, normalized=TRUE)
    
    # Save normalized counts to a file
    write.table(normalized_counts, file="normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA)
    
    message("Normalization complete. Normalized counts saved to normalized_counts.tsv")
    
    # Optional: For downstream analyses like PCA or clustering, it's often better to use
    # variance-stabilized data (rlog or vst transformation) rather than raw normalized counts.
    # These transformations account for the mean-variance relationship in count data.
    
    # # Rlog transformation (slower for large datasets, good for small-medium)
    # message("Performing rlog transformation...")
    # rlog_data <- rlog(dds, blind=FALSE) # blind=FALSE if you have a design formula
    # write.table(assay(rlog_data), file="rlog_transformed_counts.tsv", sep="\t", quote=FALSE, col.names=NA)
    # message("rlog transformed counts saved to rlog_transformed_counts.tsv")
    
    # # VST transformation (faster for large datasets)
    # message("Performing vst transformation...")
    # vst_data <- vst(dds, blind=FALSE) # blind=FALSE if you have a design formula
    # write.table(assay(vst_data), file="vst_transformed_counts.tsv", sep="\t", quote=FALSE, col.names=NA)
    # message("vst transformed counts saved to vst_transformed_counts.tsv")
    EOF
    
    # Execute the R script
    Rscript normalize_deseq2.R
    
    # Reference genome/annotation for upstream count generation (e.g., GRCh38/hg38)
    # Source: Ensembl, GENCODE, UCSC
  11. 11

    Ideally, as the number of RBP approaches infinity, the summation of “other RBPs in the multiplexed library” should approach the SMI-input.

    (Inferred with models/gemini-2.5-flash) vN/A
  12. 12

    To estimate the significance of the peak based on backgrounds, we used a chi-square test (or Fisher Exact test, if the number of reads < 5) to compare the number of reads in the IP library and the background library within and outside of the peak region.

    clipper (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
        # Install clipper (example, actual installation might vary)
        # pip install clipper
    
        # Define input files and genome assembly
        IP_BAM="ip_library.bam" # Placeholder for IP library BAM file
        BACKGROUND_BAM="background_library.bam" # Placeholder for background library BAM file
        GENOME_ASSEMBLY="hg38" # Placeholder for genome assembly (e.g., hg38, mm10)
        OUTPUT_DIR="clipper_peaks_output"
    
        # Create output directory if it doesn't exist
        mkdir -p "${OUTPUT_DIR}"
    
        # Run clipper to identify peaks and estimate significance using chi-square or Fisher's exact test.
        # The tool internally compares reads in IP vs background within and outside peak regions.
        clipper \
            --species "${GENOME_ASSEMBLY}" \
            --ip-bam "${IP_BAM}" \
            --control-bam "${BACKGROUND_BAM}" \
            --output-dir "${OUTPUT_DIR}" \
            --threshold 0.05 # Example p-value threshold for peak significance
    
  13. 13

    The pipeline is available at https://github.com/algaebrown/oligoCLIP.git

    oligoCLIP (Inferred with models/gemini-2.5-flash) vunspecified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Clone the oligoCLIP repository
    git clone https://github.com/algaebrown/oligoCLIP.git
    
    # Navigate into the cloned directory
    cd oligoCLIP
    
    # Refer to the project's documentation (e.g., README.md) for installation instructions, dependencies, and how to run the pipeline.

Tools Used

Raw Source Text
Data is processed in a similar way to standard eCLIP pipeline, except for a few adjustments to ABC’s multiplex design and library structure. UMI are extracted using umitools 1.0.0 and adaptors are removed using cutadapt 2.8, then fastqs are demultiplexed based on the 5nt barcode sequence using fastx toolkit. Since ABC is sequenced on the reverse strand, we reverse complement sequences. Alignment to the repetitive regions removes multi-mapped reads, then aligned to the genomic sequences using STAR 2.7.6. The pipeline is available at https://github.com/algaebrown/oligoCLIP.git
Given the genome alignment, we subsequently follow the steps to do peak call using CLIPper, then rank the peaks comparing to “backgrounds”, either SMInput (eCLIP), RNA-seq (ABC), or to other multiplex libraries (internal normalization). CLIPper uses nearby regions (+/- 500 b.p.) to estimate the background, followed by a Poisson distribution to assess the significance of a peak, whereas “normalization” takes into account of another background library. In the eCLIP protocol, the SMInput library is prepared the same way as the IP library, which estimates the background crosslinking rate of other RBPs with similar size ranges as well as the RNA expression level and all other bias in library preparation. Since ABC does not have a SMInput, we realize that a total RNA-seq library can capture part of the bias. Furthermore, internal normalization to other multiplexed library can estimate not only the expression level, but also other biases introduced within the protocol. Ideally, as the number of RBP approaches infinity, the summation of “other RBPs in the multiplexed library” should approach the SMI-input. To estimate the significance of the peak based on backgrounds, we used a chi-square test (or Fisher Exact test, if the number of reads < 5) to compare the number of reads in the IP library and the background library within and outside of the peak region. The pipeline is available at https://github.com/algaebrown/oligoCLIP.git
Assembly: GRCh38
Supplementary files format and content: bed (unnormalized CLIPper peak)
Supplementary files format and content: bed (normalized to RNA-seq)
Supplementary files format and content: bed (normalized to other multiplex library, repeat element removed)
← Back to Analysis