GSE205536 Processing Pipeline
Publication
Multiplexed transcriptome discovery of RNA-binding protein binding sites by antibody-barcode eCLIP.Nature methods (2023) — PMID 36550273
Dataset
GSE205536Multiplexed transcriptome discovery of RNA binding protein binding sites by antibody-barcode eCLIP
Processing Steps
Generate Jupyter Notebook-
1
Data is processed in a similar way to standard eCLIP pipeline, except for a few adjustments to ABCâs multiplex design and library structure.
$ Bash example
# Install cwltool if not already installed # pip install cwltool # Clone the eCLIP CWL workflow repository # git clone https://github.com/yeolab/eclip.git # cd eclip # Placeholder for input files and reference genome # Replace with actual paths to your FASTQ files and reference genome/annotations # The description mentions "adjustments to ABC’s multiplex design and library structure". # These adjustments would typically involve custom demultiplexing, adapter trimming, or UMI extraction # steps that might precede or be integrated into the initial stages of the eCLIP workflow. # For this standard eCLIP pipeline representation, we assume these initial steps are handled # or the input FASTQ files are ready for standard processing. INPUT_FASTQ_R1="path/to/your/sample_R1.fastq.gz" INPUT_FASTQ_R2="path/to/your/sample_R2.fastq.gz" # Reference genome (using hg38 as a common placeholder) GENOME_FASTA="path/to/your/hg38.fa" GENOME_GTF="path/to/your/hg38.gtf" STAR_INDEX="path/to/your/star_index_hg38" # Pre-built STAR index for hg38 # Create a CWL input YAML file (example based on yeolab/eclip structure) # This is a simplified representation; actual inputs might be more complex cat << EOF > inputs.yaml fastq_r1: class: File path: ${INPUT_FASTQ_R1} fastq_r2: class: File path: ${INPUT_FASTQ_R2} genome_fasta: class: File path: ${GENOME_FASTA} genome_gtf: class: File path: ${GENOME_GTF} star_index: class: Directory path: ${STAR_INDEX} # Add other necessary parameters as per eclip.cwl, e.g., adapters, output_prefix, etc. # For example, if custom adapters are used due to ABC's library structure: # adapters: # class: File # path: path/to/your/custom_adapters.fa EOF # Execute the eCLIP CWL workflow # This command runs the main eclip.cwl workflow, which orchestrates alignment, deduplication, peak calling, etc. cwltool --outdir ./eclip_output eclip.cwl inputs.yaml -
2
UMI are extracted using umitools 1.0.0 and adaptors are removed using cutadapt 2.8, then fastqs are demultiplexed based on the 5nt barcode sequence using fastx toolkit.
$ Bash example
# Install cutadapt (version 2.8) # conda create -n cutadapt_env cutadapt=2.8 # conda activate cutadapt_env # Define input and output files # Input files are assumed to be the output from the UMI extraction step. INPUT_R1="umi_extracted_R1.fastq.gz" INPUT_R2="umi_extracted_R2.fastq.gz" OUTPUT_R1="cutadapt_trimmed_R1.fastq.gz" OUTPUT_R2="cutadapt_trimmed_R2.fastq.gz" # Define common Illumina adaptors (these might need to be adjusted based on specific library preparation kit) ADAPTOR_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina Universal Adaptor ADAPTOR_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina Read 2 Adaptor (or same as R1 depending on library) # Execute cutadapt to remove adaptors cutadapt -a "${ADAPTOR_R1}" -A "${ADAPTOR_R2}" \ -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \ "${INPUT_R1}" "${INPUT_R2}" -
3
Since ABC is sequenced on the reverse strand, we reverse complement sequences.
$ Bash example
# Install seqtk if not already available # conda install -c bioconda seqtk # Reverse complement sequences from 'input.fastq' and save to 'output.fastq' # The '-r' option performs reverse complement. seqtk seq -r input.fastq > output.fastq
-
4
Alignment to the repetitive regions removes multi-mapped reads, then aligned to the genomic sequences using STAR 2.7.6.
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define variables READ1="input_R1.fastq.gz" READ2="input_R2.fastq.gz" # Assuming paired-end reads, common for eCLIP OUTPUT_PREFIX="sample_aligned" THREADS=$(nproc) # Use all available CPU cores # --- Reference Data Paths (PLACEHOLDERS - Update with actual paths) --- # Blacklist/Repetitive regions STAR index (e.g., rRNA, tRNA, simple repeats, ENCODE blacklist regions) # This index should be built from sequences that are to be filtered out. # Example: /path/to/star_index/blacklist_hg38 BLACKLIST_STAR_INDEX="/path/to/star_index/blacklist_hg38" # Genomic STAR index (e.g., human hg38) # Example: /path/to/star_index/hg38 GENOME_STAR_INDEX="/path/to/star_index/hg38" # Temporary files for unmapped reads from the blacklist alignment UNMAPPED_R1_TEMP="${OUTPUT_PREFIX}_blacklist_Unmapped.out.mate1" UNMAPPED_R2_TEMP="${OUTPUT_PREFIX}_blacklist_Unmapped.out.mate2" echo "Step 1: Aligning reads to repetitive regions (blacklist) to identify and filter out non-specific reads." # Align to blacklist genome. Reads that map here are considered repetitive/non-specific. # --outSAMtype None: Do not output BAM/SAM for this step, only interested in unmapped reads. # --outReadsUnmapped Fastx: Output unmapped reads in FASTQ format. # --outFilterMultimapNmax 100: Allow multi-mapping to blacklist, as we are primarily interested in reads that *don't* map anywhere here. STAR --runThreadN ${THREADS} \ --genomeDir ${BLACKLIST_STAR_INDEX} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_blacklist_ \ --outSAMtype None \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 100 echo "Step 2: Aligning unmapped reads from blacklist alignment to the genomic sequences." # Align the reads that did not map to the blacklist to the main genome. # --outFilterMultimapNmax 1: Ensures only uniquely mapped reads are reported for the genomic alignment, # effectively "removing multi-mapped reads" from the final genomic BAM. # Standard STAR parameters for RNA-seq alignment are included. STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_STAR_INDEX} \ --readFilesIn ${UNMAPPED_R1_TEMP} ${UNMAPPED_R2_TEMP} \ --readFilesCommand cat \ --outFileNamePrefix ${OUTPUT_PREFIX}_genome_ \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNminOverLread 0.3 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 60000000000 # Adjust based on available RAM (e.g., 60GB) # Clean up temporary unmapped files rm ${UNMAPPED_R1_TEMP} ${UNMAPPED_R2_TEMP} echo "Alignment complete. Final BAM: ${OUTPUT_PREFIX}_genome_Aligned.sortedByCoordinate.out.bam" -
5
The pipeline is available at https://github.com/algaebrown/oligoCLIP.git
$ Bash example
# Clone the oligoCLIP pipeline repository git clone https://github.com/algaebrown/oligoCLIP.git cd oligoCLIP # The description does not provide specific commands or parameters to run the pipeline. # Please refer to the repository's documentation (e.g., README.md) for usage instructions. # Example placeholder: # python run_oligoCLIP.py --input <input_file> --output <output_directory>
-
6
Given the genome alignment, we subsequently follow the steps to do peak call using CLIPper, then rank the peaks comparing to âbackgroundsâ, either SMInput (eCLIP), RNA-seq (ABC), or to other multiplex libraries (internal normalization).
$ Bash example
# Install CLIPper (if not already installed) # pip install clipper # Download a genome size file (e.g., for hg38) # wget -O hg38.chrom.sizes http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # Execute CLIPper for peak calling # Replace 'aligned_reads.bam' with your IP sample's aligned BAM file. # Replace 'background_reads.bam' with your control (SMInput, RNA-seq, or other multiplex library) aligned BAM file. # Replace 'hg38.chrom.sizes' with the appropriate genome size file for your reference genome. # Replace 'clipper_output' with your desired output prefix. python clipper.py -b aligned_reads.bam -c background_reads.bam -s hg38.chrom.sizes -o clipper_output
-
7
CLIPper uses nearby regions (+/- 500 b.p.) to estimate the background, followed by a Poisson distribution to assess the significance of a peak, whereas ânormalizationâ takes into account of another background library.
$ Bash example
# Install CLIPper (example using pip, or check repo for specific instructions) # A common way to install is via pip, or by cloning the repository and running setup.py. # pip install clipper # Alternatively, for the yeolab/clipper version, you might clone the repo and run directly: # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Example command for CLIPper peak calling # 'eclip_ip.bam' represents the treatment (IP) library. # 'eclip_input.bam' represents the control (background) library, used for normalization. # The '-w 500' parameter specifies the window size (+/- 500 b.p.) for background estimation. # 'hg38' is used as a placeholder for the genome assembly, which CLIPper uses for chromosome information. clipper.py -t eclip_ip.bam -c eclip_input.bam -o clipper_peaks -s hg38 -w 500
-
8
In the eCLIP protocol, the SMInput library is prepared the same way as the IP library, which estimates the background crosslinking rate of other RBPs with similar size ranges as well as the RNA expression level and all other bias in library preparation.
$ Bash example
# This step describes the wet-lab preparation of the SMInput library, which is part of the eCLIP protocol. # The SMInput library is prepared to estimate background crosslinking rates and RNA expression levels for normalization. # There is no direct computational command for the physical preparation of the library. # The following command represents a typical first computational step: alignment of sequencing reads generated from the prepared SMInput library. # Install STAR (if not already installed) # conda install -c bioconda star # Define variables for input and output # Replace 'sm_input.fastq.gz' with the actual path to your SMInput sequencing reads. # Replace '/path/to/STAR_genome_index/hg38' with the actual path to your STAR genome index. SM_INPUT_FASTQ="sm_input.fastq.gz" GENOME_DIR="/path/to/STAR_genome_index/hg38" # Example for human hg38 OUTPUT_PREFIX="sm_input_aligned" # Example command to generate STAR genome index (run once per genome) # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg38.fa \ # --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf \ # --runThreadN 8 # Adjust thread count as needed # Align SMInput reads using STAR STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${SM_INPUT_FASTQ} \ --runThreadN 8 \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --readFilesCommand zcat -
9
Since ABC does not have a SMInput, we realize that a total RNA-seq library can capture part of the bias.
$ Bash example
# Install Salmon (example using conda) # conda install -c bioconda salmon=1.10.2 # Define variables # Using human GRCh38 as a common reference assembly placeholder. # Transcriptome FASTA and GTF files are required for building the Salmon index. GENOME_VERSION="GRCh38" TRANSCRIPTOME_FASTA="Homo_sapiens.${GENOME_VERSION}.cdna.all.fa.gz" # Placeholder for transcriptome FASTA GTF_FILE="Homo_sapiens.${GENOME_VERSION}.109.gtf.gz" # Placeholder for GTF file (optional for index, but good for context) SALMON_INDEX_DIR="salmon_index_${GENOME_VERSION}" READ1="sample_R1.fastq.gz" # Placeholder for input R1 FASTQ READ2="sample_R2.fastq.gz" # Placeholder for input R2 FASTQ OUTPUT_DIR="salmon_quant_output" NUM_THREADS=8 # Example number of threads # Download reference files (if not already present) # Example for Ensembl release 109: # wget -P . "ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/${TRANSCRIPTOME_FASTA}" # wget -P . "ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/${GTF_FILE}" # Build Salmon index (if not already built) # This step requires the transcriptome FASTA file. # salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}" --type quasi -k 31 # Run Salmon quantification # -i: index directory # -l A: automatically detect library type # -1: first read file # -2: second read file # -o: output directory # --gcBias: correct for GC bias (relevant to "capture part of the bias" in description) # --seqBias: correct for sequence-specific bias # -p: number of threads salmon quant -i "${SALMON_INDEX_DIR}" \ -l A \ -1 "${READ1}" \ -2 "${READ2}" \ -o "${OUTPUT_DIR}" \ --gcBias \ --seqBias \ -p "${NUM_THREADS}" -
10
Furthermore, internal normalization to other multiplexed library can estimate not only the expression level, but also other biases introduced within the protocol.
$ Bash example
# Install R and DESeq2 if not already installed # conda create -n deseq2_env r-base bioconductor-deseq2 # conda activate deseq2_env # --- Placeholder for input files --- # counts.tsv: A tab-separated file with gene/transcript IDs as rows and sample names as columns, containing raw counts. # Example: # gene_id sample1 sample2 sample3 # geneA 100 150 120 # geneB 50 70 60 # ... # # sample_info.tsv: A tab-separated file with sample names as rows and experimental design variables as columns. # The row names MUST match the column names of counts.tsv. # Example: # sample_id condition batch # sample1 treated batch1 # sample2 control batch1 # sample3 treated batch2 # ... # Example R script for DESeq2 normalization cat << 'EOF' > normalize_deseq2.R library(DESeq2) # Load count data # Ensure the first column is used as row names (gene IDs) count_data <- read.table("counts.tsv", header=TRUE, row.names=1, sep="\t", check.names=FALSE) # Load sample information # Ensure the first column is used as row names (sample IDs) sample_info <- read.table("sample_info.tsv", header=TRUE, row.names=1, sep="\t", check.names=FALSE) # Ensure sample order and names match between count data and sample info # This step is crucial for DESeq2 if (!all(colnames(count_data) %in% rownames(sample_info))) { stop("Sample names in count_data columns do not match sample names in sample_info rows.") } sample_info <- sample_info[colnames(count_data), , drop=FALSE] # Create DESeqDataSet object # The 'design' formula specifies the experimental design. # Replace 'condition' with the actual column name from your sample_info.tsv that represents your primary variable of interest. # For simple normalization without differential expression, a design like ~1 or ~condition is often used. dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ condition) # !!! IMPORTANT: Adjust 'condition' to your actual column name from sample_info.tsv # Estimate size factors for normalization (median-of-ratios method) dds <- estimateSizeFactors(dds) # Get normalized counts # These are the raw counts divided by the size factors. normalized_counts <- counts(dds, normalized=TRUE) # Save normalized counts to a file write.table(normalized_counts, file="normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA) message("Normalization complete. Normalized counts saved to normalized_counts.tsv") # Optional: For downstream analyses like PCA or clustering, it's often better to use # variance-stabilized data (rlog or vst transformation) rather than raw normalized counts. # These transformations account for the mean-variance relationship in count data. # # Rlog transformation (slower for large datasets, good for small-medium) # message("Performing rlog transformation...") # rlog_data <- rlog(dds, blind=FALSE) # blind=FALSE if you have a design formula # write.table(assay(rlog_data), file="rlog_transformed_counts.tsv", sep="\t", quote=FALSE, col.names=NA) # message("rlog transformed counts saved to rlog_transformed_counts.tsv") # # VST transformation (faster for large datasets) # message("Performing vst transformation...") # vst_data <- vst(dds, blind=FALSE) # blind=FALSE if you have a design formula # write.table(assay(vst_data), file="vst_transformed_counts.tsv", sep="\t", quote=FALSE, col.names=NA) # message("vst transformed counts saved to vst_transformed_counts.tsv") EOF # Execute the R script Rscript normalize_deseq2.R # Reference genome/annotation for upstream count generation (e.g., GRCh38/hg38) # Source: Ensembl, GENCODE, UCSC -
11
Ideally, as the number of RBP approaches infinity, the summation of âother RBPs in the multiplexed libraryâ should approach the SMI-input.
(Inferred with models/gemini-2.5-flash) vN/A -
12
To estimate the significance of the peak based on backgrounds, we used a chi-square test (or Fisher Exact test, if the number of reads < 5) to compare the number of reads in the IP library and the background library within and outside of the peak region.
$ Bash example
# Install clipper (example, actual installation might vary) # pip install clipper # Define input files and genome assembly IP_BAM="ip_library.bam" # Placeholder for IP library BAM file BACKGROUND_BAM="background_library.bam" # Placeholder for background library BAM file GENOME_ASSEMBLY="hg38" # Placeholder for genome assembly (e.g., hg38, mm10) OUTPUT_DIR="clipper_peaks_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run clipper to identify peaks and estimate significance using chi-square or Fisher's exact test. # The tool internally compares reads in IP vs background within and outside peak regions. clipper \ --species "${GENOME_ASSEMBLY}" \ --ip-bam "${IP_BAM}" \ --control-bam "${BACKGROUND_BAM}" \ --output-dir "${OUTPUT_DIR}" \ --threshold 0.05 # Example p-value threshold for peak significance -
13
The pipeline is available at https://github.com/algaebrown/oligoCLIP.git
oligoCLIP (Inferred with models/gemini-2.5-flash) vunspecified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Clone the oligoCLIP repository git clone https://github.com/algaebrown/oligoCLIP.git # Navigate into the cloned directory cd oligoCLIP # Refer to the project's documentation (e.g., README.md) for installation instructions, dependencies, and how to run the pipeline.
Raw Source Text
Data is processed in a similar way to standard eCLIP pipeline, except for a few adjustments to ABCâs multiplex design and library structure. UMI are extracted using umitools 1.0.0 and adaptors are removed using cutadapt 2.8, then fastqs are demultiplexed based on the 5nt barcode sequence using fastx toolkit. Since ABC is sequenced on the reverse strand, we reverse complement sequences. Alignment to the repetitive regions removes multi-mapped reads, then aligned to the genomic sequences using STAR 2.7.6. The pipeline is available at https://github.com/algaebrown/oligoCLIP.git Given the genome alignment, we subsequently follow the steps to do peak call using CLIPper, then rank the peaks comparing to âbackgroundsâ, either SMInput (eCLIP), RNA-seq (ABC), or to other multiplex libraries (internal normalization). CLIPper uses nearby regions (+/- 500 b.p.) to estimate the background, followed by a Poisson distribution to assess the significance of a peak, whereas ânormalizationâ takes into account of another background library. In the eCLIP protocol, the SMInput library is prepared the same way as the IP library, which estimates the background crosslinking rate of other RBPs with similar size ranges as well as the RNA expression level and all other bias in library preparation. Since ABC does not have a SMInput, we realize that a total RNA-seq library can capture part of the bias. Furthermore, internal normalization to other multiplexed library can estimate not only the expression level, but also other biases introduced within the protocol. Ideally, as the number of RBP approaches infinity, the summation of âother RBPs in the multiplexed libraryâ should approach the SMI-input. To estimate the significance of the peak based on backgrounds, we used a chi-square test (or Fisher Exact test, if the number of reads < 5) to compare the number of reads in the IP library and the background library within and outside of the peak region. The pipeline is available at https://github.com/algaebrown/oligoCLIP.git Assembly: GRCh38 Supplementary files format and content: bed (unnormalized CLIPper peak) Supplementary files format and content: bed (normalized to RNA-seq) Supplementary files format and content: bed (normalized to other multiplex library, repeat element removed)