GSE198212 Processing Pipeline

RIP-Seq code_examples 7 steps

Publication

Remodeling oncogenic transcriptomes by small molecules targeting NONO.

Nature chemical biology (2023) — PMID 36864190

Dataset

GSE198212

Remodeling of oncogenic transcriptomes by small-molecules targeting the RNA-binding protein NONO

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Processed Using https://github.com/YeoLab/eclip 0.7.0

eCLIP v0.7.0 GitHub

$ Bash example

# The YeoLab eCLIP tool (https://github.com/YeoLab/eclip) is a CWL workflow.
# To run it, you would typically use a CWL runner like cwltool.

# Installation of cwltool (if not already installed)
# pip install cwltool
# or
# conda install -c conda-forge cwltool

# Clone the eclip workflow repository
# git clone https://github.com/YeoLab/eclip.git
# cd eclip

# Example command to run the eCLIP workflow using cwltool
# This is a placeholder as specific inputs (e.g., FASTQ files, reference genome) 
# are not provided in the description. 
# You would need to create an 'inputs.yaml' file specifying your data and parameters.
# Reference genome (e.g., hg38) would be specified within the inputs.yaml.
cwltool eclip.cwl --inputs inputs.yaml

View on GitHub

After standard HiSeq demultiplexing, eCLIP libraries with distinct in-line barcodes were demultiplexed using custom scripts, and the random-mer was appended to the read name for later usage.

eCLIP v1.0.0 GitHub

$ Bash example

# Install umi_tools if not already available
# conda install -c bioconda umi_tools

# Step 1: Custom demultiplexing based on in-line barcodes.
# The description states "eCLIP libraries with distinct in-line barcodes were demultiplexed using custom scripts".
# The exact custom script is not provided, but conceptually it takes an input FASTQ file
# (after standard HiSeq demultiplexing) and separates reads into multiple FASTQ files
# based on identified in-line barcodes. For example, if 'input_hiseq_demux.fastq'
# contains reads from multiple eCLIP libraries, this step would produce
# 'library1.fastq', 'library2.fastq', etc.
#
# Example (conceptual, replace with actual custom script and parameters):
# python /path/to/custom_inline_barcode_demux.py \
# --input input_hiseq_demux.fastq \
# --barcode_map barcodes.tsv \
# --output_prefix demultiplexed_library_

# Step 2: Extract random-mer (UMI) and append to read name.
# This step is performed for each demultiplexed library file. The description states
# "the random-mer was appended to the read name for later usage".
# Assuming 'demultiplexed_library_X.fastq' is the output from the custom demultiplexing
# for a specific library, and the random-mer is 10 bases at the start of the read
# (a common length in eCLIP protocols).
umi_tools extract \
--input demultiplexed_library_X.fastq \
--output demultiplexed_library_X_umi.fastq \
--extract-method=regex \
--bc-pattern="^(?P<umi_1>.{10})(?P<read_1>.*)" \
--log demultiplexed_library_X_umi.log

View on GitHub

Reads were then adapter trimmed (cutadapt v1.9.dev1) and reads less than 18 bp were discarded

cutadapt v1.9 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.9

# Adapter trimming and minimum length filtering
# -a: 3' adapter sequence (common Illumina TruSeq adapter used as a placeholder)
# -m: Discard reads shorter than the specified length (18 bp)
# -o: Output file for trimmed reads
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -m 18 -o trimmed_reads.fastq.gz input_reads.fastq.gz

View on GitHub

Mapping was then first performed against human elements in RepBase (v18.05) with STAR (v2.4.0i), repeat-mapping reads were segregated for separate analysis, and all others were then mapped against the full human genome (hg19) including a database of splice junctions with STAR (v 2.4.0i) (Dobin et al., 2013).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define reference paths (placeholders)
REPBASE_FA="repbase_human_elements_v18.05.fa"
HG19_FA="hg19.fa"
HG19_GTF="hg19.gtf" # For splice junctions

# Define STAR genome directory paths
REPBASE_GENOME_DIR="star_index_repbase_v18.05"
HG19_GENOME_DIR="star_index_hg19"

# Define input reads (placeholders)
READ1="input_R1.fastq.gz"
READ2="input_R2.fastq.gz"

# --- Step 1: Build STAR index for RepBase (if not already built) ---
# STAR --runMode genomeGenerate \
#      --genomeDir ${REPBASE_GENOME_DIR} \
#      --genomeFastaFiles ${REPBASE_FA} \
#      --runThreadN 8 # Adjust threads as needed

# --- Step 2: Build STAR index for hg19 with splice junctions (if not already built) ---
# STAR --runMode genomeGenerate \
#      --genomeDir ${HG19_GENOME_DIR} \
#      --genomeFastaFiles ${HG19_FA} \
#      --sjdbGTFfile ${HG19_GTF} \
#      --sjdbOverhang 100 \
#      --runThreadN 8 # Adjust threads as needed

# --- Step 3: First mapping against human elements in RepBase ---
# Output mapped reads to a BAM, and unmapped reads to FastQ
STAR --genomeDir ${REPBASE_GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --outFileNamePrefix repbase_mapping_ \
     --outSAMtype BAM SortedByCoordinate \
     --outReadsUnmapped Fastx \
     --runThreadN 8 # Adjust threads as needed

# Segregate repeat-mapping reads (repbase_mapping_Aligned.sortedByCoord.out.bam is the output)
# The description implies these are kept for separate analysis.

# --- Step 4: Map remaining reads (unmapped from RepBase) against the full human genome (hg19) ---
STAR --genomeDir ${HG19_GENOME_DIR} \
     --readFilesIn repbase_mapping_Unmapped.out.mate1 repbase_mapping_Unmapped.out.mate2 \
     --readFilesCommand zcat \
     --outFileNamePrefix hg19_mapping_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 # Adjust threads as needed

View on GitHub

Uniquely mapping reads were then run through a custom-built PCR duplicate removal script, removing duplicate reads based on sharing identical Read1 start position, Read2 start position, and random-mer sequence to leave 'Usable' reads.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Assuming 'aligned_sorted.bam' is the input BAM file containing uniquely mapped reads
# with UMIs already incorporated into the read names (e.g., by a preceding umi_tools extract step).
# 'deduplicated.bam' will be the output file containing 'Usable' reads.
# The --paired flag indicates paired-end reads.
# The --method=directional flag uses a directional method for deduplication, which is common for UMI-based deduplication.
umi_tools dedup --input aligned_sorted.bam --output deduplicated.bam --paired --method=directional

View on GitHub

6
eCLIP data sets with multiple in-line barcodes were merged at the usable read stage, and cluster identification was performed on usable reads using CLIPper (Yeo et al., 2009) (available at https://github.com/YeoLab/clipper/releases/tag/1.0) with options âs GRCh38 âo âbonferroni âsuperlocalâthreshold-method binomialâsave-pickle

CLIPper v1.0 GitHub
$ Bash example
```
# Install CLIPper (example using pip)
# pip install clipper

# Assuming 'merged_usable_reads.bam' is the input file containing usable reads.
# The -o flag in CLIPper typically specifies the output directory.
clipper.py -s GRCh38 -o clipper_output -bonferroni -superlocal -threshold-method binomial -save-pickle merged_usable_reads.bam
```
View on GitHub

data are further downsampled to roughly equal number per replicate

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Example: Downsample an input BAM file to a target number of reads.
# This assumes 'TARGET_READS' (e.g., the minimum read count across all replicates)
# and 'INPUT_BAM' are already defined for a specific replicate.

# Placeholder variables (replace with actual paths and target count)
INPUT_BAM="path/to/replicateX.bam"
OUTPUT_BAM="path/to/replicateX_downsampled.bam"
TARGET_READS=10000000 # Example: 10 million reads, determined from the lowest replicate count

# Get the current number of reads in the input BAM file
CURRENT_READS=$(samtools view -c "$INPUT_BAM")

# Check if downsampling is necessary
if (( CURRENT_READS > TARGET_READS )); then
    # Calculate the sampling fraction
    # 'scale=6' for precision in floating-point division
    FRACTION=$(echo "scale=6; $TARGET_READS / $CURRENT_READS" | bc)
    echo "Downsampling $INPUT_BAM from $CURRENT_READS to $TARGET_READS reads (fraction: $FRACTION)"

    # Execute samtools view for downsampling
    # -b: Output in BAM format
    # -h: Include header
    # -s <seed>.<fraction>: Sample reads. The integer part is a random seed for reproducibility (e.g., 42),
    #                       the float part is the fraction of reads to retain.
    # -o: Specify output file
    samtools view -b -h -s 42."$FRACTION" "$INPUT_BAM" -o "$OUTPUT_BAM"
else
    echo "$INPUT_BAM already has $CURRENT_READS reads, which is <= $TARGET_READS. Copying instead of downsampling."
    cp "$INPUT_BAM" "$OUTPUT_BAM"
fi

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Processed Using https://github.com/YeoLab/eclip 0.7.0
After standard HiSeq demultiplexing, eCLIP libraries with distinct in-line barcodes were demultiplexed using custom scripts, and the random-mer was appended to the read name for later usage. Reads were then adapter trimmed (cutadapt v1.9.dev1) and reads less than 18 bp were discarded
Mapping was then first performed against human elements in RepBase (v18.05) with STAR (v2.4.0i), repeat-mapping reads were segregated for separate analysis, and all others were then mapped against the full human genome (hg19) including a database of splice junctions with STAR (v 2.4.0i) (Dobin et al., 2013). Uniquely mapping reads were then run through a custom-built PCR duplicate removal script, removing duplicate reads based on sharing identical Read1 start position, Read2 start position, and random-mer sequence to leave 'Usable' reads.
eCLIP data sets with multiple in-line barcodes were merged at the usable read stage, and cluster identification was performed on usable reads using CLIPper (Yeo et al., 2009) (available at https://github.com/YeoLab/clipper/releases/tag/1.0) with options âs GRCh38 âo âbonferroni âsuperlocalâthreshold-method binomialâsave-pickle
data are further downsampled to roughly equal number per replicate
Assembly: hg38
Supplementary files format and content: wig files represent read covergae for plus and minus strands
Supplementary files format and content: peak files

← Back to Analysis