GSE198212 Processing Pipeline
Publication
Remodeling oncogenic transcriptomes by small molecules targeting NONO.Nature chemical biology (2023) — PMID 36864190
Dataset
GSE198212Remodeling of oncogenic transcriptomes by small-molecules targeting the RNA-binding protein NONO
Processing Steps
Generate Jupyter Notebook-
1
Processed Using https://github.com/YeoLab/eclip 0.7.0
$ Bash example
# The YeoLab eCLIP tool (https://github.com/YeoLab/eclip) is a CWL workflow. # To run it, you would typically use a CWL runner like cwltool. # Installation of cwltool (if not already installed) # pip install cwltool # or # conda install -c conda-forge cwltool # Clone the eclip workflow repository # git clone https://github.com/YeoLab/eclip.git # cd eclip # Example command to run the eCLIP workflow using cwltool # This is a placeholder as specific inputs (e.g., FASTQ files, reference genome) # are not provided in the description. # You would need to create an 'inputs.yaml' file specifying your data and parameters. # Reference genome (e.g., hg38) would be specified within the inputs.yaml. cwltool eclip.cwl --inputs inputs.yaml
-
2
After standard HiSeq demultiplexing, eCLIP libraries with distinct in-line barcodes were demultiplexed using custom scripts, and the random-mer was appended to the read name for later usage.
$ Bash example
# Install umi_tools if not already available # conda install -c bioconda umi_tools # Step 1: Custom demultiplexing based on in-line barcodes. # The description states "eCLIP libraries with distinct in-line barcodes were demultiplexed using custom scripts". # The exact custom script is not provided, but conceptually it takes an input FASTQ file # (after standard HiSeq demultiplexing) and separates reads into multiple FASTQ files # based on identified in-line barcodes. For example, if 'input_hiseq_demux.fastq' # contains reads from multiple eCLIP libraries, this step would produce # 'library1.fastq', 'library2.fastq', etc. # # Example (conceptual, replace with actual custom script and parameters): # python /path/to/custom_inline_barcode_demux.py \ # --input input_hiseq_demux.fastq \ # --barcode_map barcodes.tsv \ # --output_prefix demultiplexed_library_ # Step 2: Extract random-mer (UMI) and append to read name. # This step is performed for each demultiplexed library file. The description states # "the random-mer was appended to the read name for later usage". # Assuming 'demultiplexed_library_X.fastq' is the output from the custom demultiplexing # for a specific library, and the random-mer is 10 bases at the start of the read # (a common length in eCLIP protocols). umi_tools extract \ --input demultiplexed_library_X.fastq \ --output demultiplexed_library_X_umi.fastq \ --extract-method=regex \ --bc-pattern="^(?P<umi_1>.{10})(?P<read_1>.*)" \ --log demultiplexed_library_X_umi.log -
3
Reads were then adapter trimmed (cutadapt v1.9.dev1) and reads less than 18 bp were discarded
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.9 # Adapter trimming and minimum length filtering # -a: 3' adapter sequence (common Illumina TruSeq adapter used as a placeholder) # -m: Discard reads shorter than the specified length (18 bp) # -o: Output file for trimmed reads cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -m 18 -o trimmed_reads.fastq.gz input_reads.fastq.gz
-
4
Mapping was then first performed against human elements in RepBase (v18.05) with STAR (v2.4.0i), repeat-mapping reads were segregated for separate analysis, and all others were then mapped against the full human genome (hg19) including a database of splice junctions with STAR (v 2.4.0i) (Dobin et al., 2013).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define reference paths (placeholders) REPBASE_FA="repbase_human_elements_v18.05.fa" HG19_FA="hg19.fa" HG19_GTF="hg19.gtf" # For splice junctions # Define STAR genome directory paths REPBASE_GENOME_DIR="star_index_repbase_v18.05" HG19_GENOME_DIR="star_index_hg19" # Define input reads (placeholders) READ1="input_R1.fastq.gz" READ2="input_R2.fastq.gz" # --- Step 1: Build STAR index for RepBase (if not already built) --- # STAR --runMode genomeGenerate \ # --genomeDir ${REPBASE_GENOME_DIR} \ # --genomeFastaFiles ${REPBASE_FA} \ # --runThreadN 8 # Adjust threads as needed # --- Step 2: Build STAR index for hg19 with splice junctions (if not already built) --- # STAR --runMode genomeGenerate \ # --genomeDir ${HG19_GENOME_DIR} \ # --genomeFastaFiles ${HG19_FA} \ # --sjdbGTFfile ${HG19_GTF} \ # --sjdbOverhang 100 \ # --runThreadN 8 # Adjust threads as needed # --- Step 3: First mapping against human elements in RepBase --- # Output mapped reads to a BAM, and unmapped reads to FastQ STAR --genomeDir ${REPBASE_GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix repbase_mapping_ \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --runThreadN 8 # Adjust threads as needed # Segregate repeat-mapping reads (repbase_mapping_Aligned.sortedByCoord.out.bam is the output) # The description implies these are kept for separate analysis. # --- Step 4: Map remaining reads (unmapped from RepBase) against the full human genome (hg19) --- STAR --genomeDir ${HG19_GENOME_DIR} \ --readFilesIn repbase_mapping_Unmapped.out.mate1 repbase_mapping_Unmapped.out.mate2 \ --readFilesCommand zcat \ --outFileNamePrefix hg19_mapping_ \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 # Adjust threads as needed -
5
Uniquely mapping reads were then run through a custom-built PCR duplicate removal script, removing duplicate reads based on sharing identical Read1 start position, Read2 start position, and random-mer sequence to leave 'Usable' reads.
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Assuming 'aligned_sorted.bam' is the input BAM file containing uniquely mapped reads # with UMIs already incorporated into the read names (e.g., by a preceding umi_tools extract step). # 'deduplicated.bam' will be the output file containing 'Usable' reads. # The --paired flag indicates paired-end reads. # The --method=directional flag uses a directional method for deduplication, which is common for UMI-based deduplication. umi_tools dedup --input aligned_sorted.bam --output deduplicated.bam --paired --method=directional
-
6
eCLIP data sets with multiple in-line barcodes were merged at the usable read stage, and cluster identification was performed on usable reads using CLIPper (Yeo et al., 2009) (available at https://github.com/YeoLab/clipper/releases/tag/1.0) with options âs GRCh38 âo âbonferroni âsuperlocalâthreshold-method binomialâsave-pickle
$ Bash example
# Install CLIPper (example using pip) # pip install clipper # Assuming 'merged_usable_reads.bam' is the input file containing usable reads. # The -o flag in CLIPper typically specifies the output directory. clipper.py -s GRCh38 -o clipper_output -bonferroni -superlocal -threshold-method binomial -save-pickle merged_usable_reads.bam
-
7
data are further downsampled to roughly equal number per replicate
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Downsample an input BAM file to a target number of reads. # This assumes 'TARGET_READS' (e.g., the minimum read count across all replicates) # and 'INPUT_BAM' are already defined for a specific replicate. # Placeholder variables (replace with actual paths and target count) INPUT_BAM="path/to/replicateX.bam" OUTPUT_BAM="path/to/replicateX_downsampled.bam" TARGET_READS=10000000 # Example: 10 million reads, determined from the lowest replicate count # Get the current number of reads in the input BAM file CURRENT_READS=$(samtools view -c "$INPUT_BAM") # Check if downsampling is necessary if (( CURRENT_READS > TARGET_READS )); then # Calculate the sampling fraction # 'scale=6' for precision in floating-point division FRACTION=$(echo "scale=6; $TARGET_READS / $CURRENT_READS" | bc) echo "Downsampling $INPUT_BAM from $CURRENT_READS to $TARGET_READS reads (fraction: $FRACTION)" # Execute samtools view for downsampling # -b: Output in BAM format # -h: Include header # -s <seed>.<fraction>: Sample reads. The integer part is a random seed for reproducibility (e.g., 42), # the float part is the fraction of reads to retain. # -o: Specify output file samtools view -b -h -s 42."$FRACTION" "$INPUT_BAM" -o "$OUTPUT_BAM" else echo "$INPUT_BAM already has $CURRENT_READS reads, which is <= $TARGET_READS. Copying instead of downsampling." cp "$INPUT_BAM" "$OUTPUT_BAM" fi
Raw Source Text
Processed Using https://github.com/YeoLab/eclip 0.7.0 After standard HiSeq demultiplexing, eCLIP libraries with distinct in-line barcodes were demultiplexed using custom scripts, and the random-mer was appended to the read name for later usage. Reads were then adapter trimmed (cutadapt v1.9.dev1) and reads less than 18 bp were discarded Mapping was then first performed against human elements in RepBase (v18.05) with STAR (v2.4.0i), repeat-mapping reads were segregated for separate analysis, and all others were then mapped against the full human genome (hg19) including a database of splice junctions with STAR (v 2.4.0i) (Dobin et al., 2013). Uniquely mapping reads were then run through a custom-built PCR duplicate removal script, removing duplicate reads based on sharing identical Read1 start position, Read2 start position, and random-mer sequence to leave 'Usable' reads. eCLIP data sets with multiple in-line barcodes were merged at the usable read stage, and cluster identification was performed on usable reads using CLIPper (Yeo et al., 2009) (available at https://github.com/YeoLab/clipper/releases/tag/1.0) with options âs GRCh38 âo âbonferroni âsuperlocalâthreshold-method binomialâsave-pickle data are further downsampled to roughly equal number per replicate Assembly: hg38 Supplementary files format and content: wig files represent read covergae for plus and minus strands Supplementary files format and content: peak files