GSE164666 Processing Pipeline

OTHER code_examples 4 steps

Publication

Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.

Cancer discovery (2022) — PMID 34620690

Dataset

Integrative RNA-omics discovers GNAS alternative splicing as a phenotypic driver of splicing factor mutant neoplasms

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: eCLIP-Seq

eCLIP v2020 GitHub

$ Bash example

# Install cwltool if not already installed
# pip install cwltool

# Define input files and reference genome paths
# Replace these with your actual data paths
INPUT_FASTQ_R1="path/to/your_sample_R1.fastq.gz"
INPUT_FASTQ_R2="path/to/your_sample_R2.fastq.gz" # Optional, if paired-end
GENOME_FASTA="path/to/reference/genome.fa" # e.g., hg38.fa
GENOME_STAR_INDEX="path/to/star_index/" # Pre-built STAR index for the genome
ANNOTATION_GTF="path/to/genome_annotation.gtf" # e.g., gencode.v38.annotation.gtf
OUTPUT_DIR="eclip_output"
SAMPLE_NAME="my_eclip_experiment"

# The eCLIP pipeline is implemented as a CWL workflow from the yeolab/eclip repository.
# This command executes the 'eclip.cwl' workflow using cwltool.
# You would need to provide all required inputs as defined in the CWL workflow.
# A common practice is to define inputs in a YAML file (e.g., inputs.yaml)
# and pass it to cwltool. For demonstration, direct parameter passing is shown.

# Ensure you have the 'eclip.cwl' file from the yeolab/eclip repository.
# Example to get the workflow:
# git clone https://github.com/yeolab/eclip.git
# cd eclip

cwltool eclip.cwl \
  --fastq_r1 "${INPUT_FASTQ_R1}" \
  --fastq_r2 "${INPUT_FASTQ_R2}" \
  --genome_fasta "${GENOME_FASTA}" \
  --star_index "${GENOME_STAR_INDEX}" \
  --gtf "${ANNOTATION_GTF}" \
  --output_dir "${OUTPUT_DIR}" \
  --sample_name "${SAMPLE_NAME}" \
  --adapter_sequence "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" \
  --min_read_length 18 \
  --max_read_length 100 \
  --threads 8
  # Add other parameters as required by the eclip.cwl workflow (e.g., blacklist, chrom_sizes, etc.)

View on GitHub

For eCLIP-seq, and RNA-seq, raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

cutadapt v3.4 GitHub

$ Bash example

# Install cutadapt (e.g., via conda)
# conda install -c bioconda cutadapt

# Define input and output files
INPUT_R1="raw_R1.fastq.gz"
INPUT_R2="raw_R2.fastq.gz"
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"

# Define adapter sequences and minimum read length
# ADAPTER_3PRIME: Common Illumina 3' adapter sequence (e.g., Illumina Universal Adapter)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# ADAPTER_5PRIME: Placeholder for 5' barcode/UMI sequence specific to the eCLIP library preparation.
# Replace NNNNNNNN with the actual 5' barcode/UMI sequence if known.
ADAPTER_5PRIME="NNNNNNNN"
MIN_LENGTH=18

# Execute cutadapt for adapter and barcode trimming
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -A "${ADAPTER_3PRIME}" \
  -g "${ADAPTER_5PRIME}" \
  -G "${ADAPTER_5PRIME}" \
  --minimum-length "${MIN_LENGTH}" \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}"

View on GitHub

Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2

# Define reference genome and index prefix for RepBase
# Download RepBase sequences (e.g., from UCSC RepeatMasker track or RepBase website)
# For example, using hg38 repetitive sequences:
# wget -O repbase_hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz
# gunzip repbase_hg38.fa.gz
# mv hg38.fa.out repbase_hg38.fa

REPBASE_FASTA="repbase_hg38.fa"
REPBASE_INDEX_PREFIX="repbase_hg38_index"

# Build bowtie2 index for RepBase (if not already built)
# bowtie2-build "${REPBASE_FASTA}" "${REPBASE_INDEX_PREFIX}"

# Input trimmed reads (assuming paired-end, adjust for single-end if necessary)
TRIMMED_R1="trimmed_reads_R1.fastq.gz"
TRIMMED_R2="trimmed_reads_R2.fastq.gz"

# Output reads that did NOT map to RepBase
# bowtie2 --un-conc option outputs files with .1 and .2 suffixes
OUTPUT_NON_REP_PREFIX="non_rep_reads"

# Map trimmed reads against RepBase and output reads that did not map
# --very-sensitive-local is often used for repeat filtering to ensure good mapping
# --un-conc outputs reads where neither mate mapped to the reference
# -S /dev/null discards the SAM output as we only care about the --un-conc output
bowtie2 \
    --very-sensitive-local \
    -x "${REPBASE_INDEX_PREFIX}" \
    -1 "${TRIMMED_R1}" \
    -2 "${TRIMMED_R2}" \
    --un-conc "${OUTPUT_NON_REP_PREFIX}" \
    -S /dev/null

# The output files will be named non_rep_reads.1 and non_rep_reads.2 (or .1.fastq and .2.fastq depending on bowtie2 version/config)
# You might need to rename them if specific .fastq.gz extensions are required for downstream tools:
# mv "${OUTPUT_NON_REP_PREFIX}.1" "${OUTPUT_NON_REP_PREFIX}_R1.fastq"
# mv "${OUTPUT_NON_REP_PREFIX}.2" "${OUTPUT_NON_REP_PREFIX}_R2.fastq"
# gzip "${OUTPUT_NON_REP_PREFIX}_R1.fastq"
# gzip "${OUTPUT_NON_REP_PREFIX}_R2.fastq"

Remaining reads were mapped to the appropriate genome build using STAR aligner. eCLIP peaks were called with CLIPPER and normalized to size-matched input sample

STAR vSTAR 2.7.0f, CLIPPER 0.0.1 GitHub

$ Bash example

# Define variables for paths and files
GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder: Replace with actual path to STAR genome index
GENOME_FASTA="/path/to/hg38.fa"       # Placeholder: Replace with actual path to genome FASTA
GTF_FILE="/path/to/gencode.vXX.annotation.gtf" # Placeholder: Replace with actual path to GTF annotation
ECLIP_READS_FASTQ="eclip_reads.fastq.gz" # Placeholder: Replace with actual path to eCLIP input FASTQ file
INPUT_READS_FASTQ="input_reads.fastq.gz" # Placeholder: Replace with actual path to size-matched input FASTQ file
ECLIP_ALIGNED_BAM="eclip_aligned.bam" # Output BAM for eCLIP sample
INPUT_ALIGNED_BAM="input_aligned.bam" # Output BAM for input sample
STAR_ECLIP_OUTPUT_PREFIX="star_eclip_output/"
STAR_INPUT_OUTPUT_PREFIX="star_input_output/"
CLIPPER_OUTPUT_PREFIX="clipper_peaks"
NUM_THREADS=8 # Adjust based on available resources

# --- STAR Alignment ---
# Tool: STAR (Spliced Transcripts Alignment to a Reference)
# Version: 2.7.0f (Inferred from yeolab/eclip CWL workflow)
# Source: https://github.com/alexdobin/STAR

# Installation (example using conda)
# conda install -c bioconda star

# Generate STAR genome index (run once per genome build)
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --runThreadN "${NUM_THREADS}"

# Map eCLIP reads to the appropriate genome build
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${ECLIP_READS_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${STAR_ECLIP_OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbScore 1 \
     --runThreadN "${NUM_THREADS}"

# Rename STAR output BAM for eCLIP sample
mv "${STAR_ECLIP_OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${ECLIP_ALIGNED_BAM}"

# Map size-matched input reads to the appropriate genome build
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${STAR_INPUT_OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterMismatchNoverLmax 0.05 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbScore 1 \
     --runThreadN "${NUM_THREADS}"

# Rename STAR output BAM for input sample
mv "${STAR_INPUT_OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${INPUT_ALIGNED_BAM}"

# --- CLIPPER Peak Calling ---
# Tool: CLIPPER
# Version: 0.0.1 (Inferred from yeolab/eclip CWL workflow)
# Source: https://github.com/yeolab/clipper

# Installation (example using pip)
# pip install clipper
# or (example using conda)
# conda install -c bioconda clipper

# Call eCLIP peaks and normalize to size-matched input sample
clipper -e "${ECLIP_ALIGNED_BAM}" \
        -c "${INPUT_ALIGNED_BAM}" \
        -o "${CLIPPER_OUTPUT_PREFIX}" \
        -s hg38 \
        -p 0.01 \
        -f 0.05 \
        --bonferroni \
        --threshold-method pvalue \
        --save-bed \
        --save-wig \
        --save-bigwig

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Library strategy: eCLIP-Seq
For eCLIP-seq, and RNA-seq, raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner. eCLIP peaks were called with CLIPPER and normalized to size-matched input sample
Genome_build: hg19, gencode v19
Supplementary_files_format_and_content: eCLIP input normalized and filtered peaks in BED format. Peaks files (bigwigs) for eCLIP. Featurecounts file for RNA-seq samples.

← Back to Analysis