GSE164666 Processing Pipeline
OTHER
code_examples
4 steps
Publication
Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.Cancer discovery (2022) — PMID 34620690
Dataset
GSE164666Integrative RNA-omics discovers GNAS alternative splicing as a phenotypic driver of splicing factor mutant neoplasms
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: eCLIP-Seq
$ Bash example
# Install cwltool if not already installed # pip install cwltool # Define input files and reference genome paths # Replace these with your actual data paths INPUT_FASTQ_R1="path/to/your_sample_R1.fastq.gz" INPUT_FASTQ_R2="path/to/your_sample_R2.fastq.gz" # Optional, if paired-end GENOME_FASTA="path/to/reference/genome.fa" # e.g., hg38.fa GENOME_STAR_INDEX="path/to/star_index/" # Pre-built STAR index for the genome ANNOTATION_GTF="path/to/genome_annotation.gtf" # e.g., gencode.v38.annotation.gtf OUTPUT_DIR="eclip_output" SAMPLE_NAME="my_eclip_experiment" # The eCLIP pipeline is implemented as a CWL workflow from the yeolab/eclip repository. # This command executes the 'eclip.cwl' workflow using cwltool. # You would need to provide all required inputs as defined in the CWL workflow. # A common practice is to define inputs in a YAML file (e.g., inputs.yaml) # and pass it to cwltool. For demonstration, direct parameter passing is shown. # Ensure you have the 'eclip.cwl' file from the yeolab/eclip repository. # Example to get the workflow: # git clone https://github.com/yeolab/eclip.git # cd eclip cwltool eclip.cwl \ --fastq_r1 "${INPUT_FASTQ_R1}" \ --fastq_r2 "${INPUT_FASTQ_R2}" \ --genome_fasta "${GENOME_FASTA}" \ --star_index "${GENOME_STAR_INDEX}" \ --gtf "${ANNOTATION_GTF}" \ --output_dir "${OUTPUT_DIR}" \ --sample_name "${SAMPLE_NAME}" \ --adapter_sequence "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" \ --min_read_length 18 \ --max_read_length 100 \ --threads 8 # Add other parameters as required by the eclip.cwl workflow (e.g., blacklist, chrom_sizes, etc.) -
2
For eCLIP-seq, and RNA-seq, raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt (e.g., via conda) # conda install -c bioconda cutadapt # Define input and output files INPUT_R1="raw_R1.fastq.gz" INPUT_R2="raw_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" # Define adapter sequences and minimum read length # ADAPTER_3PRIME: Common Illumina 3' adapter sequence (e.g., Illumina Universal Adapter) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # ADAPTER_5PRIME: Placeholder for 5' barcode/UMI sequence specific to the eCLIP library preparation. # Replace NNNNNNNN with the actual 5' barcode/UMI sequence if known. ADAPTER_5PRIME="NNNNNNNN" MIN_LENGTH=18 # Execute cutadapt for adapter and barcode trimming cutadapt \ -a "${ADAPTER_3PRIME}" \ -A "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -G "${ADAPTER_5PRIME}" \ --minimum-length "${MIN_LENGTH}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" -
3
Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences.
bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2 # Define reference genome and index prefix for RepBase # Download RepBase sequences (e.g., from UCSC RepeatMasker track or RepBase website) # For example, using hg38 repetitive sequences: # wget -O repbase_hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz # gunzip repbase_hg38.fa.gz # mv hg38.fa.out repbase_hg38.fa REPBASE_FASTA="repbase_hg38.fa" REPBASE_INDEX_PREFIX="repbase_hg38_index" # Build bowtie2 index for RepBase (if not already built) # bowtie2-build "${REPBASE_FASTA}" "${REPBASE_INDEX_PREFIX}" # Input trimmed reads (assuming paired-end, adjust for single-end if necessary) TRIMMED_R1="trimmed_reads_R1.fastq.gz" TRIMMED_R2="trimmed_reads_R2.fastq.gz" # Output reads that did NOT map to RepBase # bowtie2 --un-conc option outputs files with .1 and .2 suffixes OUTPUT_NON_REP_PREFIX="non_rep_reads" # Map trimmed reads against RepBase and output reads that did not map # --very-sensitive-local is often used for repeat filtering to ensure good mapping # --un-conc outputs reads where neither mate mapped to the reference # -S /dev/null discards the SAM output as we only care about the --un-conc output bowtie2 \ --very-sensitive-local \ -x "${REPBASE_INDEX_PREFIX}" \ -1 "${TRIMMED_R1}" \ -2 "${TRIMMED_R2}" \ --un-conc "${OUTPUT_NON_REP_PREFIX}" \ -S /dev/null # The output files will be named non_rep_reads.1 and non_rep_reads.2 (or .1.fastq and .2.fastq depending on bowtie2 version/config) # You might need to rename them if specific .fastq.gz extensions are required for downstream tools: # mv "${OUTPUT_NON_REP_PREFIX}.1" "${OUTPUT_NON_REP_PREFIX}_R1.fastq" # mv "${OUTPUT_NON_REP_PREFIX}.2" "${OUTPUT_NON_REP_PREFIX}_R2.fastq" # gzip "${OUTPUT_NON_REP_PREFIX}_R1.fastq" # gzip "${OUTPUT_NON_REP_PREFIX}_R2.fastq" -
4
Remaining reads were mapped to the appropriate genome build using STAR aligner. eCLIP peaks were called with CLIPPER and normalized to size-matched input sample
$ Bash example
# Define variables for paths and files GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder: Replace with actual path to STAR genome index GENOME_FASTA="/path/to/hg38.fa" # Placeholder: Replace with actual path to genome FASTA GTF_FILE="/path/to/gencode.vXX.annotation.gtf" # Placeholder: Replace with actual path to GTF annotation ECLIP_READS_FASTQ="eclip_reads.fastq.gz" # Placeholder: Replace with actual path to eCLIP input FASTQ file INPUT_READS_FASTQ="input_reads.fastq.gz" # Placeholder: Replace with actual path to size-matched input FASTQ file ECLIP_ALIGNED_BAM="eclip_aligned.bam" # Output BAM for eCLIP sample INPUT_ALIGNED_BAM="input_aligned.bam" # Output BAM for input sample STAR_ECLIP_OUTPUT_PREFIX="star_eclip_output/" STAR_INPUT_OUTPUT_PREFIX="star_input_output/" CLIPPER_OUTPUT_PREFIX="clipper_peaks" NUM_THREADS=8 # Adjust based on available resources # --- STAR Alignment --- # Tool: STAR (Spliced Transcripts Alignment to a Reference) # Version: 2.7.0f (Inferred from yeolab/eclip CWL workflow) # Source: https://github.com/alexdobin/STAR # Installation (example using conda) # conda install -c bioconda star # Generate STAR genome index (run once per genome build) # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GTF_FILE}" \ # --runThreadN "${NUM_THREADS}" # Map eCLIP reads to the appropriate genome build STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${ECLIP_READS_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "${STAR_ECLIP_OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --runThreadN "${NUM_THREADS}" # Rename STAR output BAM for eCLIP sample mv "${STAR_ECLIP_OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${ECLIP_ALIGNED_BAM}" # Map size-matched input reads to the appropriate genome build STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_READS_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "${STAR_INPUT_OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --runThreadN "${NUM_THREADS}" # Rename STAR output BAM for input sample mv "${STAR_INPUT_OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" "${INPUT_ALIGNED_BAM}" # --- CLIPPER Peak Calling --- # Tool: CLIPPER # Version: 0.0.1 (Inferred from yeolab/eclip CWL workflow) # Source: https://github.com/yeolab/clipper # Installation (example using pip) # pip install clipper # or (example using conda) # conda install -c bioconda clipper # Call eCLIP peaks and normalize to size-matched input sample clipper -e "${ECLIP_ALIGNED_BAM}" \ -c "${INPUT_ALIGNED_BAM}" \ -o "${CLIPPER_OUTPUT_PREFIX}" \ -s hg38 \ -p 0.01 \ -f 0.05 \ --bonferroni \ --threshold-method pvalue \ --save-bed \ --save-wig \ --save-bigwig
Raw Source Text
Library strategy: eCLIP-Seq For eCLIP-seq, and RNA-seq, raw sequencing reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner. eCLIP peaks were called with CLIPPER and normalized to size-matched input sample Genome_build: hg19, gencode v19 Supplementary_files_format_and_content: eCLIP input normalized and filtered peaks in BED format. Peaks files (bigwigs) for eCLIP. Featurecounts file for RNA-seq samples.