GSE50178 Processing Pipeline
OTHER
code_examples
6 steps
Publication
ePRINT: exonuclease assisted mapping of protein-RNA interactions.Genome biology (2024) — PMID 38807229
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: CLIP-Seq
$ Bash example
# This script demonstrates how to execute the eCLIP CWL workflow, which processes raw sequencing data from CLIP-Seq experiments through alignment, peak calling, and other downstream analyses. # Prerequisites: # 1. cwltool installed (e.g., pip install cwltool) # 2. The eclip CWL workflow repository cloned (e.g., git clone https://github.com/yeolab/eclip.git) # 3. An input YAML file (e.g., eclip_inputs.yaml) defining all necessary parameters # and paths to input files (FASTQ, genome reference, etc.). # Example placeholder for eclip_inputs.yaml: # fastq_replicate1: { class: File, path: "path/to/sample_rep1.fastq.gz" } # fastq_replicate2: { class: File, path: "path/to/sample_rep2.fastq.gz" } # fastq_control1: { class: File, path: "path/to/control_rep1.fastq.gz" } # fastq_control2: { class: File, path: "path/to/control_rep2.fastq.gz" } # genome_fasta: { class: File, path: "path/to/hg38.fa" } # Placeholder: latest human assembly # genome_gtf: { class: File, path: "path/to/hg38.gtf" } # rbp_name: "ExampleRBP" # output_prefix: "ExampleRBP_eCLIP" # # ... other parameters as required by eclip.cwl # Navigate to the cloned eclip workflow directory (if not already there) # cd eclip # Execute the eCLIP CWL workflow # Replace 'eclip_inputs.yaml' with the actual path to your input configuration file. cwltool --outdir ./eclip_output eclip.cwl eclip_inputs.yaml -
2
Illumina default workflow (Casava1.4 software) was used for sequence basecalling.
$ Bash example
# Casava 1.4 is a proprietary software suite integrated with Illumina sequencers for basecalling and demultiplexing. # It processes raw intensity data (BCL files) generated by the sequencer into FASTQ files. # Direct command-line execution for Casava 1.4 as a standalone tool is not typically performed by users. # The following is a conceptual representation of the basecalling step using Casava 1.4. # Assuming BCL files are located in a run directory and output FASTQ files are desired. # The actual execution is handled by the Illumina instrument control software. # This placeholder command represents the basecalling process. run_illumina_casava_basecalling \ --version "1.4" \ --input_bcl_directory "/path/to/illumina_run/Data/Intensities/BaseCalls" \ --output_fastq_directory "/path/to/output/fastqs" \ --workflow "default"
-
3
CLIP-seq reads were trimmed to remove apator sequences.
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=1.18 # Define input and output files, and the adapter sequence INPUT_READS="input_reads.fastq.gz" OUTPUT_READS="trimmed_reads.fastq.gz" ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example adapter sequence, replace with actual adapter # Trim adapter sequences from CLIP-seq reads cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_READS}" "${INPUT_READS}" -
4
Reads with identical sequence were collapsed into a signle read.
$ Bash example
# Install FASTX-Toolkit (if not already installed) # conda install -c bioconda fastx_toolkit # Define input and output file paths input_fastq="input_reads.fastq" # Placeholder for your input FASTQ file output_fasta="collapsed_reads.fasta" # Placeholder for your output FASTA file # Collapse identical reads into a single read # -v: verbose output # -i: input file # -o: output file fastx_collapser -v -i "${input_fastq}" -o "${output_fasta}" -
5
Reads were aligned to the GRCh37 genome assembly using BlastN, with default parameters.
BLAST v2.10.1$ Bash example
# Install BLAST+ (if not already installed) # conda install -c bioconda blast # Download GRCh37 genome assembly (Ensembl release 75) as a placeholder # wget -O GRCh37.fa.gz "http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz" # gunzip GRCh37.fa.gz # Create a BLAST database for GRCh37 # makeblastdb -in GRCh37.fa -dbtype nucl -out GRCh37_blastdb # Align reads to GRCh37 using blastn with default parameters # Assuming 'reads.fastq' contains the input reads and 'GRCh37_blastdb' is the pre-built database blastn -query reads.fastq -db GRCh37_blastdb -out blastn_alignment.txt
-
6
Peaks were determined using CisGenome V2 with the following parameters: Window Size W=150 bp; Cutoff C>=12 reads; Step Size S=25 bp; Max Gap=50 bp; Min Peak Length=100 bp.
$ Bash example
# CisGenome V2 is typically a Perl script or a compiled binary. # Installation might involve downloading from SourceForge and ensuring Perl dependencies are met. # For example: # wget https://sourceforge.net/projects/cisgenome/files/CisGenome_V2.0/CisGenome_V2.0.tar.gz # tar -xzf CisGenome_V2.0.tar.gz # cd CisGenome_V2.0 # export PATH=$(pwd):$PATH # Placeholder for input BAM file (e.g., from alignment) INPUT_BAM="treatment.bam" # Placeholder for output peak file prefix OUTPUT_PREFIX="cisgenome_peaks" # Run CisGenome V2 peak detection # Note: The exact command might vary slightly based on the specific CisGenome V2 distribution. # This command assumes a 'cisgenome.pl' script is in the PATH and uses common parameter names. # CisGenome often works with BED files as input, so a conversion from BAM might be needed if not directly supported. # For simplicity, assuming it can take BAM or a pre-converted BED. cisgenome.pl peak_detection \ -w 150 \ -c 12 \ -s 25 \ -g 50 \ -l 100 \ ${INPUT_BAM} \ ${OUTPUT_PREFIX}
Raw Source Text
Library strategy: CLIP-Seq Illumina default workflow (Casava1.4 software) was used for sequence basecalling. CLIP-seq reads were trimmed to remove apator sequences. Reads with identical sequence were collapsed into a signle read. Reads were aligned to the GRCh37 genome assembly using BlastN, with default parameters. Peaks were determined using CisGenome V2 with the following parameters: Window Size W=150 bp; Cutoff C>=12 reads; Step Size S=25 bp; Max Gap=50 bp; Min Peak Length=100 bp. Genome_build: GRCh37 Supplementary_files_format_and_content: tab-delimited BED files include chr name, chr start, chr end, name of the BED line, score (set 1 to all), and strand.