GSE50178 Processing Pipeline

OTHER code_examples 6 steps

Publication

ePRINT: exonuclease assisted mapping of protein-RNA interactions.

Genome biology (2024) — PMID 38807229

Dataset

Identification of FUS RNA targets in HeLa cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: CLIP-Seq

CLIP-seq vv1.0.0 GitHub

$ Bash example

# This script demonstrates how to execute the eCLIP CWL workflow, which processes raw sequencing data from CLIP-Seq experiments through alignment, peak calling, and other downstream analyses.

# Prerequisites:
# 1. cwltool installed (e.g., pip install cwltool)
# 2. The eclip CWL workflow repository cloned (e.g., git clone https://github.com/yeolab/eclip.git)
# 3. An input YAML file (e.g., eclip_inputs.yaml) defining all necessary parameters
#    and paths to input files (FASTQ, genome reference, etc.).
#    Example placeholder for eclip_inputs.yaml:
#    fastq_replicate1: { class: File, path: "path/to/sample_rep1.fastq.gz" }
#    fastq_replicate2: { class: File, path: "path/to/sample_rep2.fastq.gz" }
#    fastq_control1: { class: File, path: "path/to/control_rep1.fastq.gz" }
#    fastq_control2: { class: File, path: "path/to/control_rep2.fastq.gz" }
#    genome_fasta: { class: File, path: "path/to/hg38.fa" } # Placeholder: latest human assembly
#    genome_gtf: { class: File, path: "path/to/hg38.gtf" }
#    rbp_name: "ExampleRBP"
#    output_prefix: "ExampleRBP_eCLIP"
#    # ... other parameters as required by eclip.cwl

# Navigate to the cloned eclip workflow directory (if not already there)
# cd eclip

# Execute the eCLIP CWL workflow
# Replace 'eclip_inputs.yaml' with the actual path to your input configuration file.
cwltool --outdir ./eclip_output eclip.cwl eclip_inputs.yaml

View on GitHub

Illumina default workflow (Casava1.4 software) was used for sequence basecalling.

Casava v1.4 GitHub

$ Bash example

# Casava 1.4 is a proprietary software suite integrated with Illumina sequencers for basecalling and demultiplexing.
# It processes raw intensity data (BCL files) generated by the sequencer into FASTQ files.
# Direct command-line execution for Casava 1.4 as a standalone tool is not typically performed by users.
# The following is a conceptual representation of the basecalling step using Casava 1.4.

# Assuming BCL files are located in a run directory and output FASTQ files are desired.
# The actual execution is handled by the Illumina instrument control software.
# This placeholder command represents the basecalling process.
run_illumina_casava_basecalling \
  --version "1.4" \
  --input_bcl_directory "/path/to/illumina_run/Data/Intensities/BaseCalls" \
  --output_fastq_directory "/path/to/output/fastqs" \
  --workflow "default"

View on GitHub

CLIP-seq reads were trimmed to remove apator sequences.

CLIP-seq v1.18 GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt=1.18

# Define input and output files, and the adapter sequence
INPUT_READS="input_reads.fastq.gz"
OUTPUT_READS="trimmed_reads.fastq.gz"
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example adapter sequence, replace with actual adapter

# Trim adapter sequences from CLIP-seq reads
cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_READS}" "${INPUT_READS}"

View on GitHub

Reads with identical sequence were collapsed into a signle read.

fastx_collapser (Inferred with models/gemini-2.5-flash) v0.0.14 GitHub

$ Bash example

# Install FASTX-Toolkit (if not already installed)
# conda install -c bioconda fastx_toolkit

# Define input and output file paths
input_fastq="input_reads.fastq" # Placeholder for your input FASTQ file
output_fasta="collapsed_reads.fasta" # Placeholder for your output FASTA file

# Collapse identical reads into a single read
# -v: verbose output
# -i: input file
# -o: output file
fastx_collapser -v -i "${input_fastq}" -o "${output_fasta}"

View on GitHub

Reads were aligned to the GRCh37 genome assembly using BlastN, with default parameters.

BLAST v2.10.1

$ Bash example

# Install BLAST+ (if not already installed)
# conda install -c bioconda blast

# Download GRCh37 genome assembly (Ensembl release 75) as a placeholder
# wget -O GRCh37.fa.gz "http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz"
# gunzip GRCh37.fa.gz

# Create a BLAST database for GRCh37
# makeblastdb -in GRCh37.fa -dbtype nucl -out GRCh37_blastdb

# Align reads to GRCh37 using blastn with default parameters
# Assuming 'reads.fastq' contains the input reads and 'GRCh37_blastdb' is the pre-built database
blastn -query reads.fastq -db GRCh37_blastdb -out blastn_alignment.txt

Peaks were determined using CisGenome V2 with the following parameters: Window Size W=150 bp; Cutoff C>=12 reads; Step Size S=25 bp; Max Gap=50 bp; Min Peak Length=100 bp.

CisGenome vV2 GitHub

$ Bash example

# CisGenome V2 is typically a Perl script or a compiled binary.
# Installation might involve downloading from SourceForge and ensuring Perl dependencies are met.
# For example:
# wget https://sourceforge.net/projects/cisgenome/files/CisGenome_V2.0/CisGenome_V2.0.tar.gz
# tar -xzf CisGenome_V2.0.tar.gz
# cd CisGenome_V2.0
# export PATH=$(pwd):$PATH

# Placeholder for input BAM file (e.g., from alignment)
INPUT_BAM="treatment.bam"

# Placeholder for output peak file prefix
OUTPUT_PREFIX="cisgenome_peaks"

# Run CisGenome V2 peak detection
# Note: The exact command might vary slightly based on the specific CisGenome V2 distribution.
# This command assumes a 'cisgenome.pl' script is in the PATH and uses common parameter names.
# CisGenome often works with BED files as input, so a conversion from BAM might be needed if not directly supported.
# For simplicity, assuming it can take BAM or a pre-converted BED.

cisgenome.pl peak_detection \
  -w 150 \
  -c 12 \
  -s 25 \
  -g 50 \
  -l 100 \
  ${INPUT_BAM} \
  ${OUTPUT_PREFIX}

View on GitHub

Tools Used

CLIP-seq BLAST

Raw Source Text

Library strategy: CLIP-Seq
Illumina default workflow (Casava1.4 software) was used for sequence basecalling.
CLIP-seq reads were trimmed to remove apator sequences. Reads with identical sequence were collapsed into a signle read. Reads were aligned to the GRCh37 genome assembly using BlastN, with default parameters.
Peaks were determined using CisGenome V2 with the following parameters: Window Size W=150 bp; Cutoff C>=12 reads; Step Size S=25 bp; Max Gap=50 bp; Min Peak Length=100 bp.
Genome_build: GRCh37
Supplementary_files_format_and_content: tab-delimited BED files include chr name, chr start, chr end, name of the BED line, score (set 1 to all), and strand.

← Back to Analysis