GSE50178 Processing Pipeline

OTHER code_examples 6 steps

Publication

ePRINT: exonuclease assisted mapping of protein-RNA interactions.

Genome biology (2024) — PMID 38807229

Dataset

GSE50178

Identification of FUS RNA targets in HeLa cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: CLIP-Seq

    $ Bash example
    # This script demonstrates how to execute the eCLIP CWL workflow, which processes raw sequencing data from CLIP-Seq experiments through alignment, peak calling, and other downstream analyses.
    
    # Prerequisites:
    # 1. cwltool installed (e.g., pip install cwltool)
    # 2. The eclip CWL workflow repository cloned (e.g., git clone https://github.com/yeolab/eclip.git)
    # 3. An input YAML file (e.g., eclip_inputs.yaml) defining all necessary parameters
    #    and paths to input files (FASTQ, genome reference, etc.).
    #    Example placeholder for eclip_inputs.yaml:
    #    fastq_replicate1: { class: File, path: "path/to/sample_rep1.fastq.gz" }
    #    fastq_replicate2: { class: File, path: "path/to/sample_rep2.fastq.gz" }
    #    fastq_control1: { class: File, path: "path/to/control_rep1.fastq.gz" }
    #    fastq_control2: { class: File, path: "path/to/control_rep2.fastq.gz" }
    #    genome_fasta: { class: File, path: "path/to/hg38.fa" } # Placeholder: latest human assembly
    #    genome_gtf: { class: File, path: "path/to/hg38.gtf" }
    #    rbp_name: "ExampleRBP"
    #    output_prefix: "ExampleRBP_eCLIP"
    #    # ... other parameters as required by eclip.cwl
    
    # Navigate to the cloned eclip workflow directory (if not already there)
    # cd eclip
    
    # Execute the eCLIP CWL workflow
    # Replace 'eclip_inputs.yaml' with the actual path to your input configuration file.
    cwltool --outdir ./eclip_output eclip.cwl eclip_inputs.yaml
  2. 2

    Illumina default workflow (Casava1.4 software) was used for sequence basecalling.

    Casava v1.4 GitHub
    $ Bash example
    # Casava 1.4 is a proprietary software suite integrated with Illumina sequencers for basecalling and demultiplexing.
    # It processes raw intensity data (BCL files) generated by the sequencer into FASTQ files.
    # Direct command-line execution for Casava 1.4 as a standalone tool is not typically performed by users.
    # The following is a conceptual representation of the basecalling step using Casava 1.4.
    
    # Assuming BCL files are located in a run directory and output FASTQ files are desired.
    # The actual execution is handled by the Illumina instrument control software.
    # This placeholder command represents the basecalling process.
    run_illumina_casava_basecalling \
      --version "1.4" \
      --input_bcl_directory "/path/to/illumina_run/Data/Intensities/BaseCalls" \
      --output_fastq_directory "/path/to/output/fastqs" \
      --workflow "default"
  3. 3

    CLIP-seq reads were trimmed to remove apator sequences.

    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda install -c bioconda cutadapt=1.18
    
    # Define input and output files, and the adapter sequence
    INPUT_READS="input_reads.fastq.gz"
    OUTPUT_READS="trimmed_reads.fastq.gz"
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example adapter sequence, replace with actual adapter
    
    # Trim adapter sequences from CLIP-seq reads
    cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_READS}" "${INPUT_READS}"
  4. 4

    Reads with identical sequence were collapsed into a signle read.

    fastx_collapser (Inferred with models/gemini-2.5-flash) v0.0.14 GitHub
    $ Bash example
    # Install FASTX-Toolkit (if not already installed)
    # conda install -c bioconda fastx_toolkit
    
    # Define input and output file paths
    input_fastq="input_reads.fastq" # Placeholder for your input FASTQ file
    output_fasta="collapsed_reads.fasta" # Placeholder for your output FASTA file
    
    # Collapse identical reads into a single read
    # -v: verbose output
    # -i: input file
    # -o: output file
    fastx_collapser -v -i "${input_fastq}" -o "${output_fasta}"
  5. 5

    Reads were aligned to the GRCh37 genome assembly using BlastN, with default parameters.

    BLAST v2.10.1
    $ Bash example
    # Install BLAST+ (if not already installed)
    # conda install -c bioconda blast
    
    # Download GRCh37 genome assembly (Ensembl release 75) as a placeholder
    # wget -O GRCh37.fa.gz "http://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz"
    # gunzip GRCh37.fa.gz
    
    # Create a BLAST database for GRCh37
    # makeblastdb -in GRCh37.fa -dbtype nucl -out GRCh37_blastdb
    
    # Align reads to GRCh37 using blastn with default parameters
    # Assuming 'reads.fastq' contains the input reads and 'GRCh37_blastdb' is the pre-built database
    blastn -query reads.fastq -db GRCh37_blastdb -out blastn_alignment.txt
  6. 6

    Peaks were determined using CisGenome V2 with the following parameters: Window Size W=150 bp; Cutoff C>=12 reads; Step Size S=25 bp; Max Gap=50 bp; Min Peak Length=100 bp.

    CisGenome vV2 GitHub
    $ Bash example
    # CisGenome V2 is typically a Perl script or a compiled binary.
    # Installation might involve downloading from SourceForge and ensuring Perl dependencies are met.
    # For example:
    # wget https://sourceforge.net/projects/cisgenome/files/CisGenome_V2.0/CisGenome_V2.0.tar.gz
    # tar -xzf CisGenome_V2.0.tar.gz
    # cd CisGenome_V2.0
    # export PATH=$(pwd):$PATH
    
    # Placeholder for input BAM file (e.g., from alignment)
    INPUT_BAM="treatment.bam"
    
    # Placeholder for output peak file prefix
    OUTPUT_PREFIX="cisgenome_peaks"
    
    # Run CisGenome V2 peak detection
    # Note: The exact command might vary slightly based on the specific CisGenome V2 distribution.
    # This command assumes a 'cisgenome.pl' script is in the PATH and uses common parameter names.
    # CisGenome often works with BED files as input, so a conversion from BAM might be needed if not directly supported.
    # For simplicity, assuming it can take BAM or a pre-converted BED.
    
    cisgenome.pl peak_detection \
      -w 150 \
      -c 12 \
      -s 25 \
      -g 50 \
      -l 100 \
      ${INPUT_BAM} \
      ${OUTPUT_PREFIX}
    

Tools Used

Raw Source Text
Library strategy: CLIP-Seq
Illumina default workflow (Casava1.4 software) was used for sequence basecalling.
CLIP-seq reads were trimmed to remove apator sequences. Reads with identical sequence were collapsed into a signle read. Reads were aligned to the GRCh37 genome assembly using BlastN, with default parameters.
Peaks were determined using CisGenome V2 with the following parameters: Window Size W=150 bp; Cutoff C>=12 reads; Step Size S=25 bp; Max Gap=50 bp; Min Peak Length=100 bp.
Genome_build: GRCh37
Supplementary_files_format_and_content: tab-delimited BED files include chr name, chr start, chr end, name of the BED line, score (set 1 to all), and strand.
← Back to Analysis