GSE86040 Processing Pipeline
Publication
Protein-RNA Networks Regulated by Normal and ALS-Associated Mutant HNRNPA2B1 in the Nervous System.Neuron (2016) — PMID 27773581
Dataset
GSE86040HNRNPA2B1 regulates alternative RNA processing in the nervous system and accumulates in granules in ALS IPSC-derived motor neurons [hnRNPA2B1_eCLIP_h…
Processing Steps
Generate Jupyter Notebook-
1
Reads were demultiplexed using custom scripts and the randomer was appended to the read name.
$ Bash example
# This example assumes a custom Python script 'eclip_demultiplex_and_randomer_append.py' # that performs both demultiplexing based on barcodes and appends a randomer to the read name. # A common randomer length in eCLIP is 6 base pairs. # Installation (example for a Python script with Biopython dependency): # conda create -n eclip_demux python=3.8 # conda activate eclip_demux # pip install biopython # Create a dummy barcode file (replace with actual barcodes and sample names) # echo -e "AAAA\tsample1\nTTTT\tsample2" > barcodes.tsv # Execute the custom script python eclip_demultiplex_and_randomer_append.py \ --input_fastq "raw_reads.fastq.gz" \ --barcode_file "barcodes.tsv" \ --randomer_length 6 \ --output_dir "demultiplexed_reads"
-
2
Reads were trimmed, filtered for repetitive elements, and mapped to human genome assembly hg19 as described in iCLIP computational analysis.
$ Bash example
# Input FASTQ file READS="input.fastq.gz" OUTPUT_PREFIX="sample" # Reference paths (placeholders - replace with actual paths) # Download hg19 genome and annotation files from UCSC or Ensembl. # Build STAR index for hg19: # STAR --runThreadN <threads> --runMode genomeGenerate --genomeDir /path/to/hg19_STAR_index --genomeFastaFiles /path/to/hg19.fa --sjdbGTFfile /path/to/hg19.gtf HG19_STAR_INDEX="/path/to/hg19_STAR_index" # Create a repeatome FASTA file (e.g., combining rRNA, tRNA, snRNA, snoRNA sequences) # and build STAR index for it. # Example repeatome FASTA: hg19_rRNA_tRNA_snRNA_snoRNA_ensembl_ERCC.fa # STAR --runThreadN <threads> --runMode genomeGenerate --genomeDir /path/to/repeatome_STAR_index --genomeFastaFiles /path/to/repeatome.fa REPEATOME_STAR_INDEX="/path/to/repeatome_STAR_index" # 1. Trimming with cutadapt # Common 3' adapter for iCLIP/eCLIP. Adjust adapter sequence as needed. # -a: 3' adapter sequence # -q 20: Trim low-quality ends (Phred score < 20) # -m 15: Discard reads shorter than 15 bp after trimming # conda install -c bioconda cutadapt cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -q 20 -m 15 -o "${OUTPUT_PREFIX}_trimmed.fastq.gz" "$READS" # 2. Filtering for repetitive elements (e.g., rRNA, tRNA, snRNA, snoRNA) using STAR # Align reads to a repeatome index and keep unmapped reads. # --outReadsUnmapped Fastx: Output unmapped reads to a FASTQ file. # conda install -c bioconda star STAR --runThreadN 8 \ --genomeDir "$REPEATOME_STAR_INDEX" \ --readFilesIn "${OUTPUT_PREFIX}_trimmed.fastq.gz" \ --outFileNamePrefix "${OUTPUT_PREFIX}_repeatome_" \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --outFilterType BySJout \ --outSAMattributes All \ --outSAMtype BAM Unsorted \ --outReadsUnmapped Fastx \ --outStd Log \ --readFilesCommand zcat # The unmapped reads from the repeatome alignment are the filtered reads. mv "${OUTPUT_PREFIX}_repeatome_Unmapped.out.mate1" "${OUTPUT_PREFIX}_filtered_for_repeats.fastq.gz" # 3. Mapping to human genome assembly hg19 with STAR # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file. STAR --runThreadN 8 \ --genomeDir "$HG19_STAR_INDEX" \ --readFilesIn "${OUTPUT_PREFIX}_filtered_for_repeats.fastq.gz" \ --outFileNamePrefix "${OUTPUT_PREFIX}_hg19_" \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --outFilterType BySJout \ --outSAMattributes All \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --outStd Log \ --readFilesCommand zcat -
3
PCR duplicated reads were removed based on the start positions of read1, read2, and the sequence of the randomer. eCLIP peaks were identified using CLIPPER with parameters âs hg19 âo âbonferroni âsuperlocal --threshold-method binomial --save-pickle (Lovci et al.
$ Bash example
# Install clipper (example, adjust as needed) # It's recommended to install clipper in a dedicated conda environment or via pip. # For example: # conda create -n clipper_env python=3.8 # conda activate clipper_env # pip install git+https://github.com/yeolab/clipper.git # Define input BAM files (these files are assumed to be PCR deduplicated as per the description) # Replace with actual paths to your CLIP-seq and control BAM files. # Example placeholders: CLIP_BAM_FILES="clip_sample1.dedup.bam clip_sample2.dedup.bam" CONTROL_BAM_FILES="control_sample1.dedup.bam control_sample2.dedup.bam" OUTPUT_DIR="clipper_peaks_hg19" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run CLIPPER peak calling # Parameters are directly from the description: -s hg19 -o --bonferroni --superlocal --threshold-method binomial --save-pickle clipper.py \ -s hg19 \ -o "${OUTPUT_DIR}" \ --bonferroni \ --superlocal \ --threshold-method binomial \ --save-pickle \ ${CLIP_BAM_FILES} \ -c ${CONTROL_BAM_FILES} -
4
NSMB, 2013).
$ Bash example
# The step description "NSMB, 2013)" refers to a publication and does not specify a bioinformatics tool, # parameters, or the type of assay being performed. # Therefore, a specific bash command cannot be generated. # # To generate a command, more context is needed, such as: # - The specific bioinformatics task (e.g., alignment, peak calling, variant calling). # - The name of the tool to be used. # - Input file types and desired output. # - Reference genome or other reference datasets.
-
5
Peak strength was then normalized against a size matched input by calculating fold enrichment of number of reads in IP versus number of reads in size matched input.
normalize_bedgraph.py (Inferred with models/gemini-2.5-flash) vFrom yeolab/eclip workflow (CWL version, before 2021) GitHub$ Bash example
# Assuming ip_signal.bedgraph and input_signal.bedgraph are generated from previous steps # (e.g., from bedtools genomecov or similar signal generation tools, potentially after library size normalization). # Clone the eclip repository to access the script if not already available. # git clone https://github.com/yeolab/eclip.git # cd eclip # Execute the normalization script to calculate fold enrichment. # -i: Input IP bedGraph file containing read counts or signal. # -c: Input control (size-matched input) bedGraph file containing read counts or signal. # -o: Output bedGraph file for fold enrichment. # --method fold_enrichment: Specifies the normalization method to calculate fold enrichment. python scripts/normalize_bedgraph.py \ -i ip_signal.bedgraph \ -c input_signal.bedgraph \ -o ip_fold_enrichment.bedgraph \ --method fold_enrichment -
6
Peaks were called significant if the number of reads in IP was greater than the number of reads in input and the the peaks a Bonferroni corrected fisher exact p-value of less than .05.
$ Bash example
# Install clipper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # # Ensure Python dependencies are met, e.g., pysam, numpy # # pip install -r requirements.txt # Example usage of clipper for peak calling # Replace IP.bam, INPUT.bam, and hg38.chrom.sizes with actual file paths. # The description implies a Bonferroni corrected Fisher exact p-value < 0.05 # and enrichment of IP over input, which is inherently handled by clipper's statistical test. python clipper.py \ -b IP.bam \ -c INPUT.bam \ -s hg38.chrom.sizes \ -o significant_peaks.bed \ --bonferroni \ --p_value 0.05
Tools Used
Raw Source Text
Reads were demultiplexed using custom scripts and the randomer was appended to the read name. Reads were trimmed, filtered for repetitive elements, and mapped to human genome assembly hg19 as described in iCLIP computational analysis. PCR duplicated reads were removed based on the start positions of read1, read2, and the sequence of the randomer. eCLIP peaks were identified using CLIPPER with parameters âs hg19 âo âbonferroni âsuperlocal --threshold-method binomial --save-pickle (Lovci et al. NSMB, 2013). Peak strength was then normalized against a size matched input by calculating fold enrichment of number of reads in IP versus number of reads in size matched input. Peaks were called significant if the number of reads in IP was greater than the number of reads in input and the the peaks a Bonferroni corrected fisher exact p-value of less than .05. Genome_build: hg19 Supplementary_files_format_and_content: peaks.bed and bigwig