GSE228444 Processing Pipeline

OTHER code_examples 4 steps

Publication

Integrated multi-omic characterizations of the synapse reveal RNA processing factors and ubiquitin ligases associated with neurodevelopmental disorders.

Cell systems (2025) — PMID 40054464

Dataset

GSE228444

Integrated proteomic and multi-mic characterizations of the synapse reveal RNA processing factor and ubiquitin ligases associated with neurodevelopme…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Following sequencing, raw reads were aligned to GRCh38 and analyzed following a previously published pipeline (Nostrand et al, 2016)

STAR (Inferred with models/gemini-2.5-flash) v2.5.2b GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for STAR genome index creation (run once)
# This step prepares the GRCh38 genome for alignment.
# Replace /path/to/ with actual paths to your genome files.
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/GRCh38_STAR_index \
#      --genomeFastaFiles /path/to/GRCh38.fasta \
#      --sjdbGTFfile /path/to/GRCh38.gtf \
#      --runThreadN 8

# Align raw reads to GRCh38 using STAR, following parameters typical for eCLIP.
# Replace 'read1.fastq.gz', 'read2.fastq.gz' with your actual input files.
# Replace '/path/to/GRCh38_STAR_index' with the path to your STAR genome index.
# Output BAM file will be 'aligned_reads/sample_Aligned.sortedByCoord.out.bam'.

mkdir -p aligned_reads

STAR \
  --runThreadN 8 \
  --genomeDir /path/to/GRCh38_STAR_index \
  --readFilesIn read1.fastq.gz read2.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix aligned_reads/sample_ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --outFilterMultimapNmax 20 \
  --alignSJDBoverhangMin 8 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.05 \
  --seedSearchStartLmax 30 \
  --winAnchorMultimapNmax 50 \
  --outFilterScoreMinOverLread 0 \
  --outFilterMatchNminOverLread 0 \
  --limitBAMsortRAM 60000000000

# Note: The full Nostrand et al, 2016 eCLIP pipeline (as implemented in yeolab/eclip) 
# includes additional steps for adapter trimming, duplicate removal, peak calling 
# (e.g., CLIPper), IDR, and downstream analysis beyond just alignment.

View on GitHub

Consistent with the ENCODE standard 65, reads aligning to artifact-enriched or repetitive genomic regions were removed

samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Download ENCODE hg38 blacklist (example, use appropriate assembly for your data)
# wget -O hg38-blacklist.v2.bed.gz https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz
# gunzip hg38-blacklist.v2.bed.gz

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Placeholder for input aligned BAM file
OUTPUT_BAM="filtered_reads.bam"
BLACKLIST_BED="hg38-blacklist.v2.bed" # Path to the ENCODE hg38 blacklist BED file

# Filter reads aligning to blacklisted regions
# Reads overlapping the blacklist are discarded, and non-overlapping reads are written to OUTPUT_BAM
samtools view -L "${BLACKLIST_BED}" "${INPUT_BAM}" -U "${OUTPUT_BAM}" -b > /dev/null

View on GitHub

Reproducible and significant peaks of aligned reads were defined as IDR cutoff of 0.01, P â¤ 0.001, and fold enrichment â¥8.

IDR v0.01 GitHub

$ Bash example

# Install IDR Python package (if not already installed)
# pip install idr

# Clone the yeolab/merge_peaks repository (if not already cloned)
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks

# Example usage of merge_peaks.py for IDR analysis
# Input peak files (e.g., from MACS2) for two replicates.
# These files are expected to be in a format like narrowPeak, containing P-values and fold enrichment.
# Replace 'rep1_peaks.narrowPeak' and 'rep2_peaks.narrowPeak' with actual file paths.
# Replace 'reproducible_peaks' with your desired output prefix.

python merge_peaks.py \
    --peak_files rep1_peaks.narrowPeak rep2_peaks.narrowPeak \
    --idr_threshold 0.01 \
    --p_value_threshold 0.001 \
    --fold_enrichment_threshold 8 \
    --output_prefix reproducible_peaks

View on GitHub

Genic regions of eCLIP peaks were annotated based on overlap with GENCODE v26 transcripts following the priority order consistent with the previous study

GENCODE v2.27.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install dependencies (if not already installed, recommended in a conda environment):
# conda create -n eclip_annotation python=3.7 pybedtools gffutils bedtools
# conda activate eclip_annotation

# Download the GENCODE v26 annotation GTF file
GENCODE_GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz"
GENCODE_GTF="gencode.v26.annotation.gtf"
wget -O "${GENCODE_GTF}.gz" "${GENCODE_GTF_URL}"
gunzip -f "${GENCODE_GTF}.gz"

# Placeholder for eCLIP peaks file (replace with your actual input BED file)
# Example: eCLIP_peaks.bed
# For demonstration, create a dummy peaks file:
echo -e "chr1\t1000\t2000\tpeak1\t100\t+\nchr1\t3000\t4000\tpeak2\t100\t-" > eCLIP_peaks.bed

# The annotation with priority order is typically handled by a custom Python script
# from the Yeo lab's eCLIP pipeline, which leverages pybedtools (a Python wrapper for bedtools)
# and gffutils to parse GTF and perform intersections with specific prioritization logic
# (e.g., exon > UTR > intron > intergenic).

# To use the exact script from the Yeo lab's eCLIP pipeline, you would first download it:
# wget https://raw.githubusercontent.com/yeolab/eclip/master/tools/annotate_peaks_with_genes/annotate_peaks_with_genes.py
# chmod +x annotate_peaks_with_genes.py

# Execute the annotation script
python annotate_peaks_with_genes.py \
  --peaks eCLIP_peaks.bed \
  --annotation "${GENCODE_GTF}" \
  --output annotated_eCLIP_peaks.bed

View on GitHub

Raw Source Text

Following sequencing, raw reads were aligned to GRCh38 and analyzed following a previously published pipeline (Nostrand et al, 2016)
Consistent with the ENCODE standard 65, reads aligning to artifact-enriched or repetitive genomic regions were removed
Reproducible and significant peaks of aligned reads were defined as IDR cutoff of 0.01, P â¤ 0.001, and fold enrichment â¥8.
Genic regions of eCLIP peaks were annotated based on overlap with GENCODE v26 transcripts following the priority order consistent with the previous study
Assembly: GRCh38
Supplementary files format and content: wig files represent read covergae for plus and minus strands
Supplementary files format and content: peak files

← Back to Analysis