GSE228444 Processing Pipeline

OTHER code_examples 4 steps

Publication

Integrated multi-omic characterizations of the synapse reveal RNA processing factors and ubiquitin ligases associated with neurodevelopmental disorders.

Cell systems (2025) — PMID 40054464

Dataset

GSE228444

Integrated proteomic and multi-mic characterizations of the synapse reveal RNA processing factor and ubiquitin ligases associated with neurodevelopme…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Following sequencing, raw reads were aligned to GRCh38 and analyzed following a previously published pipeline (Nostrand et al, 2016)

    STAR (Inferred with models/gemini-2.5-flash) v2.5.2b GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Placeholder for STAR genome index creation (run once)
    # This step prepares the GRCh38 genome for alignment.
    # Replace /path/to/ with actual paths to your genome files.
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/GRCh38_STAR_index \
    #      --genomeFastaFiles /path/to/GRCh38.fasta \
    #      --sjdbGTFfile /path/to/GRCh38.gtf \
    #      --runThreadN 8
    
    # Align raw reads to GRCh38 using STAR, following parameters typical for eCLIP.
    # Replace 'read1.fastq.gz', 'read2.fastq.gz' with your actual input files.
    # Replace '/path/to/GRCh38_STAR_index' with the path to your STAR genome index.
    # Output BAM file will be 'aligned_reads/sample_Aligned.sortedByCoord.out.bam'.
    
    mkdir -p aligned_reads
    
    STAR \
      --runThreadN 8 \
      --genomeDir /path/to/GRCh38_STAR_index \
      --readFilesIn read1.fastq.gz read2.fastq.gz \
      --readFilesCommand zcat \
      --outFileNamePrefix aligned_reads/sample_ \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes All \
      --outFilterMultimapNmax 20 \
      --alignSJDBoverhangMin 8 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --outFilterMismatchNmax 999 \
      --outFilterMismatchNoverLmax 0.05 \
      --seedSearchStartLmax 30 \
      --winAnchorMultimapNmax 50 \
      --outFilterScoreMinOverLread 0 \
      --outFilterMatchNminOverLread 0 \
      --limitBAMsortRAM 60000000000
    
    # Note: The full Nostrand et al, 2016 eCLIP pipeline (as implemented in yeolab/eclip) 
    # includes additional steps for adapter trimming, duplicate removal, peak calling 
    # (e.g., CLIPper), IDR, and downstream analysis beyond just alignment.
  2. 2

    Consistent with the ENCODE standard 65, reads aligning to artifact-enriched or repetitive genomic regions were removed

    samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Download ENCODE hg38 blacklist (example, use appropriate assembly for your data)
    # wget -O hg38-blacklist.v2.bed.gz https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz
    # gunzip hg38-blacklist.v2.bed.gz
    
    # Define input and output files
    INPUT_BAM="aligned_reads.bam" # Placeholder for input aligned BAM file
    OUTPUT_BAM="filtered_reads.bam"
    BLACKLIST_BED="hg38-blacklist.v2.bed" # Path to the ENCODE hg38 blacklist BED file
    
    # Filter reads aligning to blacklisted regions
    # Reads overlapping the blacklist are discarded, and non-overlapping reads are written to OUTPUT_BAM
    samtools view -L "${BLACKLIST_BED}" "${INPUT_BAM}" -U "${OUTPUT_BAM}" -b > /dev/null
  3. 3

    Reproducible and significant peaks of aligned reads were defined as IDR cutoff of 0.01, P ≤ 0.001, and fold enrichment ≥8.

    IDR v0.01 GitHub
    $ Bash example
    # Install IDR Python package (if not already installed)
    # pip install idr
    
    # Clone the yeolab/merge_peaks repository (if not already cloned)
    # git clone https://github.com/yeolab/merge_peaks.git
    # cd merge_peaks
    
    # Example usage of merge_peaks.py for IDR analysis
    # Input peak files (e.g., from MACS2) for two replicates.
    # These files are expected to be in a format like narrowPeak, containing P-values and fold enrichment.
    # Replace 'rep1_peaks.narrowPeak' and 'rep2_peaks.narrowPeak' with actual file paths.
    # Replace 'reproducible_peaks' with your desired output prefix.
    
    python merge_peaks.py \
        --peak_files rep1_peaks.narrowPeak rep2_peaks.narrowPeak \
        --idr_threshold 0.01 \
        --p_value_threshold 0.001 \
        --fold_enrichment_threshold 8 \
        --output_prefix reproducible_peaks
  4. 4

    Genic regions of eCLIP peaks were annotated based on overlap with GENCODE v26 transcripts following the priority order consistent with the previous study

    GENCODE v2.27.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install dependencies (if not already installed, recommended in a conda environment):
    # conda create -n eclip_annotation python=3.7 pybedtools gffutils bedtools
    # conda activate eclip_annotation
    
    # Download the GENCODE v26 annotation GTF file
    GENCODE_GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz"
    GENCODE_GTF="gencode.v26.annotation.gtf"
    wget -O "${GENCODE_GTF}.gz" "${GENCODE_GTF_URL}"
    gunzip -f "${GENCODE_GTF}.gz"
    
    # Placeholder for eCLIP peaks file (replace with your actual input BED file)
    # Example: eCLIP_peaks.bed
    # For demonstration, create a dummy peaks file:
    echo -e "chr1\t1000\t2000\tpeak1\t100\t+\nchr1\t3000\t4000\tpeak2\t100\t-" > eCLIP_peaks.bed
    
    # The annotation with priority order is typically handled by a custom Python script
    # from the Yeo lab's eCLIP pipeline, which leverages pybedtools (a Python wrapper for bedtools)
    # and gffutils to parse GTF and perform intersections with specific prioritization logic
    # (e.g., exon > UTR > intron > intergenic).
    
    # To use the exact script from the Yeo lab's eCLIP pipeline, you would first download it:
    # wget https://raw.githubusercontent.com/yeolab/eclip/master/tools/annotate_peaks_with_genes/annotate_peaks_with_genes.py
    # chmod +x annotate_peaks_with_genes.py
    
    # Execute the annotation script
    python annotate_peaks_with_genes.py \
      --peaks eCLIP_peaks.bed \
      --annotation "${GENCODE_GTF}" \
      --output annotated_eCLIP_peaks.bed
Raw Source Text
Following sequencing, raw reads were aligned to GRCh38 and analyzed following a previously published pipeline (Nostrand et al, 2016)
Consistent with the ENCODE standard 65, reads aligning to artifact-enriched or repetitive genomic regions were removed
Reproducible and significant peaks of aligned reads were defined as IDR cutoff of 0.01, P ≤ 0.001, and fold enrichment ≥8.
Genic regions of eCLIP peaks were annotated based on overlap with GENCODE v26 transcripts following the priority order consistent with the previous study
Assembly: GRCh38
Supplementary files format and content: wig files represent read covergae for plus and minus strands
Supplementary files format and content: peak files
← Back to Analysis