GSE228444 Processing Pipeline
OTHER
code_examples
4 steps
Publication
Integrated multi-omic characterizations of the synapse reveal RNA processing factors and ubiquitin ligases associated with neurodevelopmental disorders.Cell systems (2025) — PMID 40054464
Dataset
GSE228444Integrated proteomic and multi-mic characterizations of the synapse reveal RNA processing factor and ubiquitin ligases associated with neurodevelopme…
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Following sequencing, raw reads were aligned to GRCh38 and analyzed following a previously published pipeline (Nostrand et al, 2016)
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for STAR genome index creation (run once) # This step prepares the GRCh38 genome for alignment. # Replace /path/to/ with actual paths to your genome files. # STAR --runMode genomeGenerate \ # --genomeDir /path/to/GRCh38_STAR_index \ # --genomeFastaFiles /path/to/GRCh38.fasta \ # --sjdbGTFfile /path/to/GRCh38.gtf \ # --runThreadN 8 # Align raw reads to GRCh38 using STAR, following parameters typical for eCLIP. # Replace 'read1.fastq.gz', 'read2.fastq.gz' with your actual input files. # Replace '/path/to/GRCh38_STAR_index' with the path to your STAR genome index. # Output BAM file will be 'aligned_reads/sample_Aligned.sortedByCoord.out.bam'. mkdir -p aligned_reads STAR \ --runThreadN 8 \ --genomeDir /path/to/GRCh38_STAR_index \ --readFilesIn read1.fastq.gz read2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix aligned_reads/sample_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.05 \ --seedSearchStartLmax 30 \ --winAnchorMultimapNmax 50 \ --outFilterScoreMinOverLread 0 \ --outFilterMatchNminOverLread 0 \ --limitBAMsortRAM 60000000000 # Note: The full Nostrand et al, 2016 eCLIP pipeline (as implemented in yeolab/eclip) # includes additional steps for adapter trimming, duplicate removal, peak calling # (e.g., CLIPper), IDR, and downstream analysis beyond just alignment.
-
2
Consistent with the ENCODE standard 65, reads aligning to artifact-enriched or repetitive genomic regions were removed
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Download ENCODE hg38 blacklist (example, use appropriate assembly for your data) # wget -O hg38-blacklist.v2.bed.gz https://github.com/Boyle-Lab/Blacklist/raw/master/lists/hg38-blacklist.v2.bed.gz # gunzip hg38-blacklist.v2.bed.gz # Define input and output files INPUT_BAM="aligned_reads.bam" # Placeholder for input aligned BAM file OUTPUT_BAM="filtered_reads.bam" BLACKLIST_BED="hg38-blacklist.v2.bed" # Path to the ENCODE hg38 blacklist BED file # Filter reads aligning to blacklisted regions # Reads overlapping the blacklist are discarded, and non-overlapping reads are written to OUTPUT_BAM samtools view -L "${BLACKLIST_BED}" "${INPUT_BAM}" -U "${OUTPUT_BAM}" -b > /dev/null -
3
Reproducible and significant peaks of aligned reads were defined as IDR cutoff of 0.01, P ⤠0.001, and fold enrichment â¥8.
$ Bash example
# Install IDR Python package (if not already installed) # pip install idr # Clone the yeolab/merge_peaks repository (if not already cloned) # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks # Example usage of merge_peaks.py for IDR analysis # Input peak files (e.g., from MACS2) for two replicates. # These files are expected to be in a format like narrowPeak, containing P-values and fold enrichment. # Replace 'rep1_peaks.narrowPeak' and 'rep2_peaks.narrowPeak' with actual file paths. # Replace 'reproducible_peaks' with your desired output prefix. python merge_peaks.py \ --peak_files rep1_peaks.narrowPeak rep2_peaks.narrowPeak \ --idr_threshold 0.01 \ --p_value_threshold 0.001 \ --fold_enrichment_threshold 8 \ --output_prefix reproducible_peaks -
4
Genic regions of eCLIP peaks were annotated based on overlap with GENCODE v26 transcripts following the priority order consistent with the previous study
$ Bash example
# Install dependencies (if not already installed, recommended in a conda environment): # conda create -n eclip_annotation python=3.7 pybedtools gffutils bedtools # conda activate eclip_annotation # Download the GENCODE v26 annotation GTF file GENCODE_GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz" GENCODE_GTF="gencode.v26.annotation.gtf" wget -O "${GENCODE_GTF}.gz" "${GENCODE_GTF_URL}" gunzip -f "${GENCODE_GTF}.gz" # Placeholder for eCLIP peaks file (replace with your actual input BED file) # Example: eCLIP_peaks.bed # For demonstration, create a dummy peaks file: echo -e "chr1\t1000\t2000\tpeak1\t100\t+\nchr1\t3000\t4000\tpeak2\t100\t-" > eCLIP_peaks.bed # The annotation with priority order is typically handled by a custom Python script # from the Yeo lab's eCLIP pipeline, which leverages pybedtools (a Python wrapper for bedtools) # and gffutils to parse GTF and perform intersections with specific prioritization logic # (e.g., exon > UTR > intron > intergenic). # To use the exact script from the Yeo lab's eCLIP pipeline, you would first download it: # wget https://raw.githubusercontent.com/yeolab/eclip/master/tools/annotate_peaks_with_genes/annotate_peaks_with_genes.py # chmod +x annotate_peaks_with_genes.py # Execute the annotation script python annotate_peaks_with_genes.py \ --peaks eCLIP_peaks.bed \ --annotation "${GENCODE_GTF}" \ --output annotated_eCLIP_peaks.bed
Raw Source Text
Following sequencing, raw reads were aligned to GRCh38 and analyzed following a previously published pipeline (Nostrand et al, 2016) Consistent with the ENCODE standard 65, reads aligning to artifact-enriched or repetitive genomic regions were removed Reproducible and significant peaks of aligned reads were defined as IDR cutoff of 0.01, P ⤠0.001, and fold enrichment â¥8. Genic regions of eCLIP peaks were annotated based on overlap with GENCODE v26 transcripts following the priority order consistent with the previous study Assembly: GRCh38 Supplementary files format and content: wig files represent read covergae for plus and minus strands Supplementary files format and content: peak files