GSE171553 Processing Pipeline

RIP-Seq code_examples 2 steps

Publication

A multi-scale map of cell structure fusing protein images and interactions.

Nature (2021) — PMID 34819669

Dataset

GSE171553

Mapping cell structure across scales by fusing protein images and interactions

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    After sequencing, raw reads were aligned to GRCh38 and analyzed following the detailed instructions in ENCODE eCLIP-seq Processing Pipeline v2.2 (https://www.encodeproject.org/pipelines/ENCPL357ADL/).

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define input files and reference genome index
    READS_R1="raw_reads_R1.fastq.gz"
    READS_R2="raw_reads_R2.fastq.gz"
    GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for GRCh38 STAR index (e.g., from ENCODE or UCSC)
    OUTPUT_PREFIX="aligned_eCLIP_sample"
    
    # Align raw reads to GRCh38 using STAR, following ENCODE eCLIP pipeline recommendations
    STAR \
      --runThreadN 8 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READS_R1}" "${READS_R2}" \
      --readFilesCommand zcat \
      --outFileNamePrefix "${OUTPUT_PREFIX}_" \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes All \
      --outFilterMultimapNmax 20 \
      --outFilterMismatchNmax 999 \
      --outFilterMismatchNoverLmax 0.04 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --outFilterScoreMinOverLread 0.75 \
      --outFilterMatchNminOverLread 0.75 \
      --limitBAMsortRAM 30000000000
    
    # The ENCODE eCLIP-seq Processing Pipeline v2.2 continues with steps such as:
    # - Adapter trimming and deduplication (often handled by UMI-tools or custom scripts)
    # - Filtering and sorting of BAM files (e.g., using samtools and bedtools)
    # - Peak calling (e.g., using CLIPper: https://github.com/yeolab/clipper)
    # - Control peak calling (e.g., using MACS2 for input controls)
    # - IDR analysis for reproducible peaks (e.g., using merge_peaks: https://github.com/yeolab/merge_peaks)
    # - Generation of bigWig tracks for visualization
    
  2. 2

    Consistent with the ENCODE standard, reads aligning to artifact-enriched or repetitive genomic regions were removed.

    bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools=2.30.0
    
    # Define input and output file paths
    INPUT_BAM="aligned_reads.bam"
    OUTPUT_BAM="filtered_reads.bam"
    BLACKLIST_BED="GRCh38_unified_blacklist_V2.bed"
    
    # Download ENCODE blacklist file for GRCh38 if not available
    # mkdir -p reference
    # wget -O "${BLACKLIST_BED}.gz" https://raw.githubusercontent.com/ENCODE-DCC/chip-seq-pipeline2/master/references/GRCh38_unified_blacklist_V2.bed.gz
    # gunzip -f "${BLACKLIST_BED}.gz"
    
    # Remove reads aligning to artifact-enriched or repetitive genomic regions using bedtools intersect -v
    bedtools intersect -v -a "${INPUT_BAM}" -b "${BLACKLIST_BED}" > "${OUTPUT_BAM}"

Tools Used

Raw Source Text
After sequencing, raw reads were aligned to GRCh38 and analyzed following the detailed instructions in ENCODE eCLIP-seq Processing Pipeline v2.2 (https://www.encodeproject.org/pipelines/ENCPL357ADL/).
Consistent with the ENCODE standard, reads aligning to artifact-enriched or repetitive genomic regions were removed.
Genome_build: GRCh38
Supplementary_files_format_and_content: The processed file contains reproducible and significant peaks of aligned reads at IDR cutoff of 0.01, P ≤ 0.001, and fold enrichment ≥ 8. All peaks have annotated genic region based on overlap with GENCODE v26 transcripts.
← Back to Analysis