GSE201895 Processing Pipeline

OTHER code_examples 8 steps

Publication

Science translational medicine (2022) — PMID 35767654

Dataset

GSE201895

MECP2-related pathways are dysregulated in a cortical organoid model of Myotonic dystrophy [eCLIP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i)(Dobin et al., 2013).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Prepare STAR index for RepBase v18.05 (conceptual step, assuming index is pre-built) ---
# This step is typically done once for a given reference.
# Assuming 'repbase_18.05_human.fa' contains the human-specific repetitive elements from RepBase v18.05.
# STAR --runThreadN 8 --runMode genomeGenerate --genomeDir repbase_index_18.05 --genomeFastaFiles repbase_18.05_human.fa

# --- Mapping reads to RepBase v18.05 using STAR ---
# Define variables for clarity
GENOME_DIR="repbase_index_18.05" # Directory containing the STAR index for RepBase v18.05
READS_R1="trimmed_reads_R1.fastq.gz" # Path to adapter-trimmed forward reads
READS_R2="trimmed_reads_R2.fastq.gz" # Path to adapter-trimmed reverse reads (assuming paired-end reads)
OUTPUT_PREFIX="mapped_to_repbase" # Prefix for output files (e.g., mapped_to_repbaseAligned.sortedByCoord.out.bam)

STAR --runThreadN 8 \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --alignIntronMax 1 \
     --alignEndsType Local \
     --outFilterMultimapNmax 100 \
     --outFilterMismatchNmax 10 \
     --outFilterMismatchNoverLmax 0.1 \
     --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting (adjust based on available memory)

View on GitHub

Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Placeholder for the STAR genome index directory for hg19.
# This directory must contain the pre-built STAR index files (e.g., SA, SAindex, genome, chrName.txt, etc.).
# To build the index, you would typically run:
# STAR --runThreadN <num_threads> --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/hg19 \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/hg19.gtf \
#      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length
GENOME_DIR="/path/to/STAR_index/hg19_GRCh37" 

# Input FASTQ files. Adjust for single-end reads if necessary.
READ1="input_reads_R1.fastq.gz"
READ2="input_reads_R2.fastq.gz" 

# Prefix for output files (e.g., aligned_reads_Aligned.sortedByCoord.out.bam)
OUTPUT_PREFIX="aligned_reads"

# Number of threads to use for STAR
THREADS=8 

# Run STAR alignment
# The description states "Repeat-mapping reads were removed" prior to mapping.
# This implies the input FASTQ files are already filtered for such reads.
# STAR will then map the remaining reads to the human genome assembly (hg19).
# The --outFilterMultimapNmax parameter controls the maximum number of loci
# a read is allowed to map to. A value of 1 would mean only uniquely mapping reads.
# The default is 20. Here, we use 20 as a common setting, assuming pre-filtering
# handled the "repeat-mapping reads" definition.
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outSAMunmapped Within

View on GitHub

PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools (e.g., via conda)
# conda install -c bioconda umi-tools=1.1.2

# This command removes PCR duplicate reads based on UMI sequences.
# It assumes UMIs have been extracted to the read names (e.g., by a prior umi_tools extract step)
# from the 5' adapter region, as is common practice in eCLIP pipelines.
# The --method=unique ensures only one read per unique UMI and mapping position is retained.
# --paired-end is used as eCLIP data is typically paired-end.
# Replace input.bam with your aligned BAM file and output.bam with your desired output file name.
umi_tools dedup \
    --extract-umis-from-read-names \
    --method=unique \
    --paired-end \
    -I input.bam \
    -S output.bam \
    --output-stats output.dedup.stats

View on GitHub

Peaks were called on the usable reads by CLIPper (Lovci et al., 2013) and assigned to gene regions annotated in GENCODE (hg19) with the following descending priority order: CDS, 5âUTR, 3âUTR, proximal intron, and distal intron.

CLIPper v1.0 GitHub

$ Bash example

# Install CLIPper (if not already installed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# # Ensure Python dependencies are met, e.g., numpy, scipy

# Download GENCODE hg19 annotation (release 19 is commonly used with hg19 for hg19 builds)
# wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz

# Define variables
INPUT_BAM="path/to/usable_reads.bam" # Placeholder for the input BAM file containing usable reads
OUTPUT_BED="clipper_peaks.bed"       # Output file for called peaks
GENCODE_GTF="gencode.v19.annotation.gtf" # GENCODE (hg19) annotation file
GENOME_SIZE="hs"                     # Human genome size abbreviation for CLIPper (hg19)

# Execute CLIPper to call peaks
# The '-s hs' parameter tells CLIPper to use human genome specific settings for peak calling.
# The '-g' parameter provides gene annotations which CLIPper can use for filtering or context.
python CLIPper.py -b "${INPUT_BAM}" -g "${GENCODE_GTF}" -s "${GENOME_SIZE}" -o "${OUTPUT_BED}"

# Note on gene region assignment:
# The description mentions peaks were "assigned to gene regions annotated in GENCODE (hg19)
# with the following descending priority order: CDS, 5â€™UTR, 3â€™UTR, proximal intron, and distal intron."
# This hierarchical assignment is a downstream annotation step that typically uses a separate script
# or tool (e.g., bedtools, custom Python script) to intersect the CLIPper output BED file
# with the GENCODE GTF/BED annotation and apply the specified priority logic. CLIPper itself
# outputs peak coordinates and does not directly perform this complex hierarchical assignment.

View on GitHub

Proximal intron regions are defined as extending up to 500 bp from an exon-intron junction.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# This script defines proximal intron regions as extending up to 500 bp from an exon-intron junction.
# It requires a GTF annotation file for gene features.

# --- Configuration Variables ---
# Replace with your actual GTF file path (e.g., Ensembl GRCh38, release 109)
GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
# Output BED file for the defined proximal intron regions
OUTPUT_BED="proximal_introns_500bp.bed"

# --- Installation (commented out) ---
# # Install bedtools if not already available
# conda install -c bioconda bedtools

# --- Main Workflow ---

# 1. Extract all transcripts as BED format (0-based start, 1-based end)
#    Fields: chr, start, end, name, score, strand
awk '
BEGIN { OFS="\t" }
$3 == "transcript" {
    print $1, $4-1, $5, "transcript_" NR, ".", $7
}' "$GTF_FILE" > transcripts.bed

# 2. Extract all exons as BED format
awk '
BEGIN { OFS="\t" }
$3 == "exon" {
    print $1, $4-1, $5, "exon_" NR, ".", $7
}' "$GTF_FILE" > exons.bed

# 3. Subtract exons from transcripts to get all introns
#    This generates a BED file where each entry is an intron. The strand information is preserved.
bedtools subtract -a transcripts.bed -b exons.bed > all_introns.bed

# 4. Define the first 500bp of each intron (proximal to the upstream exon-intron junction)
#    Handles strand correctly: for '+' strand introns, it takes from the start; for '-' strand introns,
#    it takes from the end (which is the 5' end of the intron on the reverse strand).
awk 'BEGIN {OFS="\t"} {
    if ($6 == "+") { # Forward strand intron
        # Take from start (0-based) up to 500bp or intron end, whichever is smaller
        print $1, $2, ($2+500 < $3 ? $2+500 : $3), $4, $5, $6
    } else { # Reverse strand intron
        # Take from end (0-based) back 500bp or intron start, whichever is larger
        print $1, ($3-500 > $2 ? $3-500 : $2), $3, $4, $5, $6
    }
}' all_introns.bed > proximal_introns_500bp_start.bed

# 5. Define the last 500bp of each intron (proximal to the downstream exon-intron junction)
#    Handles strand correctly: for '+' strand introns, it takes from the end; for '-' strand introns,
#    it takes from the start (which is the 3' end of the intron on the reverse strand).
awk 'BEGIN {OFS="\t"} {
    if ($6 == "+") { # Forward strand intron
        # Take from end (0-based) back 500bp or intron start, whichever is larger
        print $1, ($3-500 > $2 ? $3-500 : $2), $3, $4, $5, $6
    } else { # Reverse strand intron
        # Take from start (0-based) up to 500bp or intron end, whichever is smaller
        print $1, $2, ($2+500 < $3 ? $2+500 : $3), $4, $5, $6
    }
}' all_introns.bed > proximal_introns_500bp_end.bed

# 6. Combine the start and end proximal regions and merge any overlapping intervals
#    -s ensures merging is strand-specific. Sorting by chromosome, start, and strand is crucial for bedtools merge.
cat proximal_introns_500bp_start.bed proximal_introns_500bp_end.bed \
    | sort -k1,1 -k2,2n -k6,6 \
    | bedtools merge -s -i - > "$OUTPUT_BED"

# 7. Clean up intermediate files
rm transcripts.bed exons.bed all_introns.bed proximal_introns_500bp_start.bed proximal_introns_500bp_end.bed

View on GitHub

Each peak was normalized to the size-matched input (SMInput) by calculating the fraction of the number of usable reads from immunoprecipitation to that of the usable reads from the SMInput.

scripts/calculate_enrichment.py (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Installation (conceptual, assuming a Conda environment for skipper):
# git clone https://github.com/yeolab/skipper.git
# cd skipper
# conda env create -f environment.yaml
# conda activate skipper_env

# Define placeholder paths for input and output files
# IP_BAM: Immunoprecipitation (IP) aligned reads in BAM format
# SMINPUT_BAM: Size-matched input (SMInput) aligned reads in BAM format
# RAW_PEAK_BED: Peaks identified in the IP sample (e.g., from CLIPper)
# OUTPUT_NORMALIZED_PEAK_BED: Output file for peaks with normalized scores
# CHROM_SIZES: File containing chromosome names and their sizes for the reference genome
IP_BAM="data/aligned/sample_ip.bam"
SMINPUT_BAM="data/aligned/sample_sm_input.bam"
RAW_PEAK_BED="results/peaks/sample_raw_peaks.bed"
OUTPUT_NORMALIZED_PEAK_BED="results/peaks/sample_normalized_peaks.bed"
CHROM_SIZES="references/hg38.chrom.sizes"

# The 'scripts/calculate_enrichment.py' script from the 'yeolab/skipper' pipeline
# performs the normalization described. It internally calculates usable reads
# from the provided BAM files and uses them to normalize peak scores (e.g., fold enrichment).
# The normalization factor is derived from the ratio of usable reads between IP and SMInput.

python skipper/scripts/calculate_enrichment.py \
  -i "$RAW_PEAK_BED" \
  -b "$IP_BAM" \
  -c "$SMINPUT_BAM" \
  -o "$OUTPUT_NORMALIZED_PEAK_BED" \
  -s "$CHROM_SIZES"

View on GitHub

Peaks were deemed significant at 8-fold enrichment and p-value10-3(Chi-squared test, or Fisherâs exact test if the observed or expected read number in eCLIP or SMInput was below 5).

eCLIP v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install clipper (if not already installed)
# pip install clipper # Or clone from GitHub and install
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install

# Define input files and parameters
ECLIP_BAM="path/to/your/eclip_sample.bam"
SMINPUT_BAM="path/to/your/sm_input_control.bam"
OUTPUT_PREFIX="eclip_peaks"
FOLD_ENRICHMENT=8
P_VALUE=0.001 # 10^-3

# Reference genome size file (e.g., for hg38)
# This file typically contains chromosome names and their lengths.
# Example: chr1    248956422
#          chr2    242193529
# You can generate this from a .fai file of your reference genome:
# samtools faidx <reference.fasta>
# cut -f1,2 <reference.fasta>.fai > genome_sizes.txt
GENOME_SIZE_FILE="path/to/your/genome_sizes.txt"

# Ensure the output directory exists
mkdir -p ./clipper_output

# Run clipper
clipper -i "${ECLIP_BAM}" \
        -c "${SMINPUT_BAM}" \
        -o "./clipper_output/${OUTPUT_PREFIX}" \
        -s "${GENOME_SIZE_FILE}" \
        -fe "${FOLD_ENRICHMENT}" \
        -p "${P_VALUE}"

View on GitHub

Library strategy: enhanced CLIP

eCLIP vCWL workflow (yeolab/eclip) GitHub

$ Bash example

# --- Configuration ---
# Placeholder for reference genome and annotation. For human, hg38 is the latest assembly.
GENOME_FASTA="/path/to/hg38.fa"
GENOME_DIR="/path/to/STAR_genome_index/hg38" # Pre-built STAR index
GTF_FILE="/path/to/hg38.gtf"
BLACKLIST_BED="/path/to/hg38_blacklist.bed" # ENCODE blacklist regions

# Input FASTQ files (assuming paired-end for eCLIP, with replicates and input control)
READ1_IP_REP1="ip_replicate1_R1.fastq.gz"
READ2_IP_REP1="ip_replicate1_R2.fastq.gz"
READ1_IP_REP2="ip_replicate2_R1.fastq.gz"
READ2_IP_REP2="ip_replicate2_R2.fastq.gz"
READ1_INPUT="input_control_R1.fastq.gz"
READ2_INPUT="input_control_R2.fastq.gz"

OUTPUT_DIR="eclip_analysis"
mkdir -p "$OUTPUT_DIR"

# --- 1. Alignment with STAR (example for one IP replicate) ---
# conda install -c bioconda star
STAR --genomeDir "$GENOME_DIR" \
     --readFilesIn "$READ1_IP_REP1" "$READ2_IP_REP1" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/ip_rep1_star_aligned_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.5 \
     --outFilterMatchNminOverLread 0.5 \
     --alignIntronMax 1 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 60000000000 # Adjust based on available RAM (e.g., 60GB)

ALIGNED_BAM_IP_REP1="${OUTPUT_DIR}/ip_rep1_star_aligned_Aligned.sortedByCoord.out.bam"
# Similar STAR commands would be run for IP_REP2 and INPUT control samples.

# --- 2. Peak Calling with CLIPper (example for one IP replicate) ---
# conda install -c bioconda clipper
# Note: CLIPper often takes a control BAM for background subtraction in a full pipeline.
# This example shows calling peaks on IP only. A full pipeline would align and call peaks on input control as well.
clipper -b "$ALIGNED_BAM_IP_REP1" \
        -s hg38 \
        -o "${OUTPUT_DIR}/ip_rep1_clipper_peaks.bed" \
        -p 0.01 \
        -t 10 # Number of threads

# Assume similar steps for IP_REP2 and INPUT to get:
# ALIGNED_BAM_IP_REP2, ALIGNED_BAM_INPUT
# IP_REP2_PEAKS="${OUTPUT_DIR}/ip_rep2_clipper_peaks.bed"
# INPUT_PEAKS="${OUTPUT_DIR}/input_clipper_peaks.bed"

# --- 3. Identifying Reproducible Peaks (IDR) with merge_peaks ---
# This step requires peak files from multiple IP replicates and an input control.
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks && python setup.py install # Or ensure script is in PATH

# Placeholder for actual peak files from previous steps
# REP1_PEAKS="${OUTPUT_DIR}/ip_rep1_clipper_peaks.bed"
# REP2_PEAKS="${OUTPUT_DIR}/ip_rep2_clipper_peaks.bed"
# CONTROL_PEAKS="${OUTPUT_DIR}/input_clipper_peaks.bed" # Peaks from input control

# python /path/to/merge_peaks/merge_peaks.py \
#     -i "$REP1_PEAKS" "$REP2_PEAKS" \
#     -c "$CONTROL_PEAKS" \
#     -o "${OUTPUT_DIR}/eclip_reproducible_peaks.bed" \
#     -g hg38 \
#     --min_overlap 0.5 # Example parameter, adjust as needed

View on GitHub

Tools Used

STAR eCLIP

Raw Source Text

Reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i)(Dobin et al., 2013).
Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.
Peaks were called on the usable reads by CLIPper (Lovci et al., 2013) and assigned to gene regions annotated in GENCODE (hg19) with the following descending priority order: CDS, 5âUTR, 3âUTR, proximal intron, and distal intron. Proximal intron regions are defined as extending up to 500 bp from an exon-intron junction. Each peak was normalized to the size-matched input (SMInput) by calculating the fraction of the number of usable reads from immunoprecipitation to that of the usable reads from the SMInput. Peaks were deemed significant at 8-fold enrichment and p-value10-3(Chi-squared test, or Fisherâs exact test if the observed or expected read number in eCLIP or SMInput was below 5).
Assembly: hg19
Supplementary files format and content: bigwig files for input and IP samples as well as beds files for significant eCLIP peaks per experimental group
Library strategy: enhanced CLIP

← Back to Analysis