GSE201895 Processing Pipeline
Publication
MECP2-related pathways are dysregulated in a cortical organoid model of myotonic dystrophy.Science translational medicine (2022) — PMID 35767654
Dataset
GSE201895MECP2-related pathways are dysregulated in a cortical organoid model of Myotonic dystrophy [eCLIP]
Processing Steps
Generate Jupyter Notebook-
1
Reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i)(Dobin et al., 2013).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Prepare STAR index for RepBase v18.05 (conceptual step, assuming index is pre-built) --- # This step is typically done once for a given reference. # Assuming 'repbase_18.05_human.fa' contains the human-specific repetitive elements from RepBase v18.05. # STAR --runThreadN 8 --runMode genomeGenerate --genomeDir repbase_index_18.05 --genomeFastaFiles repbase_18.05_human.fa # --- Mapping reads to RepBase v18.05 using STAR --- # Define variables for clarity GENOME_DIR="repbase_index_18.05" # Directory containing the STAR index for RepBase v18.05 READS_R1="trimmed_reads_R1.fastq.gz" # Path to adapter-trimmed forward reads READS_R2="trimmed_reads_R2.fastq.gz" # Path to adapter-trimmed reverse reads (assuming paired-end reads) OUTPUT_PREFIX="mapped_to_repbase" # Prefix for output files (e.g., mapped_to_repbaseAligned.sortedByCoord.out.bam) STAR --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --alignIntronMax 1 \ --alignEndsType Local \ --outFilterMultimapNmax 100 \ --outFilterMismatchNmax 10 \ --outFilterMismatchNoverLmax 0.1 \ --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting (adjust based on available memory) -
2
Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Placeholder for the STAR genome index directory for hg19. # This directory must contain the pre-built STAR index files (e.g., SA, SAindex, genome, chrName.txt, etc.). # To build the index, you would typically run: # STAR --runThreadN <num_threads> --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index/hg19 \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/hg19.gtf \ # --sjdbOverhang 100 # Adjust sjdbOverhang based on read length GENOME_DIR="/path/to/STAR_index/hg19_GRCh37" # Input FASTQ files. Adjust for single-end reads if necessary. READ1="input_reads_R1.fastq.gz" READ2="input_reads_R2.fastq.gz" # Prefix for output files (e.g., aligned_reads_Aligned.sortedByCoord.out.bam) OUTPUT_PREFIX="aligned_reads" # Number of threads to use for STAR THREADS=8 # Run STAR alignment # The description states "Repeat-mapping reads were removed" prior to mapping. # This implies the input FASTQ files are already filtered for such reads. # STAR will then map the remaining reads to the human genome assembly (hg19). # The --outFilterMultimapNmax parameter controls the maximum number of loci # a read is allowed to map to. A value of 1 would mean only uniquely mapping reads. # The default is 20. Here, we use 20 as a common setting, assuming pre-filtering # handled the "repeat-mapping reads" definition. STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outSAMunmapped Within -
3
PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.
$ Bash example
# Install umi_tools (e.g., via conda) # conda install -c bioconda umi-tools=1.1.2 # This command removes PCR duplicate reads based on UMI sequences. # It assumes UMIs have been extracted to the read names (e.g., by a prior umi_tools extract step) # from the 5' adapter region, as is common practice in eCLIP pipelines. # The --method=unique ensures only one read per unique UMI and mapping position is retained. # --paired-end is used as eCLIP data is typically paired-end. # Replace input.bam with your aligned BAM file and output.bam with your desired output file name. umi_tools dedup \ --extract-umis-from-read-names \ --method=unique \ --paired-end \ -I input.bam \ -S output.bam \ --output-stats output.dedup.stats -
4
Peaks were called on the usable reads by CLIPper (Lovci et al., 2013) and assigned to gene regions annotated in GENCODE (hg19) with the following descending priority order: CDS, 5âUTR, 3âUTR, proximal intron, and distal intron.
$ Bash example
# Install CLIPper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # # Ensure Python dependencies are met, e.g., numpy, scipy # Download GENCODE hg19 annotation (release 19 is commonly used with hg19 for hg19 builds) # wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz # Define variables INPUT_BAM="path/to/usable_reads.bam" # Placeholder for the input BAM file containing usable reads OUTPUT_BED="clipper_peaks.bed" # Output file for called peaks GENCODE_GTF="gencode.v19.annotation.gtf" # GENCODE (hg19) annotation file GENOME_SIZE="hs" # Human genome size abbreviation for CLIPper (hg19) # Execute CLIPper to call peaks # The '-s hs' parameter tells CLIPper to use human genome specific settings for peak calling. # The '-g' parameter provides gene annotations which CLIPper can use for filtering or context. python CLIPper.py -b "${INPUT_BAM}" -g "${GENCODE_GTF}" -s "${GENOME_SIZE}" -o "${OUTPUT_BED}" # Note on gene region assignment: # The description mentions peaks were "assigned to gene regions annotated in GENCODE (hg19) # with the following descending priority order: CDS, 5’UTR, 3’UTR, proximal intron, and distal intron." # This hierarchical assignment is a downstream annotation step that typically uses a separate script # or tool (e.g., bedtools, custom Python script) to intersect the CLIPper output BED file # with the GENCODE GTF/BED annotation and apply the specified priority logic. CLIPper itself # outputs peak coordinates and does not directly perform this complex hierarchical assignment. -
5
Proximal intron regions are defined as extending up to 500 bp from an exon-intron junction.
$ Bash example
# This script defines proximal intron regions as extending up to 500 bp from an exon-intron junction. # It requires a GTF annotation file for gene features. # --- Configuration Variables --- # Replace with your actual GTF file path (e.g., Ensembl GRCh38, release 109) GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # Output BED file for the defined proximal intron regions OUTPUT_BED="proximal_introns_500bp.bed" # --- Installation (commented out) --- # # Install bedtools if not already available # conda install -c bioconda bedtools # --- Main Workflow --- # 1. Extract all transcripts as BED format (0-based start, 1-based end) # Fields: chr, start, end, name, score, strand awk ' BEGIN { OFS="\t" } $3 == "transcript" { print $1, $4-1, $5, "transcript_" NR, ".", $7 }' "$GTF_FILE" > transcripts.bed # 2. Extract all exons as BED format awk ' BEGIN { OFS="\t" } $3 == "exon" { print $1, $4-1, $5, "exon_" NR, ".", $7 }' "$GTF_FILE" > exons.bed # 3. Subtract exons from transcripts to get all introns # This generates a BED file where each entry is an intron. The strand information is preserved. bedtools subtract -a transcripts.bed -b exons.bed > all_introns.bed # 4. Define the first 500bp of each intron (proximal to the upstream exon-intron junction) # Handles strand correctly: for '+' strand introns, it takes from the start; for '-' strand introns, # it takes from the end (which is the 5' end of the intron on the reverse strand). awk 'BEGIN {OFS="\t"} { if ($6 == "+") { # Forward strand intron # Take from start (0-based) up to 500bp or intron end, whichever is smaller print $1, $2, ($2+500 < $3 ? $2+500 : $3), $4, $5, $6 } else { # Reverse strand intron # Take from end (0-based) back 500bp or intron start, whichever is larger print $1, ($3-500 > $2 ? $3-500 : $2), $3, $4, $5, $6 } }' all_introns.bed > proximal_introns_500bp_start.bed # 5. Define the last 500bp of each intron (proximal to the downstream exon-intron junction) # Handles strand correctly: for '+' strand introns, it takes from the end; for '-' strand introns, # it takes from the start (which is the 3' end of the intron on the reverse strand). awk 'BEGIN {OFS="\t"} { if ($6 == "+") { # Forward strand intron # Take from end (0-based) back 500bp or intron start, whichever is larger print $1, ($3-500 > $2 ? $3-500 : $2), $3, $4, $5, $6 } else { # Reverse strand intron # Take from start (0-based) up to 500bp or intron end, whichever is smaller print $1, $2, ($2+500 < $3 ? $2+500 : $3), $4, $5, $6 } }' all_introns.bed > proximal_introns_500bp_end.bed # 6. Combine the start and end proximal regions and merge any overlapping intervals # -s ensures merging is strand-specific. Sorting by chromosome, start, and strand is crucial for bedtools merge. cat proximal_introns_500bp_start.bed proximal_introns_500bp_end.bed \ | sort -k1,1 -k2,2n -k6,6 \ | bedtools merge -s -i - > "$OUTPUT_BED" # 7. Clean up intermediate files rm transcripts.bed exons.bed all_introns.bed proximal_introns_500bp_start.bed proximal_introns_500bp_end.bed -
6
Each peak was normalized to the size-matched input (SMInput) by calculating the fraction of the number of usable reads from immunoprecipitation to that of the usable reads from the SMInput.
$ Bash example
# Installation (conceptual, assuming a Conda environment for skipper): # git clone https://github.com/yeolab/skipper.git # cd skipper # conda env create -f environment.yaml # conda activate skipper_env # Define placeholder paths for input and output files # IP_BAM: Immunoprecipitation (IP) aligned reads in BAM format # SMINPUT_BAM: Size-matched input (SMInput) aligned reads in BAM format # RAW_PEAK_BED: Peaks identified in the IP sample (e.g., from CLIPper) # OUTPUT_NORMALIZED_PEAK_BED: Output file for peaks with normalized scores # CHROM_SIZES: File containing chromosome names and their sizes for the reference genome IP_BAM="data/aligned/sample_ip.bam" SMINPUT_BAM="data/aligned/sample_sm_input.bam" RAW_PEAK_BED="results/peaks/sample_raw_peaks.bed" OUTPUT_NORMALIZED_PEAK_BED="results/peaks/sample_normalized_peaks.bed" CHROM_SIZES="references/hg38.chrom.sizes" # The 'scripts/calculate_enrichment.py' script from the 'yeolab/skipper' pipeline # performs the normalization described. It internally calculates usable reads # from the provided BAM files and uses them to normalize peak scores (e.g., fold enrichment). # The normalization factor is derived from the ratio of usable reads between IP and SMInput. python skipper/scripts/calculate_enrichment.py \ -i "$RAW_PEAK_BED" \ -b "$IP_BAM" \ -c "$SMINPUT_BAM" \ -o "$OUTPUT_NORMALIZED_PEAK_BED" \ -s "$CHROM_SIZES"
-
7
Peaks were deemed significant at 8-fold enrichment and p-value10-3(Chi-squared test, or Fisherâs exact test if the observed or expected read number in eCLIP or SMInput was below 5).
$ Bash example
# Install clipper (if not already installed) # pip install clipper # Or clone from GitHub and install # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Define input files and parameters ECLIP_BAM="path/to/your/eclip_sample.bam" SMINPUT_BAM="path/to/your/sm_input_control.bam" OUTPUT_PREFIX="eclip_peaks" FOLD_ENRICHMENT=8 P_VALUE=0.001 # 10^-3 # Reference genome size file (e.g., for hg38) # This file typically contains chromosome names and their lengths. # Example: chr1 248956422 # chr2 242193529 # You can generate this from a .fai file of your reference genome: # samtools faidx <reference.fasta> # cut -f1,2 <reference.fasta>.fai > genome_sizes.txt GENOME_SIZE_FILE="path/to/your/genome_sizes.txt" # Ensure the output directory exists mkdir -p ./clipper_output # Run clipper clipper -i "${ECLIP_BAM}" \ -c "${SMINPUT_BAM}" \ -o "./clipper_output/${OUTPUT_PREFIX}" \ -s "${GENOME_SIZE_FILE}" \ -fe "${FOLD_ENRICHMENT}" \ -p "${P_VALUE}" -
8
Library strategy: enhanced CLIP
$ Bash example
# --- Configuration --- # Placeholder for reference genome and annotation. For human, hg38 is the latest assembly. GENOME_FASTA="/path/to/hg38.fa" GENOME_DIR="/path/to/STAR_genome_index/hg38" # Pre-built STAR index GTF_FILE="/path/to/hg38.gtf" BLACKLIST_BED="/path/to/hg38_blacklist.bed" # ENCODE blacklist regions # Input FASTQ files (assuming paired-end for eCLIP, with replicates and input control) READ1_IP_REP1="ip_replicate1_R1.fastq.gz" READ2_IP_REP1="ip_replicate1_R2.fastq.gz" READ1_IP_REP2="ip_replicate2_R1.fastq.gz" READ2_IP_REP2="ip_replicate2_R2.fastq.gz" READ1_INPUT="input_control_R1.fastq.gz" READ2_INPUT="input_control_R2.fastq.gz" OUTPUT_DIR="eclip_analysis" mkdir -p "$OUTPUT_DIR" # --- 1. Alignment with STAR (example for one IP replicate) --- # conda install -c bioconda star STAR --genomeDir "$GENOME_DIR" \ --readFilesIn "$READ1_IP_REP1" "$READ2_IP_REP1" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_DIR}/ip_rep1_star_aligned_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.5 \ --outFilterMatchNminOverLread 0.5 \ --alignIntronMax 1 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 60000000000 # Adjust based on available RAM (e.g., 60GB) ALIGNED_BAM_IP_REP1="${OUTPUT_DIR}/ip_rep1_star_aligned_Aligned.sortedByCoord.out.bam" # Similar STAR commands would be run for IP_REP2 and INPUT control samples. # --- 2. Peak Calling with CLIPper (example for one IP replicate) --- # conda install -c bioconda clipper # Note: CLIPper often takes a control BAM for background subtraction in a full pipeline. # This example shows calling peaks on IP only. A full pipeline would align and call peaks on input control as well. clipper -b "$ALIGNED_BAM_IP_REP1" \ -s hg38 \ -o "${OUTPUT_DIR}/ip_rep1_clipper_peaks.bed" \ -p 0.01 \ -t 10 # Number of threads # Assume similar steps for IP_REP2 and INPUT to get: # ALIGNED_BAM_IP_REP2, ALIGNED_BAM_INPUT # IP_REP2_PEAKS="${OUTPUT_DIR}/ip_rep2_clipper_peaks.bed" # INPUT_PEAKS="${OUTPUT_DIR}/input_clipper_peaks.bed" # --- 3. Identifying Reproducible Peaks (IDR) with merge_peaks --- # This step requires peak files from multiple IP replicates and an input control. # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks && python setup.py install # Or ensure script is in PATH # Placeholder for actual peak files from previous steps # REP1_PEAKS="${OUTPUT_DIR}/ip_rep1_clipper_peaks.bed" # REP2_PEAKS="${OUTPUT_DIR}/ip_rep2_clipper_peaks.bed" # CONTROL_PEAKS="${OUTPUT_DIR}/input_clipper_peaks.bed" # Peaks from input control # python /path/to/merge_peaks/merge_peaks.py \ # -i "$REP1_PEAKS" "$REP2_PEAKS" \ # -c "$CONTROL_PEAKS" \ # -o "${OUTPUT_DIR}/eclip_reproducible_peaks.bed" \ # -g hg38 \ # --min_overlap 0.5 # Example parameter, adjust as needed
Raw Source Text
Reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i)(Dobin et al., 2013). Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ. Peaks were called on the usable reads by CLIPper (Lovci et al., 2013) and assigned to gene regions annotated in GENCODE (hg19) with the following descending priority order: CDS, 5âUTR, 3âUTR, proximal intron, and distal intron. Proximal intron regions are defined as extending up to 500 bp from an exon-intron junction. Each peak was normalized to the size-matched input (SMInput) by calculating the fraction of the number of usable reads from immunoprecipitation to that of the usable reads from the SMInput. Peaks were deemed significant at 8-fold enrichment and p-value10-3(Chi-squared test, or Fisherâs exact test if the observed or expected read number in eCLIP or SMInput was below 5). Assembly: hg19 Supplementary files format and content: bigwig files for input and IP samples as well as beds files for significant eCLIP peaks per experimental group Library strategy: enhanced CLIP