GSE221870 Processing Pipeline
Publication
HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence.Molecular cell (2023) — PMID 37421941
Dataset
GSE221870HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence
Processing Steps
Generate Jupyter Notebook-
1
Reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i)(Dobin et al., 2013).
STAR v2.4.0i$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Placeholder for RepBase FASTA file (version 18.05, human-specific) # RepBase requires a license for download. This path assumes the file is already available. GENOME_FASTA="repbase_18.05_human.fasta" GENOME_DIR="star_index_repbase_18.05" # Placeholder for adapter-trimmed reads (from Cutadapt output) # Assuming paired-end reads, adjust if single-end READS_R1="trimmed_reads_R1.fastq.gz" READS_R2="trimmed_reads_R2.fastq.gz" OUTPUT_PREFIX="mapped_repbase" THREADS=8 # Placeholder for number of threads # 1. Build STAR index for RepBase (if not already built) # The --genomeSAindexNbases parameter is adjusted for smaller/repetitive genomes like RepBase STAR --runMode genomeGenerate \ --genomeDir $GENOME_DIR \ --genomeFastaFiles $GENOME_FASTA \ --runThreadN $THREADS \ --genomeSAindexNbases 10 # 2. Map adapter-trimmed reads to RepBase using STAR # --alignIntronMax 1 is crucial for repetitive elements as they typically do not contain introns, # effectively disabling splice-aware alignment. # --outFilterMultimapNmax 100 allows reads to map to multiple locations, common for repetitive elements. STAR --genomeDir $GENOME_DIR \ --readFilesIn $READS_R1 $READS_R2 \ --runThreadN $THREADS \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 100 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --alignIntronMax 1 \ --alignMatesGapMax 1000000 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 -
2
Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables GENOME_DIR="/path/to/STAR_index/hg19" # Placeholder for hg19 STAR index INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for input FASTQ file OUTPUT_PREFIX="aligned_reads" # Prefix for output files NUM_THREADS=8 # Number of threads to use # Check if STAR index exists (optional, for robustness) if [ ! -d "${GENOME_DIR}" ]; then echo "Error: STAR genome index not found at ${GENOME_DIR}" echo "Please build the STAR index for hg19 or provide the correct path." # Example for building index (requires hg19 FASTA and GTF) # STAR --runThreadN ${NUM_THREADS} --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/hg19.gtf \ # --sjdbOverhang 100 # Adjust sjdbOverhang based on read length - 1 exit 1 fi # Run STAR alignment # Parameters: # --genomeDir: Path to the STAR genome index # --readFilesIn: Input FASTQ file(s). Use 'zcat' with --readFilesCommand for gzipped files. # --outFileNamePrefix: Prefix for all output files # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file # --outFilterMultimapNmax 1: Keep only uniquely mapping reads (addresses "repeat-mapping reads were removed") # --outFilterType BySJout: Recommended for RNA-seq, filters out reads with non-canonical junctions # --outFilterMismatchNmax 999: Maximum number of mismatches per read (allows many) # --outFilterMismatchNoverLmax 0.04: Maximum fraction of mismatches relative to read length # --alignIntronMin 20: Minimum intron length # --alignIntronMax 1000000: Maximum intron length # --alignMatesGapMax 1000000: Maximum genomic distance between mates # --sjdbScore 1: Score for novel splice junctions # --readFilesCommand zcat: Command to decompress gzipped input files # --runThreadN: Number of threads to use STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterType BySJout \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 \ --readFilesCommand zcat \ --runThreadN "${NUM_THREADS}" -
3
Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Path to the STAR genome index for hg19 (e.g., built with GENCODE v19 annotation) # Example: /path/to/STAR_index/hg19_gencode_v19 GENOME_DIR="/path/to/STAR_index/hg19_gencode_v19" # Input FASTQ file containing reads after repeat-mapping reads were removed READS="filtered_reads.fastq.gz" # Prefix for output files (e.g., mapped_reads_Aligned.sortedByCoord.out.bam) OUTPUT_PREFIX="mapped_reads_" # Number of threads to use THREADS=8 # Adjust as needed # Maximum RAM for sorting BAM (e.g., 30GB, adjust based on available memory) LIMIT_BAM_SORT_RAM=30000000000 # Optional: Example command to build STAR genome index (run once per genome/annotation) # STAR --runThreadN ${THREADS} --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \ # --sjdbOverhang 100 # Recommended: read length - 1 # Map reads to the human genome (hg19) with STAR STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM "${LIMIT_BAM_SORT_RAM}" -
4
PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.
$ Bash example
# Install umi_tools (e.g., using conda) # conda install -c bioconda umi-tools=1.1.2 # This command assumes that UMIs have already been extracted from the 5' adapter # and appended to the read IDs (e.g., by a preceding 'umi_tools extract' step). # The 'directional' method is generally recommended for UMI-based deduplication. # Replace 'aligned.bam' with your actual input BAM file and 'deduplicated.bam' with your desired output file name. umi_tools dedup \ --method=directional \ --extract-umi-method=read_id \ --umi-separator='_' \ -I aligned.bam \ -S deduplicated.bam
-
5
Peaks were called on the usable reads by CLIPper (Lovci et al., 2013) and assigned to gene regions annotated in GENCODE (hg19) with the following descending priority order: CDS, 5âUTR, 3âUTR, proximal intron, and distal intron.
CLIPper vLovci et al., 2013 implementation$ Bash example
# Clone the CLIPper repository if not already available # git clone https://github.com/yeolab/clipper.git # cd clipper # It is recommended to use a Python 2.7 environment for CLIPper scripts. # conda create -n clipper_env python=2.7 # conda activate clipper_env # Define input and output file names INPUT_BAM="input_usable_reads.bam" # Placeholder for the input BAM file containing usable reads OUTPUT_PEAKS_BED="clipper_peaks.bed" ANNOTATED_PEAKS_TSV="clipper_annotated_peaks.tsv" # Download GENCODE v19 annotation for hg19 if not already available # This GTF file is required for assigning peaks to gene regions. # wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz GTF_FILE="gencode.v19.annotation.gtf" # Step 1: Call peaks using CLIPper # The 'clipper.py' script should be in the current directory or in your PATH. # This command calls peaks from the input BAM file and outputs them in BED format. python clipper.py -s "${INPUT_BAM}" -o "${OUTPUT_PEAKS_BED}" # Step 2: Assign peaks to gene regions using assign_peaks_to_genes.py # The 'assign_peaks_to_genes.py' script should be in the current directory or in your PATH. # This command takes the called peaks and the GENCODE GTF, then assigns peaks to gene regions # based on the specified priority order: CDS, 5'UTR, 3'UTR, proximal intron, and distal intron. python assign_peaks_to_genes.py \ -g "${GTF_FILE}" \ -p "${OUTPUT_PEAKS_BED}" \ -o "${ANNOTATED_PEAKS_TSV}" \ --priority_order "CDS,5UTR,3UTR,proximal_intron,distal_intron" -
6
Proximal intron regions are defined as extending up to 500 bp from an exon-intron junction.
$ Bash example
# Define reference files (replace with actual paths) # Chromosome sizes file (e.g., from UCSC or samtools faidx): CHROM_SIZES="GRCh38.chrom.sizes" # --- Step 0: Obtain a BED file of introns --- # The description defines "proximal intron regions", implying that intron coordinates # are already known. For this code block, we assume 'introns.bed' is available. # # Example of how 'introns.bed' could be generated from a GTF file: # (This is a complex process and often involves custom scripts or tools like gffread/bedops) # # # 1. Extract exons from GTF and convert to BED format (requires gtf2bed from bedops suite or custom script) # # awk '$3 == "exon"' "Homo_sapiens.GRCh38.109.gtf" | gtf2bed - > exons.bed # # # 2. Sort and merge exons by transcript to get contiguous exon blocks # # bedtools sort -i exons.bed | bedtools merge -s -c 4,5,6 -o distinct,first,first -i - > merged_exons.bed # # # 3. Generate introns by finding gaps between merged exons within each transcript (requires custom scripting) # # For simplicity, let's assume 'introns.bed' is pre-generated. # # # Placeholder for introns.bed (replace with your actual intron coordinates) # # Example format: chr\tstart\tend\tname\tscore\tstrand INTRONS_BED="introns.bed" # --- Step 1: Define proximal intron regions (extending up to 500 bp from exon-intron junctions) --- # This involves taking the first 500bp and the last 500bp of each intron. # Get the 5' proximal 500 bp of each intron # For '+' strand introns, this is the region from the start coordinate to start+500. # For '-' strand introns, this is the region from end-500 to the end coordinate. bedtools flank -i "$INTRONS_BED" -g "$CHROM_SIZES" -l 500 -r 0 -s > proximal_5prime_introns.bed # Get the 3' proximal 500 bp of each intron # For '+' strand introns, this is the region from end-500 to the end coordinate. # For '-' strand introns, this is the region from the start coordinate to start+500. bedtools flank -i "$INTRONS_BED" -g "$CHROM_SIZES" -l 0 -r 500 -s > proximal_3prime_introns.bed # Combine the 5' and 3' proximal regions # Sort and merge any overlapping regions to get the final set of unique proximal intron regions. cat proximal_5prime_introns.bed proximal_3prime_introns.bed | \ bedtools sort -i - | \ bedtools merge -i - > proximal_intron_regions_500bp.bed # Clean up intermediate files (uncomment if desired) # rm proximal_5prime_introns.bed proximal_3prime_introns.bed -
7
Each peak was normalized to the size-matched input (SMInput) by calculating the fraction of the number of usable reads from immunoprecipitation to that of the usable reads from the SMInput.
$ Bash example
# This script performs peak normalization based on usable read counts from IP and SMInput samples. # The number of usable reads for IP and SMInput would typically be obtained from alignment statistics # (e.g., from samtools flagstat output or custom scripts that count uniquely mapped reads). # Placeholder for usable read counts (replace with actual values from your data) # Example: usable_reads_ip=$(cat path/to/ip_usable_reads.txt) # Example: usable_reads_sminput=$(cat path/to/sminput_usable_reads.txt) usable_reads_ip=10000000 # Example: Total usable reads from IP sample usable_reads_sminput=5000000 # Example: Total usable reads from SMInput sample # Input peak file (e.g., a BED file where the 5th column represents the peak score/count) input_peaks_file="input_peaks.bed" # Output normalized peak file output_normalized_peaks_file="normalized_peaks.bed" # Calculate the normalization factor: # fraction = (usable reads from immunoprecipitation) / (usable reads from SMInput) normalization_factor=$(awk "BEGIN { print $usable_reads_ip / $usable_reads_sminput }") echo "Normalization factor calculated: $normalization_factor" # Normalize each peak score in the input BED file. # This command assumes the 5th column of the BED file contains the score to be normalized. # It multiplies the existing score by the calculated normalization factor. awk -v factor="$normalization_factor" 'OFS="\t" { $5 = $5 * factor; print }' "$input_peaks_file" > "$output_normalized_peaks_file" echo "Peaks normalized and saved to $output_normalized_peaks_file" -
8
Peaks were deemed significant at 8-fold enrichment and p-value10-3(Chi-squared test, or Fisherâs exact test if the observed or expected read number in eCLIP or SMInput was below 5).
$ Bash example
# Install clipper (assuming Python and pip are available) # git clone https://github.com/yeolab/clipper.git # cd clipper # pip install . # Placeholder for input files and genome # Replace with actual paths to your eCLIP BAM, SMInput BAM, and reference genome FASTA. # Example: hg38.fa for human genome. ECLIP_BAM="path/to/your/eclip_sample.bam" SMINPUT_BAM="path/to/your/sminput_sample.bam" GENOME_FASTA="path/to/your/genome.fa" OUTPUT_PREFIX="eclip_peaks" # Execute clipper for peak calling # Peaks were deemed significant at 8-fold enrichment and p-value 10-3 # (Chi-squared test, or Fisher’s exact test if the observed or expected read number in eCLIP or SMInput was below 5). # clipper automatically handles switching between Chi-squared and Fisher's exact test for low counts. python clipper/clipper.py \ -f 8 \ -p 0.001 \ "${ECLIP_BAM}" \ "${SMINPUT_BAM}" \ "${GENOME_FASTA}" \ "${OUTPUT_PREFIX}"
Raw Source Text
Reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i)(Dobin et al., 2013). Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ. Peaks were called on the usable reads by CLIPper (Lovci et al., 2013) and assigned to gene regions annotated in GENCODE (hg19) with the following descending priority order: CDS, 5âUTR, 3âUTR, proximal intron, and distal intron. Proximal intron regions are defined as extending up to 500 bp from an exon-intron junction. Each peak was normalized to the size-matched input (SMInput) by calculating the fraction of the number of usable reads from immunoprecipitation to that of the usable reads from the SMInput. Peaks were deemed significant at 8-fold enrichment and p-value10-3(Chi-squared test, or Fisherâs exact test if the observed or expected read number in eCLIP or SMInput was below 5). Assembly: hg19 Supplementary files format and content: bigwig files for input and IP samples as well as beds files for significant eCLIP peaks per experimental group