GSE230097 Processing Pipeline
RNA-Seq
code_examples
6 steps
Publication
ePRINT: exonuclease assisted mapping of protein-RNA interactions.Genome biology (2024) — PMID 38807229
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Adapters were removes using cutadapt and reads were mapped to hg19 using STAR.
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt # Define variables INPUT_READS="input_reads.fastq.gz" TRIMMED_READS="trimmed_reads.fastq.gz" OUTPUT_DIR="star_output" STAR_INDEX_DIR="/path/to/hg19_STAR_index" # Placeholder for hg19 STAR index. Ensure this index is pre-built. NUM_THREADS=8 # Placeholder for number of threads # 1. Remove adapters using cutadapt # Using a common Illumina universal adapter sequence for single-end reads. # Adjust -a for specific adapters if known, or use -A for paired-end reads. # -q 20: Trim low-quality ends (below Q20) # -m 20: Discard reads shorter than 20 bp after trimming cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -q 20 -m 20 -o "${TRIMMED_READS}" "${INPUT_READS}" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Install STAR (e.g., using conda) # conda install -c bioconda star # 2. Map reads to hg19 using STAR # --genomeDir: Path to the STAR genome index for hg19 # --readFilesIn: Input FASTQ file (trimmed reads) # --outFileNamePrefix: Prefix for output files (e.g., aligned_Log.out, aligned_Aligned.sortedByCoordinate.out.bam) # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file # --runThreadN: Number of threads to use # --readFilesCommand zcat: Command to decompress gzipped input files STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${TRIMMED_READS}" \ --outFileNamePrefix "${OUTPUT_DIR}/aligned_" \ --outSAMtype BAM SortedByCoordinate \ --runThreadN "${NUM_THREADS}" \ --readFilesCommand zcat -
2
Initial peak calling was performed for each sample by CLIPper using only the r1 reads.
$ Bash example
# Clone the CLIPper repository # git clone https://github.com/yeolab/clipper.git # cd clipper # CLIPper requires Python 2.7 or 3.x and several libraries. It's recommended to set up a virtual environment. # Example dependencies (may vary, check CLIPper's documentation or requirements.txt if available): # pip install numpy scipy pysam pybedtools # Placeholder for reference genome and blacklist # Replace with actual paths to your reference files GENOME_FASTA="/path/to/your/reference/genome/hg38.fa" BLACKLIST_BED="/path/to/your/blacklist/hg38_blacklist.bed" # Ensure these files are properly indexed if required by CLIPper (e.g., .fai for fasta) # Input BAM file: The description specifies "using only the r1 reads". # This means the input BAM should already contain only R1 reads. # If starting from a paired-end BAM (e.g., aligned.bam), you might first filter for R1: # samtools view -b -f 0x40 aligned.bam > input_sample_r1.bam INPUT_BAM="input_sample_r1.bam" OUTPUT_DIR="clipper_peaks_output" SAMPLE_NAME="sample_name" # Used for output file naming # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Execute CLIPper for initial peak calling # Common parameters for CLIPper: # -i: Input BAM file (assumed to contain only R1 reads) # -g: Reference genome FASTA file # -b: Blacklist BED file # -o: Output directory # -p: P-value threshold (default 0.01) # -f: Fold enrichment threshold (default 2.0) # -a: Adjusted p-value threshold (default 0.05) # --output-prefix: Prefix for output files (e.g., sample_name_peaks.bed) python clipper.py \ -i "${INPUT_BAM}" \ -g "${GENOME_FASTA}" \ -b "${BLACKLIST_BED}" \ -o "${OUTPUT_DIR}" \ -p 0.01 \ -f 2.0 \ -a 0.05 \ --output-prefix "${SAMPLE_NAME}" -
3
Peaks with genomic overlap in at least two ePRINT samples were identified and their genomic coordinates merged to define the candidate peak superset using HOMER.
HOMER v4.12$ Bash example
# Install HOMER (if not already installed) # conda install -c bioconda homer # Or manual installation: # wget http://homer.ucsd.edu/homer/configureHomer.pl # perl configureHomer.pl -install # Define input peak files. These should be in BED format or HOMER's native peak format. # Replace with actual file paths for your ePRINT samples. # Example: # peak_file_1="path/to/sample1_ePRINT_peaks.bed" # peak_file_2="path/to/sample2_ePRINT_peaks.bed" # peak_file_3="path/to/sample3_ePRINT_peaks.bed" # Merge peaks from at least two ePRINT samples to define a candidate peak superset. # -bed: Assumes input files are in BED format. Adjust if using HOMER's native format. # -d 0: Merges only directly overlapping peaks (distance 0), which is a common interpretation of "genomic overlap". # -min 2: Only reports merged peaks that overlap with at least 2 input peak files, fulfilling the "at least two ePRINT samples" criteria. # Replace sampleX_ePRINT_peaks.bed with your actual input peak files. mergePeaks -bed -d 0 -min 2 sample1_ePRINT_peaks.bed sample2_ePRINT_peaks.bed sample3_ePRINT_peaks.bed > candidate_peak_superset.bed
-
4
These peaks were then filtered using the input retaining those significantly enriched in the ePRINT samples in comparison to the corresponding input samples.
$ Bash example
# Install clipper if not already installed # pip install clipper # or # conda install -c bioconda clipper # Define input and output files EPRINT_BAM="ePRINT_sample.bam" # Placeholder for aligned ePRINT reads (IP sample) INPUT_BAM="corresponding_input_sample.bam" # Placeholder for aligned corresponding input reads OUTPUT_PREFIX="ePRINT_enriched_peaks" # Prefix for output peak files GENOME_SIZE="hs" # Placeholder: 'hs' for human, 'mm' for mouse, or a numerical value (e.g., 2.7e9 for hg38). # Run clipper to identify significantly enriched peaks in ePRINT samples compared to input samples. # The tool performs statistical testing to retain regions significantly enriched. clipper -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}" "${EPRINT_BAM}" "${INPUT_BAM}" -
5
The remaining high confidence peaks were then input to DESEQ2 for differential analysis.
$ Bash example
# Install DESeq2 (if not already installed) # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("DESeq2") # Create a placeholder R script for DESeq2 differential analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # --- Configuration --- # # Input count matrix file (e.g., from featureCounts, htseq-count, or peak quantification) # Rows are features (peaks), columns are samples. count_file <- "peak_counts.tsv" # Sample metadata file # Must contain 'sample_id' column matching count matrix column names, # and a 'condition' column for differential analysis. metadata_file <- "sample_metadata.tsv" # Output directory for results output_dir <- "deseq2_results" dir.create(output_dir, showWarnings = FALSE) # Define the design formula (e.g., ~ condition) # Adjust based on your experimental design design_formula <- ~ condition # Define the contrast for differential analysis (e.g., c("condition", "treated", "control")) # This will compare 'treated' vs 'control' contrast_factors <- c("condition", "treated", "control") # --- Load Data --- # counts_data <- read.delim(count_file, row.names = 1, sep = "\t", header = TRUE) metadata <- read.delim(metadata_file, row.names = 1, sep = "\t", header = TRUE) # Ensure sample names match and are in the same order common_samples <- intersect(colnames(counts_data), rownames(metadata)) counts_data <- counts_data[, common_samples] metadata <- metadata[common_samples, ] # Ensure count data is integer matrix counts_data <- as.matrix(counts_data) mode(counts_data) <- "integer" # Ensure condition is a factor metadata$condition <- factor(metadata$condition) # --- Create DESeqDataSet object --- # dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = metadata, design = design_formula) # --- Run DESeq2 differential analysis --- # dds <- DESeq(dds) # --- Extract and save results --- # res <- results(dds, contrast = contrast_factors) # Order results by adjusted p-value res_ordered <- res[order(res$padj), ] # Save full results write.csv(as.data.frame(res_ordered), file.path(output_dir, "deseq2_differential_results.csv")) # Save significant results (e.g., padj < 0.05) significant_res <- subset(res_ordered, padj < 0.05) write.csv(as.data.frame(significant_res), file.path(output_dir, "deseq2_significant_results.csv")) message("DESeq2 analysis complete. Results saved in ", output_dir) EOF # Create placeholder input files (replace with your actual data) # Example peak_counts.tsv cat << 'EOF_COUNTS' > peak_counts.tsv peak_id sample1 sample2 sample3 sample4 peak1 100 150 50 75 peak2 20 30 10 15 peak3 500 600 200 250 peak4 10 12 5 6 peak5 80 90 40 45 EOF_COUNTS # Example sample_metadata.tsv cat << 'EOF_METADATA' > sample_metadata.tsv sample_id condition sample1 treated sample2 treated sample3 control sample4 control EOF_METADATA # Execute the R script Rscript run_deseq2.R -
6
Bedgraph files were generated from the r1 reads using bedtools
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools=2.30.0 # Define input and output files INPUT_BAM="r1.bam" # Assuming 'r1 reads' refers to aligned reads in BAM format OUTPUT_BEDGRAPH="r1.bedgraph" GENOME_CHROM_SIZES="hg38.chrom.sizes" # Placeholder for genome chromosome sizes (e.g., from UCSC) # Generate bedgraph file from aligned reads bedtools genomeCoverageBed -ibam ${INPUT_BAM} -bg -g ${GENOME_CHROM_SIZES} > ${OUTPUT_BEDGRAPH}
Raw Source Text
Adapters were removes using cutadapt and reads were mapped to hg19 using STAR. Initial peak calling was performed for each sample by CLIPper using only the r1 reads. Peaks with genomic overlap in at least two ePRINT samples were identified and their genomic coordinates merged to define the candidate peak superset using HOMER. These peaks were then filtered using the input retaining those significantly enriched in the ePRINT samples in comparison to the corresponding input samples. The remaining high confidence peaks were then input to DESEQ2 for differential analysis. Bedgraph files were generated from the r1 reads using bedtools Assembly: hg19 Supplementary files format and content: bedgraph