GSE230097 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

ePRINT: exonuclease assisted mapping of protein-RNA interactions.

Genome biology (2024) — PMID 38807229

Dataset

Exonuclease assisted mapping of protein-RNA interactions (ePRINT)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Adapters were removes using cutadapt and reads were mapped to hg19 using STAR.

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt

# Define variables
INPUT_READS="input_reads.fastq.gz"
TRIMMED_READS="trimmed_reads.fastq.gz"
OUTPUT_DIR="star_output"
STAR_INDEX_DIR="/path/to/hg19_STAR_index" # Placeholder for hg19 STAR index. Ensure this index is pre-built.
NUM_THREADS=8 # Placeholder for number of threads

# 1. Remove adapters using cutadapt
# Using a common Illumina universal adapter sequence for single-end reads.
# Adjust -a for specific adapters if known, or use -A for paired-end reads.
# -q 20: Trim low-quality ends (below Q20)
# -m 20: Discard reads shorter than 20 bp after trimming
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -q 20 -m 20 -o "${TRIMMED_READS}" "${INPUT_READS}"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Install STAR (e.g., using conda)
# conda install -c bioconda star

# 2. Map reads to hg19 using STAR
# --genomeDir: Path to the STAR genome index for hg19
# --readFilesIn: Input FASTQ file (trimmed reads)
# --outFileNamePrefix: Prefix for output files (e.g., aligned_Log.out, aligned_Aligned.sortedByCoordinate.out.bam)
# --outSAMtype BAM SortedByCoordinate: Output sorted BAM file
# --runThreadN: Number of threads to use
# --readFilesCommand zcat: Command to decompress gzipped input files
STAR --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${TRIMMED_READS}" \
     --outFileNamePrefix "${OUTPUT_DIR}/aligned_" \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN "${NUM_THREADS}" \
     --readFilesCommand zcat

View on GitHub

Initial peak calling was performed for each sample by CLIPper using only the r1 reads.

CLIPper vfrom source GitHub

$ Bash example

# Clone the CLIPper repository
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# CLIPper requires Python 2.7 or 3.x and several libraries. It's recommended to set up a virtual environment.
# Example dependencies (may vary, check CLIPper's documentation or requirements.txt if available):
# pip install numpy scipy pysam pybedtools

# Placeholder for reference genome and blacklist
# Replace with actual paths to your reference files
GENOME_FASTA="/path/to/your/reference/genome/hg38.fa"
BLACKLIST_BED="/path/to/your/blacklist/hg38_blacklist.bed"
# Ensure these files are properly indexed if required by CLIPper (e.g., .fai for fasta)

# Input BAM file: The description specifies "using only the r1 reads".
# This means the input BAM should already contain only R1 reads.
# If starting from a paired-end BAM (e.g., aligned.bam), you might first filter for R1:
# samtools view -b -f 0x40 aligned.bam > input_sample_r1.bam
INPUT_BAM="input_sample_r1.bam"
OUTPUT_DIR="clipper_peaks_output"
SAMPLE_NAME="sample_name" # Used for output file naming

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute CLIPper for initial peak calling
# Common parameters for CLIPper:
# -i: Input BAM file (assumed to contain only R1 reads)
# -g: Reference genome FASTA file
# -b: Blacklist BED file
# -o: Output directory
# -p: P-value threshold (default 0.01)
# -f: Fold enrichment threshold (default 2.0)
# -a: Adjusted p-value threshold (default 0.05)
# --output-prefix: Prefix for output files (e.g., sample_name_peaks.bed)
python clipper.py \
    -i "${INPUT_BAM}" \
    -g "${GENOME_FASTA}" \
    -b "${BLACKLIST_BED}" \
    -o "${OUTPUT_DIR}" \
    -p 0.01 \
    -f 2.0 \
    -a 0.05 \
    --output-prefix "${SAMPLE_NAME}"

View on GitHub

Peaks with genomic overlap in at least two ePRINT samples were identified and their genomic coordinates merged to define the candidate peak superset using HOMER.

HOMER v4.12

$ Bash example

# Install HOMER (if not already installed)
# conda install -c bioconda homer
# Or manual installation:
# wget http://homer.ucsd.edu/homer/configureHomer.pl
# perl configureHomer.pl -install

# Define input peak files. These should be in BED format or HOMER's native peak format.
# Replace with actual file paths for your ePRINT samples.
# Example:
# peak_file_1="path/to/sample1_ePRINT_peaks.bed"
# peak_file_2="path/to/sample2_ePRINT_peaks.bed"
# peak_file_3="path/to/sample3_ePRINT_peaks.bed"

# Merge peaks from at least two ePRINT samples to define a candidate peak superset.
# -bed: Assumes input files are in BED format. Adjust if using HOMER's native format.
# -d 0: Merges only directly overlapping peaks (distance 0), which is a common interpretation of "genomic overlap".
# -min 2: Only reports merged peaks that overlap with at least 2 input peak files, fulfilling the "at least two ePRINT samples" criteria.
# Replace sampleX_ePRINT_peaks.bed with your actual input peak files.
mergePeaks -bed -d 0 -min 2 sample1_ePRINT_peaks.bed sample2_ePRINT_peaks.bed sample3_ePRINT_peaks.bed > candidate_peak_superset.bed

These peaks were then filtered using the input retaining those significantly enriched in the ePRINT samples in comparison to the corresponding input samples.

clipper (Inferred with models/gemini-2.5-flash) v5.0 GitHub

$ Bash example

# Install clipper if not already installed
# pip install clipper
# or
# conda install -c bioconda clipper

# Define input and output files
EPRINT_BAM="ePRINT_sample.bam" # Placeholder for aligned ePRINT reads (IP sample)
INPUT_BAM="corresponding_input_sample.bam" # Placeholder for aligned corresponding input reads
OUTPUT_PREFIX="ePRINT_enriched_peaks" # Prefix for output peak files
GENOME_SIZE="hs" # Placeholder: 'hs' for human, 'mm' for mouse, or a numerical value (e.g., 2.7e9 for hg38).

# Run clipper to identify significantly enriched peaks in ePRINT samples compared to input samples.
# The tool performs statistical testing to retain regions significantly enriched.
clipper -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}" "${EPRINT_BAM}" "${INPUT_BAM}"

View on GitHub

The remaining high confidence peaks were then input to DESEQ2 for differential analysis.

DESeq2 vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install DESeq2 (if not already installed)
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("DESeq2")

# Create a placeholder R script for DESeq2 differential analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# --- Configuration --- #
# Input count matrix file (e.g., from featureCounts, htseq-count, or peak quantification)
# Rows are features (peaks), columns are samples.
count_file <- "peak_counts.tsv"

# Sample metadata file
# Must contain 'sample_id' column matching count matrix column names,
# and a 'condition' column for differential analysis.
metadata_file <- "sample_metadata.tsv"

# Output directory for results
output_dir <- "deseq2_results"
dir.create(output_dir, showWarnings = FALSE)

# Define the design formula (e.g., ~ condition)
# Adjust based on your experimental design
design_formula <- ~ condition

# Define the contrast for differential analysis (e.g., c("condition", "treated", "control"))
# This will compare 'treated' vs 'control'
contrast_factors <- c("condition", "treated", "control")

# --- Load Data --- #
counts_data <- read.delim(count_file, row.names = 1, sep = "\t", header = TRUE)
metadata <- read.delim(metadata_file, row.names = 1, sep = "\t", header = TRUE)

# Ensure sample names match and are in the same order
common_samples <- intersect(colnames(counts_data), rownames(metadata))
counts_data <- counts_data[, common_samples]
metadata <- metadata[common_samples, ]

# Ensure count data is integer matrix
counts_data <- as.matrix(counts_data)
mode(counts_data) <- "integer"

# Ensure condition is a factor
metadata$condition <- factor(metadata$condition)

# --- Create DESeqDataSet object --- #
dds <- DESeqDataSetFromMatrix(countData = counts_data,
                              colData = metadata,
                              design = design_formula)

# --- Run DESeq2 differential analysis --- #
dds <- DESeq(dds)

# --- Extract and save results --- #
res <- results(dds, contrast = contrast_factors)

# Order results by adjusted p-value
res_ordered <- res[order(res$padj), ]

# Save full results
write.csv(as.data.frame(res_ordered), file.path(output_dir, "deseq2_differential_results.csv"))

# Save significant results (e.g., padj < 0.05)
significant_res <- subset(res_ordered, padj < 0.05)
write.csv(as.data.frame(significant_res), file.path(output_dir, "deseq2_significant_results.csv"))

message("DESeq2 analysis complete. Results saved in ", output_dir)
EOF

# Create placeholder input files (replace with your actual data)
# Example peak_counts.tsv
cat << 'EOF_COUNTS' > peak_counts.tsv
peak_id	sample1	sample2	sample3	sample4
peak1	100	150	50	75
peak2	20	30	10	15
peak3	500	600	200	250
peak4	10	12	5	6
peak5	80	90	40	45
EOF_COUNTS

# Example sample_metadata.tsv
cat << 'EOF_METADATA' > sample_metadata.tsv
sample_id	condition
sample1	treated
sample2	treated
sample3	control
sample4	control
EOF_METADATA

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Bedgraph files were generated from the r1 reads using bedtools

bedtools v2.30.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools=2.30.0

# Define input and output files
INPUT_BAM="r1.bam" # Assuming 'r1 reads' refers to aligned reads in BAM format
OUTPUT_BEDGRAPH="r1.bedgraph"
GENOME_CHROM_SIZES="hg38.chrom.sizes" # Placeholder for genome chromosome sizes (e.g., from UCSC)

# Generate bedgraph file from aligned reads
bedtools genomeCoverageBed -ibam ${INPUT_BAM} -bg -g ${GENOME_CHROM_SIZES} > ${OUTPUT_BEDGRAPH}

View on GitHub

Tools Used

STAR HOMER DESeq2

Raw Source Text

Adapters were removes using cutadapt and reads were mapped to hg19 using STAR.
Initial peak calling was performed for each sample by CLIPper using only the r1 reads. Peaks with genomic overlap in at least two ePRINT samples were identified and their genomic coordinates merged to define the candidate peak superset using HOMER. These peaks were then filtered using the input retaining those significantly enriched in the ePRINT samples in comparison to the corresponding input samples. The remaining high confidence peaks were then input to DESEQ2 for differential analysis.
Bedgraph files were generated from the r1 reads using bedtools
Assembly: hg19
Supplementary files format and content: bedgraph

← Back to Analysis