GSE230097 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

ePRINT: exonuclease assisted mapping of protein-RNA interactions.

Genome biology (2024) — PMID 38807229

Dataset

GSE230097

Exonuclease assisted mapping of protein-RNA interactions (ePRINT)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Adapters were removes using cutadapt and reads were mapped to hg19 using STAR.

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda install -c bioconda cutadapt
    
    # Define variables
    INPUT_READS="input_reads.fastq.gz"
    TRIMMED_READS="trimmed_reads.fastq.gz"
    OUTPUT_DIR="star_output"
    STAR_INDEX_DIR="/path/to/hg19_STAR_index" # Placeholder for hg19 STAR index. Ensure this index is pre-built.
    NUM_THREADS=8 # Placeholder for number of threads
    
    # 1. Remove adapters using cutadapt
    # Using a common Illumina universal adapter sequence for single-end reads.
    # Adjust -a for specific adapters if known, or use -A for paired-end reads.
    # -q 20: Trim low-quality ends (below Q20)
    # -m 20: Discard reads shorter than 20 bp after trimming
    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -q 20 -m 20 -o "${TRIMMED_READS}" "${INPUT_READS}"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Install STAR (e.g., using conda)
    # conda install -c bioconda star
    
    # 2. Map reads to hg19 using STAR
    # --genomeDir: Path to the STAR genome index for hg19
    # --readFilesIn: Input FASTQ file (trimmed reads)
    # --outFileNamePrefix: Prefix for output files (e.g., aligned_Log.out, aligned_Aligned.sortedByCoordinate.out.bam)
    # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file
    # --runThreadN: Number of threads to use
    # --readFilesCommand zcat: Command to decompress gzipped input files
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${TRIMMED_READS}" \
         --outFileNamePrefix "${OUTPUT_DIR}/aligned_" \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN "${NUM_THREADS}" \
         --readFilesCommand zcat
  2. 2

    Initial peak calling was performed for each sample by CLIPper using only the r1 reads.

    CLIPper vfrom source GitHub
    $ Bash example
    # Clone the CLIPper repository
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # CLIPper requires Python 2.7 or 3.x and several libraries. It's recommended to set up a virtual environment.
    # Example dependencies (may vary, check CLIPper's documentation or requirements.txt if available):
    # pip install numpy scipy pysam pybedtools
    
    # Placeholder for reference genome and blacklist
    # Replace with actual paths to your reference files
    GENOME_FASTA="/path/to/your/reference/genome/hg38.fa"
    BLACKLIST_BED="/path/to/your/blacklist/hg38_blacklist.bed"
    # Ensure these files are properly indexed if required by CLIPper (e.g., .fai for fasta)
    
    # Input BAM file: The description specifies "using only the r1 reads".
    # This means the input BAM should already contain only R1 reads.
    # If starting from a paired-end BAM (e.g., aligned.bam), you might first filter for R1:
    # samtools view -b -f 0x40 aligned.bam > input_sample_r1.bam
    INPUT_BAM="input_sample_r1.bam"
    OUTPUT_DIR="clipper_peaks_output"
    SAMPLE_NAME="sample_name" # Used for output file naming
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute CLIPper for initial peak calling
    # Common parameters for CLIPper:
    # -i: Input BAM file (assumed to contain only R1 reads)
    # -g: Reference genome FASTA file
    # -b: Blacklist BED file
    # -o: Output directory
    # -p: P-value threshold (default 0.01)
    # -f: Fold enrichment threshold (default 2.0)
    # -a: Adjusted p-value threshold (default 0.05)
    # --output-prefix: Prefix for output files (e.g., sample_name_peaks.bed)
    python clipper.py \
        -i "${INPUT_BAM}" \
        -g "${GENOME_FASTA}" \
        -b "${BLACKLIST_BED}" \
        -o "${OUTPUT_DIR}" \
        -p 0.01 \
        -f 2.0 \
        -a 0.05 \
        --output-prefix "${SAMPLE_NAME}"
  3. 3

    Peaks with genomic overlap in at least two ePRINT samples were identified and their genomic coordinates merged to define the candidate peak superset using HOMER.

    HOMER v4.12
    $ Bash example
    # Install HOMER (if not already installed)
    # conda install -c bioconda homer
    # Or manual installation:
    # wget http://homer.ucsd.edu/homer/configureHomer.pl
    # perl configureHomer.pl -install
    
    # Define input peak files. These should be in BED format or HOMER's native peak format.
    # Replace with actual file paths for your ePRINT samples.
    # Example:
    # peak_file_1="path/to/sample1_ePRINT_peaks.bed"
    # peak_file_2="path/to/sample2_ePRINT_peaks.bed"
    # peak_file_3="path/to/sample3_ePRINT_peaks.bed"
    
    # Merge peaks from at least two ePRINT samples to define a candidate peak superset.
    # -bed: Assumes input files are in BED format. Adjust if using HOMER's native format.
    # -d 0: Merges only directly overlapping peaks (distance 0), which is a common interpretation of "genomic overlap".
    # -min 2: Only reports merged peaks that overlap with at least 2 input peak files, fulfilling the "at least two ePRINT samples" criteria.
    # Replace sampleX_ePRINT_peaks.bed with your actual input peak files.
    mergePeaks -bed -d 0 -min 2 sample1_ePRINT_peaks.bed sample2_ePRINT_peaks.bed sample3_ePRINT_peaks.bed > candidate_peak_superset.bed
  4. 4

    These peaks were then filtered using the input retaining those significantly enriched in the ePRINT samples in comparison to the corresponding input samples.

    clipper (Inferred with models/gemini-2.5-flash) v5.0 GitHub
    $ Bash example
    # Install clipper if not already installed
    # pip install clipper
    # or
    # conda install -c bioconda clipper
    
    # Define input and output files
    EPRINT_BAM="ePRINT_sample.bam" # Placeholder for aligned ePRINT reads (IP sample)
    INPUT_BAM="corresponding_input_sample.bam" # Placeholder for aligned corresponding input reads
    OUTPUT_PREFIX="ePRINT_enriched_peaks" # Prefix for output peak files
    GENOME_SIZE="hs" # Placeholder: 'hs' for human, 'mm' for mouse, or a numerical value (e.g., 2.7e9 for hg38).
    
    # Run clipper to identify significantly enriched peaks in ePRINT samples compared to input samples.
    # The tool performs statistical testing to retain regions significantly enriched.
    clipper -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}" "${EPRINT_BAM}" "${INPUT_BAM}"
  5. 5

    The remaining high confidence peaks were then input to DESEQ2 for differential analysis.

    DESeq2 vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install DESeq2 (if not already installed)
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("DESeq2")
    
    # Create a placeholder R script for DESeq2 differential analysis
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # --- Configuration --- #
    # Input count matrix file (e.g., from featureCounts, htseq-count, or peak quantification)
    # Rows are features (peaks), columns are samples.
    count_file <- "peak_counts.tsv"
    
    # Sample metadata file
    # Must contain 'sample_id' column matching count matrix column names,
    # and a 'condition' column for differential analysis.
    metadata_file <- "sample_metadata.tsv"
    
    # Output directory for results
    output_dir <- "deseq2_results"
    dir.create(output_dir, showWarnings = FALSE)
    
    # Define the design formula (e.g., ~ condition)
    # Adjust based on your experimental design
    design_formula <- ~ condition
    
    # Define the contrast for differential analysis (e.g., c("condition", "treated", "control"))
    # This will compare 'treated' vs 'control'
    contrast_factors <- c("condition", "treated", "control")
    
    # --- Load Data --- #
    counts_data <- read.delim(count_file, row.names = 1, sep = "\t", header = TRUE)
    metadata <- read.delim(metadata_file, row.names = 1, sep = "\t", header = TRUE)
    
    # Ensure sample names match and are in the same order
    common_samples <- intersect(colnames(counts_data), rownames(metadata))
    counts_data <- counts_data[, common_samples]
    metadata <- metadata[common_samples, ]
    
    # Ensure count data is integer matrix
    counts_data <- as.matrix(counts_data)
    mode(counts_data) <- "integer"
    
    # Ensure condition is a factor
    metadata$condition <- factor(metadata$condition)
    
    # --- Create DESeqDataSet object --- #
    dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                  colData = metadata,
                                  design = design_formula)
    
    # --- Run DESeq2 differential analysis --- #
    dds <- DESeq(dds)
    
    # --- Extract and save results --- #
    res <- results(dds, contrast = contrast_factors)
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj), ]
    
    # Save full results
    write.csv(as.data.frame(res_ordered), file.path(output_dir, "deseq2_differential_results.csv"))
    
    # Save significant results (e.g., padj < 0.05)
    significant_res <- subset(res_ordered, padj < 0.05)
    write.csv(as.data.frame(significant_res), file.path(output_dir, "deseq2_significant_results.csv"))
    
    message("DESeq2 analysis complete. Results saved in ", output_dir)
    EOF
    
    # Create placeholder input files (replace with your actual data)
    # Example peak_counts.tsv
    cat << 'EOF_COUNTS' > peak_counts.tsv
    peak_id	sample1	sample2	sample3	sample4
    peak1	100	150	50	75
    peak2	20	30	10	15
    peak3	500	600	200	250
    peak4	10	12	5	6
    peak5	80	90	40	45
    EOF_COUNTS
    
    # Example sample_metadata.tsv
    cat << 'EOF_METADATA' > sample_metadata.tsv
    sample_id	condition
    sample1	treated
    sample2	treated
    sample3	control
    sample4	control
    EOF_METADATA
    
    # Execute the R script
    Rscript run_deseq2.R
  6. 6

    Bedgraph files were generated from the r1 reads using bedtools

    bedtools v2.30.0 GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools=2.30.0
    
    # Define input and output files
    INPUT_BAM="r1.bam" # Assuming 'r1 reads' refers to aligned reads in BAM format
    OUTPUT_BEDGRAPH="r1.bedgraph"
    GENOME_CHROM_SIZES="hg38.chrom.sizes" # Placeholder for genome chromosome sizes (e.g., from UCSC)
    
    # Generate bedgraph file from aligned reads
    bedtools genomeCoverageBed -ibam ${INPUT_BAM} -bg -g ${GENOME_CHROM_SIZES} > ${OUTPUT_BEDGRAPH}

Tools Used

Raw Source Text
Adapters were removes using cutadapt and reads were mapped to hg19 using STAR.
Initial peak calling was performed for each sample by CLIPper using only the r1 reads. Peaks with genomic overlap in at least two ePRINT samples were identified and their genomic coordinates merged to define the candidate peak superset using HOMER. These peaks were then filtered using the input retaining those significantly enriched in the ePRINT samples in comparison to the corresponding input samples. The remaining high confidence peaks were then input to DESEQ2 for differential analysis.
Bedgraph files were generated from the r1 reads using bedtools
Assembly: hg19
Supplementary files format and content: bedgraph
← Back to Analysis