GSE240014 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [in_situ_STAMP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

remove adapter with Cutadapt

cutadapt v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output file paths (placeholders)
INPUT_FASTQ="input.fastq.gz"
OUTPUT_TRIMMED_FASTQ="output_trimmed.fastq.gz"

# Define common Illumina adapter sequence (placeholder, adjust if a specific adapter is known)
# This example uses a common Illumina 3' adapter sequence.
ADAPTER_SEQUENCE="AGATCGGAAGAGC"

# Run Cutadapt to remove 3' adapters, perform quality trimming, and filter by minimum length.
# -a: Specifies a 3' adapter sequence.
# -q 20: Trims low-quality bases from the 3' end with a quality cutoff of 20.
# --minimum-length 20: Discards reads shorter than 20 bases after trimming.
# -o: Specifies the output file for trimmed reads.
cutadapt -a "${ADAPTER_SEQUENCE}" \
         -q 20 \
         --minimum-length 20 \
         -o "${OUTPUT_TRIMMED_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

align to hg38 using STAR 2.4.0 (Homo sapiens) or mm10 using STAR 2.5.2 (Mus musculus)

STAR v2.4.0 GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.4.0

# Define variables
GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for hg38 STAR index
READ1="sample_R1.fastq.gz" # Placeholder for input read 1 FASTQ file
READ2="sample_R2.fastq.gz" # Placeholder for input read 2 FASTQ file (remove if single-end)
OUTPUT_PREFIX="sample_aligned"
NUM_THREADS=8 # Number of threads to use

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_PREFIX}_dir"

# Run STAR alignment for paired-end reads
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_dir/${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 31000000000 # Example: 31GB RAM for sorting (adjust based on available RAM)

View on GitHub

SAILOR analysis to call C-to-U edits and keep only sites with score >0.5 and edit fraction <80%

SAILOR vv0.1.0

$ Bash example

# Install SAILOR (if not already installed)
# git clone https://github.com/gersteinlab/SAILOR.git
# cd SAILOR
# # It is recommended to create a conda environment for SAILOR:
# # conda env create -f environment.yml
# # conda activate SAILOR_env

# Example usage for calling C-to-U edits with specified filters.
# Replace <input.bam>, <reference.fasta>, and <output_prefix> with actual file paths.
# The default parameters for minimum score (-s 0.5) and maximum edit fraction (-f 0.8) 
# directly correspond to the description's criteria (score >0.5 and edit fraction <80%).
# A common reference genome for human would be hg38.fa.
python SAILOR.py \
    -i <input.bam> \
    -r <reference.fasta> \
    -o <output_prefix> \
    -s 0.5 \
    -f 0.8

FLARE analysis to call C-to-U edit clusters

FLARE vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install FLARE (if not already available in the environment)
# It's recommended to clone the repository and run from source or add to PATH:
# git clone https://github.com/yeolab/flare.git
# cd flare
# # Add the flare directory to your PATH or run scripts directly from here
# # export PATH=$(pwd):$PATH

# Define input and output paths
INPUT_BAM="aligned_reads.bam" # Replace with your actual aligned BAM file
REFERENCE_GENOME="GRCh38.fa" # Replace with your actual reference genome FASTA (e.g., from GENCODE, Ensembl)
OUTPUT_DIR="flare_output"
CHROM_SIZES="GRCh38.chrom.sizes" # Replace with your actual chromosome sizes file (e.g., from UCSC table browser)

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Run FLARE analysis to call C-to-U edit clusters
# -i: Input BAM file
# -g: Reference genome FASTA file
# -o: Output directory
# -c: Chromosome sizes file (optional but good practice for filtering)
# -s: Strand-specific (use if your library is strand-specific, e.g., dUTP)
# -m: Minimum coverage (e.g., 10 reads)
# -q: Minimum base quality (e.g., 20)
# -e: Minimum edit fraction (e.g., 0.1, meaning at least 10% of reads show the edit)
python flare/flare.py \
    -i "${INPUT_BAM}" \
    -g "${REFERENCE_GENOME}" \
    -o "${OUTPUT_DIR}" \
    -c "${CHROM_SIZES}" \
    -m 10 \
    -q 20 \
    -e 0.1 \
    -s # Use -s for strand-specific libraries

View on GitHub

Intersect the edit clusters from 3 replicates, which yields "*confident_peaks.bed"

intersect_peaks.py (part of yeolab/merge_peaks pipeline) vN/A GitHub

$ Bash example

# Install Python (if not already available)
# conda create -n merge_peaks_env python=3.8
# conda activate merge_peaks_env

# Install pybedtools, a dependency for intersect_peaks.py
# conda install -c bioconda pybedtools

# Clone the merge_peaks repository if not already present
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks

# Assuming input edit cluster BED files are named rep1_clusters.bed, rep2_clusters.bed, rep3_clusters.bed
# and the script is in the current directory or accessible via PATH
python intersect_peaks.py \
  --input_files rep1_clusters.bed rep2_clusters.bed rep3_clusters.bed \
  --min_replicates 3 \
  --output_file confident_peaks.bed

View on GitHub

Subtract STAMP confident clusters to Buffer only control, which yields "*cleaned_confident_peaks.bed"

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Define placeholder input files based on the description
# Replace these with actual file paths from your pipeline
STAMP_CONFIDENT_CLUSTERS_BED="stamp_confident_clusters.bed"
BUFFER_ONLY_CONTROL_BED="buffer_only_control.bed"

# Define the output file name as specified
CLEANED_CONFIDENT_PEAKS_BED="cleaned_confident_peaks.bed"

# Subtract the buffer-only control regions from the STAMP confident clusters
# The -a option specifies the file from which features are subtracted (STAMP clusters)
# The -b option specifies the file containing features to subtract (Buffer only control)
bedtools subtract -a "${STAMP_CONFIDENT_CLUSTERS_BED}" -b "${BUFFER_ONLY_CONTROL_BED}" > "${CLEANED_CONFIDENT_PEAKS_BED}"

View on GitHub

Tools Used

STAR SAILOR

Raw Source Text

remove adapter with Cutadapt
align to hg38 using STAR 2.4.0 (Homo sapiens) or mm10 using STAR 2.5.2 (Mus musculus)
SAILOR analysis to call C-to-U edits and keep only sites with score >0.5 and edit fraction <80%
FLARE analysis to call C-to-U edit clusters
Intersect the edit clusters from 3 replicates, which yields "*confident_peaks.bed"
Subtract STAMP confident clusters to Buffer only control, which yields "*cleaned_confident_peaks.bed"
Assembly: hg38
Assembly: mm10
Supplementary files format and content: SAILOR step yields bed file: *0.5Score0.8Fraction.fastqTr.sorted.STARUnmapped.out.sorted.STARAligned.out.sorted.bam.combined.readfiltered.formatted.varfiltered.snpfiltered.ranked.bed
Supplementary files format and content: FLARE step yields .tsv file: "*merged_sorted_peaks.fdr_0.1.d_15.scored.tsv"
Supplementary files format and content: Intersection step yields .bed file:  "*confident_peaks.bed"
Supplementary files format and content: Subtraction step yields .bed file:  "*cleaned_confident_peaks.bed"

← Back to Analysis