GSE240014 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

GSE240014

An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [in_situ_STAMP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    remove adapter with Cutadapt

    cutadapt v4.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Cutadapt (if not already installed)
    # conda install -c bioconda cutadapt
    
    # Define input and output file paths (placeholders)
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_TRIMMED_FASTQ="output_trimmed.fastq.gz"
    
    # Define common Illumina adapter sequence (placeholder, adjust if a specific adapter is known)
    # This example uses a common Illumina 3' adapter sequence.
    ADAPTER_SEQUENCE="AGATCGGAAGAGC"
    
    # Run Cutadapt to remove 3' adapters, perform quality trimming, and filter by minimum length.
    # -a: Specifies a 3' adapter sequence.
    # -q 20: Trims low-quality bases from the 3' end with a quality cutoff of 20.
    # --minimum-length 20: Discards reads shorter than 20 bases after trimming.
    # -o: Specifies the output file for trimmed reads.
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -q 20 \
             --minimum-length 20 \
             -o "${OUTPUT_TRIMMED_FASTQ}" \
             "${INPUT_FASTQ}"
  2. 2

    align to hg38 using STAR 2.4.0 (Homo sapiens) or mm10 using STAR 2.5.2 (Mus musculus)

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star=2.4.0
    
    # Define variables
    GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for hg38 STAR index
    READ1="sample_R1.fastq.gz" # Placeholder for input read 1 FASTQ file
    READ2="sample_R2.fastq.gz" # Placeholder for input read 2 FASTQ file (remove if single-end)
    OUTPUT_PREFIX="sample_aligned"
    NUM_THREADS=8 # Number of threads to use
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_PREFIX}_dir"
    
    # Run STAR alignment for paired-end reads
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_dir/${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes Standard \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --alignSJDBoverhangMin 1 \
         --alignSJoverhangMin 8 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 31000000000 # Example: 31GB RAM for sorting (adjust based on available RAM)
    
  3. 3

    SAILOR analysis to call C-to-U edits and keep only sites with score >0.5 and edit fraction <80%

    SAILOR vv0.1.0
    $ Bash example
    # Install SAILOR (if not already installed)
    # git clone https://github.com/gersteinlab/SAILOR.git
    # cd SAILOR
    # # It is recommended to create a conda environment for SAILOR:
    # # conda env create -f environment.yml
    # # conda activate SAILOR_env
    
    # Example usage for calling C-to-U edits with specified filters.
    # Replace <input.bam>, <reference.fasta>, and <output_prefix> with actual file paths.
    # The default parameters for minimum score (-s 0.5) and maximum edit fraction (-f 0.8) 
    # directly correspond to the description's criteria (score >0.5 and edit fraction <80%).
    # A common reference genome for human would be hg38.fa.
    python SAILOR.py \
        -i <input.bam> \
        -r <reference.fasta> \
        -o <output_prefix> \
        -s 0.5 \
        -f 0.8
  4. 4

    FLARE analysis to call C-to-U edit clusters

    FLARE vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install FLARE (if not already available in the environment)
    # It's recommended to clone the repository and run from source or add to PATH:
    # git clone https://github.com/yeolab/flare.git
    # cd flare
    # # Add the flare directory to your PATH or run scripts directly from here
    # # export PATH=$(pwd):$PATH
    
    # Define input and output paths
    INPUT_BAM="aligned_reads.bam" # Replace with your actual aligned BAM file
    REFERENCE_GENOME="GRCh38.fa" # Replace with your actual reference genome FASTA (e.g., from GENCODE, Ensembl)
    OUTPUT_DIR="flare_output"
    CHROM_SIZES="GRCh38.chrom.sizes" # Replace with your actual chromosome sizes file (e.g., from UCSC table browser)
    
    # Create output directory
    mkdir -p "${OUTPUT_DIR}"
    
    # Run FLARE analysis to call C-to-U edit clusters
    # -i: Input BAM file
    # -g: Reference genome FASTA file
    # -o: Output directory
    # -c: Chromosome sizes file (optional but good practice for filtering)
    # -s: Strand-specific (use if your library is strand-specific, e.g., dUTP)
    # -m: Minimum coverage (e.g., 10 reads)
    # -q: Minimum base quality (e.g., 20)
    # -e: Minimum edit fraction (e.g., 0.1, meaning at least 10% of reads show the edit)
    python flare/flare.py \
        -i "${INPUT_BAM}" \
        -g "${REFERENCE_GENOME}" \
        -o "${OUTPUT_DIR}" \
        -c "${CHROM_SIZES}" \
        -m 10 \
        -q 20 \
        -e 0.1 \
        -s # Use -s for strand-specific libraries
  5. 5

    Intersect the edit clusters from 3 replicates, which yields "*confident_peaks.bed"

    intersect_peaks.py (part of yeolab/merge_peaks pipeline) vN/A GitHub
    $ Bash example
    # Install Python (if not already available)
    # conda create -n merge_peaks_env python=3.8
    # conda activate merge_peaks_env
    
    # Install pybedtools, a dependency for intersect_peaks.py
    # conda install -c bioconda pybedtools
    
    # Clone the merge_peaks repository if not already present
    # git clone https://github.com/yeolab/merge_peaks.git
    # cd merge_peaks
    
    # Assuming input edit cluster BED files are named rep1_clusters.bed, rep2_clusters.bed, rep3_clusters.bed
    # and the script is in the current directory or accessible via PATH
    python intersect_peaks.py \
      --input_files rep1_clusters.bed rep2_clusters.bed rep3_clusters.bed \
      --min_replicates 3 \
      --output_file confident_peaks.bed
  6. 6

    Subtract STAMP confident clusters to Buffer only control, which yields "*cleaned_confident_peaks.bed"

    bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools
    
    # Define placeholder input files based on the description
    # Replace these with actual file paths from your pipeline
    STAMP_CONFIDENT_CLUSTERS_BED="stamp_confident_clusters.bed"
    BUFFER_ONLY_CONTROL_BED="buffer_only_control.bed"
    
    # Define the output file name as specified
    CLEANED_CONFIDENT_PEAKS_BED="cleaned_confident_peaks.bed"
    
    # Subtract the buffer-only control regions from the STAMP confident clusters
    # The -a option specifies the file from which features are subtracted (STAMP clusters)
    # The -b option specifies the file containing features to subtract (Buffer only control)
    bedtools subtract -a "${STAMP_CONFIDENT_CLUSTERS_BED}" -b "${BUFFER_ONLY_CONTROL_BED}" > "${CLEANED_CONFIDENT_PEAKS_BED}"

Tools Used

Raw Source Text
remove adapter with Cutadapt
align to hg38 using STAR 2.4.0 (Homo sapiens) or mm10 using STAR 2.5.2 (Mus musculus)
SAILOR analysis to call C-to-U edits and keep only sites with score >0.5 and edit fraction <80%
FLARE analysis to call C-to-U edit clusters
Intersect the edit clusters from 3 replicates, which yields "*confident_peaks.bed"
Subtract STAMP confident clusters to Buffer only control, which yields "*cleaned_confident_peaks.bed"
Assembly: hg38
Assembly: mm10
Supplementary files format and content: SAILOR step yields bed file: *0.5Score0.8Fraction.fastqTr.sorted.STARUnmapped.out.sorted.STARAligned.out.sorted.bam.combined.readfiltered.formatted.varfiltered.snpfiltered.ranked.bed
Supplementary files format and content: FLARE step yields .tsv file: "*merged_sorted_peaks.fdr_0.1.d_15.scored.tsv"
Supplementary files format and content: Intersection step yields .bed file:  "*confident_peaks.bed"
Supplementary files format and content: Subtraction step yields .bed file:  "*cleaned_confident_peaks.bed"
← Back to Analysis