GSE240326 Processing Pipeline

GSE code_examples 6 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

GSE240326

An in situ method for identification of transcriptome-wide protein-RNA interactions in cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    remove adapter with Cutadapt

    cutadapt v4.1 GitHub
    $ Bash example
    # Install Cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=4.1
    
    # Define input and output files
    # Replace with your actual input FASTQ files
    INPUT_R1="path/to/your/input_read1.fastq.gz"
    INPUT_R2="path/to/your/input_read2.fastq.gz" # For paired-end reads. Remove if single-end.
    
    # Replace with your desired output FASTQ files
    OUTPUT_R1_TRIMMED="path/to/your/output_read1_trimmed.fastq.gz"
    OUTPUT_R2_TRIMMED="path/to/your/output_read2_trimmed.fastq.gz" # For paired-end reads. Remove if single-end.
    
    # Define a report file for Cutadapt's summary
    REPORT_FILE="cutadapt_trimming_report.txt"
    
    # Define adapter sequences
    # These are common Illumina TruSeq adapters. You MUST replace these with the actual adapter sequences
    # used in your library preparation. If you don't know them, you might need to auto-detect or consult
    # your sequencing provider/library prep kit documentation.
    # For single-end reads, typically only -a ADAPTER_R1 is needed.
    # For paired-end reads, -a ADAPTER_R1 for read 1 and -A ADAPTER_R2 for read 2.
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina universal adapter
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example: Illumina index adapter (reverse complement of universal adapter)
    
    # Run Cutadapt for paired-end reads
    # Adjust parameters like --minimum-length, --quality-cutoff, --cores as needed.
    # If processing single-end reads, remove -A, -p, and INPUT_R2.
    cutadapt \
      -a "${ADAPTER_R1}" \
      -A "${ADAPTER_R2}" \
      -o "${OUTPUT_R1_TRIMMED}" \
      -p "${OUTPUT_R2_TRIMMED}" \
      --minimum-length 18 \
      --quality-cutoff 20 \
      --cores 8 \
      "${INPUT_R1}" "${INPUT_R2}" > "${REPORT_FILE}" 2>&1
    
    # For single-end reads, the command would look like this:
    # cutadapt \
    #   -a "${ADAPTER_R1}" \
    #   -o "${OUTPUT_R1_TRIMMED}" \
    #   --minimum-length 18 \
    #   --quality-cutoff 20 \
    #   --cores 8 \
    #   "${INPUT_R1}" > "${REPORT_FILE}" 2>&1
  2. 2

    align to hg38 using STAR 2.4.0 (Homo sapiens) or mm10 using STAR 2.5.2 (Mus musculus)

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.4.0
    
    # Define input and output variables
    # Replace with actual paths and filenames
    READ1="input_R1.fastq.gz"
    READ2="input_R2.fastq.gz" # Remove if single-end
    OUTPUT_PREFIX="aligned_output"
    NUM_THREADS=8 # Adjust as needed
    
    # Define genome index paths
    # Replace with actual paths to your STAR indices
    # For Homo sapiens (hg38)
    HG38_STAR_INDEX="/path/to/STAR_index/hg38"
    # For Mus musculus (mm10)
    MM10_STAR_INDEX="/path/to/STAR_index/mm10"
    
    # --- Choose the appropriate genome index based on species ---
    # For Homo sapiens (hg38):
    GENOME_DIR="${HG38_STAR_INDEX}"
    # For Mus musculus (mm10):
    # GENOME_DIR="${MM10_STAR_INDEX}"
    
    # Run STAR alignment
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMstrandField intronMotif \
         --outFilterMultimapNmax 20 \
         --alignSJDBoverhangMin 1 \
         --alignSJoverhangMin 8 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outReadsUnmapped Fastx \
         --quantMode GeneCounts # Optional: if gene counts are desired, otherwise remove
  3. 3

    SAILOR analysis to call C-to-U edits and keep only sites with score >0.5 and edit fraction <80%

    SAILOR v0.1.0
    $ Bash example
    # Install SAILOR (e.g., via conda)
    # conda create -n sailor_env python=3.8
    # conda activate sailor_env
    # conda install -c bioconda sailor=0.1.0
    
    # Define input and output files
    # Replace 'aligned_reads.bam' with your actual input BAM file containing aligned RNA-seq reads.
    INPUT_BAM="aligned_reads.bam"
    # Replace with the path to your reference genome FASTA file (e.g., GRCh38).
    REFERENCE_FASTA="path/to/human_genome/GRCh38.p13.genome.fa"
    # Replace with the path to a VCF file of known SNPs for the reference genome (e.g., dbSNP for GRCh38).
    KNOWN_SNPS_VCF="path/to/known_snps/dbSNP_153_GRCh38.vcf.gz"
    # Define the output file for the filtered C-to-U editing sites.
    OUTPUT_TSV="c_to_u_edits_filtered.tsv"
    
    # Run SAILOR to call C-to-U edits and apply filtering criteria.
    # --min-score 0.5: Filters for sites with an editing score greater than 0.5.
    # --max-edit-fraction 0.8: Filters for sites where the edit fraction is less than 80% (0.8).
    # --fasta: Specifies the reference genome FASTA file.
    # --vcf: Specifies a VCF file of known SNPs to exclude from editing calls.
    sailor call \
        --min-score 0.5 \
        --max-edit-fraction 0.8 \
        --fasta "${REFERENCE_FASTA}" \
        --vcf "${KNOWN_SNPS_VCF}" \
        "${INPUT_BAM}" \
        > "${OUTPUT_TSV}"
  4. 4

    FLARE analysis to call C-to-U edit clusters

    FLARE v0.1.0 GitHub
    $ Bash example
    # Clone the FLARE repository
    # git clone https://github.com/yeolab/FLARE.git
    # cd FLARE
    
    # Install dependencies (if not already installed)
    # pip install -r requirements.txt
    
    # Example usage of FLARE to call C-to-U edit clusters
    # Replace <input_bam>, <output_directory>, <reference_fasta>, and <gene_annotation> with actual paths.
    # Reference datasets: GRCh38 is used as a placeholder for human genome.
    # Gene annotation: A GTF file for GRCh38 is used as a placeholder.
    
    # Define placeholder variables
    INPUT_BAM="path/to/your/aligned.bam"
    OUTPUT_DIR="flare_c_to_u_edits"
    REFERENCE_FASTA="path/to/GRCh38.fa" # e.g., from Gencode or Ensembl
    GENE_ANNOTATION="path/to/GRCh38.gtf" # e.g., from Gencode or Ensembl
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute FLARE analysis
    # The description implies calling C-to-U edits, which is the default behavior of FLARE.
    # Common parameters might include:
    # --min_reads 5 (minimum reads supporting an edit)
    # --min_edit_frac 0.1 (minimum fraction of reads supporting an edit)
    # --min_coverage 10 (minimum coverage at a site)
    # --min_base_qual 20 (minimum base quality)
    # --min_map_qual 20 (minimum mapping quality)
    # --blacklist (path to a blacklist BED file)
    # --known_edits (path to a VCF of known edits for filtering)
    
    python FLARE.py \
        -i "${INPUT_BAM}" \
        -o "${OUTPUT_DIR}" \
        -r "${REFERENCE_FASTA}" \
        -g "${GENE_ANNOTATION}"
  5. 5

    Intersect the edit clusters from 3 replicates, which yields "*confident_peaks.bed"

    merge_peaks (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Install bedtools if not already installed, as it's a common dependency for intersection operations within pipelines like merge_peaks.
    # conda install -c bioconda bedtools
    
    # Assume input edit cluster BED files are:
    # replicate1_edit_clusters.bed
    # replicate2_edit_clusters.bed
    # replicate3_edit_clusters.bed
    
    # Intersect the edit clusters from the first two replicates
    bedtools intersect -a replicate1_edit_clusters.bed -b replicate2_edit_clusters.bed > temp_intersect_1_2.bed
    
    # Intersect the result with the third replicate to find regions common to all three
    bedtools intersect -a temp_intersect_1_2.bed -b replicate3_edit_clusters.bed > confident_peaks.bed
    
    # Clean up temporary file
    rm temp_intersect_1_2.bed
  6. 6

    Subtract STAMP confident clusters to Buffer only control, which yields "*cleaned_confident_peaks.bed"

    bedtools (Inferred with models/gemini-2.5-flash) vv2.30.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools
    
    # Subtract Buffer only control regions from STAMP confident clusters
    # This yields regions that are present in STAMP clusters but not in the control.
    # Assuming 'stamp_confident_clusters.bed' contains the STAMP confident clusters
    # and 'buffer_only_control.bed' contains the Buffer only control regions.
    bedtools subtract -a stamp_confident_clusters.bed -b buffer_only_control.bed > cleaned_confident_peaks.bed

Tools Used

Raw Source Text
remove adapter with Cutadapt
align to hg38 using STAR 2.4.0 (Homo sapiens) or mm10 using STAR 2.5.2 (Mus musculus)
SAILOR analysis to call C-to-U edits and keep only sites with score >0.5 and edit fraction <80%
FLARE analysis to call C-to-U edit clusters
Intersect the edit clusters from 3 replicates, which yields "*confident_peaks.bed"
Subtract STAMP confident clusters to Buffer only control, which yields "*cleaned_confident_peaks.bed"
Assembly: hg38
Assembly: mm10
Supplementary files format and content: SAILOR step yields bed file: *0.5Score0.8Fraction.fastqTr.sorted.STARUnmapped.out.sorted.STARAligned.out.sorted.bam.combined.readfiltered.formatted.varfiltered.snpfiltered.ranked.bed
Supplementary files format and content: FLARE step yields .tsv file: "*merged_sorted_peaks.fdr_0.1.d_15.scored.tsv"
Supplementary files format and content: Intersection step yields .bed file:  "*confident_peaks.bed"
Supplementary files format and content: Subtraction step yields .bed file:  "*cleaned_confident_peaks.bed"
← Back to Analysis