GSE232514 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

GSE232514

Expanded repertoire of RNA-editing-based detection for RNA binding protein interactions (2)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    remove adapter with Cutadapt

    cutadapt v4.0 GitHub
    $ Bash example
    # Install Cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output file paths (placeholders)
    INPUT_READ1="input_R1.fastq.gz"
    INPUT_READ2="input_R2.fastq.gz"
    OUTPUT_READ1="trimmed_R1.fastq.gz"
    OUTPUT_READ2="trimmed_R2.fastq.gz"
    REPORT_FILE="cutadapt_report.txt"
    
    # Define adapter sequences (common Illumina adapters for eCLIP, adjust as needed based on library prep)
    # For single-end reads, only -a is needed. For paired-end, -a for R1 and -A for R2.
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Universal Adapter
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example: Illumina Small RNA 3' Adapter (often used for R2 in some protocols)
    
    # Define trimming parameters
    MIN_LENGTH=15 # Minimum read length after trimming
    QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming
    THREADS=8 # Number of CPU cores to use
    
    # Execute Cutadapt for paired-end reads
    cutadapt \
      --cores "${THREADS}" \
      -a "${ADAPTER_R1}" \
      -A "${ADAPTER_R2}" \
      -o "${OUTPUT_READ1}" \
      -p "${OUTPUT_READ2}" \
      -m "${MIN_LENGTH}" \
      -q "${QUALITY_CUTOFF}" \
      "${INPUT_READ1}" \
      "${INPUT_READ2}" \
      > "${REPORT_FILE}" 2>&1
  2. 2

    align to human genome using STAR 2.7.6a

    $ Bash example
    # Install STAR using conda
    # conda create -n star_env star=2.7.6a -c bioconda -c conda-forge
    # conda activate star_env
    
    # Define variables (replace with actual paths and filenames)
    # Reference genome: Human GRCh38 (hg38)
    # Source: GENCODE (for GTF) and UCSC/ENA/NCBI (for FASTA)
    GENOME_DIR="/path/to/STAR_human_GRCh38_index" # Path to pre-built STAR genome index for human GRCh38
    GTF_FILE="/path/to/gencode.v44.annotation.gtf.gz" # Path to human GRCh38 GTF file (e.g., GENCODE v44)
    READ1="sample_R1.fastq.gz"
    READ2="sample_R2.fastq.gz" # Remove this line if reads are single-end
    OUTPUT_PREFIX="sample_aligned"
    THREADS=8 # Number of threads to use
    
    # Example command to generate STAR genome index (run once per genome/GTF combination)
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/human_GRCh38.fa \
    #      --sjdbGTFfile ${GTF_FILE} \
    #      --sjdbOverhang 100 \
    #      --runThreadN ${THREADS}
    
    # Align reads to the human genome (GRCh38)
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READ1} ${READ2} \
         --readFilesCommand zcat \
         --runThreadN ${THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX}_ \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --sjdbGTFfile ${GTF_FILE} \
         --sjdbOverhang 100
  3. 3

    align unmapped reads from above to 12X MS2 stem-loop reporter mRNA using STAR 2.7.6a

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star=2.7.6a
    
    # Placeholder for input unmapped reads (from a previous step)
    # This file would contain reads that did not map to a primary genome.
    UNMAPPED_READS="unmapped_reads.fastq.gz"
    
    # Placeholder for STAR index directory for the 12X MS2 stem-loop reporter mRNA.
    # This index needs to be pre-built using STAR --runMode genomeGenerate with the MS2 reporter sequence.
    MS2_REPORTER_STAR_INDEX="ms2_reporter_star_index"
    
    # Placeholder for output prefix
    OUTPUT_PREFIX="ms2_aligned_reads"
    
    # Align unmapped reads to the 12X MS2 stem-loop reporter mRNA
    STAR \
        --genomeDir "${MS2_REPORTER_STAR_INDEX}" \
        --readFilesIn "${UNMAPPED_READS}" \
        --readFilesCommand zcat \
        --outFileNamePrefix "${OUTPUT_PREFIX}" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattributes Standard \
        --outFilterMultimapNmax 1 \
        --outFilterMismatchNmax 3 \
        --runThreadN 8 # Adjust thread count as needed
    
  4. 4

    SAILOR analysis of data for C-to-U edits

    $ Bash example
    # Install SAILOR via conda
    # conda create -n sailor_env python=3.8
    # conda activate sailor_env
    # conda install -c bioconda sailor
    
    # Define input and output paths
    INPUT_BAM="input.bam" # Replace with your input RNA-seq BAM file
    OUTPUT_PREFIX="sailor_output" # Prefix for output files
    REFERENCE_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" # Placeholder: Path to GRCh38 reference FASTA
    GENE_ANNOTATION_GTF="/path/to/gencode.v38.annotation.gtf" # Placeholder: Path to GENCODE v38 gene annotation GTF
    
    # Run SAILOR for C-to-U editing analysis
    # This command is a basic example. Refer to SAILOR documentation for more options.
    sailor -i "${INPUT_BAM}" \
           -o "${OUTPUT_PREFIX}" \
           -r "${REFERENCE_FASTA}" \
           -g "${GENE_ANNOTATION_GTF}" \
           --strand-specific # Example option: use if your RNA-seq library is strand-specific
  5. 5

    SAILOR analysis of data for A-to-I edits

    SAILOR vv1.0
    $ Bash example
    # Clone the SAILOR repository
    # git clone https://github.com/gersteinlab/SAILOR.git
    # cd SAILOR
    
    # Create a conda environment and install dependencies (pysam, numpy, scipy, matplotlib)
    # conda create -n sailor_env python=3.8 -y
    # conda activate sailor_env
    # pip install pysam numpy scipy matplotlib
    
    # Define input/output files and reference data
    # Replace 'path/to/your_aligned_reads.bam' with the actual path to your input BAM file
    INPUT_BAM="path/to/your_aligned_reads.bam"
    OUTPUT_PREFIX="sailor_output"
    
    # Reference genome and gene annotation (using GRCh38 as the latest assembly placeholder)
    # Download from sources like GENCODE, Ensembl, or UCSC
    REFERENCE_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
    GENE_ANNOTATION_GTF="path/to/gencode.v38.annotation.gtf"
    
    # Strand information: 0 for unstranded, 1 for first-strand, 2 for second-strand
    # Assuming first-strand for typical RNA-seq data, adjust if your data is different
    STRAND_INFO="1"
    
    # Run SAILOR analysis for A-to-I edits
    # Ensure you are in the directory where SAILOR.py is located or provide its full path
    python SAILOR.py -i "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -r "${REFERENCE_FASTA}" -g "${GENE_ANNOTATION_GTF}" -s "${STRAND_INFO}"

Tools Used

Raw Source Text
remove adapter with Cutadapt
align to human genome using STAR 2.7.6a
align unmapped reads from above to 12X MS2 stem-loop reporter mRNA using STAR 2.7.6a
SAILOR analysis of data for C-to-U edits
SAILOR analysis of data for A-to-I edits
Assembly: GRCh38
Assembly: 12X MS2 stem-loop reporter mRNA
Supplementary files format and content: The processed data files are in .bed format. They contain SAILOR-identified C-to-U or A-to-I edit information for endogenous transcripts (*non.construct.bed) and the 12X stem-loop-bearing-bearing reporter mRNA (*reporter.construct.bed).
← Back to Analysis