GSE232514 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

Expanded repertoire of RNA-editing-based detection for RNA binding protein interactions (2)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

remove adapter with Cutadapt

cutadapt v4.0 GitHub

$ Bash example

# Install Cutadapt (if not already installed)
# conda install -c bioconda cutadapt=4.0

# Define input and output file paths (placeholders)
INPUT_READ1="input_R1.fastq.gz"
INPUT_READ2="input_R2.fastq.gz"
OUTPUT_READ1="trimmed_R1.fastq.gz"
OUTPUT_READ2="trimmed_R2.fastq.gz"
REPORT_FILE="cutadapt_report.txt"

# Define adapter sequences (common Illumina adapters for eCLIP, adjust as needed based on library prep)
# For single-end reads, only -a is needed. For paired-end, -a for R1 and -A for R2.
ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Universal Adapter
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example: Illumina Small RNA 3' Adapter (often used for R2 in some protocols)

# Define trimming parameters
MIN_LENGTH=15 # Minimum read length after trimming
QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming
THREADS=8 # Number of CPU cores to use

# Execute Cutadapt for paired-end reads
cutadapt \
  --cores "${THREADS}" \
  -a "${ADAPTER_R1}" \
  -A "${ADAPTER_R2}" \
  -o "${OUTPUT_READ1}" \
  -p "${OUTPUT_READ2}" \
  -m "${MIN_LENGTH}" \
  -q "${QUALITY_CUTOFF}" \
  "${INPUT_READ1}" \
  "${INPUT_READ2}" \
  > "${REPORT_FILE}" 2>&1

View on GitHub

align to human genome using STAR 2.7.6a

STAR v2.7.6a GitHub

$ Bash example

# Install STAR using conda
# conda create -n star_env star=2.7.6a -c bioconda -c conda-forge
# conda activate star_env

# Define variables (replace with actual paths and filenames)
# Reference genome: Human GRCh38 (hg38)
# Source: GENCODE (for GTF) and UCSC/ENA/NCBI (for FASTA)
GENOME_DIR="/path/to/STAR_human_GRCh38_index" # Path to pre-built STAR genome index for human GRCh38
GTF_FILE="/path/to/gencode.v44.annotation.gtf.gz" # Path to human GRCh38 GTF file (e.g., GENCODE v44)
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz" # Remove this line if reads are single-end
OUTPUT_PREFIX="sample_aligned"
THREADS=8 # Number of threads to use

# Example command to generate STAR genome index (run once per genome/GTF combination)
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/human_GRCh38.fa \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN ${THREADS}

# Align reads to the human genome (GRCh38)
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbGTFfile ${GTF_FILE} \
     --sjdbOverhang 100

View on GitHub

align unmapped reads from above to 12X MS2 stem-loop reporter mRNA using STAR 2.7.6a

STAR v2.7.6a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star=2.7.6a

# Placeholder for input unmapped reads (from a previous step)
# This file would contain reads that did not map to a primary genome.
UNMAPPED_READS="unmapped_reads.fastq.gz"

# Placeholder for STAR index directory for the 12X MS2 stem-loop reporter mRNA.
# This index needs to be pre-built using STAR --runMode genomeGenerate with the MS2 reporter sequence.
MS2_REPORTER_STAR_INDEX="ms2_reporter_star_index"

# Placeholder for output prefix
OUTPUT_PREFIX="ms2_aligned_reads"

# Align unmapped reads to the 12X MS2 stem-loop reporter mRNA
STAR \
    --genomeDir "${MS2_REPORTER_STAR_INDEX}" \
    --readFilesIn "${UNMAPPED_READS}" \
    --readFilesCommand zcat \
    --outFileNamePrefix "${OUTPUT_PREFIX}" \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes Standard \
    --outFilterMultimapNmax 1 \
    --outFilterMismatchNmax 3 \
    --runThreadN 8 # Adjust thread count as needed

View on GitHub

SAILOR analysis of data for C-to-U edits

SAILOR v1.0

$ Bash example

# Install SAILOR via conda
# conda create -n sailor_env python=3.8
# conda activate sailor_env
# conda install -c bioconda sailor

# Define input and output paths
INPUT_BAM="input.bam" # Replace with your input RNA-seq BAM file
OUTPUT_PREFIX="sailor_output" # Prefix for output files
REFERENCE_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" # Placeholder: Path to GRCh38 reference FASTA
GENE_ANNOTATION_GTF="/path/to/gencode.v38.annotation.gtf" # Placeholder: Path to GENCODE v38 gene annotation GTF

# Run SAILOR for C-to-U editing analysis
# This command is a basic example. Refer to SAILOR documentation for more options.
sailor -i "${INPUT_BAM}" \
       -o "${OUTPUT_PREFIX}" \
       -r "${REFERENCE_FASTA}" \
       -g "${GENE_ANNOTATION_GTF}" \
       --strand-specific # Example option: use if your RNA-seq library is strand-specific

SAILOR analysis of data for A-to-I edits

SAILOR vv1.0

$ Bash example

# Clone the SAILOR repository
# git clone https://github.com/gersteinlab/SAILOR.git
# cd SAILOR

# Create a conda environment and install dependencies (pysam, numpy, scipy, matplotlib)
# conda create -n sailor_env python=3.8 -y
# conda activate sailor_env
# pip install pysam numpy scipy matplotlib

# Define input/output files and reference data
# Replace 'path/to/your_aligned_reads.bam' with the actual path to your input BAM file
INPUT_BAM="path/to/your_aligned_reads.bam"
OUTPUT_PREFIX="sailor_output"

# Reference genome and gene annotation (using GRCh38 as the latest assembly placeholder)
# Download from sources like GENCODE, Ensembl, or UCSC
REFERENCE_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
GENE_ANNOTATION_GTF="path/to/gencode.v38.annotation.gtf"

# Strand information: 0 for unstranded, 1 for first-strand, 2 for second-strand
# Assuming first-strand for typical RNA-seq data, adjust if your data is different
STRAND_INFO="1"

# Run SAILOR analysis for A-to-I edits
# Ensure you are in the directory where SAILOR.py is located or provide its full path
python SAILOR.py -i "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -r "${REFERENCE_FASTA}" -g "${GENE_ANNOTATION_GTF}" -s "${STRAND_INFO}"

Tools Used

STAR SAILOR

Raw Source Text

remove adapter with Cutadapt
align to human genome using STAR 2.7.6a
align unmapped reads from above to 12X MS2 stem-loop reporter mRNA using STAR 2.7.6a
SAILOR analysis of data for C-to-U edits
SAILOR analysis of data for A-to-I edits
Assembly: GRCh38
Assembly: 12X MS2 stem-loop reporter mRNA
Supplementary files format and content: The processed data files are in .bed format. They contain SAILOR-identified C-to-U or A-to-I edit information for endogenous transcripts (*non.construct.bed) and the 12X stem-loop-bearing-bearing reporter mRNA (*reporter.construct.bed).

← Back to Analysis