GSE232513 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

Expanded repertoire of RNA-editing-based detection for RNA binding protein interactions (1)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

remove adapter with Cutadapt

cutadapt v4.0 GitHub

$ Bash example

# Install Cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output file paths
INPUT_R1="reads_R1.fastq.gz"
INPUT_R2="reads_R2.fastq.gz"
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"

# Define adapter sequences (common Illumina adapters, adjust as needed)
# For 3' adapter on Read 1
ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# For 3' adapter on Read 2 (often the reverse complement of R1 adapter or a different sequence)
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Define minimum read length after trimming
MIN_LENGTH=18

# Define quality cutoff
QUALITY_CUTOFF=20

# Define number of threads to use
THREADS=8

# Execute Cutadapt for paired-end reads
cutadapt \
  -j ${THREADS} \
  -q ${QUALITY_CUTOFF} \
  --minimum-length ${MIN_LENGTH} \
  -a ${ADAPTER_R1} \
  -A ${ADAPTER_R2} \
  -o ${OUTPUT_R1} \
  -p ${OUTPUT_R2} \
  ${INPUT_R1} ${INPUT_R2}

# For single-end reads, use a command like this:
# INPUT_SE="reads.fastq.gz"
# OUTPUT_SE="trimmed.fastq.gz"
# cutadapt \
#   -j ${THREADS} \
#   -q ${QUALITY_CUTOFF} \
#   --minimum-length ${MIN_LENGTH} \
#   -a ${ADAPTER_R1} \
#   -o ${OUTPUT_SE} \
#   ${INPUT_SE}

View on GitHub

align to genome using bowtie2 or bwa-mem

Bowtie2 v2.4.2 GitHub

$ Bash example

# Install Bowtie2 (if not already installed)
# conda install -c bioconda bowtie2

# Define variables
# Placeholder for human genome GRCh38 index. Replace with your actual index path.
# The index should be built using bowtie2-build from a FASTA file (e.g., GRCh38.fa).
GENOME_INDEX="/path/to/your/genome/index/GRCh38" # Example: /data/indexes/bowtie2/GRCh38/GRCh38

# Placeholder for input FASTQ file(s). Replace with your actual file paths.
# For single-end reads:
INPUT_READS="input.fastq.gz"

# For paired-end reads, uncomment and set these variables:
# READ1="input_R1.fastq.gz"
# READ2="input_R2.fastq.gz"

OUTPUT_SAM="aligned.sam"
NUM_THREADS=8 # Number of threads to use

# Align single-end reads to the genome using Bowtie2
bowtie2 -x "${GENOME_INDEX}" \
        -U "${INPUT_READS}" \
        -S "${OUTPUT_SAM}" \
        -p "${NUM_THREADS}"

# To align paired-end reads, comment out the single-end command above and uncomment/use the following:
# bowtie2 -x "${GENOME_INDEX}" \
#         -1 "${READ1}" \
#         -2 "${READ2}" \
#         -S "${OUTPUT_SAM}" \
#         -p "${NUM_THREADS}"

# Optional: Convert SAM to BAM, sort, and index for downstream analysis (e.g., peak calling)
# # Install samtools (if not already installed)
# # conda install -c bioconda samtools
# samtools view -bS "${OUTPUT_SAM}" > "aligned.bam"
# samtools sort "aligned.bam" -o "aligned.sorted.bam"
# samtools index "aligned.sorted.bam"

View on GitHub

generate count tables along reporter sequence using Pysamstats

Pysamstats v1.2.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Pysamstats (if not already installed)
# conda install -c bioconda pysamstats

# Define input and output files
INPUT_BAM="aligned_reads.bam"
REPORTER_BED="reporter_sequences.bed"
GENOME_FASTA="hg38.fa" # Placeholder for reference genome FASTA
OUTPUT_COUNTS="reporter_sequence_counts.tsv"

# Generate count tables along reporter sequences using Pysamstats
# --type coverage_all: Calculates coverage for all reads (forward and reverse)
# --fasta: Provides the reference genome for sequence-based statistics (optional for simple coverage, but good practice)
# --regions: Specifies a BED file containing the reporter sequences to analyze
pysamstats --type coverage_all \
           --fasta ${GENOME_FASTA} \
           --regions ${REPORTER_BED} \
           ${INPUT_BAM} > ${OUTPUT_COUNTS}

Tools Used

Bowtie2

Raw Source Text

remove adapter with Cutadapt
align to genome using bowtie2 or bwa-mem
generate count tables along reporter sequence using Pysamstats
Assembly: 12X MS2 stem-loop reporter mRNA
Supplementary files format and content: pysamstats tables that contain the read counts for RNA bases  at each position along the reporter mRNA sequence

← Back to Analysis