GSE232513 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

GSE232513

Expanded repertoire of RNA-editing-based detection for RNA binding protein interactions (1)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    remove adapter with Cutadapt

    cutadapt v4.0 GitHub
    $ Bash example
    # Install Cutadapt (if not already installed)
    # conda install -c bioconda cutadapt
    
    # Define input and output file paths
    INPUT_R1="reads_R1.fastq.gz"
    INPUT_R2="reads_R2.fastq.gz"
    OUTPUT_R1="trimmed_R1.fastq.gz"
    OUTPUT_R2="trimmed_R2.fastq.gz"
    
    # Define adapter sequences (common Illumina adapters, adjust as needed)
    # For 3' adapter on Read 1
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    # For 3' adapter on Read 2 (often the reverse complement of R1 adapter or a different sequence)
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Define minimum read length after trimming
    MIN_LENGTH=18
    
    # Define quality cutoff
    QUALITY_CUTOFF=20
    
    # Define number of threads to use
    THREADS=8
    
    # Execute Cutadapt for paired-end reads
    cutadapt \
      -j ${THREADS} \
      -q ${QUALITY_CUTOFF} \
      --minimum-length ${MIN_LENGTH} \
      -a ${ADAPTER_R1} \
      -A ${ADAPTER_R2} \
      -o ${OUTPUT_R1} \
      -p ${OUTPUT_R2} \
      ${INPUT_R1} ${INPUT_R2}
    
    # For single-end reads, use a command like this:
    # INPUT_SE="reads.fastq.gz"
    # OUTPUT_SE="trimmed.fastq.gz"
    # cutadapt \
    #   -j ${THREADS} \
    #   -q ${QUALITY_CUTOFF} \
    #   --minimum-length ${MIN_LENGTH} \
    #   -a ${ADAPTER_R1} \
    #   -o ${OUTPUT_SE} \
    #   ${INPUT_SE}
  2. 2

    align to genome using bowtie2 or bwa-mem

    $ Bash example
    # Install Bowtie2 (if not already installed)
    # conda install -c bioconda bowtie2
    
    # Define variables
    # Placeholder for human genome GRCh38 index. Replace with your actual index path.
    # The index should be built using bowtie2-build from a FASTA file (e.g., GRCh38.fa).
    GENOME_INDEX="/path/to/your/genome/index/GRCh38" # Example: /data/indexes/bowtie2/GRCh38/GRCh38
    
    # Placeholder for input FASTQ file(s). Replace with your actual file paths.
    # For single-end reads:
    INPUT_READS="input.fastq.gz"
    
    # For paired-end reads, uncomment and set these variables:
    # READ1="input_R1.fastq.gz"
    # READ2="input_R2.fastq.gz"
    
    OUTPUT_SAM="aligned.sam"
    NUM_THREADS=8 # Number of threads to use
    
    # Align single-end reads to the genome using Bowtie2
    bowtie2 -x "${GENOME_INDEX}" \
            -U "${INPUT_READS}" \
            -S "${OUTPUT_SAM}" \
            -p "${NUM_THREADS}"
    
    # To align paired-end reads, comment out the single-end command above and uncomment/use the following:
    # bowtie2 -x "${GENOME_INDEX}" \
    #         -1 "${READ1}" \
    #         -2 "${READ2}" \
    #         -S "${OUTPUT_SAM}" \
    #         -p "${NUM_THREADS}"
    
    # Optional: Convert SAM to BAM, sort, and index for downstream analysis (e.g., peak calling)
    # # Install samtools (if not already installed)
    # # conda install -c bioconda samtools
    # samtools view -bS "${OUTPUT_SAM}" > "aligned.bam"
    # samtools sort "aligned.bam" -o "aligned.sorted.bam"
    # samtools index "aligned.sorted.bam"
  3. 3

    generate count tables along reporter sequence using Pysamstats

    Pysamstats v1.2.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install Pysamstats (if not already installed)
    # conda install -c bioconda pysamstats
    
    # Define input and output files
    INPUT_BAM="aligned_reads.bam"
    REPORTER_BED="reporter_sequences.bed"
    GENOME_FASTA="hg38.fa" # Placeholder for reference genome FASTA
    OUTPUT_COUNTS="reporter_sequence_counts.tsv"
    
    # Generate count tables along reporter sequences using Pysamstats
    # --type coverage_all: Calculates coverage for all reads (forward and reverse)
    # --fasta: Provides the reference genome for sequence-based statistics (optional for simple coverage, but good practice)
    # --regions: Specifies a BED file containing the reporter sequences to analyze
    pysamstats --type coverage_all \
               --fasta ${GENOME_FASTA} \
               --regions ${REPORTER_BED} \
               ${INPUT_BAM} > ${OUTPUT_COUNTS}

Tools Used

Raw Source Text
remove adapter with Cutadapt
align to genome using bowtie2 or bwa-mem
generate count tables along reporter sequence using Pysamstats
Assembly: 12X MS2 stem-loop reporter mRNA
Supplementary files format and content: pysamstats tables that contain the read counts for RNA bases  at each position along the reporter mRNA sequence
← Back to Analysis