GSE107867 Processing Pipeline

RNA-Seq code_examples 14 steps

Publication

Widespread RNA editing dysregulation in brains from autistic individuals.

Nature neuroscience (2019) — PMID 30559470

Dataset

Widespread RNA editing dysregulation in Autism Spectrum Disorders [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Map fastq reads with HISAT2 hisat2 -q --phred64 -x grch37_tran/genome_tran --no-unal --reorder --no-mixed

HISAT2 v2.2.1 GitHub

$ Bash example

# Install HISAT2 (if not already installed)
# conda install -c bioconda hisat2

# Define variables
# HISAT2_INDEX_PATH should point to the base name of your HISAT2 index files (e.g., /path/to/grch37_tran/genome_tran)
# This index is for GRCh37 (human genome build 37) and includes transcriptome information for splice-aware alignment.
HISAT2_INDEX_PATH="grch37_tran/genome_tran" 

# Input FASTQ files (adjust for single-end or paired-end)
# For single-end reads:
INPUT_FASTQ_SE="input_single_end.fastq.gz"
# For paired-end reads:
INPUT_FASTQ_PE1="input_read1.fastq.gz"
INPUT_FASTQ_PE2="input_read2.fastq.gz"

OUTPUT_SAM="output.sam" # Desired output SAM file name

# Execute HISAT2 mapping (example for single-end reads)
# Parameters:
# -q: Quiet mode (suppress verbose output)
# --phred64: Input quality scores are Phred+64 (default is Phred+33)
# -x: Path to the HISAT2 index
# -U: Input single-end FASTQ file
# --no-unal: Do not output unaligned reads
# --reorder: Output alignments in the same order as the input reads
# --no-mixed: Do not report paired-end alignments where one mate is unaligned
# -S: Output SAM file
hisat2 -q --phred64 \
       -x "${HISAT2_INDEX_PATH}" \
       -U "${INPUT_FASTQ_SE}" \
       --no-unal \
       --reorder \
       --no-mixed \
       -S "${OUTPUT_SAM}"

# To run with paired-end reads, uncomment and modify the following block:
# hisat2 -q --phred64 \
#        -x "${HISAT2_INDEX_PATH}" \
#        -1 "${INPUT_FASTQ_PE1}" \
#        -2 "${INPUT_FASTQ_PE2}" \
#        --no-unal \
#        --reorder \
#        --no-mixed \
#        -S "${OUTPUT_SAM}"

View on GitHub

Extract uniquely mapped reads from SAM files

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define input and output file paths
INPUT_SAM="input.sam"
OUTPUT_BAM="uniquely_mapped.bam"

# Extract uniquely mapped reads from SAM file
# -b: Output in BAM format
# -h: Include header in the output
# -F 0x4: Exclude unmapped reads
# -F 0x100: Exclude secondary alignments
# -F 0x800: Exclude supplementary alignments
# -q 30: Only output alignments with mapping quality (MAPQ) >= 30 (a common threshold for uniquely mapped reads)
samtools view -b -h -F 0x4 -F 0x100 -F 0x800 -q 30 "${INPUT_SAM}" > "${OUTPUT_BAM}"

View on GitHub

Remap unmapped reads with hyperediting pipeline.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define reference genome paths (replace with actual paths)
# Using hg38 as a placeholder for the latest human assembly
GENOME_DIR="/path/to/star_index/hg38"
GENOME_FASTA="/path/to/reference/hg38.fa"
GTF_FILE="/path/to/annotations/gencode.v38.annotation.gtf"

# Input unmapped reads file (replace with actual path, assuming single-end reads)
UNMAPPED_READS_FASTQ="unmapped_reads.fastq.gz"

# Output prefix for remapped files
OUTPUT_PREFIX="remapped_hyperediting_"

# Number of threads to use
NUM_THREADS=8

# --- Generate STAR genome index (run once per genome, if not already done) ---
# This step creates the necessary index files for STAR alignment.
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --runThreadN ${NUM_THREADS}

# --- Remap unmapped reads with STAR ---
# Parameters are chosen to be relatively permissive for potential hyperedited regions,
# which may contain a high number of mismatches (A-to-I edits).
# outFilterMismatchNmax 999 and outFilterMismatchNoverLmax 0.1 allow for many mismatches.
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${UNMAPPED_READS_FASTQ} \
     --runThreadN ${NUM_THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --outReadsUnmapped Fastx \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000

# Note: The primary output BAM file will be named ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam
# Unmapped reads from this remapping step will be in ${OUTPUT_PREFIX}Unmapped.out.mate1

View on GitHub

In brief, convert all adenosines to guanosines and align to modified hg19 genome where adenosines are also substituted by guanosines.

bwa mem (Inferred with models/gemini-2.5-flash) v0.7.17 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bwa
# conda install -c bioconda bwa

# Install samtools
# conda install -c bioconda samtools

# --- Reference Genome Preparation ---
# Download hg19 reference genome (if not already available)
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz

# Create a modified hg19 reference where all 'A' are replaced by 'G' (case-insensitive)
sed 's/A/G/gI' hg19.fa > hg19_A_to_G.fa

# Index the modified reference genome for BWA
bwa index hg19_A_to_G.fa

# --- Read Preparation and Alignment ---
# Assume input reads are in input_reads.fastq.gz
# Create modified reads where all 'A' are replaced by 'G' (case-insensitive) in the sequence lines.
# This awk command processes FASTQ files line by line, modifying only the sequence line (2nd line of each 4-line record).
gunzip -c input_reads.fastq.gz | awk 'BEGIN {OFS="\n"} {
    print $0; # Print header line
    getline; # Read sequence line into $0
    gsub(/A/,"G",$0); # Replace 'A' with 'G'
    gsub(/a/,"g",$0); # Replace 'a' with 'g'
    print $0; # Print modified sequence line
    getline; print $0; # Read and print plus line
    getline; print $0; # Read and print quality line
}' | gzip > modified_reads.fastq.gz

# Align modified reads to the modified hg19 genome
# -t 8: Use 8 threads for alignment
# -M: Mark shorter split hits as secondary (for Picard compatibility)
bwa mem -t 8 hg19_A_to_G.fa modified_reads.fastq.gz | samtools view -bS - | samtools sort -o aligned_A_to_G_sorted.bam -

# Index the resulting BAM file
samtools index aligned_A_to_G_sorted.bam

View on GitHub

Obtain uniquely mapped reads pairs from hyperediting pipeline and combine with uniquely mapped reads from first round of read mapping

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge uniquely mapped reads from the hyperediting pipeline
# with uniquely mapped reads from the first round of mapping.
# Assuming input BAM files are 'hyperediting_uniquely_mapped.bam'
# and 'first_round_uniquely_mapped.bam'.
samtools merge combined_uniquely_mapped.bam hyperediting_uniquely_mapped.bam first_round_uniquely_mapped.bam

View on GitHub

Identify RNA-DNA differences between mapped reads and hg19 reference genome.

REDItools (Inferred with models/gemini-2.5-flash) v2.0

$ Bash example

# Install REDItools (if not already installed)
# conda create -n reditools_env python=3.8
# conda activate reditools_env
# conda install -c bioconda reditools=2.0

# Define input and reference files
# Placeholder for your aligned RNA-seq reads in BAM format
MAPPED_READS="input.bam" 
# Placeholder for the hg19 reference genome FASTA file
# Source: For example, from GATK resource bundle or UCSC goldenPath
REFERENCE_GENOME="/path/to/reference/GRCh37/hg19.fa" 
# Placeholder for a dbSNP VCF file for hg19 (to filter out known germline variants)
# Source: For example, from GATK resource bundle (e.g., dbsnp_138.hg19.vcf.gz)
DBSNP_VCF="/path/to/reference/GRCh37/dbsnp_138.hg19.vcf.gz" 
OUTPUT_PREFIX="rna_dna_differences"

# Ensure the input BAM file is indexed (if not already)
# samtools index "${MAPPED_READS}"

# Run REDItools RD.py to identify RNA-DNA differences
# -i: Input BAM file
# -f: Reference genome FASTA file
# -o: Output file prefix (will generate .txt and .vcf if -v is used)
# -S: dbSNP VCF file (highly recommended to filter out known DNA variants)
# -m: Minimum mapping quality (e.g., 20)
# -q: Minimum base quality (e.g., 20)
# -c: Minimum coverage at a site to consider (e.g., 10)
# -u: Minimum number of reads supporting the variant allele (e.g., 3)
# -v: Output results in VCF format
# -g: Specify the genomic region to analyze (optional, for parallelization or specific regions)
# -t: Number of threads (optional)

reditools RD.py \
    -i "${MAPPED_READS}" \
    -f "${REFERENCE_GENOME}" \
    -o "${OUTPUT_PREFIX}" \
    -S "${DBSNP_VCF}" \
    -m 20 \
    -q 20 \
    -c 10 \
    -u 3 \
    -v

Apply log-likelihood test to determine whether RDD is likely resulted from sequencing error (Bahn et. al.

RDD v1.0

$ Bash example

# RDD is a Perl script. Ensure Perl is installed.
# conda install -c bioconda perl

# Download the RDD script (if not already available)
# wget https://raw.githubusercontent.com/yarden/RDD/master/RDD.pl
# chmod +x RDD.pl

# Define input and output files/paths
INPUT_BAM="aligned_reads.bam" # Replace with your input BAM file
OUTPUT_PREFIX="rdd_analysis" # Prefix for output files
REFERENCE_GENOME="path/to/human_genome.fa" # Placeholder: e.g., /path/to/GRCh38.fa

# Execute RDD to perform log-likelihood test for RNA degradation
perl RDD.pl -i "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -r "${REFERENCE_GENOME}"

PMID: 21960545)

CLIPper vOriginal (from PMID: 21960545) GitHub

$ Bash example

# CLIPper is a Python script. Ensure Python 3 is installed.
# Required libraries: pysam, numpy, scipy
# pip install pysam numpy scipy

# Clone the repository if you haven't already
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned eCLIP/PAR-CLIP reads
GENOME_SIZES="hg38.chrom.sizes" # Placeholder for your genome sizes file (e.g., from UCSC)
OUTPUT_PREFIX="clipper_peaks"

# Example: Download hg38 chrom.sizes if not available
# wget -O ${GENOME_SIZES} http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

# Run CLIPper for peak calling
# This command uses common parameters. Adjust -p (p-value) and -c (min read count) as needed.
# Ensure you are in the directory where clipper.py is located or provide the full path to clipper.py
python clipper.py \
    -b "${INPUT_BAM}" \
    -s "${GENOME_SIZES}" \
    -o "${OUTPUT_PREFIX}" \
    -p 0.01 \
    -c 10 \
    -v

View on GitHub

Apply posterior filters to remove RDD sites that are likely caused by technical artifats (Bahn et. al.

skipper (Inferred with models/gemini-2.5-flash) v0.1.0 GitHub

$ Bash example

# Install skipper (if not already installed, preferably in a conda environment)
# conda create -n skipper_env python=3.8
# conda activate skipper_env
# pip install git+https://github.com/yeolab/skipper.git@main

# Define variables for input/output files and reference data
INPUT_PEAKS="path/to/your/input_peaks.bed"
OUTPUT_FILTERED_PEAKS="path/to/your/output_filtered_peaks.bed"
RDD_SITES_BED="path/to/your/rdd_sites.bed" # Pre-computed RDD sites BED file (e.g., from Bahn et al. 2021 supplementary data)

# Example parameters (adjust as needed based on specific filtering criteria)
MIN_FOLD_ENRICHMENT=2.0
MIN_LOG2_FOLD_ENRICHMENT=1.0
MIN_PVALUE=0.05
MIN_FDR=0.05
MIN_PEAK_LENGTH=10
MIN_READS_IN_IP=5

# Execute the filtering script from skipper
# The script filter_rdd_sites.py is typically found within the skipper workflow/scripts directory.
# Ensure the script is executable or call it with python.
python -m skipper.workflow.scripts.filter_rdd_sites \
    --input_peaks "${INPUT_PEAKS}" \
    --output_peaks "${OUTPUT_FILTERED_PEAKS}" \
    --rdd_sites "${RDD_SITES_BED}" \
    --min_fold_enrichment "${MIN_FOLD_ENRICHMENT}" \
    --min_log2_fold_enrichment "${MIN_LOG2_FOLD_ENRICHMENT}" \
    --min_pvalue "${MIN_PVALUE}" \
    --min_fdr "${MIN_FDR}" \
    --min_peak_length "${MIN_PEAK_LENGTH}" \
    --min_reads_in_ip "${MIN_READS_IN_IP}"

View on GitHub

PMID: 21960545 and Lee et. al.

CLIPper v1.0 GitHub

$ Bash example

# Install CLIPper (example using git clone and python setup.py)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install --user # Install to user's local site-packages

# Define input files and reference genome
# Replace with actual paths to your experiment and control BAM files
EXPERIMENT_BAM="path/to/your/experiment.bam"
CONTROL_BAM="path/to/your/control.bam"

# Define output prefix for CLIPper results
OUTPUT_PREFIX="clipper_peaks"

# Define the path to the CLIPper script
# Adjust this path if CLIPper is installed in a different location or added to PATH
CLIPPER_SCRIPT="path/to/clipper/clipper.py"

# Run CLIPper for peak calling
# -s: Species (e.g., hg38, mm10). CLIPper uses this to find genome files if they are in a standard location.
#     Alternatively, you can provide a path to the genome FASTA if -s is not sufficient.
# -o: Output prefix for the generated files.
# -c: Path to the control BAM file.
python "${CLIPPER_SCRIPT}" \
  -s hg38 \
  -o "${OUTPUT_PREFIX}" \
  -c "${CONTROL_BAM}" \
  "${EXPERIMENT_BAM}"

View on GitHub

PMID: 23598527)

clipper (Inferred with models/gemini-2.5-flash) v0.0.1 GitHub

$ Bash example

# Install clipper (assuming Python environment is set up)
# It's recommended to clone the repository and run directly or install via pip if available.
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# pip install .

# Example usage of clipper for eCLIP peak calling
# This command assumes pre-aligned BAM files for input and control, and reference genome/annotation files.
# Replace 'input.bam', 'control.bam', 'hg38.fa', and 'gencode.v38.annotation.gtf' with actual file paths.
# Reference genome: GRCh38 (hg38) - latest human assembly used as a placeholder.
# Annotation: GENCODE v38 - latest human annotation used as a placeholder.

python clipper.py \
  -s input.bam \
  -o peaks.bed \
  -c control.bam \
  -p 0.01 \
  -f 0.05 \
  -u 10 \
  -d 10 \
  -v 5 \
  -m 10 \
  -g hg38.fa \
  -a gencode.v38.annotation.gtf

View on GitHub

Retain only highly prevalent editing sites.

awk (Inferred with models/gemini-2.5-flash) vGNU Awk (gawk) system default

$ Bash example

# This command filters a tab-separated file of RNA editing sites.
# It assumes that one column (e.g., the 5th column here) contains a numerical score
# representing the prevalence (e.g., frequency, or count of samples where it's found).
# Users should adjust the column number and the threshold value based on their data format and definition of 'highly prevalent'.
#
# Example: Filter for sites with a prevalence score (e.g., frequency) of 0.8 or higher in the 5th column.
# Replace 'input_editing_sites.tsv' with your actual input file and 'highly_prevalent_sites.tsv' with your desired output file name.
# Replace '0.8' with your specific prevalence threshold.

awk -v THRESHOLD=0.8 '$5 >= THRESHOLD' input_editing_sites.tsv > highly_prevalent_sites.tsv

In particular, filter for editing sites that for at least 5 people have at least 5 total reads of which 2 reads are edited

bcftools (Inferred with models/gemini-2.5-flash) v1.18 GitHub

$ Bash example

# Install bcftools if not already installed
# conda install -c bioconda bcftools

# Define input and output files
# Assumes input_editing_sites.vcf.gz is a VCF file containing editing sites across multiple samples.
# The VCF should have FORMAT fields for DP (Total Depth) and AD (Allelic Depths).
INPUT_VCF="input_editing_sites.vcf.gz"
OUTPUT_VCF="filtered_editing_sites.vcf.gz"

# Filter for editing sites where at least 5 samples (people) meet the criteria:
# - Total Depth (DP) is at least 5
# - Alternate Allele Depth (AD[1], representing edited reads) is at least 2
# Reference dataset: Not explicitly mentioned. Assumes a standard human reference genome (e.g., GRCh38) was used during VCF generation.
bcftools view \
    --include 'N_PASS(DP>=5 && AD[1]>=2) >= 5' \
    --output-type z \
    --output "$OUTPUT_VCF" \
    "$INPUT_VCF"

View on GitHub

Also quantify gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.

Ensembl v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem=1.3.3

# --- Reference Data Preparation (run once) ---
# Download hg19 reference genome FASTA (Ensembl GRCh37 release 75)
# wget -nc ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
# gunzip -f Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
# GENOME_FASTA="Homo_sapiens.GRCh37.dna.primary_assembly.fa"

# Download Ensembl hg19 GTF annotation (Ensembl GRCh37 release 75)
# wget -nc ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip -f Homo_sapiens.GRCh37.75.gtf.gz
# GTF_FILE="Homo_sapiens.GRCh37.75.gtf"

# Define RSEM index prefix
RSEM_INDEX_PREFIX="hg19_ensembl_rsem_index"

# Build RSEM reference index
# This step needs to be run only once for a given reference genome and annotation.
# rsem-prepare-reference --gtf "${GTF_FILE}" \
#                        --fasta "${GENOME_FASTA}" \
#                        "${RSEM_INDEX_PREFIX}"

# --- Gene Expression Quantification ---
# Define input BAM file (e.g., from STAR alignment, coordinate-sorted)
INPUT_BAM="path/to/your/aligned_reads.bam"

# Define output prefix for RSEM results
OUTPUT_PREFIX="sample_expression"

# Quantify gene expression using RSEM
# RSEM will output *.genes.results and *.isoforms.results files,
# which contain RPKM values (among other metrics like TPM, FPKM, counts).
# Adjust --num-threads based on available computational resources.
# RSEM automatically detects paired-end reads if the BAM file is properly formatted.
rsem-calculate-expression --bam \
                          --num-threads 8 \
                          "${INPUT_BAM}" \
                          "${RSEM_INDEX_PREFIX}" \
                          "${OUTPUT_PREFIX}"

View on GitHub

Raw Source Text

Map fastq reads with HISAT2 hisat2 -q --phred64 -x grch37_tran/genome_tran  --no-unal --reorder --no-mixed
Extract uniquely mapped reads from SAM files
Remap unmapped reads with hyperediting pipeline. In brief, convert all adenosines to guanosines and align to modified hg19 genome where adenosines are also substituted by guanosines.
Obtain uniquely mapped reads pairs from hyperediting pipeline and combine with uniquely mapped reads from first round of read mapping
Identify RNA-DNA differences between mapped reads and hg19 reference genome.
Apply log-likelihood test to determine whether RDD is likely resulted from sequencing error (Bahn et. al. PMID: 21960545)
Apply posterior filters to remove RDD sites that are likely caused by technical artifats (Bahn et. al. PMID: 21960545 and Lee et. al. PMID: 23598527)
Retain only highly prevalent editing sites. In particular, filter for editing sites that for at least 5 people have at least 5 total reads of which 2 reads are edited
Also quantify gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.
Genome_build: hg19
Supplementary_files_format_and_content: gene expression (RPKM)

← Back to Analysis