GSE117776 Processing Pipeline

RNA-Seq code_examples 14 steps

Publication

Widespread RNA editing dysregulation in brains from autistic individuals.

Nature neuroscience (2019) — PMID 30559470

Dataset

Widespread RNA editing dysregulation in Autism Spectrum Disorders II

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Map fastq reads with HISAT2 hisat2 -q --phred64 -x grch37_tran/genome_tran --no-unal --reorder --no-mixed

HISAT2 v2.2.1 GitHub

$ Bash example

# Install HISAT2 (if not already installed)
# conda install -c bioconda hisat2

# Define input FASTQ files (assuming paired-end reads based on --no-mixed flag)
# Replace with your actual input file paths
READ1="your_read1.fastq.gz"
READ2="your_read2.fastq.gz"

# Define output SAM file path
OUTPUT_SAM="aligned_reads.sam"

# Define HISAT2 index path
# The description specifies 'grch37_tran/genome_tran', indicating a GRCh37 (hg19) human genome index
# with transcriptome information. Ensure this path points to your HISAT2 index files.
HISAT2_INDEX="grch37_tran/genome_tran"

# Run HISAT2 to map fastq reads
hisat2 -q --phred64 \
       -x "${HISAT2_INDEX}" \
       -1 "${READ1}" \
       -2 "${READ2}" \
       --no-unal \
       --reorder \
       --no-mixed \
       -S "${OUTPUT_SAM}"

# Optional: Convert SAM to BAM and sort (common post-alignment steps)
# samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "aligned_reads.bam"
# samtools index "aligned_reads.bam"

View on GitHub

Extract uniquely mapped reads from SAM files

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define input and output file paths
INPUT_SAM="input.sam" # Replace with your actual input SAM file
OUTPUT_BAM="output_unique.bam" # Desired output BAM file for uniquely mapped reads

# Extract uniquely mapped reads from SAM file
# -F 4: Exclude unmapped reads
# -F 256: Exclude secondary alignments
# -F 2048: Exclude supplementary alignments
# -q 30: Filter reads with mapping quality less than 30 (a common threshold for uniquely mapped, high-quality reads)
# -b: Output in BAM format
# -S: Input is SAM format (optional, samtools can often infer)
# -o: Output file
samtools view -F 4 -F 256 -F 2048 -q 30 -bS "${INPUT_SAM}" -o "${OUTPUT_BAM}"

View on GitHub

Remap unmapped reads with hyperediting pipeline.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Placeholder for STAR genome index directory (e.g., built from GRCh38)
GENOME_DIR="/path/to/STAR_genome_index/GRCh38"
# Placeholder for input unmapped reads (FASTQ format, can be gzipped)
# For paired-end reads, use --readFilesIn R1.fastq.gz R2.fastq.gz
INPUT_FASTQ="unmapped_reads.fastq.gz"
# Prefix for output files
OUTPUT_PREFIX="remapped_hyperediting"
# Number of threads to use
NUM_THREADS=8
# RAM limit for sorting BAM (e.g., 30GB), adjust based on system resources
LIMIT_BAM_SORT_RAM=30000000000

# Ensure the genome index exists. Example command to build a genome index for GRCh38:
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \
#      --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf \
#      --runThreadN ${NUM_THREADS}

# Run STAR alignment to remap unmapped reads.
# Parameters are adjusted to be permissive for RNA editing detection (e.g., relaxed mismatch filtering)
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --runThreadN ${NUM_THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outReadsUnmapped Fastx \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM ${LIMIT_BAM_SORT_RAM}

View on GitHub

In brief, convert all adenosines to guanosines and align to modified hg19 genome where adenosines are also substituted by guanosines.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Reference Genome Preparation ---
# Download hg19 reference genome (if not already present)
# wget -c http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz

# Convert adenosines to guanosines in the reference genome
# This creates a new FASTA file with all 'A's replaced by 'G's
sed 's/A/G/g' hg19.fa > hg19_AG_modified.fa

# Create a directory for the STAR index
mkdir -p hg19_AG_STAR_index

# Build STAR index for the A->G modified genome
# Adjust --runThreadN based on available CPU cores
STAR --runMode genomeGenerate \
     --genomeDir hg19_AG_STAR_index \
     --genomeFastaFiles hg19_AG_modified.fa \
     --runThreadN 8

# --- Read Preparation and Alignment ---
# Define input FASTQ file(s)
# For single-end reads, set INPUT_READS="input_reads.fastq.gz"
# For paired-end reads:
INPUT_READS_R1="input_reads_R1.fastq.gz"
INPUT_READS_R2="input_reads_R2.fastq.gz"

# Convert adenosines to guanosines in the input reads (assuming FASTQ format)
# This awk command processes FASTQ files, replacing 'A' with 'G' only in sequence lines.
# Note: This simple substitution might affect quality scores if 'A' appears in the quality string,
# though this is rare for standard Phred scores. For more robust handling of FASTQ, consider specialized tools or custom scripts.
# For paired-end reads, apply to both R1 and R2.

# Process R1 reads
gunzip -c "${INPUT_READS_R1}" | \
awk '{
    if (NR%4 == 2) {
        gsub(/A/, "G", $0);
    }
    print $0;
}' | gzip > input_reads_R1_AG_modified.fastq.gz

# Process R2 reads
gunzip -c "${INPUT_READS_R2}" | \
awk '{
    if (NR%4 == 2) {
        gsub(/A/, "G", $0);
    }
    print $0;
}' | gzip > input_reads_R2_AG_modified.fastq.gz

# Align the A->G modified reads to the A->G modified genome using STAR
# Adjust --runThreadN based on available CPU cores
# For paired-end reads:
STAR --genomeDir hg19_AG_STAR_index \
     --readFilesIn input_reads_R1_AG_modified.fastq.gz input_reads_R2_AG_modified.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix alignment_AG_ \
     --runThreadN 8 \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting

View on GitHub

Obtain uniquely mapped reads pairs from hyperediting pipeline and combine with uniquely mapped reads from first round of read mapping

samtools (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Assume input BAM files are:
# hyperediting_pipeline_output.bam (e.g., coordinate-sorted)
# first_round_mapping_output.bam (e.g., coordinate-sorted)

# Filter uniquely mapped primary read pairs (MAPQ >= 20, properly paired, not secondary, not supplementary)
# -b: output BAM
# -h: include header
# -f 0x2: read is properly paired
# -F 0x100: exclude secondary alignment
# -F 0x800: exclude supplementary alignment
# -q 20: mapping quality >= 20 (common threshold for unique mapping)
samtools view -b -h -f 0x2 -F 0x100 -F 0x800 -q 20 hyperediting_pipeline_output.bam > hyperediting_uniquely_mapped.bam

# Filter uniquely mapped primary read pairs from first round mapping output
samtools view -b -h -f 0x2 -F 0x100 -F 0x800 -q 20 first_round_mapping_output.bam > first_round_uniquely_mapped.bam

# Combine the two sets of uniquely mapped reads into a single BAM file.
# Assuming input BAMs are coordinate-sorted, samtools merge will produce a coordinate-sorted output.
samtools merge combined_uniquely_mapped.bam hyperediting_uniquely_mapped.bam first_round_uniquely_mapped.bam

# Clean up intermediate files (optional)
# rm hyperediting_uniquely_mapped.bam first_round_uniquely_mapped.bam

View on GitHub

Identify RNA-DNA differences between mapped reads and hg19 reference genome.

REDItools (Inferred with models/gemini-2.5-flash) v2.0 GitHub

$ Bash example

# Install REDItools (if not already installed)
# conda install -c bioconda reditools

# Define input and reference files
INPUT_BAM="mapped_reads.bam" # Placeholder for mapped reads BAM file
REFERENCE_FASTA="/path/to/hg19.fa" # Placeholder for hg19 reference genome FASTA file
OUTPUT_FILE="rna_dna_differences.txt"

# Execute REDItools to identify RNA-DNA differences
# -i: Input BAM file
# -f: Reference FASTA file
# -o: Output file
# -t: Number of threads (example: 10)
# -p: Print progress
REDItools.py -i "${INPUT_BAM}" -f "${REFERENCE_FASTA}" -o "${OUTPUT_FILE}" -t 10 -p

View on GitHub

Apply log-likelihood test to determine whether RDD is likely resulted from sequencing error (Bahn et. al.

RDD (Inferred with models/gemini-2.5-flash) v1.0

$ Bash example

# Installation (assuming Perl is already installed):
# git clone https://github.com/m-bahn/RDD.git
# cd RDD
# chmod +x RDD.pl

# Example usage of RDD.pl
# The 'reference_file' (-r) is a pre-computed file representing expected read distribution,
# often derived from a control sample or pooled samples, NOT a genome assembly.
# Replace placeholders with actual file paths and desired p-value.
# Input file (-i) is typically a BAM file of aligned reads.
perl RDD.pl \
    -i input.bam \
    -o rdd_output.txt \
    -r control_sample_rdd_reference.txt \
    -p 0.05

PMID: 21960545)

CLIPper (Inferred with models/gemini-2.5-flash) vv1.0 GitHub

$ Bash example

# Install CLIPper (if not already installed)
# conda install -c bioconda clipper

# Example execution command for CLIPper v1.0
# This command performs peak calling on a BAM file using parameters
# commonly found in the yeolab/eclip workflow for eCLIP assays.
# Replace 'input.bam', 'output.peaks.bed', and 'hg19' if different.
# The 'species' parameter can be 'hg19', 'mm10', or a path to a custom genome size file.
clipper \
    --bam_file input.bam \
    --species hg19 \
    --output_file output.peaks.bed \
    --threshold 0.01 \
    --window_size 25 \
    --step_size 1 \
    --min_peak_length 10 \
    --max_peak_length 250 \
    --pad 0 \
    --bonferroni

View on GitHub

Apply posterior filters to remove RDD sites that are likely caused by technical artifats (Bahn et. al.

filter_rdd_sites.py (Inferred with models/gemini-2.5-flash) vNot explicitly versioned, part of yeolab/eclip pipeline

$ Bash example

# Install dependencies if needed (e.g., pysam)
# conda install -c bioconda pysam

# Replace input.bam with your actual input BAM file.
# Replace hg38_rdd_sites.bed with the path to your RDD sites BED file (e.g., from ENCODE or Yeo lab resources).
# Replace filtered.bam with your desired output BAM file name.

python filter_rdd_sites.py \
    --input_bam input.bam \
    --rdd_sites_bed hg38_rdd_sites.bed \
    --output_bam_name filtered.bam

PMID: 21960545 and Lee et. al.

CLIPper vNot explicitly versioned, associated with PMID 21960545 (2011 publication)

$ Bash example

# Install CLIPper (example using conda)
# conda install -c bioconda clipper

# Define input and output files (placeholders)
INPUT_BAM="sample_aligned.bam"
OUTPUT_BED="sample_peaks.bed"

# Define reference genome files (placeholders for hg38)
# Replace with actual paths to your genome size file (e.g., from UCSC goldenPath)
# Example for hg38: wget -qO- http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes > hg38.chrom.sizes
GENOME_SIZE_FILE="/path/to/hg38.chrom.sizes"

# Execute CLIPper for peak calling
# -i: Input BAM file containing aligned reads
# -o: Output BED file for identified peaks
# -s: Path to a file containing chromosome sizes (e.g., hg38.chrom.sizes)
clipper -i "${INPUT_BAM}" -o "${OUTPUT_BED}" -s "${GENOME_SIZE_FILE}"

PMID: 23598527)

clipper (Inferred with models/gemini-2.5-flash) vNot explicitly versioned GitHub

$ Bash example

# Install clipper (assuming Python environment)
# pip install git+https://github.com/yeolab/clipper.git

# Example usage of clipper for peak calling on CLIP-seq data.
# Reference datasets:
# - Genome assembly: hg38 (GRCh38) is used as a placeholder.
# - Gene annotation: GENCODE v44 (or latest) GTF for hg38 is used as a placeholder.

# Define input and output files
EXPERIMENT_BAM="experiment.bam" # Placeholder for aligned experiment BAM file
CONTROL_BAM="control.bam"       # Placeholder for aligned control BAM file
OUTPUT_PREFIX="clipper_peaks"
SPECIES="hg38"
ANNOTATION_FILE="/path/to/gencode.v44.annotation.gtf" # Placeholder for GTF annotation file

# Run clipper with example parameters
python clipper.py \
  -s "${SPECIES}" \
  -o "${OUTPUT_PREFIX}" \
  -e "${EXPERIMENT_BAM}" \
  -c "${CONTROL_BAM}" \
  -a "${ANNOTATION_FILE}" \
  -p 0.01 \
  -f 0.05

View on GitHub

Retain only highly prevalent editing sites.

bcftools (Inferred with models/gemini-2.5-flash) v1.18 GitHub

$ Bash example

# Install bcftools (if not already installed)
# conda install -c bioconda bcftools

# Retain only highly prevalent editing sites.
# This command assumes the input is a VCF file (input.vcf) containing RNA editing calls.
# It filters for sites with an Allele Frequency (AF) greater than 0.1 (10% editing frequency)
# and a minimum Depth (DP) of 20 reads. These thresholds are common for defining 'highly prevalent'
# and can be adjusted based on specific experimental design and desired stringency.
bcftools filter -i 'AF > 0.1 && DP > 20' input.vcf -o highly_prevalent_editing_sites.vcf

View on GitHub

In particular, filter for editing sites where both pooled fragile X and pooled controls have at least 10 reads each and where pooled fragile X or pooled controls have editing ratio at least 0.1

bcftools filter (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Input VCF file containing editing sites for pooled fragile X and pooled controls
INPUT_VCF="editing_sites.vcf"
OUTPUT_VCF="filtered_editing_sites.vcf"

# Filter for editing sites based on read depth and editing ratio
# Conditions:
# 1. Both 'pooled fragile X' and 'pooled controls' samples have at least 10 reads (DP >= 10)
# 2. At least one of the samples ('pooled fragile X' OR 'pooled controls') has an editing ratio (ALT_AD / DP) of at least 0.1
# Note: Replace 'FRAGILE_X_POOL' and 'CONTROL_POOL' with actual sample names from your VCF header.
# If sample names are not available, you might use sample indices (e.g., [0] and [1]) if their order is consistent.
bcftools filter \
    --samples FRAGILE_X_POOL,CONTROL_POOL \
    --include 'FORMAT/DP[FRAGILE_X_POOL] >= 10 && FORMAT/DP[CONTROL_POOL] >= 10 && ((FORMAT/AD[FRAGILE_X_POOL][1] / FORMAT/DP[FRAGILE_X_POOL]) >= 0.1 || (FORMAT/AD[CONTROL_POOL][1] / FORMAT/DP[CONTROL_POOL]) >= 0.1)' \
    --output-type v \
    --output "${OUTPUT_VCF}" \
    "${INPUT_VCF}"

View on GitHub

Also quantify gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.

Ensembl v1.3.3 GitHub

$ Bash example

# Define variables
FASTQ_R1="sample_R1.fastq.gz"
FASTQ_R2="sample_R2.fastq.gz" # Optional, if paired-end
SAMPLE_NAME="sample_name"
THREADS=8

# Reference files for hg19 ENSEMBL (e.g., Ensembl release 75 for GRCh37)
# Download ENSEMBL hg19 GTF and FASTA files if not available.
# Example download commands for Ensembl release 75 (GRCh37):
# wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip Homo_sapiens.GRCh37.75.gtf.gz
# wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz

GTF_FILE="Homo_sapiens.GRCh37.75.gtf"
FASTA_FILE="Homo_sapiens.GRCh37.dna.primary_assembly.fa"
RSEM_REFERENCE_DIR="rsem_ensembl_hg19_ref"
RSEM_REFERENCE_NAME="ensembl_hg19"

# --- RSEM and STAR Installation (commented out) ---
# It is recommended to install RSEM and STAR via Conda:
# conda create -n rsem_env rsem star -y
# conda activate rsem_env

# 1. Prepare RSEM reference (if not already done)
# This step builds the STAR index and RSEM transcript models.
# It needs to be run only once per reference genome/annotation combination.
# The --star option tells RSEM to use STAR for alignment during quantification.
if [ ! -d "${RSEM_REFERENCE_DIR}" ]; then
    mkdir -p "${RSEM_REFERENCE_DIR}"
    rsem-prepare-reference \
        --gtf "${GTF_FILE}" \
        --star \
        --num-threads "${THREADS}" \
        "${FASTA_FILE}" \
        "${RSEM_REFERENCE_DIR}/${RSEM_REFERENCE_NAME}"
fi

# 2. Quantify gene expression using RSEM
# This command performs alignment (using STAR internally) and quantification.
# For paired-end reads (assuming FASTQ_R1 and FASTQ_R2 are defined):
rsem-calculate-expression \
    --star \
    --num-threads "${THREADS}" \
    --paired-end \
    "${FASTQ_R1}" \
    "${FASTQ_R2}" \
    "${RSEM_REFERENCE_DIR}/${RSEM_REFERENCE_NAME}" \
    "${SAMPLE_NAME}"

# For single-end reads, use the following command instead:
# rsem-calculate-expression \
#     --star \
#     --num-threads "${THREADS}" \
#     "${FASTQ_R1}" \
#     "${RSEM_REFERENCE_DIR}/${RSEM_REFERENCE_NAME}" \
#     "${SAMPLE_NAME}"

# The output file "${SAMPLE_NAME}.genes.results" will contain FPKM values.
# FPKM (Fragments Per Kilobase of transcript per Million mapped fragments) is the standard metric for paired-end RNA-seq and is equivalent to RPKM for single-end data.
# The relevant column for gene-level quantification is 'FPKM' in the .genes.results file.

View on GitHub

Raw Source Text

Map fastq reads with HISAT2 hisat2 -q --phred64 -x grch37_tran/genome_tran  --no-unal --reorder --no-mixed
Extract uniquely mapped reads from SAM files
Remap unmapped reads with hyperediting pipeline. In brief, convert all adenosines to guanosines and align to modified hg19 genome where adenosines are also substituted by guanosines.
Obtain uniquely mapped reads pairs from hyperediting pipeline and combine with uniquely mapped reads from first round of read mapping
Identify RNA-DNA differences between mapped reads and hg19 reference genome.
Apply log-likelihood test to determine whether RDD is likely resulted from sequencing error (Bahn et. al. PMID: 21960545)
Apply posterior filters to remove RDD sites that are likely caused by technical artifats (Bahn et. al. PMID: 21960545 and Lee et. al. PMID: 23598527)
Retain only highly prevalent editing sites. In particular, filter for editing sites where both pooled fragile X and pooled controls have at least 10 reads each and where pooled fragile X or pooled controls have editing ratio at least 0.1
Also quantify gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.
Genome_build: hg19
Supplementary_files_format_and_content: gene expression (RPKM)

← Back to Analysis