GSE107896 Processing Pipeline

GSE code_examples 14 steps

Publication

Widespread RNA editing dysregulation in brains from autistic individuals.

Nature neuroscience (2019) — PMID 30559470

Dataset

Widespread RNA editing dysregulation in Autism Spectrum Disorders

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Map fastq reads with HISAT2 hisat2 -q --phred64 -x grch37_tran/genome_tran --no-unal --reorder --no-mixed

HISAT2 v2.1.0 (Inferred from yeolab/eclip workflow) GitHub

$ Bash example

# Install HISAT2 and Samtools (if not already installed)
# conda install -c bioconda hisat2 samtools

# Define variables
READS_R1="input_R1.fastq.gz" # Placeholder for forward reads
READS_R2="input_R2.fastq.gz" # Placeholder for reverse reads
INDEX_PREFIX="grch37_tran/genome_tran" # Path to HISAT2 index for GRCh37 genome and transcriptome
OUTPUT_BAM="aligned_reads.bam"
THREADS=8 # Number of threads to use for HISAT2

# Map fastq reads with HISAT2, convert to BAM, and sort
hisat2 \
    -q \
    --phred64 \
    -x "${INDEX_PREFIX}" \
    --no-unal \
    --reorder \
    --no-mixed \
    -1 "${READS_R1}" \
    -2 "${READS_R2}" \
    -p "${THREADS}" \
    | samtools view -bS - \
    | samtools sort -o "${OUTPUT_BAM}" -

# Note: The index 'grch37_tran/genome_tran' refers to a HISAT2 index built from the GRCh37 (hg19) human reference genome and its corresponding transcriptome.
# This index needs to be pre-built using `hisat2-build`.

View on GitHub

Extract uniquely mapped reads from SAM files

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Extract uniquely mapped reads from a SAM file.
# -F 2308: Exclude unmapped (4), secondary (256), and supplementary (2048) alignments.
# -q 30: Filter reads with a mapping quality score less than 30 (common threshold for uniquely mapped reads).
# -bS: Output in BAM format, input is SAM format.
# Replace 'input.sam' with your actual input SAM file and 'output_unique.bam' with your desired output file name.
samtools view -F 2308 -q 30 -bS input.sam > output_unique.bam

View on GitHub

Remap unmapped reads with hyperediting pipeline.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star
# Install samtools (if not already installed)
# conda install -c bioconda samtools

# --- Placeholder for reference genome and STAR index ---
# Replace with actual paths to your reference genome and STAR index.
# For example, using human genome hg38 as the latest assembly placeholder.
# The STAR index should be built for the chosen reference genome.
GENOME_DIR="/path/to/STAR_index_hg38"

# --- Input files ---
# Assuming 'original_alignment.bam' is the BAM file from a previous alignment
# where some reads were unmapped (flag 4).
INPUT_BAM="original_alignment.bam"

# --- Output files ---
UNMAPPED_BAM="unmapped_reads.bam"
UNMAPPED_R1_FASTQ="unmapped_R1.fastq"
UNMAPPED_R2_FASTQ="unmapped_R2.fastq"
REMAP_OUTPUT_PREFIX="remapped_hyperediting_"

# 1. Extract unmapped reads from the original alignment BAM
# Reads with flag 4 are unmapped. '-b' for BAM output.
samtools view -b -f 4 "${INPUT_BAM}" > "${UNMAPPED_BAM}"

# 2. Convert unmapped BAM to FASTQ format
# This assumes paired-end reads. Adjust for single-end if necessary.
# -F 0x900 filters out secondary and supplementary alignments.
# -s /dev/null discards singleton reads.
samtools fastq -F 0x900 -1 "${UNMAPPED_R1_FASTQ}" -2 "${UNMAPPED_R2_FASTQ}" -s /dev/null "${UNMAPPED_BAM}"

# 3. Remap unmapped reads using STAR
# Parameters are adjusted to be more permissive for hyperedited reads,
# which can contain a high number of mismatches (A-to-I editing appears as A-to-G).
# --outFilterMismatchNmax 20: Allows up to 20 mismatches per read. Default is 10.
#                             This is increased to capture hyperedited reads.
# --outFilterMultimapNmax 20: Allows reads to map to up to 20 loci. Default is 10.
#                             Can be increased for reads that might map ambiguously due to editing.
# --outSAMattributes All: Includes all possible SAM attributes for downstream analysis.
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${UNMAPPED_R1_FASTQ}" "${UNMAPPED_R2_FASTQ}" \
  --runThreadN 8 \
  --outFileNamePrefix "${REMAP_OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outFilterMismatchNmax 20 \
  --outFilterMultimapNmax 20 \
  --outSAMattributes All

# Clean up intermediate files (optional)
# rm "${UNMAPPED_BAM}" "${UNMAPPED_R1_FASTQ}" "${UNMAPPED_R2_FASTQ}"

View on GitHub

In brief, convert all adenosines to guanosines and align to modified hg19 genome where adenosines are also substituted by guanosines.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# --- Installation (commented out) ---
# conda install -c bioconda star
# conda install -c bioconda samtools

# --- Reference Genome Preparation ---
# Download hg19 reference genome (if not already available)
# mkdir -p reference/hg19
# wget -O reference/hg19/hg19.fa.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip -f reference/hg19/hg19.fa.gz

# Create a modified hg19 genome (A->G substitution)
# This uses 'awk' to replace all 'A' with 'G' in the FASTA sequence lines, preserving header lines.
# Ensure the original hg19.fa is unzipped in 'reference/hg19/' before running.
GENOME_DIR="reference/hg19_AG_modified"
mkdir -p "${GENOME_DIR}"
awk '/^>/ {print; next} {gsub(/A/,"G"); print}' reference/hg19/hg19.fa > "${GENOME_DIR}/hg19_AG.fa"

# --- STAR Index Generation for Modified Genome ---
# Define STAR genome directory
STAR_GENOME_DIR="${GENOME_DIR}/star_index"
mkdir -p "${STAR_GENOME_DIR}"

# Generate STAR genome index. Adjust --runThreadN as needed.
STAR --runMode genomeGenerate \
     --genomeDir "${STAR_GENOME_DIR}" \
     --genomeFastaFiles "${GENOME_DIR}/hg19_AG.fa" \
     --runThreadN 8 # Use 8 threads for index generation

# --- Read Pre-processing (A->G substitution) ---
# Input FASTQ file (placeholder). For paired-end reads, repeat for R2.
INPUT_FASTQ="reads.fastq.gz"
OUTPUT_FASTQ_AG="reads_AG.fastq.gz"

# Convert A to G in FASTQ reads. Handles gzipped input/output.
zcat "${INPUT_FASTQ}" | \
awk '{
    if (NR % 4 == 2) { # Sequence line
        gsub(/A/, "G", $0);
    }
    print;
}' | gzip > "${OUTPUT_FASTQ_AG}"

# --- Alignment using STAR ---
OUTPUT_BAM="aligned_reads_AG.bam"

# Align modified reads to the modified genome using STAR.
# Adjust --runThreadN and --limitBAMsortRAM based on available resources.
STAR --genomeDir "${STAR_GENOME_DIR}" \
     --readFilesIn "${OUTPUT_FASTQ_AG}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "star_output_AG_" \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --runThreadN 8 \
     --outReadsUnmapped Fastx \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # 30GB RAM for sorting

# Rename the output BAM to the desired name
mv star_output_AG_Aligned.sortedByCoord.out.bam "${OUTPUT_BAM}"

# Index the BAM file for downstream analysis
samtools index "${OUTPUT_BAM}"

View on GitHub

Obtain uniquely mapped reads pairs from hyperediting pipeline and combine with uniquely mapped reads from first round of read mapping

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assume input BAM files are sorted by coordinate
# hyperediting_unique.bam: Uniquely mapped reads from hyperediting pipeline
# first_round_unique.bam: Uniquely mapped reads from first round of read mapping
# combined_unique.bam: Output combined BAM file

samtools merge combined_unique.bam hyperediting_unique.bam first_round_unique.bam

View on GitHub

Identify RNA-DNA differences between mapped reads and hg19 reference genome.

REDItools (Inferred with models/gemini-2.5-flash) v2.0 GitHub

$ Bash example

# Install REDItools (if not already installed)
# conda install -c bioconda reditools

# --- Reference Data Setup ---
# Download hg19 reference genome (if not already available)
# Example using UCSC: 
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz
# samtools faidx hg19.fa
export REF_GENOME="/path/to/hg19.fa"

# Download dbSNP VCF for hg19 (e.g., from NCBI or GATK resource bundle)
# This is crucial for filtering out germline variants from true RNA edits.
# Example (dbSNP build 138 for hg19 from GATK resource bundle):
# wget https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz
# Or a more specific hg19 dbSNP VCF if available.
export DBSNP_VCF="/path/to/dbSNP_hg19.vcf.gz"

# --- Input and Output Files ---
export MAPPED_READS_BAM="input_mapped_reads.bam"
export OUTPUT_FILE="rna_dna_differences.txt"

# --- Run REDItools to identify RNA-DNA differences ---
# Parameters:
# -i: Input BAM/SAM file
# -f: Reference FASTA file
# -o: Output file
# -g: Known SNPs VCF file (highly recommended to filter germline variants)
# -t: Number of threads (e.g., 8)
# -m: Minimum coverage (e.g., 5 reads)
# -q: Minimum base quality (e.g., 25)
# -s: Strand specific (0: no, 1: yes, 2: auto). Assuming strand-specific RNA-seq (1).
REDItoolDnaRna.py \
  -i "${MAPPED_READS_BAM}" \
  -f "${REF_GENOME}" \
  -o "${OUTPUT_FILE}" \
  -g "${DBSNP_VCF}" \
  -t 8 \
  -m 5 \
  -q 25 \
  -s 1

View on GitHub

Apply log-likelihood test to determine whether RDD is likely resulted from sequencing error (Bahn et. al.

RDD_analyzer (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Installation instructions (example, actual method may vary)
# RDD_analyzer is often a Python script. You might need to clone a repository or download the script directly.
# Example: git clone https://github.com/some_user/RDD_analyzer.git
# Example: cd RDD_analyzer
# Ensure Python environment is set up and dependencies like pysam are installed.
# Example: conda create -n rdd_env python=3.8 pysam
# Example: conda activate rdd_env

# Define input BAM files (replace with actual paths to your treatment and control samples)
INPUT_BAM="path/to/your/treatment_sample.bam"
CONTROL_BAM="path/to/your/control_sample.bam" # Or a background/reference distribution BAM

# Define reference genome FASTA (e.g., GRCh38/hg38)
# Replace with the actual path to your reference genome FASTA file
REFERENCE_FASTA="path/to/GRCh38.primary_assembly.genome.fa"

# Define output file for the RDD analysis results
OUTPUT_FILE="rdd_loglikelihood_results.txt"

# Execute the RDD_analyzer script
# The exact script name and parameters might vary depending on the specific implementation
# of RDD_analyzer you are using. This command assumes a common interface for
# comparing two BAM files (input vs. control) against a reference genome.
python RDD_analyzer.py \
    -i "${INPUT_BAM}" \
    -c "${CONTROL_BAM}" \
    -r "${REFERENCE_FASTA}" \
    -o "${OUTPUT_FILE}"

PMID: 21960545)

CLIPper (Inferred with models/gemini-2.5-flash) v1.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the CLIPper repository
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Example usage of CLIPper for peak calling
# Replace 'input.bam' with your aligned reads BAM file.
# Replace 'hg38' with the appropriate genome assembly (e.g., mm10, hg19).
# Adjust p-value (-p) and cluster cutoff (-c) as needed.
python clipper.py \
  -s hg38 \
  -o ./clipper_output \
  -p 0.01 \
  -c 10 \
  --upstream_extension 20 \
  --downstream_extension 20 \
  input.bam

View on GitHub

Apply posterior filters to remove RDD sites that are likely caused by technical artifats (Bahn et. al.

filter_rdd.py (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The filter_rdd.py script is part of the yeolab/eclip workflow.
# Clone the repository if you don't have it:
# git clone https://github.com/yeolab/eclip.git

# Assuming 'input_peaks.bed' is the result of a peak calling step
# and 'output_filtered_peaks.bed' will contain the filtered peaks.
# Adjust paths as necessary if the 'eclip' repository is not in the current directory.
python eclip/scripts/filter_rdd.py input_peaks.bed output_filtered_peaks.bed

PMID: 21960545 and Lee et. al.

CLIPper v1.0 GitHub

$ Bash example

# --- CLIPper Installation (Python 2.7 environment recommended) ---
# It is recommended to use a dedicated Python 2.7 environment (e.g., with conda)
# conda create -n clipper_env python=2.7
# conda activate clipper_env
# pip install numpy scipy

# Clone the CLIPper repository if not already done
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# # Ensure CLIPper.py is executable or in your PATH
# # chmod +x CLIPper.py
# # export PATH=$(pwd):$PATH
# -----------------------------------------------------------------

# --- Reference Data Setup ---
# Replace with actual paths to your genome FASTA and chromosome sizes file
# Example for human genome build hg38
GENOME_FASTA="/path/to/your/genome/hg38.fa"
CHROM_SIZES="/path/to/your/genome/hg38.chrom.sizes"

# If chrom.sizes is not available, it can be generated from the FASTA file:
# samtools faidx "$GENOME_FASTA"
# cut -f1,2 "$GENOME_FASTA".fai > "$CHROM_SIZES"
# -----------------------------------------------------------------

# --- Define Input and Output ---
# Replace with your actual input BAM file (e.g., from STAR/HISAT2 alignment)
INPUT_BAM="aligned_reads.bam"
# Prefix for output files (e.g., peak BED, peak sequences)
OUTPUT_PREFIX="clipper_peaks"
# Directory to store CLIPper output
OUTPUT_DIR="clipper_output"

mkdir -p "$OUTPUT_DIR"
# -----------------------------------------------------------------

# --- CLIPper Execution ---
# Run CLIPper for peak calling.
# This command uses default parameters (e.g., p-value threshold of 0.05).
# Adjust parameters like -p (p-value), -c (control BAM), -u (unique reads only) as needed.
# Ensure CLIPper.py is in your current directory or in your system's PATH.
python CLIPper.py \
    -b "$INPUT_BAM" \
    -s "$CHROM_SIZES" \
    -o "$OUTPUT_DIR/$OUTPUT_PREFIX" \
    -f "$GENOME_FASTA" # Optional: provide FASTA to extract peak sequences
# -----------------------------------------------------------------

View on GitHub

PMID: 23598527)

CLIPper v1.0.0 GitHub

$ Bash example

# Install CLIPper (example using pip, or manual clone and setup)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install # Or just run clipper.py directly

# Example usage of CLIPper for eCLIP peak calling
# Replace 'ip_unique.bam', 'ip_all.bam', 'control_unique.bam' with actual input BAM files.
# 'ip_unique.bam' typically contains uniquely mapped reads for the immunoprecipitated sample.
# 'ip_all.bam' can be the same as 'ip_unique.bam' or include multi-mapped reads if desired.
# 'control_unique.bam' contains uniquely mapped reads for the control sample (e.g., input, IgG).
# The species 'hg38' is used as a placeholder for the human genome, which CLIPper uses to retrieve chromosome sizes and gene annotations from UCSC.
# Output directory 'clipper_output' will be created.
# Thresholds for peak calling (e.g., -t 10, -p 0.01, -f 0.05) are example values and may need adjustment.

python clipper.py \
  -s hg38 \
  -u ip_unique.bam \
  -a ip_unique.bam \
  -c control_unique.bam \
  -o clipper_output \
  -t 10 \
  -p 0.01 \
  -f 0.05

View on GitHub

Retain only highly prevalent editing sites.

Custom Python script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script filters a tab-separated file of editing sites based on a prevalence score.
# It assumes the input file has a header and a specific column containing the numeric prevalence score.

# Define input and output file paths
INPUT_FILE="editing_sites.tsv" # Replace with your actual input file path
OUTPUT_FILE="highly_prevalent_editing_sites.filtered.tsv" # Replace with your desired output file path

# Parameters for filtering
# PREVALENCE_THRESHOLD: Minimum prevalence score to retain a site (e.g., 0.8 for 80%)
# PREVALENCE_COLUMN_INDEX: 1-based index of the column containing the prevalence score
PREVALENCE_THRESHOLD=0.8
PREVALENCE_COLUMN_INDEX=5 # Example: 5th column (adjust based on your file format)

# Python script to filter based on prevalence
# This script reads a tab-separated file, skips the header, and filters rows
# where the value in the specified column is >= PREVALENCE_THRESHOLD.
# It assumes the prevalence column contains numeric values.
python -c "
import sys
import os

input_file = sys.argv[1]
output_file = sys.argv[2]
prevalence_threshold = float(sys.argv[3])
prevalence_col_idx = int(sys.argv[4])

if not os.path.exists(input_file):
    sys.stderr.write(f'Error: Input file not found: {input_file}\n')
    sys.exit(1)

with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
    header = infile.readline()
    outfile.write(header) # Write header to output

    for line in infile:
        parts = line.strip().split('\t')
        if len(parts) < prevalence_col_idx:
            sys.stderr.write(f'Warning: Skipping line due to insufficient columns for index {prevalence_col_idx}: {line.strip()}\n')
            continue
        try:
            # Adjust for 0-indexing in Python (user provides 1-based index)
            prevalence_score = float(parts[prevalence_col_idx - 1])
            if prevalence_score >= prevalence_threshold:
                outfile.write(line)
        except ValueError:
            sys.stderr.write(f'Warning: Skipping line due to non-numeric prevalence score in column {prevalence_col_idx}: {line.strip()}\n')
        except IndexError:
            sys.stderr.write(f'Warning: Skipping line due to unexpected index error for column {prevalence_col_idx}: {line.strip()}\n')

sys.stderr.write(f'Filtering complete. Highly prevalent editing sites saved to {output_file}\n')
" "${INPUT_FILE}" "${OUTPUT_FILE}" "${PREVALENCE_THRESHOLD}" "${PREVALENCE_COLUMN_INDEX}"

In particular, filter for editing sites that for at least 5 people have at least 5 total reads of which 2 reads are edited

Custom script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script filters a table of RNA editing sites based on specified criteria.
# It assumes an input file (e.g., TSV) where each row represents an editing site,
# and there are columns for each sample's total reads and edited reads at that site.
#
# Example conceptual Python script (filter_editing_sites.py) to implement this logic:
# import pandas as pd
# import argparse
#
# def filter_sites(input_file, output_file, min_people, min_total_reads, min_edited_reads):
#     df = pd.read_csv(input_file, sep='\t')
#
#     # Identify sample-specific read columns (e.g., 'sample1_total_reads', 'sample1_edited_reads')
#     # This part needs to be adapted based on the actual column naming convention in the input file.
#     # For example, if columns are 'sample_A_total', 'sample_A_edited', 'sample_B_total', 'sample_B_edited'
#     sample_total_read_cols = [col for col in df.columns if '_total_reads' in col]
#     sample_names = [col.replace('_total_reads', '') for col in sample_total_read_cols]
#
#     filtered_rows = []
#     for index, row in df.iterrows():
#         people_meeting_criteria = 0
#         for sample in sample_names:
#             total_reads = row.get(f'{sample}_total_reads', 0) # Default to 0 if column missing
#             edited_reads = row.get(f'{sample}_edited_reads', 0) # Default to 0 if column missing
#
#             if total_reads >= min_total_reads and edited_reads >= min_edited_reads:
#                 people_meeting_criteria += 1
#
#         if people_meeting_criteria >= min_people:
#             filtered_rows.append(row)
#
#     pd.DataFrame(filtered_rows).to_csv(output_file, sep='\t', index=False)
#
# if __name__ == "__main__":
#     parser = argparse.ArgumentParser(description="Filter RNA editing sites based on read depth and sample count.")
#     parser.add_argument("--input", required=True, help="Path to the input TSV file of editing sites.")
#     parser.add_argument("--output", required=True, help="Path to the output filtered TSV file.")
#     parser.add_argument("--min_people", type=int, default=5, help="Minimum number of people meeting read criteria.")
#     parser.add_argument("--min_total_reads", type=int, default=5, help="Minimum total reads per person at an editing site.")
#     parser.add_argument("--min_edited_reads", type=int, default=2, help="Minimum edited reads per person at an editing site.")
#     args = parser.parse_args()
#
#     filter_sites(args.input, args.output, args.min_people, args.min_total_reads, args.min_edited_reads)
#
# # To install pandas if not already installed:
# # conda install pandas
# # or
# # pip install pandas

# Execute the filtering script (assuming 'filter_editing_sites.py' exists and is executable)
python filter_editing_sites.py \
    --input all_editing_sites.tsv \
    --output filtered_editing_sites.tsv \
    --min_people 5 \
    --min_total_reads 5 \
    --min_edited_reads 2

Also quantify gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.

Ensembl vSubread 2.0.3 GitHub

$ Bash example

# This script quantifies gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.

# --- Configuration Variables ---
# Placeholder for your input BAM file. Replace with the actual path to your aligned reads.
INPUT_BAM="sample.bam"

# Output file names
ENSEMBL_GTF="Homo_sapiens.GRCh37.75.gtf"
ENSEMBL_GTF_GZ="${ENSEMBL_GTF}.gz"
ENSEMBL_GTF_URL="ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/${ENSEMBL_GTF_GZ}"
GENE_LENGTHS_FILE="gene_lengths.tsv"
OUTPUT_COUNTS="gene_counts.tsv"
RPKM_OUTPUT="gene_expression_rpkm.tsv"

# --- Step 1: Download Ensembl hg19 GTF (GRCh37 release 75) ---
if [ ! -f "${ENSEMBL_GTF}" ]; then
    echo "Downloading ${ENSEMBL_GTF_GZ}..."
    wget -q ${ENSEMBL_GTF_URL}
    gunzip -f ${ENSEMBL_GTF_GZ}
    echo "Download and decompression complete."
else
    echo "${ENSEMBL_GTF} already exists. Skipping download."
fi

# --- Step 2: Calculate gene lengths (union of exon lengths) from GTF ---
# This Python script parses the GTF, merges overlapping exons per gene, and sums their lengths.
# This provides a more accurate 'exon union' length for RPKM calculation.

if [ ! -f "${GENE_LENGTHS_FILE}" ]; then
    echo "Calculating gene lengths from ${ENSEMBL_GTF}..."
    cat << 'EOF' > calculate_gene_lengths.py
import sys
from collections import defaultdict

def calculate_gene_lengths(gtf_file, output_file):
    gene_exons = defaultdict(list)
    print(f"Parsing GTF: {gtf_file}", file=sys.stderr)

    with open(gtf_file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue
            feature_type = parts[2]
            if feature_type == 'exon':
                chrom = parts[0]
                start = int(parts[3])
                end = int(parts[4])
                strand = parts[6]
                attributes = parts[8]

                gene_id = None
                for attr in attributes.split(';'):
                    attr = attr.strip()
                    if attr.startswith('gene_id "'):
                        gene_id = attr.split('"')[1]
                        break
                
                if gene_id:
                    gene_exons[gene_id].append((chrom, start, end, strand))

    print("Calculating union exon lengths...", file=sys.stderr)
    gene_lengths = {}
    for gene_id, exons in gene_exons.items():
        # Group exons by chromosome and strand for merging
        chrom_strand_exons = defaultdict(list)
        for chrom, start, end, strand in exons:
            chrom_strand_exons[(chrom, strand)].append((start, end))
        
        total_length = 0
        for (chrom, strand), exon_coords in chrom_strand_exons.items():
            # Sort exons by start coordinate
            exon_coords.sort()
            
            merged_intervals = []
            if exon_coords:
                current_start, current_end = exon_coords[0]
                for i in range(1, len(exon_coords)): # Iterate from the second exon
                    next_start, next_end = exon_coords[i]
                    if next_start <= current_end: # Overlap or contiguous
                        current_end = max(current_end, next_end)
                    else:
                        merged_intervals.append((current_start, current_end))
                        current_start, current_end = next_start, next_end
                merged_intervals.append((current_start, current_end)) # Add the last merged interval
            
            for start, end in merged_intervals:
                total_length += (end - start + 1) # +1 for 1-based coordinates length

        gene_lengths[gene_id] = total_length

    print(f"Writing gene lengths to {output_file}", file=sys.stderr)
    with open(output_file, 'w') as out_f:
        out_f.write("gene_id\tlength\n")
        for gene_id, length in gene_lengths.items():
            out_f.write(f"{gene_id}\t{length}\n")

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print("Usage: python calculate_gene_lengths.py <gtf_file> <output_file>", file=sys.stderr)
        sys.exit(1)
    calculate_gene_lengths(sys.argv[1], sys.argv[2])
EOF

    python calculate_gene_lengths.py "${ENSEMBL_GTF}" "${GENE_LENGTHS_FILE}"
    echo "Gene length calculation complete."
else
    echo "${GENE_LENGTHS_FILE} already exists. Skipping gene length calculation."
fi

# --- Step 3: Quantify raw gene counts using featureCounts ---
# # Install Subread (featureCounts) if not available
# # conda install -c bioconda subread

echo "Running featureCounts on ${INPUT_BAM}..."
featureCounts -p \
    -t exon \
    -g gene_id \
    -a "${ENSEMBL_GTF}" \
    -o "${OUTPUT_COUNTS}" \
    "${INPUT_BAM}"
echo "featureCounts complete. Raw counts saved to ${OUTPUT_COUNTS}"

# --- Step 4: Calculate RPKM values ---
# Get total mapped reads from samtools flagstat for the input BAM file
# # Install samtools if not available
# # conda install -c bioconda samtools
TOTAL_MAPPED_READS=$(samtools flagstat "${INPUT_BAM}" | grep "mapped (" | awk '{print $1}')

if [ -z "${TOTAL_MAPPED_READS}" ] || [ "${TOTAL_MAPPED_READS}" -eq 0 ]; then
    echo "Error: Could not determine total mapped reads from ${INPUT_BAM} using samtools flagstat or it is zero."
    echo "Please ensure samtools is installed and ${INPUT_BAM} is a valid BAM file."
    exit 1
fi
echo "Total mapped reads for RPKM calculation: ${TOTAL_MAPPED_READS}"


echo "Calculating RPKM values..."
awk -v total_reads="${TOTAL_MAPPED_READS}" \
    'BEGIN {
        FS="\t"; OFS="\t";
        print "gene_id", "raw_counts", "gene_length_bp", "RPKM";
    }
    FNR==NR { # Process gene_lengths.tsv first
        if (FNR==1) { next } # Skip header of gene_lengths.tsv
        gene_lengths[$1] = $2;
        next;
    }
    FNR==1 { next } # Skip header of featureCounts output
    { # Process featureCounts output
        gene_id = $1;
        raw_counts = $7; # Assuming 7th column is counts for the first sample
        
        if (gene_id in gene_lengths && gene_lengths[gene_id] > 0) {
            gene_len_bp = gene_lengths[gene_id];
            # RPKM = (raw_counts * 10^9) / (gene_len_bp * total_reads)
            rpkm = (raw_counts * 1000000000) / (gene_len_bp * total_reads);
            print gene_id, raw_counts, gene_len_bp, rpkm;
        } else {
            print gene_id, raw_counts, "0", "0" > "/dev/stderr"; # Output genes with no length to stderr
        }
    }' "${GENE_LENGTHS_FILE}" <(tail -n +2 "${OUTPUT_COUNTS}") > "${RPKM_OUTPUT}"

echo "RPKM quantification complete. Results saved to ${RPKM_OUTPUT}"

View on GitHub

Raw Source Text

Map fastq reads with HISAT2 hisat2 -q --phred64 -x grch37_tran/genome_tran  --no-unal --reorder --no-mixed
Extract uniquely mapped reads from SAM files
Remap unmapped reads with hyperediting pipeline. In brief, convert all adenosines to guanosines and align to modified hg19 genome where adenosines are also substituted by guanosines.
Obtain uniquely mapped reads pairs from hyperediting pipeline and combine with uniquely mapped reads from first round of read mapping
Identify RNA-DNA differences between mapped reads and hg19 reference genome.
Apply log-likelihood test to determine whether RDD is likely resulted from sequencing error (Bahn et. al. PMID: 21960545)
Apply posterior filters to remove RDD sites that are likely caused by technical artifats (Bahn et. al. PMID: 21960545 and Lee et. al. PMID: 23598527)
Retain only highly prevalent editing sites. In particular, filter for editing sites that for at least 5 people have at least 5 total reads of which 2 reads are edited
Also quantify gene expression as RPKM values over the exon union of ENSEMBL transcripts in hg19.
Genome_build: hg19
Supplementary_files_format_and_content: gene expression (RPKM)

← Back to Analysis