GSE176012 Processing Pipeline

RNA-Seq code_examples 12 steps

Publication

ADAR and hnRNPC deficiency synergize in activating endogenous dsRNA-induced type I IFN responses.

The Journal of experimental medicine (2021) — PMID 34297039

Dataset

RNAseq of hnRNPC and ADAR deficient THP-1

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were aligned to GrCH37 v13 using STAR (V2.5.4a) with the basic two-pass mode allowing for a maximum number of multiple alignments of 1000 and 999 mismatches per read (--outFilterMultimapNmax 1000 --outFilterMismatchNmax 999), however 10% mismatches per read pair using --outFilterMismatchNoverReadLmax 0.1 to increase the detection of A-to-I editing sites.

STAR v2.5.4a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for STAR genome index directory (replace with actual path to GrCH37 v13 index)
# The GrCH37 v13 (hg19) genome index needs to be generated prior to alignment using STAR --runMode genomeGenerate
GENOME_DIR="/path/to/GrCH37_v13_STAR_index"

# Placeholder for input FASTQ files (replace with actual paths)
# Assuming paired-end reads based on common RNA-seq practices
READS_R1="reads_R1.fastq.gz"
READS_R2="reads_R2.fastq.gz"

# Placeholder for output directory
OUTPUT_DIR="star_alignment_output"
mkdir -p "${OUTPUT_DIR}"

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_DIR}/" \
     --outFilterMultimapNmax 1000 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverReadLmax 0.1 \
     --twopassMode Basic \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outReadsUnmapped Fastx

View on GitHub

In addition, we allowed introns to be 20-1000000 nt and a 1000000 nt distance between read pair mates.

STAR (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Placeholder for STAR genome index generation (if not already done)
# STAR --runMode genomeGenerate \
#      --genomeDir GRCh38_STAR_index \
#      --genomeFastaFiles genome.fa \
#      --sjdbGTFfile annotations.gtf \
#      --runThreadN 8

# Placeholder input files
READ1="read1.fastq.gz"
READ2="read2.fastq.gz"
GENOME_DIR="GRCh38_STAR_index" # Using GRCh38 as a common latest assembly placeholder
OUTPUT_PREFIX="aligned_reads"
THREADS=8

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --readFilesCommand zcat \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard

View on GitHub

Aligned reads were summarized to gene level using the R Rsubread package featureCounts function in paired end mode (isPairedEnd=TRUE) based on the GrCH37 gencode V19 GTF annotation, using the default exons as features for counting and genes as meta-features (GTF.featureType="exon" and GTF.attrType="gene_id").

featureCounts v2.18.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Rsubread (if not already installed)
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("Rsubread")'

# Download the Gencode V19 GTF annotation for GRCh37
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz

# Assuming 'aligned_reads.bam' is your input BAM file(s)
# For multiple BAM files, list them space-separated after the options.

featureCounts \
  -a gencode.v19.annotation.gtf \
  -o gene_counts.txt \
  -p \
  -f exon \
  -g gene_id \
  -T 8 \
  aligned_reads.bam

We only allowed uniquely mapping reads by setting the read-quality filter to 255 (minMQS=255).

samtools (Inferred with models/gemini-2.5-flash) v1.15 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Filter uniquely mapping reads with a minimum mapping quality score of 255
# Replace 'input_aligned.bam' with your aligned BAM file
# Replace 'output_unique_mq255.bam' with your desired output file name
samtools view -b -q 255 input_aligned.bam > output_unique_mq255.bam

View on GitHub

To quantify introns, we first extracted intron ranges from appris annotation based on https://davetang.org/muse/2013/01/18/defining-genomic-regions/ from the GrCH37 gencode V19 GTF file using awk to filter for appris annotated transcripts (awk âBEGINâ{OFS=â\tâ;} /appris/ {print}â).

GENCODE vV19 GitHub

$ Bash example

# Download GENCODE v19 GTF for GRCh37
# This is a placeholder; actual download command might vary.
# wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz

# Define input GTF file
INPUT_GTF="gencode.v19.annotation.gtf"
OUTPUT_GTF="appris_filtered_transcripts.gtf"

# Extract appris annotated transcripts (as described)
awk 'BEGIN{OFS="\t";} /appris/ {print}' "${INPUT_GTF}" > "${OUTPUT_GTF}"

View on GitHub

Exons were extracted from the appris annotated transcripts into bed format using awk ('BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}'), sorted using bedtools2 sortBed and overlapping exons merged with bedtools2 mergeBed.

awk, bedtools vN/A (Inferred with models/gemini-2.5-flash), 2.30.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Input file: appris annotated transcripts (e.g., in GTF/GFF format)
INPUT_TRANSCRIPTS="appris_transcripts.gtf"
OUTPUT_EXONS_BED="merged_exons.bed"

# Extract exons from appris annotated transcripts, convert to BED, sort, and merge overlapping exons.
# awk (Inferred with models/gemini-2.5-flash)
# bedtools (Inferred with models/gemini-2.5-flash)
# conda install -c bioconda bedtools

awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}' "${INPUT_TRANSCRIPTS}" | \
bedtools sort -i - | \
bedtools merge -i - > "${OUTPUT_EXONS_BED}"

View on GitHub

Next, unique genes of appris-annotated transcripts were extracted using awk as described above, their annotations extracted from the original GTF file, sorted and the merged appris-annotated exons subtracted from genes using bedtools2 subtractBed to yield introns delineated from appris annotated transcripts, saved in bed format for later quantification.

bedtools v2.x GitHub

$ Bash example

# Define placeholder filenames
ORIGINAL_GTF="original.gtf" # Placeholder: e.g., gencode.v44.annotation.gtf
APPRIS_TRANSCRIPTS_GTF="appris_annotated_transcripts.gtf" # Placeholder: This file is assumed to contain appris-annotated transcripts, possibly a filtered GTF.

# --- Step 1: Extract unique gene IDs from appris-annotated transcripts ---
# This awk command extracts gene_id from GTF lines where feature is 'transcript'
# and then sorts and gets unique IDs.
# Example GTF attribute: gene_id "ENSG00000000003.15";
awk '$3 == "transcript" {
    match($0, /gene_id "([^"]+)"/);
    if (RSTART) {
        gene_id = substr($0, RSTART + 9, RLENGTH - 10);
        print gene_id;
    }
}' "${APPRIS_TRANSCRIPTS_GTF}" | sort -u > unique_appris_genes.txt

# --- Step 2: Extract annotations for these unique genes from the original GTF ---
# This filters the original GTF to only include features related to the identified unique genes.
# Note: For very large GTF files, `awk` might be more performant than `grep -Ff`.
grep -Ff unique_appris_genes.txt "${ORIGINAL_GTF}" > filtered_genes.gtf

# --- Step 3: Convert filtered GTF to BED format for genes and merged exons ---

# Convert gene features from filtered_genes.gtf to BED format.
# Columns: chrom, start (0-based), end (1-based), name (gene_id), score (.), strand
# Note: GTF coordinates are 1-based, BED coordinates are 0-based start.
awk '$3 == "gene" {
    match($0, /gene_id "([^"]+)"/);
    if (RSTART) {
        gene_id = substr($0, RSTART + 9, RLENGTH - 10);
        print $1"\t"($4-1)"\t"$5"\t"gene_id"\t.\t"$7;
    }
}' filtered_genes.gtf > genes.bed

# Convert exon features from filtered_genes.gtf to BED format, then merge overlapping exons.
# The description specifies "merged appris-annotated exons".
# Columns: chrom, start (0-based), end (1-based), name (gene_id), score (.), strand
awk '$3 == "exon" {
    match($0, /gene_id "([^"]+)"/);
    if (RSTART) {
        gene_id = substr($0, RSTART + 9, RLENGTH - 10);
        print $1"\t"($4-1)"\t"$5"\t"gene_id"\t.\t"$7;
    }
}' filtered_genes.gtf | sort -k1,1 -k2,2n | bedtools merge -s -i - > merged_appris_exons.bed

# --- Step 4: Subtract merged appris-annotated exons from genes to yield introns ---
# bedtools installation:
# conda install -c bioconda bedtools

# Ensure input BED files are sorted for optimal performance with bedtools subtract.
# sort -k1,1 -k2,2n genes.bed -o genes.bed
# sort -k1,1 -k2,2n merged_appris_exons.bed -o merged_appris_exons.bed

bedtools subtract -a genes.bed -b merged_appris_exons.bed > introns.bed

View on GitHub

Reads mapping to intron- and editing-cluster-rich-region- features were counted using the R Rsubread package featureCounts function after converting bed files to SAF format, as described for gene quantification, not using the meta-feature flag.

featureCounts vnot specified

$ Bash example

# Install subread (which includes the featureCounts executable)
# conda install -c bioconda subread

# The description mentions converting bed files to SAF format.
# SAF format: GeneID\tChr\tStart\tEnd\tStrand
# Assuming 'intron_editing_regions.bed' contains the defined features (e.g., 4 columns: chrom, start, end, name)
# and assuming non-strand-specific features ('.') if strand information is not available in the BED file.
# If your BED file has strand information in the 6th column, adjust the awk command accordingly.
# awk 'BEGIN{OFS="\t"} {print $4, $1, $2+1, $3, "."}' intron_editing_regions.bed > intron_editing_regions.saf

# Define input BAM files (replace with actual file paths)
# For example, if you have multiple BAM files:
# BAM_FILES="sample1.bam sample2.bam sample3.bam"
BAM_FILES="input_aligned_reads.bam"

# Define the SAF annotation file containing intron- and editing-cluster-rich-region features
SAF_ANNOTATION_FILE="intron_editing_regions.saf"

# Define the output file for counts
OUTPUT_COUNTS_FILE="feature_counts.txt"

# Run featureCounts
# -a: Specify the annotation file
# -F SAF: Specify that the annotation file is in SAF format
# -o: Specify the output file for the read counts
# The description states "not using the meta-feature flag". When using SAF format,
# featureCounts directly counts reads to the features defined in the SAF file,
# which inherently means no meta-feature aggregation is performed by the tool itself.
featureCounts -a "${SAF_ANNOTATION_FILE}" \
              -F SAF \
              -o "${OUTPUT_COUNTS_FILE}" \
              ${BAM_FILES}

Editing sites were discovered using the SAILOR pipeline (https://github.com/YeoLab/sailor).

SAILOR vNot specified GitHub

$ Bash example

# Clone the SAILOR repository if not already done
# git clone https://github.com/YeoLab/sailor.git
# cd sailor

# # Install dependencies (example using conda)
# conda create -n sailor_env python=3.7 -y
# conda activate sailor_env
# pip install -r requirements.txt

# # IMPORTANT: Review and modify sailor_config.py according to your specific needs and environment.
# # For example, you might need to adjust paths to external tools or specific parameters.
# # cp sailor_config.py.example sailor_config.py # If an example exists, otherwise modify the default.

# Define input and output paths
INPUT_BAM="/path/to/your/input.bam" # Replace with your aligned BAM file
OUTPUT_DIR="/path/to/your/output_sailor" # Replace with your desired output directory
GENOME_FASTA="/path/to/references/GRCh38/GRCh38.primary_assembly.genome.fa" # Placeholder for human GRCh38 genome FASTA
ANNOTATION_GTF="/path/to/references/GRCh38/gencode.v38.annotation.gtf" # Placeholder for Gencode v38 annotation GTF
REPEAT_MASKER_BED="/path/to/references/GRCh38/repeatmasker.bed" # Placeholder for RepeatMasker BED file
SAMPLE_NAME="sample1" # Replace with your actual sample name

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Execute SAILOR to discover editing sites
python sailor.py \
    -i "${INPUT_BAM}" \
    -o "${OUTPUT_DIR}" \
    -g "${GENOME_FASTA}" \
    -a "${ANNOTATION_GTF}" \
    -r "${REPEAT_MASKER_BED}" \
    -s "${SAMPLE_NAME}"

View on GitHub

Total or G counts (indicating adenosine-to-inosine editing) were extracted from *.sorted.rmdup.readfiltered.bam intermediate files generated by the SAILOR pipeline using flags -B -d10000000 -t DPR,INFO/DPR,DP4 -uf, using the -l flag with a bed annotation of the primary selection of editing sites.

SAILOR vNot explicitly versioned, based on 2021 publication

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Define variables
INPUT_BAM="input.sorted.rmdup.readfiltered.bam"
EDITING_SITES_BED="editing_sites.bed" # Placeholder for the bed annotation of primary editing sites
REFERENCE_FASTA="hg38.fa" # Placeholder for human reference genome (e.g., GRCh38), inferred from -f flag
OUTPUT_BCF="output.raw.bcf"

# Command to generate pileup information with specified flags using samtools mpileup
# -B: Disable BAQ (Base Alignment Quality) recalculation
# -d 10000000: Maximum depth (set to a very high value to avoid truncation)
# -t DPR,INFO/DPR,DP4: Output format tags for depth (DPR) and allele counts (DP4)
# -u: Output uncompressed BCF
# -f <ref.fa>: Reference FASTA file (inferred from the -uf flag in the description)
# -l <bed>: Only include sites specified in the BED file
samtools mpileup -B -d 10000000 -t DPR,INFO/DPR,DP4 -u -f "${REFERENCE_FASTA}" -l "${EDITING_SITES_BED}" "${INPUT_BAM}" > "${OUTPUT_BCF}"

# Note: The SAILOR pipeline typically further processes this BCF output
# using bcftools call and custom scripts (e.g., awk) to extract the specific
# "Total or G counts" for A-to-I editing sites. The flags provided in the
# description are primarily for the samtools mpileup step, which generates
# the necessary data for subsequent extraction.

INFO/DPR counts were used to determine nucleotide counts.

bedtools (Inferred with models/gemini-2.5-flash) v2.31.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Example: Determine nucleotide counts for genomic regions against a reference genome.
# The description "INFO/DPR counts were used to determine nucleotide counts" is ambiguous.
# If "INFO/DPR counts" refers to specific regions or a filtered set of regions, 
# 'regions.bed' would be the result of processing those counts.
# 'regions.bed' could represent peaks, binding sites, or other genomic features (e.g., from an eCLIP or ChIP-seq peak calling step).
# 'genome.fa' is the reference genome FASTA file (e.g., hg38.fa for human, mm10.fa for mouse).
# This command calculates the A, C, G, T, N counts, GC content, and length for each region.
bedtools nuc -fi genome.fa -bed regions.bed -fo nucleotide_counts.txt

View on GitHub

T and C counts for sense or A and G counts for antisense reads, respectively, were considered sequencing errors and added to reference counts.

bcftools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install bcftools (if not already installed)
# conda install -c bioconda bcftools

# Placeholder for reference genome
# Replace with actual path to reference genome FASTA (e.g., hg38.fa)
REFERENCE_GENOME="path/to/reference.fasta"

# Placeholder for input BAM file
# Replace with actual path to input BAM file (e.g., aligned_reads.bam)
INPUT_BAM="path/to/input.bam"

# Placeholder for output VCF file
# This VCF will contain initial variant calls with allele depths.
# The specific logic described in the step ("T and C counts for sense or A and G counts for antisense reads,
# respectively, were considered sequencing errors and added to reference counts")
# is a custom post-processing step that would typically involve parsing this VCF
# (or the raw mpileup output) and modifying allele counts based on strand information.
# bcftools itself does not have a direct command for this specific count manipulation.
OUTPUT_VCF="initial_variants.vcf"

# Generate variant calls with allele depths from the input BAM file.
# -f: Reference genome FASTA file.
# -m: Call multiallelic sites.
# -v: Output variant sites only (skip reference sites).
# -o: Output VCF file.
bcftools mpileup -f "${REFERENCE_GENOME}" "${INPUT_BAM}" | bcftools call -mv -o "${OUTPUT_VCF}"

# Note: The described logic of identifying specific sequencing errors (T/C on sense, A/G on antisense)
# and adding their counts to the reference allele count is a custom post-processing step.
# This would typically be implemented by a custom script (e.g., in Python using pysam or a VCF parser)
# that reads the VCF (or the raw mpileup output), extracts strand-specific base counts (if available
# or derived from the BAM), applies the described rule, and then modifies the AD (Allelic Depth) field
# or filters variants accordingly.

View on GitHub

Tools Used

STAR SAILOR

Raw Source Text

Reads were aligned to GrCH37 v13 using STAR (V2.5.4a) with the basic two-pass mode allowing for a maximum number of multiple alignments of 1000 and 999 mismatches per read (--outFilterMultimapNmax 1000 --outFilterMismatchNmax 999), however 10% mismatches per read pair using --outFilterMismatchNoverReadLmax 0.1 to increase the detection of A-to-I editing sites. In addition, we allowed introns to be 20-1000000 nt and a 1000000 nt distance between read pair mates.
Aligned reads were summarized to gene level using the R Rsubread package featureCounts function in paired end mode (isPairedEnd=TRUE) based on the GrCH37 gencode V19 GTF annotation, using the default exons as features for counting and genes as meta-features (GTF.featureType="exon" and GTF.attrType="gene_id"). We only allowed uniquely mapping reads by setting the read-quality filter to 255 (minMQS=255).
To quantify introns, we first extracted intron ranges from appris annotation based on https://davetang.org/muse/2013/01/18/defining-genomic-regions/ from the GrCH37 gencode V19 GTF file using awk to filter for appris annotated transcripts (awk âBEGINâ{OFS=â\tâ;} /appris/ {print}â). Exons were extracted from the appris annotated transcripts into bed format using awk ('BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}'), sorted using bedtools2 sortBed and overlapping exons merged with bedtools2 mergeBed. Next, unique genes of appris-annotated transcripts were extracted using awk as described above, their annotations extracted from the original GTF file, sorted and the merged appris-annotated exons subtracted from genes using bedtools2 subtractBed to yield introns delineated from appris annotated transcripts, saved in bed format for later quantification.
Reads mapping to intron- and editing-cluster-rich-region- features were counted using the R Rsubread package featureCounts function after converting bed files to SAF format, as described for gene quantification, not using the meta-feature flag.
Editing sites were discovered using the SAILOR pipeline (https://github.com/YeoLab/sailor). Total or G counts (indicating adenosine-to-inosine editing) were extracted from *.sorted.rmdup.readfiltered.bam intermediate files generated by the SAILOR pipeline using flags -B -d10000000 -t DPR,INFO/DPR,DP4 -uf, using the -l flag with a bed annotation of the primary selection of editing sites. INFO/DPR counts were used to determine nucleotide counts. T and C counts for sense or A and G counts for antisense reads, respectively, were considered sequencing errors and added to reference counts.
Genome_build: GrCH37 v13
Supplementary_files_format_and_content: Raw gene counts, comma separated values
Supplementary_files_format_and_content: Raw intron counts, comma separated values
Supplementary_files_format_and_content: Raw counts of regions rich in RNA editing clusters (ECR), comma separated values
Supplementary_files_format_and_content: Raw G counts at unfiltered, putative editing sites, comma separated values
Supplementary_files_format_and_content: Raw total (G+H) counts at unfiltered, putative editing sites, comma separated values

← Back to Analysis