GSE176012 Processing Pipeline
Publication
ADAR and hnRNPC deficiency synergize in activating endogenous dsRNA-induced type I IFN responses.The Journal of experimental medicine (2021) — PMID 34297039
Processing Steps
Generate Jupyter Notebook-
1
Reads were aligned to GrCH37 v13 using STAR (V2.5.4a) with the basic two-pass mode allowing for a maximum number of multiple alignments of 1000 and 999 mismatches per read (--outFilterMultimapNmax 1000 --outFilterMismatchNmax 999), however 10% mismatches per read pair using --outFilterMismatchNoverReadLmax 0.1 to increase the detection of A-to-I editing sites.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for STAR genome index directory (replace with actual path to GrCH37 v13 index) # The GrCH37 v13 (hg19) genome index needs to be generated prior to alignment using STAR --runMode genomeGenerate GENOME_DIR="/path/to/GrCH37_v13_STAR_index" # Placeholder for input FASTQ files (replace with actual paths) # Assuming paired-end reads based on common RNA-seq practices READS_R1="reads_R1.fastq.gz" READS_R2="reads_R2.fastq.gz" # Placeholder for output directory OUTPUT_DIR="star_alignment_output" mkdir -p "${OUTPUT_DIR}" STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_DIR}/" \ --outFilterMultimapNmax 1000 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverReadLmax 0.1 \ --twopassMode Basic \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outReadsUnmapped Fastx -
2
In addition, we allowed introns to be 20-1000000 nt and a 1000000 nt distance between read pair mates.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Placeholder for STAR genome index generation (if not already done) # STAR --runMode genomeGenerate \ # --genomeDir GRCh38_STAR_index \ # --genomeFastaFiles genome.fa \ # --sjdbGTFfile annotations.gtf \ # --runThreadN 8 # Placeholder input files READ1="read1.fastq.gz" READ2="read2.fastq.gz" GENOME_DIR="GRCh38_STAR_index" # Using GRCh38 as a common latest assembly placeholder OUTPUT_PREFIX="aligned_reads" THREADS=8 STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard -
3
Aligned reads were summarized to gene level using the R Rsubread package featureCounts function in paired end mode (isPairedEnd=TRUE) based on the GrCH37 gencode V19 GTF annotation, using the default exons as features for counting and genes as meta-features (GTF.featureType="exon" and GTF.attrType="gene_id").
featureCounts v2.18.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install Rsubread (if not already installed) # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("Rsubread")' # Download the Gencode V19 GTF annotation for GRCh37 # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz # Assuming 'aligned_reads.bam' is your input BAM file(s) # For multiple BAM files, list them space-separated after the options. featureCounts \ -a gencode.v19.annotation.gtf \ -o gene_counts.txt \ -p \ -f exon \ -g gene_id \ -T 8 \ aligned_reads.bam -
4
We only allowed uniquely mapping reads by setting the read-quality filter to 255 (minMQS=255).
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Filter uniquely mapping reads with a minimum mapping quality score of 255 # Replace 'input_aligned.bam' with your aligned BAM file # Replace 'output_unique_mq255.bam' with your desired output file name samtools view -b -q 255 input_aligned.bam > output_unique_mq255.bam
-
5
To quantify introns, we first extracted intron ranges from appris annotation based on https://davetang.org/muse/2013/01/18/defining-genomic-regions/ from the GrCH37 gencode V19 GTF file using awk to filter for appris annotated transcripts (awk âBEGINâ{OFS=â\tâ;} /appris/ {print}â).
$ Bash example
# Download GENCODE v19 GTF for GRCh37 # This is a placeholder; actual download command might vary. # wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz # Define input GTF file INPUT_GTF="gencode.v19.annotation.gtf" OUTPUT_GTF="appris_filtered_transcripts.gtf" # Extract appris annotated transcripts (as described) awk 'BEGIN{OFS="\t";} /appris/ {print}' "${INPUT_GTF}" > "${OUTPUT_GTF}" -
6
Exons were extracted from the appris annotated transcripts into bed format using awk ('BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}'), sorted using bedtools2 sortBed and overlapping exons merged with bedtools2 mergeBed.
awk, bedtools vN/A (Inferred with models/gemini-2.5-flash), 2.30.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Input file: appris annotated transcripts (e.g., in GTF/GFF format) INPUT_TRANSCRIPTS="appris_transcripts.gtf" OUTPUT_EXONS_BED="merged_exons.bed" # Extract exons from appris annotated transcripts, convert to BED, sort, and merge overlapping exons. # awk (Inferred with models/gemini-2.5-flash) # bedtools (Inferred with models/gemini-2.5-flash) # conda install -c bioconda bedtools awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}' "${INPUT_TRANSCRIPTS}" | \ bedtools sort -i - | \ bedtools merge -i - > "${OUTPUT_EXONS_BED}" -
7
Next, unique genes of appris-annotated transcripts were extracted using awk as described above, their annotations extracted from the original GTF file, sorted and the merged appris-annotated exons subtracted from genes using bedtools2 subtractBed to yield introns delineated from appris annotated transcripts, saved in bed format for later quantification.
$ Bash example
# Define placeholder filenames ORIGINAL_GTF="original.gtf" # Placeholder: e.g., gencode.v44.annotation.gtf APPRIS_TRANSCRIPTS_GTF="appris_annotated_transcripts.gtf" # Placeholder: This file is assumed to contain appris-annotated transcripts, possibly a filtered GTF. # --- Step 1: Extract unique gene IDs from appris-annotated transcripts --- # This awk command extracts gene_id from GTF lines where feature is 'transcript' # and then sorts and gets unique IDs. # Example GTF attribute: gene_id "ENSG00000000003.15"; awk '$3 == "transcript" { match($0, /gene_id "([^"]+)"/); if (RSTART) { gene_id = substr($0, RSTART + 9, RLENGTH - 10); print gene_id; } }' "${APPRIS_TRANSCRIPTS_GTF}" | sort -u > unique_appris_genes.txt # --- Step 2: Extract annotations for these unique genes from the original GTF --- # This filters the original GTF to only include features related to the identified unique genes. # Note: For very large GTF files, `awk` might be more performant than `grep -Ff`. grep -Ff unique_appris_genes.txt "${ORIGINAL_GTF}" > filtered_genes.gtf # --- Step 3: Convert filtered GTF to BED format for genes and merged exons --- # Convert gene features from filtered_genes.gtf to BED format. # Columns: chrom, start (0-based), end (1-based), name (gene_id), score (.), strand # Note: GTF coordinates are 1-based, BED coordinates are 0-based start. awk '$3 == "gene" { match($0, /gene_id "([^"]+)"/); if (RSTART) { gene_id = substr($0, RSTART + 9, RLENGTH - 10); print $1"\t"($4-1)"\t"$5"\t"gene_id"\t.\t"$7; } }' filtered_genes.gtf > genes.bed # Convert exon features from filtered_genes.gtf to BED format, then merge overlapping exons. # The description specifies "merged appris-annotated exons". # Columns: chrom, start (0-based), end (1-based), name (gene_id), score (.), strand awk '$3 == "exon" { match($0, /gene_id "([^"]+)"/); if (RSTART) { gene_id = substr($0, RSTART + 9, RLENGTH - 10); print $1"\t"($4-1)"\t"$5"\t"gene_id"\t.\t"$7; } }' filtered_genes.gtf | sort -k1,1 -k2,2n | bedtools merge -s -i - > merged_appris_exons.bed # --- Step 4: Subtract merged appris-annotated exons from genes to yield introns --- # bedtools installation: # conda install -c bioconda bedtools # Ensure input BED files are sorted for optimal performance with bedtools subtract. # sort -k1,1 -k2,2n genes.bed -o genes.bed # sort -k1,1 -k2,2n merged_appris_exons.bed -o merged_appris_exons.bed bedtools subtract -a genes.bed -b merged_appris_exons.bed > introns.bed -
8
Reads mapping to intron- and editing-cluster-rich-region- features were counted using the R Rsubread package featureCounts function after converting bed files to SAF format, as described for gene quantification, not using the meta-feature flag.
featureCounts vnot specified$ Bash example
# Install subread (which includes the featureCounts executable) # conda install -c bioconda subread # The description mentions converting bed files to SAF format. # SAF format: GeneID\tChr\tStart\tEnd\tStrand # Assuming 'intron_editing_regions.bed' contains the defined features (e.g., 4 columns: chrom, start, end, name) # and assuming non-strand-specific features ('.') if strand information is not available in the BED file. # If your BED file has strand information in the 6th column, adjust the awk command accordingly. # awk 'BEGIN{OFS="\t"} {print $4, $1, $2+1, $3, "."}' intron_editing_regions.bed > intron_editing_regions.saf # Define input BAM files (replace with actual file paths) # For example, if you have multiple BAM files: # BAM_FILES="sample1.bam sample2.bam sample3.bam" BAM_FILES="input_aligned_reads.bam" # Define the SAF annotation file containing intron- and editing-cluster-rich-region features SAF_ANNOTATION_FILE="intron_editing_regions.saf" # Define the output file for counts OUTPUT_COUNTS_FILE="feature_counts.txt" # Run featureCounts # -a: Specify the annotation file # -F SAF: Specify that the annotation file is in SAF format # -o: Specify the output file for the read counts # The description states "not using the meta-feature flag". When using SAF format, # featureCounts directly counts reads to the features defined in the SAF file, # which inherently means no meta-feature aggregation is performed by the tool itself. featureCounts -a "${SAF_ANNOTATION_FILE}" \ -F SAF \ -o "${OUTPUT_COUNTS_FILE}" \ ${BAM_FILES} -
9
Editing sites were discovered using the SAILOR pipeline (https://github.com/YeoLab/sailor).
$ Bash example
# Clone the SAILOR repository if not already done # git clone https://github.com/YeoLab/sailor.git # cd sailor # # Install dependencies (example using conda) # conda create -n sailor_env python=3.7 -y # conda activate sailor_env # pip install -r requirements.txt # # IMPORTANT: Review and modify sailor_config.py according to your specific needs and environment. # # For example, you might need to adjust paths to external tools or specific parameters. # # cp sailor_config.py.example sailor_config.py # If an example exists, otherwise modify the default. # Define input and output paths INPUT_BAM="/path/to/your/input.bam" # Replace with your aligned BAM file OUTPUT_DIR="/path/to/your/output_sailor" # Replace with your desired output directory GENOME_FASTA="/path/to/references/GRCh38/GRCh38.primary_assembly.genome.fa" # Placeholder for human GRCh38 genome FASTA ANNOTATION_GTF="/path/to/references/GRCh38/gencode.v38.annotation.gtf" # Placeholder for Gencode v38 annotation GTF REPEAT_MASKER_BED="/path/to/references/GRCh38/repeatmasker.bed" # Placeholder for RepeatMasker BED file SAMPLE_NAME="sample1" # Replace with your actual sample name # Create output directory mkdir -p "${OUTPUT_DIR}" # Execute SAILOR to discover editing sites python sailor.py \ -i "${INPUT_BAM}" \ -o "${OUTPUT_DIR}" \ -g "${GENOME_FASTA}" \ -a "${ANNOTATION_GTF}" \ -r "${REPEAT_MASKER_BED}" \ -s "${SAMPLE_NAME}" -
10
Total or G counts (indicating adenosine-to-inosine editing) were extracted from *.sorted.rmdup.readfiltered.bam intermediate files generated by the SAILOR pipeline using flags -B -d10000000 -t DPR,INFO/DPR,DP4 -uf, using the -l flag with a bed annotation of the primary selection of editing sites.
SAILOR vNot explicitly versioned, based on 2021 publication$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Define variables INPUT_BAM="input.sorted.rmdup.readfiltered.bam" EDITING_SITES_BED="editing_sites.bed" # Placeholder for the bed annotation of primary editing sites REFERENCE_FASTA="hg38.fa" # Placeholder for human reference genome (e.g., GRCh38), inferred from -f flag OUTPUT_BCF="output.raw.bcf" # Command to generate pileup information with specified flags using samtools mpileup # -B: Disable BAQ (Base Alignment Quality) recalculation # -d 10000000: Maximum depth (set to a very high value to avoid truncation) # -t DPR,INFO/DPR,DP4: Output format tags for depth (DPR) and allele counts (DP4) # -u: Output uncompressed BCF # -f <ref.fa>: Reference FASTA file (inferred from the -uf flag in the description) # -l <bed>: Only include sites specified in the BED file samtools mpileup -B -d 10000000 -t DPR,INFO/DPR,DP4 -u -f "${REFERENCE_FASTA}" -l "${EDITING_SITES_BED}" "${INPUT_BAM}" > "${OUTPUT_BCF}" # Note: The SAILOR pipeline typically further processes this BCF output # using bcftools call and custom scripts (e.g., awk) to extract the specific # "Total or G counts" for A-to-I editing sites. The flags provided in the # description are primarily for the samtools mpileup step, which generates # the necessary data for subsequent extraction. -
11
INFO/DPR counts were used to determine nucleotide counts.
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Example: Determine nucleotide counts for genomic regions against a reference genome. # The description "INFO/DPR counts were used to determine nucleotide counts" is ambiguous. # If "INFO/DPR counts" refers to specific regions or a filtered set of regions, # 'regions.bed' would be the result of processing those counts. # 'regions.bed' could represent peaks, binding sites, or other genomic features (e.g., from an eCLIP or ChIP-seq peak calling step). # 'genome.fa' is the reference genome FASTA file (e.g., hg38.fa for human, mm10.fa for mouse). # This command calculates the A, C, G, T, N counts, GC content, and length for each region. bedtools nuc -fi genome.fa -bed regions.bed -fo nucleotide_counts.txt
-
12
T and C counts for sense or A and G counts for antisense reads, respectively, were considered sequencing errors and added to reference counts.
$ Bash example
# Install bcftools (if not already installed) # conda install -c bioconda bcftools # Placeholder for reference genome # Replace with actual path to reference genome FASTA (e.g., hg38.fa) REFERENCE_GENOME="path/to/reference.fasta" # Placeholder for input BAM file # Replace with actual path to input BAM file (e.g., aligned_reads.bam) INPUT_BAM="path/to/input.bam" # Placeholder for output VCF file # This VCF will contain initial variant calls with allele depths. # The specific logic described in the step ("T and C counts for sense or A and G counts for antisense reads, # respectively, were considered sequencing errors and added to reference counts") # is a custom post-processing step that would typically involve parsing this VCF # (or the raw mpileup output) and modifying allele counts based on strand information. # bcftools itself does not have a direct command for this specific count manipulation. OUTPUT_VCF="initial_variants.vcf" # Generate variant calls with allele depths from the input BAM file. # -f: Reference genome FASTA file. # -m: Call multiallelic sites. # -v: Output variant sites only (skip reference sites). # -o: Output VCF file. bcftools mpileup -f "${REFERENCE_GENOME}" "${INPUT_BAM}" | bcftools call -mv -o "${OUTPUT_VCF}" # Note: The described logic of identifying specific sequencing errors (T/C on sense, A/G on antisense) # and adding their counts to the reference allele count is a custom post-processing step. # This would typically be implemented by a custom script (e.g., in Python using pysam or a VCF parser) # that reads the VCF (or the raw mpileup output), extracts strand-specific base counts (if available # or derived from the BAM), applies the described rule, and then modifies the AD (Allelic Depth) field # or filters variants accordingly.
Raw Source Text
Reads were aligned to GrCH37 v13 using STAR (V2.5.4a) with the basic two-pass mode allowing for a maximum number of multiple alignments of 1000 and 999 mismatches per read (--outFilterMultimapNmax 1000 --outFilterMismatchNmax 999), however 10% mismatches per read pair using --outFilterMismatchNoverReadLmax 0.1 to increase the detection of A-to-I editing sites. In addition, we allowed introns to be 20-1000000 nt and a 1000000 nt distance between read pair mates.
Aligned reads were summarized to gene level using the R Rsubread package featureCounts function in paired end mode (isPairedEnd=TRUE) based on the GrCH37 gencode V19 GTF annotation, using the default exons as features for counting and genes as meta-features (GTF.featureType="exon" and GTF.attrType="gene_id"). We only allowed uniquely mapping reads by setting the read-quality filter to 255 (minMQS=255).
To quantify introns, we first extracted intron ranges from appris annotation based on https://davetang.org/muse/2013/01/18/defining-genomic-regions/ from the GrCH37 gencode V19 GTF file using awk to filter for appris annotated transcripts (awk âBEGINâ{OFS=â\tâ;} /appris/ {print}â). Exons were extracted from the appris annotated transcripts into bed format using awk ('BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}'), sorted using bedtools2 sortBed and overlapping exons merged with bedtools2 mergeBed. Next, unique genes of appris-annotated transcripts were extracted using awk as described above, their annotations extracted from the original GTF file, sorted and the merged appris-annotated exons subtracted from genes using bedtools2 subtractBed to yield introns delineated from appris annotated transcripts, saved in bed format for later quantification.
Reads mapping to intron- and editing-cluster-rich-region- features were counted using the R Rsubread package featureCounts function after converting bed files to SAF format, as described for gene quantification, not using the meta-feature flag.
Editing sites were discovered using the SAILOR pipeline (https://github.com/YeoLab/sailor). Total or G counts (indicating adenosine-to-inosine editing) were extracted from *.sorted.rmdup.readfiltered.bam intermediate files generated by the SAILOR pipeline using flags -B -d10000000 -t DPR,INFO/DPR,DP4 -uf, using the -l flag with a bed annotation of the primary selection of editing sites. INFO/DPR counts were used to determine nucleotide counts. T and C counts for sense or A and G counts for antisense reads, respectively, were considered sequencing errors and added to reference counts.
Genome_build: GrCH37 v13
Supplementary_files_format_and_content: Raw gene counts, comma separated values
Supplementary_files_format_and_content: Raw intron counts, comma separated values
Supplementary_files_format_and_content: Raw counts of regions rich in RNA editing clusters (ECR), comma separated values
Supplementary_files_format_and_content: Raw G counts at unfiltered, putative editing sites, comma separated values
Supplementary_files_format_and_content: Raw total (G+H) counts at unfiltered, putative editing sites, comma separated values