GSE176012 Processing Pipeline

RNA-Seq code_examples 12 steps

Publication

ADAR and hnRNPC deficiency synergize in activating endogenous dsRNA-induced type I IFN responses.

The Journal of experimental medicine (2021) — PMID 34297039

Dataset

GSE176012

RNAseq of hnRNPC and ADAR deficient THP-1

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads were aligned to GrCH37 v13 using STAR (V2.5.4a) with the basic two-pass mode allowing for a maximum number of multiple alignments of 1000 and 999 mismatches per read (--outFilterMultimapNmax 1000 --outFilterMismatchNmax 999), however 10% mismatches per read pair using --outFilterMismatchNoverReadLmax 0.1 to increase the detection of A-to-I editing sites.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Placeholder for STAR genome index directory (replace with actual path to GrCH37 v13 index)
    # The GrCH37 v13 (hg19) genome index needs to be generated prior to alignment using STAR --runMode genomeGenerate
    GENOME_DIR="/path/to/GrCH37_v13_STAR_index"
    
    # Placeholder for input FASTQ files (replace with actual paths)
    # Assuming paired-end reads based on common RNA-seq practices
    READS_R1="reads_R1.fastq.gz"
    READS_R2="reads_R2.fastq.gz"
    
    # Placeholder for output directory
    OUTPUT_DIR="star_alignment_output"
    mkdir -p "${OUTPUT_DIR}"
    
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_R1}" "${READS_R2}" \
         --runThreadN 8 \
         --outFileNamePrefix "${OUTPUT_DIR}/" \
         --outFilterMultimapNmax 1000 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverReadLmax 0.1 \
         --twopassMode Basic \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outReadsUnmapped Fastx
  2. 2

    In addition, we allowed introns to be 20-1000000 nt and a 1000000 nt distance between read pair mates.

    STAR (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Placeholder for STAR genome index generation (if not already done)
    # STAR --runMode genomeGenerate \
    #      --genomeDir GRCh38_STAR_index \
    #      --genomeFastaFiles genome.fa \
    #      --sjdbGTFfile annotations.gtf \
    #      --runThreadN 8
    
    # Placeholder input files
    READ1="read1.fastq.gz"
    READ2="read2.fastq.gz"
    GENOME_DIR="GRCh38_STAR_index" # Using GRCh38 as a common latest assembly placeholder
    OUTPUT_PREFIX="aligned_reads"
    THREADS=8
    
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --readFilesCommand zcat \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes Standard
  3. 3

    Aligned reads were summarized to gene level using the R Rsubread package featureCounts function in paired end mode (isPairedEnd=TRUE) based on the GrCH37 gencode V19 GTF annotation, using the default exons as features for counting and genes as meta-features (GTF.featureType="exon" and GTF.attrType="gene_id").

    featureCounts v2.18.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install Rsubread (if not already installed)
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("Rsubread")'
    
    # Download the Gencode V19 GTF annotation for GRCh37
    # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip gencode.v19.annotation.gtf.gz
    
    # Assuming 'aligned_reads.bam' is your input BAM file(s)
    # For multiple BAM files, list them space-separated after the options.
    
    featureCounts \
      -a gencode.v19.annotation.gtf \
      -o gene_counts.txt \
      -p \
      -f exon \
      -g gene_id \
      -T 8 \
      aligned_reads.bam
  4. 4

    We only allowed uniquely mapping reads by setting the read-quality filter to 255 (minMQS=255).

    samtools (Inferred with models/gemini-2.5-flash) v1.15 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Filter uniquely mapping reads with a minimum mapping quality score of 255
    # Replace 'input_aligned.bam' with your aligned BAM file
    # Replace 'output_unique_mq255.bam' with your desired output file name
    samtools view -b -q 255 input_aligned.bam > output_unique_mq255.bam
  5. 5

    To quantify introns, we first extracted intron ranges from appris annotation based on https://davetang.org/muse/2013/01/18/defining-genomic-regions/ from the GrCH37 gencode V19 GTF file using awk to filter for appris annotated transcripts (awk ’BEGIN’{OFS=”\t”;} /appris/ {print}’).

    GENCODE vV19 GitHub
    $ Bash example
    # Download GENCODE v19 GTF for GRCh37
    # This is a placeholder; actual download command might vary.
    # wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip gencode.v19.annotation.gtf.gz
    
    # Define input GTF file
    INPUT_GTF="gencode.v19.annotation.gtf"
    OUTPUT_GTF="appris_filtered_transcripts.gtf"
    
    # Extract appris annotated transcripts (as described)
    awk 'BEGIN{OFS="\t";} /appris/ {print}' "${INPUT_GTF}" > "${OUTPUT_GTF}"
  6. 6

    Exons were extracted from the appris annotated transcripts into bed format using awk ('BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}'), sorted using bedtools2 sortBed and overlapping exons merged with bedtools2 mergeBed.

    awk, bedtools vN/A (Inferred with models/gemini-2.5-flash), 2.30.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Input file: appris annotated transcripts (e.g., in GTF/GFF format)
    INPUT_TRANSCRIPTS="appris_transcripts.gtf"
    OUTPUT_EXONS_BED="merged_exons.bed"
    
    # Extract exons from appris annotated transcripts, convert to BED, sort, and merge overlapping exons.
    # awk (Inferred with models/gemini-2.5-flash)
    # bedtools (Inferred with models/gemini-2.5-flash)
    # conda install -c bioconda bedtools
    
    awk 'BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}' "${INPUT_TRANSCRIPTS}" | \
    bedtools sort -i - | \
    bedtools merge -i - > "${OUTPUT_EXONS_BED}"
  7. 7

    Next, unique genes of appris-annotated transcripts were extracted using awk as described above, their annotations extracted from the original GTF file, sorted and the merged appris-annotated exons subtracted from genes using bedtools2 subtractBed to yield introns delineated from appris annotated transcripts, saved in bed format for later quantification.

    bedtools v2.x GitHub
    $ Bash example
    # Define placeholder filenames
    ORIGINAL_GTF="original.gtf" # Placeholder: e.g., gencode.v44.annotation.gtf
    APPRIS_TRANSCRIPTS_GTF="appris_annotated_transcripts.gtf" # Placeholder: This file is assumed to contain appris-annotated transcripts, possibly a filtered GTF.
    
    # --- Step 1: Extract unique gene IDs from appris-annotated transcripts ---
    # This awk command extracts gene_id from GTF lines where feature is 'transcript'
    # and then sorts and gets unique IDs.
    # Example GTF attribute: gene_id "ENSG00000000003.15";
    awk '$3 == "transcript" {
        match($0, /gene_id "([^"]+)"/);
        if (RSTART) {
            gene_id = substr($0, RSTART + 9, RLENGTH - 10);
            print gene_id;
        }
    }' "${APPRIS_TRANSCRIPTS_GTF}" | sort -u > unique_appris_genes.txt
    
    # --- Step 2: Extract annotations for these unique genes from the original GTF ---
    # This filters the original GTF to only include features related to the identified unique genes.
    # Note: For very large GTF files, `awk` might be more performant than `grep -Ff`.
    grep -Ff unique_appris_genes.txt "${ORIGINAL_GTF}" > filtered_genes.gtf
    
    # --- Step 3: Convert filtered GTF to BED format for genes and merged exons ---
    
    # Convert gene features from filtered_genes.gtf to BED format.
    # Columns: chrom, start (0-based), end (1-based), name (gene_id), score (.), strand
    # Note: GTF coordinates are 1-based, BED coordinates are 0-based start.
    awk '$3 == "gene" {
        match($0, /gene_id "([^"]+)"/);
        if (RSTART) {
            gene_id = substr($0, RSTART + 9, RLENGTH - 10);
            print $1"\t"($4-1)"\t"$5"\t"gene_id"\t.\t"$7;
        }
    }' filtered_genes.gtf > genes.bed
    
    # Convert exon features from filtered_genes.gtf to BED format, then merge overlapping exons.
    # The description specifies "merged appris-annotated exons".
    # Columns: chrom, start (0-based), end (1-based), name (gene_id), score (.), strand
    awk '$3 == "exon" {
        match($0, /gene_id "([^"]+)"/);
        if (RSTART) {
            gene_id = substr($0, RSTART + 9, RLENGTH - 10);
            print $1"\t"($4-1)"\t"$5"\t"gene_id"\t.\t"$7;
        }
    }' filtered_genes.gtf | sort -k1,1 -k2,2n | bedtools merge -s -i - > merged_appris_exons.bed
    
    # --- Step 4: Subtract merged appris-annotated exons from genes to yield introns ---
    # bedtools installation:
    # conda install -c bioconda bedtools
    
    # Ensure input BED files are sorted for optimal performance with bedtools subtract.
    # sort -k1,1 -k2,2n genes.bed -o genes.bed
    # sort -k1,1 -k2,2n merged_appris_exons.bed -o merged_appris_exons.bed
    
    bedtools subtract -a genes.bed -b merged_appris_exons.bed > introns.bed
  8. 8

    Reads mapping to intron- and editing-cluster-rich-region- features were counted using the R Rsubread package featureCounts function after converting bed files to SAF format, as described for gene quantification, not using the meta-feature flag.

    featureCounts vnot specified
    $ Bash example
    # Install subread (which includes the featureCounts executable)
    # conda install -c bioconda subread
    
    # The description mentions converting bed files to SAF format.
    # SAF format: GeneID\tChr\tStart\tEnd\tStrand
    # Assuming 'intron_editing_regions.bed' contains the defined features (e.g., 4 columns: chrom, start, end, name)
    # and assuming non-strand-specific features ('.') if strand information is not available in the BED file.
    # If your BED file has strand information in the 6th column, adjust the awk command accordingly.
    # awk 'BEGIN{OFS="\t"} {print $4, $1, $2+1, $3, "."}' intron_editing_regions.bed > intron_editing_regions.saf
    
    # Define input BAM files (replace with actual file paths)
    # For example, if you have multiple BAM files:
    # BAM_FILES="sample1.bam sample2.bam sample3.bam"
    BAM_FILES="input_aligned_reads.bam"
    
    # Define the SAF annotation file containing intron- and editing-cluster-rich-region features
    SAF_ANNOTATION_FILE="intron_editing_regions.saf"
    
    # Define the output file for counts
    OUTPUT_COUNTS_FILE="feature_counts.txt"
    
    # Run featureCounts
    # -a: Specify the annotation file
    # -F SAF: Specify that the annotation file is in SAF format
    # -o: Specify the output file for the read counts
    # The description states "not using the meta-feature flag". When using SAF format,
    # featureCounts directly counts reads to the features defined in the SAF file,
    # which inherently means no meta-feature aggregation is performed by the tool itself.
    featureCounts -a "${SAF_ANNOTATION_FILE}" \
                  -F SAF \
                  -o "${OUTPUT_COUNTS_FILE}" \
                  ${BAM_FILES}
  9. 9

    Editing sites were discovered using the SAILOR pipeline (https://github.com/YeoLab/sailor).

    SAILOR vNot specified GitHub
    $ Bash example
    # Clone the SAILOR repository if not already done
    # git clone https://github.com/YeoLab/sailor.git
    # cd sailor
    
    # # Install dependencies (example using conda)
    # conda create -n sailor_env python=3.7 -y
    # conda activate sailor_env
    # pip install -r requirements.txt
    
    # # IMPORTANT: Review and modify sailor_config.py according to your specific needs and environment.
    # # For example, you might need to adjust paths to external tools or specific parameters.
    # # cp sailor_config.py.example sailor_config.py # If an example exists, otherwise modify the default.
    
    # Define input and output paths
    INPUT_BAM="/path/to/your/input.bam" # Replace with your aligned BAM file
    OUTPUT_DIR="/path/to/your/output_sailor" # Replace with your desired output directory
    GENOME_FASTA="/path/to/references/GRCh38/GRCh38.primary_assembly.genome.fa" # Placeholder for human GRCh38 genome FASTA
    ANNOTATION_GTF="/path/to/references/GRCh38/gencode.v38.annotation.gtf" # Placeholder for Gencode v38 annotation GTF
    REPEAT_MASKER_BED="/path/to/references/GRCh38/repeatmasker.bed" # Placeholder for RepeatMasker BED file
    SAMPLE_NAME="sample1" # Replace with your actual sample name
    
    # Create output directory
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute SAILOR to discover editing sites
    python sailor.py \
        -i "${INPUT_BAM}" \
        -o "${OUTPUT_DIR}" \
        -g "${GENOME_FASTA}" \
        -a "${ANNOTATION_GTF}" \
        -r "${REPEAT_MASKER_BED}" \
        -s "${SAMPLE_NAME}"
  10. 10

    Total or G counts (indicating adenosine-to-inosine editing) were extracted from *.sorted.rmdup.readfiltered.bam intermediate files generated by the SAILOR pipeline using flags -B -d10000000 -t DPR,INFO/DPR,DP4 -uf, using the -l flag with a bed annotation of the primary selection of editing sites.

    SAILOR vNot explicitly versioned, based on 2021 publication
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Define variables
    INPUT_BAM="input.sorted.rmdup.readfiltered.bam"
    EDITING_SITES_BED="editing_sites.bed" # Placeholder for the bed annotation of primary editing sites
    REFERENCE_FASTA="hg38.fa" # Placeholder for human reference genome (e.g., GRCh38), inferred from -f flag
    OUTPUT_BCF="output.raw.bcf"
    
    # Command to generate pileup information with specified flags using samtools mpileup
    # -B: Disable BAQ (Base Alignment Quality) recalculation
    # -d 10000000: Maximum depth (set to a very high value to avoid truncation)
    # -t DPR,INFO/DPR,DP4: Output format tags for depth (DPR) and allele counts (DP4)
    # -u: Output uncompressed BCF
    # -f <ref.fa>: Reference FASTA file (inferred from the -uf flag in the description)
    # -l <bed>: Only include sites specified in the BED file
    samtools mpileup -B -d 10000000 -t DPR,INFO/DPR,DP4 -u -f "${REFERENCE_FASTA}" -l "${EDITING_SITES_BED}" "${INPUT_BAM}" > "${OUTPUT_BCF}"
    
    # Note: The SAILOR pipeline typically further processes this BCF output
    # using bcftools call and custom scripts (e.g., awk) to extract the specific
    # "Total or G counts" for A-to-I editing sites. The flags provided in the
    # description are primarily for the samtools mpileup step, which generates
    # the necessary data for subsequent extraction.
  11. 11

    INFO/DPR counts were used to determine nucleotide counts.

    bedtools (Inferred with models/gemini-2.5-flash) v2.31.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools
    
    # Example: Determine nucleotide counts for genomic regions against a reference genome.
    # The description "INFO/DPR counts were used to determine nucleotide counts" is ambiguous.
    # If "INFO/DPR counts" refers to specific regions or a filtered set of regions, 
    # 'regions.bed' would be the result of processing those counts.
    # 'regions.bed' could represent peaks, binding sites, or other genomic features (e.g., from an eCLIP or ChIP-seq peak calling step).
    # 'genome.fa' is the reference genome FASTA file (e.g., hg38.fa for human, mm10.fa for mouse).
    # This command calculates the A, C, G, T, N counts, GC content, and length for each region.
    bedtools nuc -fi genome.fa -bed regions.bed -fo nucleotide_counts.txt
  12. 12

    T and C counts for sense or A and G counts for antisense reads, respectively, were considered sequencing errors and added to reference counts.

    bcftools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install bcftools (if not already installed)
    # conda install -c bioconda bcftools
    
    # Placeholder for reference genome
    # Replace with actual path to reference genome FASTA (e.g., hg38.fa)
    REFERENCE_GENOME="path/to/reference.fasta"
    
    # Placeholder for input BAM file
    # Replace with actual path to input BAM file (e.g., aligned_reads.bam)
    INPUT_BAM="path/to/input.bam"
    
    # Placeholder for output VCF file
    # This VCF will contain initial variant calls with allele depths.
    # The specific logic described in the step ("T and C counts for sense or A and G counts for antisense reads,
    # respectively, were considered sequencing errors and added to reference counts")
    # is a custom post-processing step that would typically involve parsing this VCF
    # (or the raw mpileup output) and modifying allele counts based on strand information.
    # bcftools itself does not have a direct command for this specific count manipulation.
    OUTPUT_VCF="initial_variants.vcf"
    
    # Generate variant calls with allele depths from the input BAM file.
    # -f: Reference genome FASTA file.
    # -m: Call multiallelic sites.
    # -v: Output variant sites only (skip reference sites).
    # -o: Output VCF file.
    bcftools mpileup -f "${REFERENCE_GENOME}" "${INPUT_BAM}" | bcftools call -mv -o "${OUTPUT_VCF}"
    
    # Note: The described logic of identifying specific sequencing errors (T/C on sense, A/G on antisense)
    # and adding their counts to the reference allele count is a custom post-processing step.
    # This would typically be implemented by a custom script (e.g., in Python using pysam or a VCF parser)
    # that reads the VCF (or the raw mpileup output), extracts strand-specific base counts (if available
    # or derived from the BAM), applies the described rule, and then modifies the AD (Allelic Depth) field
    # or filters variants accordingly.

Tools Used

Raw Source Text
Reads were aligned to GrCH37 v13 using STAR (V2.5.4a) with the basic two-pass mode allowing for a maximum number of multiple alignments of 1000 and 999 mismatches per read (--outFilterMultimapNmax 1000 --outFilterMismatchNmax 999), however 10% mismatches per read pair using --outFilterMismatchNoverReadLmax 0.1 to increase the detection of A-to-I editing sites. In addition, we allowed introns to be 20-1000000 nt and a 1000000 nt distance between read pair mates.
Aligned reads were summarized to gene level using the R Rsubread package featureCounts function in paired end mode (isPairedEnd=TRUE) based on the GrCH37 gencode V19 GTF annotation, using the default exons as features for counting and genes as meta-features (GTF.featureType="exon" and GTF.attrType="gene_id"). We only allowed uniquely mapping reads by setting the read-quality filter to 255 (minMQS=255).
To quantify introns, we first extracted intron ranges from appris annotation based on https://davetang.org/muse/2013/01/18/defining-genomic-regions/ from the GrCH37 gencode V19 GTF file using awk to filter for appris annotated transcripts (awk ’BEGIN’{OFS=”\t”;} /appris/ {print}’). Exons were extracted from the appris annotated transcripts into bed format using awk ('BEGIN{OFS="\t";} $3=="exon" {print $1,$4-1,$5,$10,$3}'), sorted using bedtools2 sortBed and overlapping exons merged with bedtools2 mergeBed. Next, unique genes of appris-annotated transcripts were extracted using awk as described above, their annotations extracted from the original GTF file, sorted and the merged appris-annotated exons subtracted from genes using bedtools2 subtractBed to yield introns delineated from appris annotated transcripts, saved in bed format for later quantification.
Reads mapping to intron- and editing-cluster-rich-region- features were counted using the R Rsubread package featureCounts function after converting bed files to SAF format, as described for gene quantification, not using the meta-feature flag.
Editing sites were discovered using the SAILOR pipeline (https://github.com/YeoLab/sailor). Total or G counts (indicating adenosine-to-inosine editing) were extracted from *.sorted.rmdup.readfiltered.bam intermediate files generated by the SAILOR pipeline using flags -B -d10000000 -t DPR,INFO/DPR,DP4 -uf, using the -l flag with a bed annotation of the primary selection of editing sites. INFO/DPR counts were used to determine nucleotide counts. T and C counts for sense or A and G counts for antisense reads, respectively, were considered sequencing errors and added to reference counts.
Genome_build: GrCH37 v13
Supplementary_files_format_and_content: Raw gene counts, comma separated values
Supplementary_files_format_and_content: Raw intron counts, comma separated values
Supplementary_files_format_and_content: Raw counts of regions rich in RNA editing clusters (ECR), comma separated values
Supplementary_files_format_and_content: Raw G counts at unfiltered, putative editing sites, comma separated values
Supplementary_files_format_and_content: Raw total (G+H) counts at unfiltered, putative editing sites, comma separated values
← Back to Analysis