GSE74583 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

PPAR-δ is repressed in Huntington's disease, is required for normal neuronal function and can be targeted therapeutically.

Nature medicine (2016) — PMID 26642438

Dataset

GSE74583

Next Generation Sequencing Investigation of altered transcripts in presence of dominant-negative transcription factor

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina Casava1.7 software used for basecalling.

Illumina Casava v1.7

$ Bash example

# Illumina Casava 1.7 is a proprietary software suite primarily used for basecalling and demultiplexing on Illumina sequencing platforms.
# It is typically integrated into the instrument's control software or run on a dedicated Illumina analysis server.
# There is no publicly available command-line installation or execution for Casava 1.7 in the same way as open-source tools.
# The process converts raw intensity data (Bcl files) into FASTQ files.
# Example conceptual representation (not an actual executable command):
# casava_1.7_basecaller --input_bcl_directory /path/to/run/Data/Intensities/BaseCalls --output_fastq_directory /path/to/output/fastq

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to mm8 whole genome using bowtie v0.12.2 with parameters -q -p 4 -e 100 -y -a -m 10 --best --strata

Bowtie v0.12.2 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie=0.12.2

# Assuming 'reads.fastq' is the input file after trimming and masking.
# Replace /path/to/bowtie_indexes/mm8 with the actual path to your Bowtie index for mm8.
# Replace reads.fastq with your actual input reads file.
# Replace output.sam with your desired output SAM file name.

bowtie -q -p 4 -e 100 -y -a -m 10 --best --strata /path/to/bowtie_indexes/mm8 reads.fastq > output.sam

View on GitHub

Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

RPKM Calculator (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script calculates Reads Per Kilobase of exon per Megabase of library size (RPKM)
# based on the protocol described in Chepelev et al., Nucleic Acids Research, 2009.
#
# RPKM Formula: (read_count * 10^9) / (total_mapped_reads * gene_length_in_kb)
#
# Input files required:
# 1. Counts file: Tab-separated file with two columns: Feature_ID, Read_Count
#    (e.g., output from featureCounts, HTSeq-count, or similar tools)
# 2. Lengths file: Tab-separated file with two columns: Feature_ID, Length_in_Bases
#    (e.g., derived from a genome annotation file like GTF/GFF for a specific assembly, e.g., GRCh38/hg38)
# 3. Total Mapped Reads file: A plain text file containing a single integer: Total_Number_of_Mapped_Reads
#    (e.g., extracted from samtools flagstat output)
#
# Output file: Tab-separated file with Feature_ID, Read_Count, Length_Bases, RPKM

# --- Placeholder for input files (replace with your actual data) ---
# For demonstration purposes, creating dummy input files:
echo -e "gene1\t1000" > counts.txt
echo -e "gene2\t2000" >> counts.txt
echo -e "gene3\t500" >> counts.txt

echo -e "gene1\t1500" > gene_lengths.txt # Length in bases
echo -e "gene2\t2500" >> gene_lengths.txt
echo -e "gene3\t800" >> gene_lengths.txt
echo -e "gene4\t1000" >> gene_lengths.txt # A gene not in counts

echo "10000000" > total_mapped_reads.txt # 10 million total mapped reads

# --- Python script for RPKM calculation ---
# (Python is typically pre-installed or easily installed via package managers)
# # conda install python # Uncomment to install Python if not available

cat << 'EOF' > rpkm_calculator.py
import sys

def calculate_rpkm(counts_file, lengths_file, total_mapped_reads_file, output_file):
    """
    Calculates RPKM values based on read counts, feature lengths, and total mapped reads.
    Formula: RPKM = (read_count * 10^9) / (total_mapped_reads * gene_length_in_kb)
    """
    
    # Read total mapped reads
    try:
        with open(total_mapped_reads_file, 'r') as f:
            total_mapped_reads = int(f.readline().strip())
    except FileNotFoundError:
        sys.stderr.write(f"Error: Total mapped reads file '{total_mapped_reads_file}' not found.\n")
        sys.exit(1)
    except ValueError:
        sys.stderr.write(f"Error: Invalid content in total mapped reads file '{total_mapped_reads_file}'. Expected an integer.\n")
        sys.exit(1)
    
    if total_mapped_reads == 0:
        sys.stderr.write("Error: Total mapped reads is zero. Cannot calculate RPKM.\n")
        sys.exit(1)

    # Read feature lengths
    feature_lengths = {}
    try:
        with open(lengths_file, 'r') as f:
            for line in f:
                parts = line.strip().split('\t')
                if len(parts) == 2:
                    feature_id, length_bases_str = parts[0], parts[1]
                    try:
                        feature_lengths[feature_id] = int(length_bases_str)
                    except ValueError:
                        sys.stderr.write(f"Warning: Skipping malformed length (not an integer) in lengths file: {line.strip()}\n")
                else:
                    sys.stderr.write(f"Warning: Skipping malformed line in lengths file: {line.strip()}\n")
    except FileNotFoundError:
        sys.stderr.write(f"Error: Lengths file '{lengths_file}' not found.\n")
        sys.exit(1)

    # Calculate RPKM and write to output
    try:
        with open(counts_file, 'r') as infile, open(output_file, 'w') as outfile:
            outfile.write("Feature_ID\tRead_Count\tLength_Bases\tRPKM\n")
            for line in infile:
                parts = line.strip().split('\t')
                if len(parts) == 2:
                    feature_id, read_count_str = parts[0], parts[1]
                    try:
                        read_count = int(read_count_str)
                    except ValueError:
                        sys.stderr.write(f"Warning: Skipping malformed read count (not an integer) in counts file: {line.strip()}\n")
                        continue
                    
                    length_bases = feature_lengths.get(feature_id)
                    
                    if length_bases is None:
                        sys.stderr.write(f"Warning: Feature '{feature_id}' not found in lengths file. Skipping RPKM calculation for this feature.\n")
                        outfile.write(f"{feature_id}\t{read_count}\tN/A\tN/A\n")
                        continue
                    
                    if length_bases == 0:
                        rpkm = 0.0 # RPKM is 0 if feature has no length
                        sys.stderr.write(f"Warning: Feature '{feature_id}' has zero length. RPKM set to 0.\n")
                    else:
                        length_kb = length_bases / 1000.0
                        rpkm = (read_count * 10**9) / (total_mapped_reads * length_kb)
                    
                    outfile.write(f"{feature_id}\t{read_count}\t{length_bases}\t{rpkm:.4f}\n")
                else:
                    sys.stderr.write(f"Warning: Skipping malformed line in counts file: {line.strip()}\n")
    except FileNotFoundError:
        sys.stderr.write(f"Error: Counts file '{counts_file}' not found.\n")
        sys.exit(1)

if __name__ == "__main__":
    # Define input and output file names
    counts_file = "counts.txt"
    lengths_file = "gene_lengths.txt"
    total_mapped_reads_file = "total_mapped_reads.txt"
    output_file = "rpkm_output.txt"

    calculate_rpkm(counts_file, lengths_file, total_mapped_reads_file, output_file)
    
    print(f"RPKM calculation complete. Output saved to {output_file}")
EOF

# --- Execute the Python script ---
python rpkm_calculator.py

# --- Clean up generated files ---
rm counts.txt gene_lengths.txt total_mapped_reads.txt rpkm_calculator.py

In short, exons from all isoforms of a gene were merged to create one meta-transcript.

gtf_to_genes_and_transcripts.py (Inferred with models/gemini-2.5-flash) vRSEM 1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# Define input GTF and output prefix
INPUT_GTF="gencode.v45.annotation.gtf" # Example: Gencode human GRCh38, release 45
OUTPUT_PREFIX="gencode.v45.meta_transcripts"

# Placeholder for reference GTF download (uncomment and adjust if needed)
# wget -O ${INPUT_GTF}.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
# gunzip -f ${INPUT_GTF}.gz

# Locate the gtf_to_genes_and_transcripts.py script within your RSEM installation.
# If RSEM is installed via conda, it might be in $CONDA_PREFIX/opt/rsem-X.Y.Z/util/
# Replace "/path/to/rsem/util" with the actual path to the RSEM 'util' directory.
RSEM_UTIL_SCRIPT="/path/to/rsem/util/gtf_to_genes_and_transcripts.py"

# Execute the script to merge exons from all isoforms of a gene into one meta-transcript.
# The primary output file containing the meta-transcripts (union exon models) will be: ${OUTPUT_PREFIX}.genes.gtf
python ${RSEM_UTIL_SCRIPT} ${INPUT_GTF} ${OUTPUT_PREFIX}

View on GitHub

The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

featureCounts (Inferred with models/gemini-2.5-flash) v2.0.6

$ Bash example

# Install Subread (if not already installed)
# conda install -c bioconda subread

# Define input files and parameters
BAM_FILE="aligned_reads.bam"
# Placeholder for the GTF/GFF file defining the exons of the meta-transcript.
# Replace with your specific meta-transcript annotation file.
# Example: Using a generic human GTF for GRCh38 assembly.
GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
OUTPUT_COUNTS_PREFIX="meta_transcript_counts"
OUTPUT_NORMALIZED="meta_transcript_rpkm.tsv"

# 1. Count reads falling in exons of the meta-transcript
# -a: annotation file (GTF/GFF)
# -o: output file prefix
# -F GTF: specify GTF format
# -t exon: count reads for features of type 'exon'
# -g gene_id: group features by 'gene_id' (assuming meta-transcript is defined by gene_id)
# -s 0: strand specific (0=unstranded, 1=stranded, 2=reverse stranded). Adjust as needed based on library prep.
# -p: specify if reads are paired-end. Remove if single-end.
# --extra-attributes=gene_name: include gene names in output (optional)
# --fracOverlap 0.5: minimum fraction of read overlap with feature (optional, default 0)
# --minOverlap 1: minimum number of overlapping bases (optional, default 1)
featureCounts -a "${GTF_FILE}" -o "${OUTPUT_COUNTS_PREFIX}.txt" -F GTF -t exon -g gene_id -s 0 "${BAM_FILE}"

# 2. Normalize counts to RPKM (Reads Per Kilobase of transcript per Million mapped reads)
# RPKM = (read_count * 10^9) / (feature_length_in_bases * total_mapped_reads_in_library)

# Extract total mapped reads from the featureCounts summary file
TOTAL_MAPPED_READS=$(awk '/Total reads/{print $NF}' "${OUTPUT_COUNTS_PREFIX}.txt.summary")

# Check if total mapped reads were extracted successfully
if [ -z "$TOTAL_MAPPED_READS" ] || [ "$TOTAL_MAPPED_READS" -eq 0 ]; then
    echo "Warning: Could not extract total mapped reads from featureCounts summary or total reads are zero. Attempting to count from BAM."
    TOTAL_MAPPED_READS=$(samtools view -c -F 260 "${BAM_FILE}") # Count primary alignments
    if [ "$TOTAL_MAPPED_READS" -eq 0 ]; then
        echo "Error: Total mapped reads are zero. Cannot calculate RPKM."
        exit 1
    fi
fi

# Process the featureCounts output to calculate RPKM
# The featureCounts output file has columns: Geneid, Chr, Start, End, Strand, Length, Sample_Name
# We need Geneid, Length, and Sample_Name (read count)
echo -e "Geneid\tRPKM" > "${OUTPUT_NORMALIZED}"
awk -v total_reads="${TOTAL_MAPPED_READS}" '
BEGIN {
    OFS="\t";
    # Convert total_reads to millions for RPKM formula
    total_reads_in_millions = total_reads / 1000000;
}
NR==1 {next} # Skip header line
!/^#/ { # Skip comment lines
    gene_id = $1;
    feature_length = $6;
    read_count = $7; # Assuming single sample output, this is the count column

    if (feature_length > 0 && total_reads_in_millions > 0) {
        rpkm = read_count / (feature_length / 1000) / total_reads_in_millions;
        print gene_id, rpkm;
    } else {
        print gene_id, 0; # Assign 0 RPKM if length or total reads are zero
    }
}' "${OUTPUT_COUNTS_PREFIX}.txt" >> "${OUTPUT_NORMALIZED}"

# Clean up intermediate files (optional)
# rm "${OUTPUT_COUNTS_PREFIX}.txt" "${OUTPUT_COUNTS_PREFIX}.txt.summary"

Raw Source Text

Illumina Casava1.7 software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to mm8 whole genome using bowtie v0.12.2 with parameters -q -p 4 -e 100 -y -a -m 10 --best --strata
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: mm9
Supplementary_files_format_and_content: tab-delimited text files include RPKM values for each Sample

← Back to Analysis