GSE74583 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

PPAR-δ is repressed in Huntington's disease, is required for normal neuronal function and can be targeted therapeutically.

Nature medicine (2016) — PMID 26642438

Dataset

GSE74583

Next Generation Sequencing Investigation of altered transcripts in presence of dominant-negative transcription factor

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Illumina Casava1.7 software used for basecalling.

    Illumina Casava v1.7
    $ Bash example
    # Illumina Casava 1.7 is a proprietary software suite primarily used for basecalling and demultiplexing on Illumina sequencing platforms.
    # It is typically integrated into the instrument's control software or run on a dedicated Illumina analysis server.
    # There is no publicly available command-line installation or execution for Casava 1.7 in the same way as open-source tools.
    # The process converts raw intensity data (Bcl files) into FASTQ files.
    # Example conceptual representation (not an actual executable command):
    # casava_1.7_basecaller --input_bcl_directory /path/to/run/Data/Intensities/BaseCalls --output_fastq_directory /path/to/output/fastq
  2. 2

    Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to mm8 whole genome using bowtie v0.12.2 with parameters -q -p 4 -e 100 -y -a -m 10 --best --strata

    Bowtie v0.12.2 GitHub
    $ Bash example
    # Install Bowtie (if not already installed)
    # conda install -c bioconda bowtie=0.12.2
    
    # Assuming 'reads.fastq' is the input file after trimming and masking.
    # Replace /path/to/bowtie_indexes/mm8 with the actual path to your Bowtie index for mm8.
    # Replace reads.fastq with your actual input reads file.
    # Replace output.sam with your desired output SAM file name.
    
    bowtie -q -p 4 -e 100 -y -a -m 10 --best --strata /path/to/bowtie_indexes/mm8 reads.fastq > output.sam
  3. 3

    Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

    RPKM Calculator (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This script calculates Reads Per Kilobase of exon per Megabase of library size (RPKM)
    # based on the protocol described in Chepelev et al., Nucleic Acids Research, 2009.
    #
    # RPKM Formula: (read_count * 10^9) / (total_mapped_reads * gene_length_in_kb)
    #
    # Input files required:
    # 1. Counts file: Tab-separated file with two columns: Feature_ID, Read_Count
    #    (e.g., output from featureCounts, HTSeq-count, or similar tools)
    # 2. Lengths file: Tab-separated file with two columns: Feature_ID, Length_in_Bases
    #    (e.g., derived from a genome annotation file like GTF/GFF for a specific assembly, e.g., GRCh38/hg38)
    # 3. Total Mapped Reads file: A plain text file containing a single integer: Total_Number_of_Mapped_Reads
    #    (e.g., extracted from samtools flagstat output)
    #
    # Output file: Tab-separated file with Feature_ID, Read_Count, Length_Bases, RPKM
    
    # --- Placeholder for input files (replace with your actual data) ---
    # For demonstration purposes, creating dummy input files:
    echo -e "gene1\t1000" > counts.txt
    echo -e "gene2\t2000" >> counts.txt
    echo -e "gene3\t500" >> counts.txt
    
    echo -e "gene1\t1500" > gene_lengths.txt # Length in bases
    echo -e "gene2\t2500" >> gene_lengths.txt
    echo -e "gene3\t800" >> gene_lengths.txt
    echo -e "gene4\t1000" >> gene_lengths.txt # A gene not in counts
    
    echo "10000000" > total_mapped_reads.txt # 10 million total mapped reads
    
    # --- Python script for RPKM calculation ---
    # (Python is typically pre-installed or easily installed via package managers)
    # # conda install python # Uncomment to install Python if not available
    
    cat << 'EOF' > rpkm_calculator.py
    import sys
    
    def calculate_rpkm(counts_file, lengths_file, total_mapped_reads_file, output_file):
        """
        Calculates RPKM values based on read counts, feature lengths, and total mapped reads.
        Formula: RPKM = (read_count * 10^9) / (total_mapped_reads * gene_length_in_kb)
        """
        
        # Read total mapped reads
        try:
            with open(total_mapped_reads_file, 'r') as f:
                total_mapped_reads = int(f.readline().strip())
        except FileNotFoundError:
            sys.stderr.write(f"Error: Total mapped reads file '{total_mapped_reads_file}' not found.\n")
            sys.exit(1)
        except ValueError:
            sys.stderr.write(f"Error: Invalid content in total mapped reads file '{total_mapped_reads_file}'. Expected an integer.\n")
            sys.exit(1)
        
        if total_mapped_reads == 0:
            sys.stderr.write("Error: Total mapped reads is zero. Cannot calculate RPKM.\n")
            sys.exit(1)
    
        # Read feature lengths
        feature_lengths = {}
        try:
            with open(lengths_file, 'r') as f:
                for line in f:
                    parts = line.strip().split('\t')
                    if len(parts) == 2:
                        feature_id, length_bases_str = parts[0], parts[1]
                        try:
                            feature_lengths[feature_id] = int(length_bases_str)
                        except ValueError:
                            sys.stderr.write(f"Warning: Skipping malformed length (not an integer) in lengths file: {line.strip()}\n")
                    else:
                        sys.stderr.write(f"Warning: Skipping malformed line in lengths file: {line.strip()}\n")
        except FileNotFoundError:
            sys.stderr.write(f"Error: Lengths file '{lengths_file}' not found.\n")
            sys.exit(1)
    
        # Calculate RPKM and write to output
        try:
            with open(counts_file, 'r') as infile, open(output_file, 'w') as outfile:
                outfile.write("Feature_ID\tRead_Count\tLength_Bases\tRPKM\n")
                for line in infile:
                    parts = line.strip().split('\t')
                    if len(parts) == 2:
                        feature_id, read_count_str = parts[0], parts[1]
                        try:
                            read_count = int(read_count_str)
                        except ValueError:
                            sys.stderr.write(f"Warning: Skipping malformed read count (not an integer) in counts file: {line.strip()}\n")
                            continue
                        
                        length_bases = feature_lengths.get(feature_id)
                        
                        if length_bases is None:
                            sys.stderr.write(f"Warning: Feature '{feature_id}' not found in lengths file. Skipping RPKM calculation for this feature.\n")
                            outfile.write(f"{feature_id}\t{read_count}\tN/A\tN/A\n")
                            continue
                        
                        if length_bases == 0:
                            rpkm = 0.0 # RPKM is 0 if feature has no length
                            sys.stderr.write(f"Warning: Feature '{feature_id}' has zero length. RPKM set to 0.\n")
                        else:
                            length_kb = length_bases / 1000.0
                            rpkm = (read_count * 10**9) / (total_mapped_reads * length_kb)
                        
                        outfile.write(f"{feature_id}\t{read_count}\t{length_bases}\t{rpkm:.4f}\n")
                    else:
                        sys.stderr.write(f"Warning: Skipping malformed line in counts file: {line.strip()}\n")
        except FileNotFoundError:
            sys.stderr.write(f"Error: Counts file '{counts_file}' not found.\n")
            sys.exit(1)
    
    if __name__ == "__main__":
        # Define input and output file names
        counts_file = "counts.txt"
        lengths_file = "gene_lengths.txt"
        total_mapped_reads_file = "total_mapped_reads.txt"
        output_file = "rpkm_output.txt"
    
        calculate_rpkm(counts_file, lengths_file, total_mapped_reads_file, output_file)
        
        print(f"RPKM calculation complete. Output saved to {output_file}")
    EOF
    
    # --- Execute the Python script ---
    python rpkm_calculator.py
    
    # --- Clean up generated files ---
    rm counts.txt gene_lengths.txt total_mapped_reads.txt rpkm_calculator.py
  4. 4

    In short, exons from all isoforms of a gene were merged to create one meta-transcript.

    gtf_to_genes_and_transcripts.py (Inferred with models/gemini-2.5-flash) vRSEM 1.3.3 GitHub
    $ Bash example
    # Install RSEM (if not already installed)
    # conda install -c bioconda rsem
    
    # Define input GTF and output prefix
    INPUT_GTF="gencode.v45.annotation.gtf" # Example: Gencode human GRCh38, release 45
    OUTPUT_PREFIX="gencode.v45.meta_transcripts"
    
    # Placeholder for reference GTF download (uncomment and adjust if needed)
    # wget -O ${INPUT_GTF}.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz
    # gunzip -f ${INPUT_GTF}.gz
    
    # Locate the gtf_to_genes_and_transcripts.py script within your RSEM installation.
    # If RSEM is installed via conda, it might be in $CONDA_PREFIX/opt/rsem-X.Y.Z/util/
    # Replace "/path/to/rsem/util" with the actual path to the RSEM 'util' directory.
    RSEM_UTIL_SCRIPT="/path/to/rsem/util/gtf_to_genes_and_transcripts.py"
    
    # Execute the script to merge exons from all isoforms of a gene into one meta-transcript.
    # The primary output file containing the meta-transcripts (union exon models) will be: ${OUTPUT_PREFIX}.genes.gtf
    python ${RSEM_UTIL_SCRIPT} ${INPUT_GTF} ${OUTPUT_PREFIX}
  5. 5

    The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

    featureCounts (Inferred with models/gemini-2.5-flash) v2.0.6
    $ Bash example
    # Install Subread (if not already installed)
    # conda install -c bioconda subread
    
    # Define input files and parameters
    BAM_FILE="aligned_reads.bam"
    # Placeholder for the GTF/GFF file defining the exons of the meta-transcript.
    # Replace with your specific meta-transcript annotation file.
    # Example: Using a generic human GTF for GRCh38 assembly.
    GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
    OUTPUT_COUNTS_PREFIX="meta_transcript_counts"
    OUTPUT_NORMALIZED="meta_transcript_rpkm.tsv"
    
    # 1. Count reads falling in exons of the meta-transcript
    # -a: annotation file (GTF/GFF)
    # -o: output file prefix
    # -F GTF: specify GTF format
    # -t exon: count reads for features of type 'exon'
    # -g gene_id: group features by 'gene_id' (assuming meta-transcript is defined by gene_id)
    # -s 0: strand specific (0=unstranded, 1=stranded, 2=reverse stranded). Adjust as needed based on library prep.
    # -p: specify if reads are paired-end. Remove if single-end.
    # --extra-attributes=gene_name: include gene names in output (optional)
    # --fracOverlap 0.5: minimum fraction of read overlap with feature (optional, default 0)
    # --minOverlap 1: minimum number of overlapping bases (optional, default 1)
    featureCounts -a "${GTF_FILE}" -o "${OUTPUT_COUNTS_PREFIX}.txt" -F GTF -t exon -g gene_id -s 0 "${BAM_FILE}"
    
    # 2. Normalize counts to RPKM (Reads Per Kilobase of transcript per Million mapped reads)
    # RPKM = (read_count * 10^9) / (feature_length_in_bases * total_mapped_reads_in_library)
    
    # Extract total mapped reads from the featureCounts summary file
    TOTAL_MAPPED_READS=$(awk '/Total reads/{print $NF}' "${OUTPUT_COUNTS_PREFIX}.txt.summary")
    
    # Check if total mapped reads were extracted successfully
    if [ -z "$TOTAL_MAPPED_READS" ] || [ "$TOTAL_MAPPED_READS" -eq 0 ]; then
        echo "Warning: Could not extract total mapped reads from featureCounts summary or total reads are zero. Attempting to count from BAM."
        TOTAL_MAPPED_READS=$(samtools view -c -F 260 "${BAM_FILE}") # Count primary alignments
        if [ "$TOTAL_MAPPED_READS" -eq 0 ]; then
            echo "Error: Total mapped reads are zero. Cannot calculate RPKM."
            exit 1
        fi
    fi
    
    # Process the featureCounts output to calculate RPKM
    # The featureCounts output file has columns: Geneid, Chr, Start, End, Strand, Length, Sample_Name
    # We need Geneid, Length, and Sample_Name (read count)
    echo -e "Geneid\tRPKM" > "${OUTPUT_NORMALIZED}"
    awk -v total_reads="${TOTAL_MAPPED_READS}" '
    BEGIN {
        OFS="\t";
        # Convert total_reads to millions for RPKM formula
        total_reads_in_millions = total_reads / 1000000;
    }
    NR==1 {next} # Skip header line
    !/^#/ { # Skip comment lines
        gene_id = $1;
        feature_length = $6;
        read_count = $7; # Assuming single sample output, this is the count column
    
        if (feature_length > 0 && total_reads_in_millions > 0) {
            rpkm = read_count / (feature_length / 1000) / total_reads_in_millions;
            print gene_id, rpkm;
        } else {
            print gene_id, 0; # Assign 0 RPKM if length or total reads are zero
        }
    }' "${OUTPUT_COUNTS_PREFIX}.txt" >> "${OUTPUT_NORMALIZED}"
    
    # Clean up intermediate files (optional)
    # rm "${OUTPUT_COUNTS_PREFIX}.txt" "${OUTPUT_COUNTS_PREFIX}.txt.summary"
Raw Source Text
Illumina Casava1.7 software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to mm8 whole genome using bowtie v0.12.2 with parameters -q -p 4 -e 100 -y -a -m 10 --best --strata
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: mm9
Supplementary_files_format_and_content: tab-delimited text files include RPKM values for each Sample
← Back to Analysis