GSE74583 Processing Pipeline
Publication
PPAR-δ is repressed in Huntington's disease, is required for normal neuronal function and can be targeted therapeutically.Nature medicine (2016) — PMID 26642438
Dataset
GSE74583Next Generation Sequencing Investigation of altered transcripts in presence of dominant-negative transcription factor
Processing Steps
Generate Jupyter Notebook-
1
Illumina Casava1.7 software used for basecalling.
Illumina Casava v1.7$ Bash example
# Illumina Casava 1.7 is a proprietary software suite primarily used for basecalling and demultiplexing on Illumina sequencing platforms. # It is typically integrated into the instrument's control software or run on a dedicated Illumina analysis server. # There is no publicly available command-line installation or execution for Casava 1.7 in the same way as open-source tools. # The process converts raw intensity data (Bcl files) into FASTQ files. # Example conceptual representation (not an actual executable command): # casava_1.7_basecaller --input_bcl_directory /path/to/run/Data/Intensities/BaseCalls --output_fastq_directory /path/to/output/fastq
-
2
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to mm8 whole genome using bowtie v0.12.2 with parameters -q -p 4 -e 100 -y -a -m 10 --best --strata
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie=0.12.2 # Assuming 'reads.fastq' is the input file after trimming and masking. # Replace /path/to/bowtie_indexes/mm8 with the actual path to your Bowtie index for mm8. # Replace reads.fastq with your actual input reads file. # Replace output.sam with your desired output SAM file name. bowtie -q -p 4 -e 100 -y -a -m 10 --best --strata /path/to/bowtie_indexes/mm8 reads.fastq > output.sam
-
3
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.
RPKM Calculator (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This script calculates Reads Per Kilobase of exon per Megabase of library size (RPKM) # based on the protocol described in Chepelev et al., Nucleic Acids Research, 2009. # # RPKM Formula: (read_count * 10^9) / (total_mapped_reads * gene_length_in_kb) # # Input files required: # 1. Counts file: Tab-separated file with two columns: Feature_ID, Read_Count # (e.g., output from featureCounts, HTSeq-count, or similar tools) # 2. Lengths file: Tab-separated file with two columns: Feature_ID, Length_in_Bases # (e.g., derived from a genome annotation file like GTF/GFF for a specific assembly, e.g., GRCh38/hg38) # 3. Total Mapped Reads file: A plain text file containing a single integer: Total_Number_of_Mapped_Reads # (e.g., extracted from samtools flagstat output) # # Output file: Tab-separated file with Feature_ID, Read_Count, Length_Bases, RPKM # --- Placeholder for input files (replace with your actual data) --- # For demonstration purposes, creating dummy input files: echo -e "gene1\t1000" > counts.txt echo -e "gene2\t2000" >> counts.txt echo -e "gene3\t500" >> counts.txt echo -e "gene1\t1500" > gene_lengths.txt # Length in bases echo -e "gene2\t2500" >> gene_lengths.txt echo -e "gene3\t800" >> gene_lengths.txt echo -e "gene4\t1000" >> gene_lengths.txt # A gene not in counts echo "10000000" > total_mapped_reads.txt # 10 million total mapped reads # --- Python script for RPKM calculation --- # (Python is typically pre-installed or easily installed via package managers) # # conda install python # Uncomment to install Python if not available cat << 'EOF' > rpkm_calculator.py import sys def calculate_rpkm(counts_file, lengths_file, total_mapped_reads_file, output_file): """ Calculates RPKM values based on read counts, feature lengths, and total mapped reads. Formula: RPKM = (read_count * 10^9) / (total_mapped_reads * gene_length_in_kb) """ # Read total mapped reads try: with open(total_mapped_reads_file, 'r') as f: total_mapped_reads = int(f.readline().strip()) except FileNotFoundError: sys.stderr.write(f"Error: Total mapped reads file '{total_mapped_reads_file}' not found.\n") sys.exit(1) except ValueError: sys.stderr.write(f"Error: Invalid content in total mapped reads file '{total_mapped_reads_file}'. Expected an integer.\n") sys.exit(1) if total_mapped_reads == 0: sys.stderr.write("Error: Total mapped reads is zero. Cannot calculate RPKM.\n") sys.exit(1) # Read feature lengths feature_lengths = {} try: with open(lengths_file, 'r') as f: for line in f: parts = line.strip().split('\t') if len(parts) == 2: feature_id, length_bases_str = parts[0], parts[1] try: feature_lengths[feature_id] = int(length_bases_str) except ValueError: sys.stderr.write(f"Warning: Skipping malformed length (not an integer) in lengths file: {line.strip()}\n") else: sys.stderr.write(f"Warning: Skipping malformed line in lengths file: {line.strip()}\n") except FileNotFoundError: sys.stderr.write(f"Error: Lengths file '{lengths_file}' not found.\n") sys.exit(1) # Calculate RPKM and write to output try: with open(counts_file, 'r') as infile, open(output_file, 'w') as outfile: outfile.write("Feature_ID\tRead_Count\tLength_Bases\tRPKM\n") for line in infile: parts = line.strip().split('\t') if len(parts) == 2: feature_id, read_count_str = parts[0], parts[1] try: read_count = int(read_count_str) except ValueError: sys.stderr.write(f"Warning: Skipping malformed read count (not an integer) in counts file: {line.strip()}\n") continue length_bases = feature_lengths.get(feature_id) if length_bases is None: sys.stderr.write(f"Warning: Feature '{feature_id}' not found in lengths file. Skipping RPKM calculation for this feature.\n") outfile.write(f"{feature_id}\t{read_count}\tN/A\tN/A\n") continue if length_bases == 0: rpkm = 0.0 # RPKM is 0 if feature has no length sys.stderr.write(f"Warning: Feature '{feature_id}' has zero length. RPKM set to 0.\n") else: length_kb = length_bases / 1000.0 rpkm = (read_count * 10**9) / (total_mapped_reads * length_kb) outfile.write(f"{feature_id}\t{read_count}\t{length_bases}\t{rpkm:.4f}\n") else: sys.stderr.write(f"Warning: Skipping malformed line in counts file: {line.strip()}\n") except FileNotFoundError: sys.stderr.write(f"Error: Counts file '{counts_file}' not found.\n") sys.exit(1) if __name__ == "__main__": # Define input and output file names counts_file = "counts.txt" lengths_file = "gene_lengths.txt" total_mapped_reads_file = "total_mapped_reads.txt" output_file = "rpkm_output.txt" calculate_rpkm(counts_file, lengths_file, total_mapped_reads_file, output_file) print(f"RPKM calculation complete. Output saved to {output_file}") EOF # --- Execute the Python script --- python rpkm_calculator.py # --- Clean up generated files --- rm counts.txt gene_lengths.txt total_mapped_reads.txt rpkm_calculator.py -
4
In short, exons from all isoforms of a gene were merged to create one meta-transcript.
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # Define input GTF and output prefix INPUT_GTF="gencode.v45.annotation.gtf" # Example: Gencode human GRCh38, release 45 OUTPUT_PREFIX="gencode.v45.meta_transcripts" # Placeholder for reference GTF download (uncomment and adjust if needed) # wget -O ${INPUT_GTF}.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz # gunzip -f ${INPUT_GTF}.gz # Locate the gtf_to_genes_and_transcripts.py script within your RSEM installation. # If RSEM is installed via conda, it might be in $CONDA_PREFIX/opt/rsem-X.Y.Z/util/ # Replace "/path/to/rsem/util" with the actual path to the RSEM 'util' directory. RSEM_UTIL_SCRIPT="/path/to/rsem/util/gtf_to_genes_and_transcripts.py" # Execute the script to merge exons from all isoforms of a gene into one meta-transcript. # The primary output file containing the meta-transcripts (union exon models) will be: ${OUTPUT_PREFIX}.genes.gtf python ${RSEM_UTIL_SCRIPT} ${INPUT_GTF} ${OUTPUT_PREFIX} -
5
The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
featureCounts (Inferred with models/gemini-2.5-flash) v2.0.6$ Bash example
# Install Subread (if not already installed) # conda install -c bioconda subread # Define input files and parameters BAM_FILE="aligned_reads.bam" # Placeholder for the GTF/GFF file defining the exons of the meta-transcript. # Replace with your specific meta-transcript annotation file. # Example: Using a generic human GTF for GRCh38 assembly. GTF_FILE="Homo_sapiens.GRCh38.109.gtf" OUTPUT_COUNTS_PREFIX="meta_transcript_counts" OUTPUT_NORMALIZED="meta_transcript_rpkm.tsv" # 1. Count reads falling in exons of the meta-transcript # -a: annotation file (GTF/GFF) # -o: output file prefix # -F GTF: specify GTF format # -t exon: count reads for features of type 'exon' # -g gene_id: group features by 'gene_id' (assuming meta-transcript is defined by gene_id) # -s 0: strand specific (0=unstranded, 1=stranded, 2=reverse stranded). Adjust as needed based on library prep. # -p: specify if reads are paired-end. Remove if single-end. # --extra-attributes=gene_name: include gene names in output (optional) # --fracOverlap 0.5: minimum fraction of read overlap with feature (optional, default 0) # --minOverlap 1: minimum number of overlapping bases (optional, default 1) featureCounts -a "${GTF_FILE}" -o "${OUTPUT_COUNTS_PREFIX}.txt" -F GTF -t exon -g gene_id -s 0 "${BAM_FILE}" # 2. Normalize counts to RPKM (Reads Per Kilobase of transcript per Million mapped reads) # RPKM = (read_count * 10^9) / (feature_length_in_bases * total_mapped_reads_in_library) # Extract total mapped reads from the featureCounts summary file TOTAL_MAPPED_READS=$(awk '/Total reads/{print $NF}' "${OUTPUT_COUNTS_PREFIX}.txt.summary") # Check if total mapped reads were extracted successfully if [ -z "$TOTAL_MAPPED_READS" ] || [ "$TOTAL_MAPPED_READS" -eq 0 ]; then echo "Warning: Could not extract total mapped reads from featureCounts summary or total reads are zero. Attempting to count from BAM." TOTAL_MAPPED_READS=$(samtools view -c -F 260 "${BAM_FILE}") # Count primary alignments if [ "$TOTAL_MAPPED_READS" -eq 0 ]; then echo "Error: Total mapped reads are zero. Cannot calculate RPKM." exit 1 fi fi # Process the featureCounts output to calculate RPKM # The featureCounts output file has columns: Geneid, Chr, Start, End, Strand, Length, Sample_Name # We need Geneid, Length, and Sample_Name (read count) echo -e "Geneid\tRPKM" > "${OUTPUT_NORMALIZED}" awk -v total_reads="${TOTAL_MAPPED_READS}" ' BEGIN { OFS="\t"; # Convert total_reads to millions for RPKM formula total_reads_in_millions = total_reads / 1000000; } NR==1 {next} # Skip header line !/^#/ { # Skip comment lines gene_id = $1; feature_length = $6; read_count = $7; # Assuming single sample output, this is the count column if (feature_length > 0 && total_reads_in_millions > 0) { rpkm = read_count / (feature_length / 1000) / total_reads_in_millions; print gene_id, rpkm; } else { print gene_id, 0; # Assign 0 RPKM if length or total reads are zero } }' "${OUTPUT_COUNTS_PREFIX}.txt" >> "${OUTPUT_NORMALIZED}" # Clean up intermediate files (optional) # rm "${OUTPUT_COUNTS_PREFIX}.txt" "${OUTPUT_COUNTS_PREFIX}.txt.summary"
Raw Source Text
Illumina Casava1.7 software used for basecalling. Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to mm8 whole genome using bowtie v0.12.2 with parameters -q -p 4 -e 100 -y -a -m 10 --best --strata Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library. Genome_build: mm9 Supplementary_files_format_and_content: tab-delimited text files include RPKM values for each Sample