GSE95156 Processing Pipeline
Publication
A role for alternative splicing in circadian control of exocytosis and glucose homeostasis.Genes & development (2020) — PMID 32616519
Dataset
GSE95156Pancreatic alpha and beta-cellular clocks have distinct molecular properties and impact on islet hormone secretion and gene expression
Processing Steps
Generate Jupyter Notebook-
1
Sequenced paired-end reads were mapped to the whole genome (mm10) using STAR (Dobin et al., 2013) with default parameters.
$ Bash example
# Install STAR # conda install -c bioconda star # Define paths and filenames GENOME_DIR="./mm10_star_index" FASTA_URL="ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz" GTF_URL="ftp://ftp.ensembl.org/pub/release-109/gtf/mus_musculus/Mus_musculus.GRCm38.109.gtf.gz" FASTA_GZ_FILE="Mus_musculus.GRCm38.dna.primary_assembly.fa.gz" GTF_GZ_FILE="Mus_musculus.GRCm38.109.gtf.gz" GENOME_FASTA="Mus_musculus.GRCm38.dna.primary_assembly.fa" GENOME_GTF="Mus_musculus.GRCm38.109.gtf" READ1="sample_R1.fastq.gz" # Placeholder for paired-end read 1 READ2="sample_R2.fastq.gz" # Placeholder for paired-end read 2 OUTPUT_PREFIX="sample_aligned_" NUM_THREADS=8 # Adjust based on available resources # 1. Create directory for genome index # mkdir -p "${GENOME_DIR}" # 2. Download reference genome FASTA and GTF files # wget -O "${FASTA_GZ_FILE}" "${FASTA_URL}" # wget -O "${GTF_GZ_FILE}" "${GTF_URL}" # 3. Decompress files # gunzip -f "${FASTA_GZ_FILE}" # gunzip -f "${GTF_GZ_FILE}" # 4. Generate STAR genome index (using common parameters for RNA-seq, sjdbOverhang is often read_length - 1, 100 is a common default) # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GENOME_GTF}" \ # --sjdbOverhang 100 \ # --runThreadN "${NUM_THREADS}" # 5. Perform paired-end alignment with STAR using default parameters (and common output format for pipelines) STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --runThreadN "${NUM_THREADS}" -
2
Genome index was built considering Ensembl annotation.
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define variables for genome and annotation files # Using Homo sapiens GRCh38 and Ensembl release 109 as a common example. GENOME_FASTA="Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz" GTF_FILE="Homo_sapiens.GRCh38.109.gtf.gz" GENOME_DIR="STAR_GRCh38_Ensembl109_Index" NUM_THREADS=8 # Adjust as needed # Download reference files (example, actual download method may vary) # mkdir -p reference_data # wget -P reference_data ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/${GENOME_FASTA} # wget -P reference_data ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/${GTF_FILE} # Create output directory for the genome index mkdir -p ${GENOME_DIR} # Build the STAR genome index STAR --runMode genomeGenerate \ --genomeDir ${GENOME_DIR} \ --genomeFastaFiles reference_data/${GENOME_FASTA} \ --sjdbGTFfile reference_data/${GTF_FILE} \ --sjdbOverhang 100 \ --runThreadN ${NUM_THREADS} -
3
For each Ensembl gene, all of the exons of the respective annotated protein-coding transcripts were considered.
Ensembl vRelease 111 (GRCh38) (Inferred from common usage and recent releases)$ Bash example
# This step describes the *selection criteria* for genomic features from Ensembl. # It implies using a GTF/GFF file derived from Ensembl annotations. # The following commands demonstrate how one might obtain and filter such a file # to focus on exons of protein-coding transcripts for Homo sapiens (GRCh38). # --- Installation (if needed, wget, gunzip, awk are usually pre-installed) --- # # conda install -c conda-forge wget # --- Define variables for species and Ensembl release --- SPECIES="Homo_sapiens" ASSEMBLY="GRCh38" ENSEMBL_RELEASE="111" # Example: Use a specific Ensembl release version # --- Download the Ensembl GTF file for the specified release and assembly --- # This URL pattern is common for Ensembl FTP archives. # Adjust the URL if a different mirror or specific file path is required. ENSEMBL_FTP_BASE="ftp://ftp.ensembl.org/pub/release-${ENSEMBL_RELEASE}/gtf/${SPECIES,,}" GTF_FILE="${SPECIES}.${ASSEMBLY}.${ENSEMBL_RELEASE}.gtf.gz" DOWNLOAD_URL="${ENSEMBL_FTP_BASE}/${GTF_FILE}" echo "Downloading Ensembl GTF file from: ${DOWNLOAD_URL}" wget -nc "${DOWNLOAD_URL}" # --- Decompress the GTF file --- gunzip -f "${GTF_FILE}" # --- Filter the GTF file to consider only exons from protein-coding transcripts --- # This command extracts lines where the feature type is 'exon' AND # the gene_biotype attribute is 'protein_coding'. INPUT_GTF="${SPECIES}.${ASSEMBLY}.${ENSEMBL_RELEASE}.gtf" OUTPUT_FILTERED_GTF="${SPECIES}.${ASSEMBLY}.${ENSEMBL_RELEASE}.protein_coding_exons.gtf" echo "Filtering GTF file for protein-coding exons..." awk '$3 == "exon" && $0 ~ /gene_biotype "protein_coding"/' "${INPUT_GTF}" > "${OUTPUT_FILTERED_GTF}" echo "Filtered GTF saved to: ${OUTPUT_FILTERED_GTF}" -
4
Reads up to one mismatch were counted considering only reads in the right gene orientation and mapped in the defined exons.
featureCounts (Inferred with models/gemini-2.5-flash) v2.0.6$ Bash example
# Install Subread (which includes featureCounts) # conda install -c bioconda subread # conda install -c bioconda samtools # Define input and output files RAW_ALIGNED_BAM="aligned_reads.bam" # Input BAM from alignment step FILTERED_ALIGNED_BAM="aligned_reads_mismatch_filtered.bam" ANNOTATION_GTF="Homo_sapiens.GRCh38.109.gtf" # Placeholder for annotation file (e.g., from Ensembl) OUTPUT_COUNTS="exon_counts.txt" FEATURE_TYPE="exon" # As specified in the description GENE_ID_ATTRIBUTE="gene_id" # Common attribute for grouping exons into genes # 1. Filter reads for up to one mismatch (NM:i:0 or NM:i:1) # This step uses samtools to convert BAM to SAM, then awk to filter based on the NM tag # (number of mismatches), and finally samtools to convert back to BAM. samtools view -h "${RAW_ALIGNED_BAM}" | \ awk 'BEGIN{FS="\t"} /^@/ || /NM:i:0/ || /NM:i:1/' | \ samtools view -b - > "${FILTERED_ALIGNED_BAM}" # 2. Count reads in defined exons with right gene orientation # -a: Annotation file # -o: Output file # -s 2: Strand-specific counting (reverse strand, common for many RNA-seq protocols). # "right gene orientation" implies either -s 1 or -s 2. We assume -s 2 as a common default. # -t exon: Count reads mapping to exons # -g gene_id: Group features by gene_id # -p: If paired-end reads (add if applicable, otherwise omit) featureCounts -a "${ANNOTATION_GTF}" \ -o "${OUTPUT_COUNTS}" \ -s 2 \ -t "${FEATURE_TYPE}" \ -g "${GENE_ID_ATTRIBUTE}" \ "${FILTERED_ALIGNED_BAM}" -
5
Read counts were normalized by the library size (EdgeR, Robinson et al.
edgeR v3.42.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install R and edgeR if not already present # conda install -c conda-forge r-base # conda install -c bioconductor bioconductor-edger # Create a dummy counts.tsv for demonstration purposes. # In a real pipeline, this file would be the output of a previous step (e.g., featureCounts, HTSeq). # It should contain gene/feature IDs in the first column and raw counts for each sample in subsequent columns. echo -e "Gene\tSample1\tSample2\tSample3" > counts.tsv echo -e "GeneA\t1000\t1200\t1100" >> counts.tsv echo -e "GeneB\t2000\t2500\t2200" >> counts.tsv echo -e "GeneC\t500\t600\t550" >> counts.tsv echo -e "GeneD\t100\t120\t110" >> counts.tsv # R script to perform edgeR library size normalization using TMM (Trimmed Mean of M-values). # The normalization factors are calculated and saved to 'normalization_factors.tsv'. # These factors are typically used in downstream differential expression analysis with edgeR. Rscript -e ' library(edgeR) # Read count data. Assumes first column is row names (e.g., gene IDs). counts_data <- read.delim("counts.tsv", row.names = 1, check.names = FALSE) # Create a DGEList object from the count matrix. dge <- DGEList(counts = counts_data) # Calculate normalization factors (TMM method is default for calcNormFactors). # These factors adjust for differences in library size and composition. dge <- calcNormFactors(dge) # Save the sample information, including the calculated normalization factors, to a TSV file. # The normalization factors are in the "norm.factors" column. write.table(dge$samples, "normalization_factors.tsv", sep = "\t", quote = FALSE, row.names = TRUE) # Optional: To get TMM-normalized counts (e.g., as Counts Per Million - CPM), # you can use the cpm() function with normalized.lib.sizes = TRUE. # normalized_cpm_counts <- cpm(dge, normalized.lib.sizes = TRUE) # write.table(normalized_cpm_counts, "normalized_cpm_counts.tsv", sep = "\t", quote = FALSE, row.names = TRUE) ' -
6
2010) and the average size of the transcripts.
$ Bash example
# Install seqkit (if not already installed) # conda install -c bioconda seqkit # Define reference paths # Replace with actual paths to your reference genome and annotation files REF_DIR="/path/to/reference/GRCh38" TRANSCRIPTOME_FASTA="${REF_DIR}/gencode.v44.transcripts.fa" # Placeholder for GRCh38 transcriptome FASTA # Ensure the reference transcriptome FASTA file exists # Example download (uncomment and modify as needed): # wget -P "${REF_DIR}" https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.transcripts.fa.gz # gunzip "${REF_DIR}/gencode.v44.transcripts.fa.gz" # Calculate the average size of transcripts using seqkit # seqkit fxlen outputs sequence ID and length. We then pipe to awk to calculate the average. seqkit fxlen "${TRANSCRIPTOME_FASTA}" | awk ' BEGIN { total_length = 0; transcript_count = 0; } NR > 1 { # Skip header line from seqkit fxlen output total_length += $2; # Second column is length transcript_count++; } END { if (transcript_count > 0) { print "Total transcripts: " transcript_count; print "Total length (bases): " total_length; print "Average transcript size (bases): " total_length / transcript_count; } else { print "No transcripts found or file is empty."; } }' > average_transcript_size.txt -
7
Normalized reads count (RPKM) were log2 transformed.
awk (Inferred with models/gemini-2.5-flash) vN/A (Generic mathematical operation)$ Bash example
# This script assumes RPKM values are in a tab-separated file, e.g., in the second column. # Adjust column number ($2) if RPKM values are in a different column. # Add a pseudocount (e.g., +1) before log2 transformation if RPKM values can be zero to avoid -Inf. # Example: input_rpkm.tsv contains gene_id\tRPKM_value # Example using awk for log2 transformation (adding a pseudocount of 1 to avoid log(0)) # Assuming RPKM values are in the second column awk 'BEGIN {OFS="\t"} {print $1, log($2 + 1)/log(2)}' input_rpkm.tsv > output_log2_rpkm.tsv # Alternative using R (requires R installation) # Rscript -e ' # data <- read.table("input_rpkm.tsv", header=FALSE, sep="\t") # # Add a pseudocount of 1 to avoid log(0) # data$V2_log2 <- log2(data$V2 + 1) # write.table(data, "output_log2_rpkm.tsv", sep="\t", quote=FALSE, row.names=FALSE, col.names=FALSE) # ' # Alternative using Python (requires Python and pandas/numpy installation) # # conda install pandas numpy # python -c ' # import pandas as pd # import numpy as np # df = pd.read_csv("input_rpkm.tsv", sep="\t", header=None) # # Add a pseudocount of 1 to avoid log(0) # df[1] = np.log2(df[1] + 1) # df.to_csv("output_log2_rpkm.tsv", sep="\t", header=False, index=False) # '
Tools Used
Raw Source Text
Sequenced paired-end reads were mapped to the whole genome (mm10) using STAR (Dobin et al., 2013) with default parameters. Genome index was built considering Ensembl annotation. For each Ensembl gene, all of the exons of the respective annotated protein-coding transcripts were considered. Reads up to one mismatch were counted considering only reads in the right gene orientation and mapped in the defined exons. Read counts were normalized by the library size (EdgeR, Robinson et al. 2010) and the average size of the transcripts. Normalized reads count (RPKM) were log2 transformed. Genome_build: mm10 Supplementary_files_format_and_content: Tab-delimited text file include RPKM values for each sample