GSE277160 Processing Pipeline
Publication
Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.Molecular cell (2024) — PMID 39303722
Dataset
GSE277160Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [RNA-seq]
Processing Steps
Generate Jupyter Notebook-
1
We aligned reads to the human genome version GRCh38 with annotation version Gencode v40 using STAR (v2.7.1a).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.1a # Define paths and variables # Replace with actual paths to your STAR index and FASTQ files GENOME_DIR="/path/to/STAR_index/GRCh38_Gencode_v40" # Pre-built STAR index for GRCh38 and Gencode v40 READ1_FASTQ="sample_R1.fastq.gz" # Input FASTQ file (read 1) READ2_FASTQ="sample_R2.fastq.gz" # Input FASTQ file (read 2), remove if single-end OUTPUT_PREFIX="sample_aligned_" # Prefix for output files NUM_THREADS=8 # Number of threads to use, adjust as needed # Align reads to the human genome GRCh38 with Gencode v40 annotation STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --runThreadN "${NUM_THREADS}" -
2
Gene expression levels were quantified using RSEM (v1.3.0), in which the TPM values were used to calculate Pearson correlation coefficient between replicates.
$ Bash example
# Install RSEM (if not already installed) # For example, using conda: # conda install -c bioconda rsem=1.3.0 # --- IMPORTANT: RSEM requires a pre-built reference index. --- # Example command to build a reference (replace with actual paths and genome/transcriptome files): # rsem-prepare-reference --gtf genome.gtf --bowtie2 genome.fa rsem_reference_index_name # Define variables for reference and input files RSEM_REFERENCE_INDEX="path/to/rsem_indexed_reference/hg38_rsem_index" # Replace with actual path to RSEM indexed reference (e.g., built from GRCh38) READ1_FASTQ="sample_R1.fastq.gz" # Replace with actual path to R1 FASTQ file READ2_FASTQ="sample_R2.fastq.gz" # Replace with actual path to R2 FASTQ file OUTPUT_SAMPLE_NAME="sample_output" # Prefix for output files (e.g., sample_output.genes.results) # Quantify gene expression levels using RSEM # This command assumes paired-end, gzipped FASTQ files and uses RSEM's default aligner (often Bowtie2). # For single-end reads, remove --paired-end and provide only one FASTQ file. # For using STAR as the aligner, add --star option and ensure STAR is in your PATH # and the reference was built with the --star option. rsem-calculate-expression \ --paired-end \ --gzip-compressed \ "${READ1_FASTQ}" \ "${READ2_FASTQ}" \ "${RSEM_REFERENCE_INDEX}" \ "${OUTPUT_SAMPLE_NAME}" -
3
Samples meeting specific criteria were selected for further analysis: only those with unique aligned reads greater than 10M and a Pearson correlation coefficient between replicates of 0.9 or higher were considered.
QC filtering script (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# This script filters samples based on pre-computed Quality Control (QC) metrics. # It assumes an input file (e.g., 'qc_metrics.tsv') containing sample-specific # metrics, including unique aligned reads and replicate correlation. # Define filtering thresholds MIN_UNIQUE_ALIGNED_READS=10000000 # 10 million unique aligned reads MIN_PEARSON_CORRELATION=0.9 # Pearson correlation coefficient of 0.9 or higher # Placeholder for the actual input QC metrics file. # In a real bioinformatics pipeline, this file would be generated by # previous steps (e.g., alignment, QC metric calculation). INPUT_QC_FILE="qc_metrics.tsv" # Output file to store the selected sample IDs or their full QC rows OUTPUT_SELECTED_SAMPLES="selected_samples.tsv" # Filter samples using awk. # This example assumes the 'unique_aligned_reads' metric is in the 2nd column ($2) # and the 'replicate_correlation' metric is in the 3rd column ($3) of the # tab-separated INPUT_QC_FILE. # Adjust column numbers ($2, $3) based on the actual format of your 'qc_metrics.tsv'. # The first line (header) is printed without filtering. awk -v min_reads="$MIN_UNIQUE_ALIGNED_READS" -v min_corr="$MIN_PEARSON_CORRELATION" ' BEGIN { FS="\t"; OFS="\t" } NR==1 { print; next } # Print header line $2 >= min_reads && $3 >= min_corr { print } ' "$INPUT_QC_FILE" > "$OUTPUT_SELECTED_SAMPLES" -
4
Samples with fewer reads were re-sequenced to ensure an adequate count, while those with lower correlation were repeated.
Custom Quality Control Script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This step describes an experimental decision-making process based on quality control metrics (read counts and correlation), rather than a specific, universally recognized computational tool. Re-sequencing and repeating experiments are wet-lab actions. # A hypothetical custom script or pipeline module would typically perform checks like: # Example: Check read count (hypothetical threshold) # READ_COUNT=$(samtools flagstat input.bam | grep "total" | awk '{print $1}') # if [ "$READ_COUNT" -lt 10000000 ]; then # echo "Sample has fewer reads ($READ_COUNT), flagged for re-sequencing." # fi # Example: Check replicate correlation (hypothetical threshold, assuming deepTools or similar was used to generate metrics) # # CORRELATION_VALUE=$(python -c "import numpy as np; matrix=np.load('correlation_matrix.npz')['correlations']; print(matrix[0,1])") # Simplified # # if (( $(echo "$CORRELATION_VALUE < 0.8" | bc -l) )); then # # echo "Sample has lower correlation, flagged for repeating." # # fi # No direct execution command for the described 're-sequencing' or 'repeating' actions, as these are experimental decisions. -
5
Plus and minus strand bigwig files were generated using bedGraphToBigWig (v2.9).
$ Bash example
# Install UCSC tools (bedGraphToBigWig is part of this suite) # conda install -c bioconda ucsc-bedgraphtobigwig # conda install -c bioconda ucsc-fetchchromsizes # Placeholder for reference genome assembly (e.g., hg38) REFERENCE_GENOME="hg38" # Generate chrom.sizes file for the reference genome fetchChromSizes ${REFERENCE_GENOME} > ${REFERENCE_GENOME}.chrom.sizes # Generate plus strand bigWig file bedGraphToBigWig plus_strand.bedGraph ${REFERENCE_GENOME}.chrom.sizes plus_strand.bigWig # Generate minus strand bigWig file bedGraphToBigWig minus_strand.bedGraph ${REFERENCE_GENOME}.chrom.sizes minus_strand.bigWig -
6
Transcript abundance was quantified using Salmon (v1.9.0) with the âgcBias option.
$ Bash example
# Install Salmon (if not already installed) # conda install -c bioconda salmon # --- Placeholder for reference data --- # Salmon requires a transcriptome index. # Replace <TRANSCRIPTOME_FASTA> with the path to your transcriptome FASTA file (e.g., from GENCODE, Ensembl). # Replace <INDEX_DIR> with your desired output directory for the index. # Example for human GRCh38: # salmon index -t /path/to/human_grch38_transcriptome.fasta.gz -i /path/to/salmon_index/human_grch38 # --- Quantification command --- # Replace <INDEX_DIR> with the actual path to your Salmon index. # Replace <READ1_FASTQ> and <READ2_FASTQ> with your input FASTQ files. # If single-end reads, use -r <READS_FASTQ> instead of -1 and -2. # Replace <OUTPUT_DIR> with your desired output directory for quantification results. # -l A: Auto-detect library type (common default for RNA-seq). Adjust if a specific library type is known (e.g., -l ISR for paired-end stranded). # -p: Number of threads to use (adjust based on available resources). salmon quant \ -i /path/to/salmon_index/human_grch38 \ -l A \ -1 /path/to/reads/sample_R1.fastq.gz \ -2 /path/to/reads/sample_R2.fastq.gz \ --gcBias \ -o /path/to/output/sample_quant \ -p 8
-
7
CQN (v1.40.0) was employed to normalize gene-level GC content and length biases.
$ Bash example
# Install R and Bioconductor packages if not already installed # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("cqn")' # Placeholder for input files. Replace with actual paths. # gene_counts.tsv: Tab-separated file with gene IDs as row names and sample counts as columns. # Example: # GeneID Sample1 Sample2 # GeneA 100 150 # GeneB 200 220 # ... # # gene_features.tsv: Tab-separated file with gene IDs, GC_content, and Length columns. # These features are typically derived from a reference genome (e.g., hg38) # and a gene annotation file (e.g., GENCODE GTF). # Example: # GeneID GC_content Length # GeneA 0.45 1500 # GeneB 0.52 2000 # ... # Create an R script for CQN normalization cat << 'EOF' > run_cqn_normalization.R library(cqn) # --- Configuration --- counts_file <- "gene_counts.tsv" # Path to your raw gene counts matrix features_file <- "gene_features.tsv" # Path to your gene features (GC content, length) output_file <- "cqn_normalized_counts.tsv" # Output file for normalized counts # --- Load Data --- message(paste("Loading counts from:", counts_file)) counts_matrix <- read.delim(counts_file, row.names = 1, check.names = FALSE) message(paste("Loading features from:", features_file)) features_df <- read.delim(features_file, row.names = 1, check.names = FALSE) # Ensure gene IDs match and are in the same order common_genes <- intersect(rownames(counts_matrix), rownames(features_df)) if (length(common_genes) == 0) { stop("No common genes found between counts and features files. Please check gene IDs.") } counts_matrix <- counts_matrix[common_genes, ] features_df <- features_df[common_genes, ] message(paste("Found", length(common_genes), "common genes for normalization.")) # Extract GC content and lengths gc_content_vector <- features_df$GC_content gene_length_vector <- features_df$Length # Basic validation if (any(is.na(gc_content_vector)) || any(is.na(gene_length_vector))) { stop("NA values found in GC content or gene lengths. Please check your features file.") } if (any(gene_length_vector <= 0)) { stop("Gene lengths must be positive. Please check your features file.") } if (any(counts_matrix < 0)) { warning("Negative counts found in the matrix. CQN expects non-negative counts.") } # --- Perform CQN Normalization --- message("Performing CQN normalization...") # Using total counts as size factors. For more advanced normalization, # size factors from DESeq2/edgeR could be used for better library size adjustment. size_factors <- colSums(counts_matrix) # CQN function expects counts to be non-negative integers. # If your counts are already normalized or transformed, adjust accordingly. cqn_res <- cqn(counts = counts_matrix, x = gc_content_vector, lengths = gene_length_vector, sizeFactors = size_factors) # Get normalized counts (log2 scale) # The 'y' component is the log2(normalized_counts / sizeFactors) # The 'offset' component accounts for GC content and length biases. # Adding the offset to y gives the log2-transformed normalized expression values. normalized_log2_counts <- cqn_res$y + cqn_res$offset # Save results message(paste("Saving normalized counts to:", output_file)) write.table(normalized_log2_counts, output_file, sep = "\t", quote = FALSE, col.names = NA) message("CQN normalization complete.") EOF # Execute the R script Rscript run_cqn_normalization.R -
8
Transcript merging to genes was performed using Tximport (v1.22.0).
$ Bash example
# --- Installation (commented out) --- # Install R if not already installed # sudo apt-get update && sudo apt-get install -y r-base # Install BiocManager and tximport within R # R -q -e 'install.packages("BiocManager", repos="https://cloud.r-project.org")' # R -q -e 'BiocManager::install("tximport")' # R -q -e 'install.packages("readr", repos="https://cloud.r-project.org")' # For write_tsv # --- Placeholder Data Generation (for demonstration) --- # In a real pipeline, these would be actual output from quantification tools (e.g., Salmon) # and a pre-generated tx2gene mapping. # Create dummy Salmon quantification files mkdir -p sample1 sample2 sample3 echo -e "Name\tLength\tEffectiveLength\tTPM\tNumReads" > sample1/quant.sf echo -e "ENST00000000001.1\t1000\t900\t10.5\t100" >> sample1/quant.sf echo -e "ENST00000000002.1\t1200\t1100\t20.3\t250" >> sample1/quant.sf echo -e "ENST00000000003.1\t800\t700\t5.1\t50" >> sample1/quant.sf echo -e "Name\tLength\tEffectiveLength\tTPM\tNumReads" > sample2/quant.sf echo -e "ENST00000000001.1\t1000\t900\t12.1\t120" >> sample2/quant.sf echo -e "ENST00000000002.1\t1200\t1100\t18.7\t230" >> sample2/quant.sf echo -e "ENST00000000003.1\t800\t700\t6.2\t60" >> sample2/quant.sf echo -e "Name\tLength\tEffectiveLength\tTPM\tNumReads" > sample3/quant.sf echo -e "ENST00000000001.1\t1000\t900\t9.8\t95" >> sample3/quant.sf echo -e "ENST00000000002.1\t1200\t1100\t22.5\t270" >> sample3/quant.sf echo -e "ENST00000000003.1\t800\t700\t4.9\t45" >> sample3/quant.sf # Create a dummy tx2gene mapping file # In a real scenario, this would be generated from a GTF/GFF annotation file # (e.g., using GenomicFeatures::makeTx2gene() in R). # Reference annotation source: Ensembl, GENCODE, UCSC (e.g., Homo_sapiens.GRCh38.109.gtf) echo -e "transcript_id\tgene_id" > tx2gene.tsv echo -e "ENST00000000001.1\tENSG00000000001.1" >> tx2gene.tsv echo -e "ENST00000000002.1\tENSG00000000001.1" >> tx2gene.tsv # Two transcripts map to one gene echo -e "ENST00000000003.1\tENSG00000000002.1" >> tx2gene.tsv # --- R script for Tximport --- # This script performs the transcript merging to genes cat << 'EOF' > run_tximport.R library(tximport) library(readr) # For write_tsv # --- Configuration --- # List of paths to transcript quantification files (e.g., Salmon quant.sf files) quant_files <- c( "sample1/quant.sf", "sample2/quant.sf", "sample3/quant.sf" ) names(quant_files) <- c("sample1", "sample2", "sample3") # Path to the transcript-to-gene mapping file # This file is typically generated from a reference genome annotation (e.g., GTF/GFF) # Example reference: Homo_sapiens.GRCh38.109.gtf from Ensembl tx2gene_file <- "tx2gene.tsv" # Output file for gene-level counts output_gene_counts_file <- "gene_level_counts.tsv" # --- Load tx2gene mapping --- tx2gene <- read.delim(tx2gene_file, header = TRUE, stringsAsFactors = FALSE) # --- Perform transcript merging to genes --- # 'type' should match the quantification tool used (e.g., "salmon", "kallisto", "rsem") # 'countsFromAbundance' specifies how to get gene-level counts from transcript abundances. # "lengthScaledTPM" is often recommended for differential expression analysis. txi <- tximport( files = quant_files, type = "salmon", # Assuming Salmon was used for quantification tx2gene = tx2gene, countsFromAbundance = "lengthScaledTPM", ignoreTxVersion = TRUE # Useful if transcript IDs in quant files have versions (e.g., ENST... .1) ) # Extract gene-level counts gene_counts <- as.data.frame(txi$counts) # Write gene-level counts to a TSV file write_tsv(gene_counts, output_gene_counts_file) message(paste("Gene-level counts saved to:", output_gene_counts_file)) EOF # --- Execution Command --- Rscript run_tximport.R # --- Clean up dummy data (optional) --- # rm -rf sample1 sample2 sample3 tx2gene.tsv run_tximport.R
Tools Used
Raw Source Text
We aligned reads to the human genome version GRCh38 with annotation version Gencode v40 using STAR (v2.7.1a). Gene expression levels were quantified using RSEM (v1.3.0), in which the TPM values were used to calculate Pearson correlation coefficient between replicates. Samples meeting specific criteria were selected for further analysis: only those with unique aligned reads greater than 10M and a Pearson correlation coefficient between replicates of 0.9 or higher were considered. Samples with fewer reads were re-sequenced to ensure an adequate count, while those with lower correlation were repeated. Plus and minus strand bigwig files were generated using bedGraphToBigWig (v2.9). Transcript abundance was quantified using Salmon (v1.9.0) with the âgcBias option. CQN (v1.40.0) was employed to normalize gene-level GC content and length biases. Transcript merging to genes was performed using Tximport (v1.22.0). Assembly: GRCh38 Supplementary files format and content: Bam files and salmon quant files