GSE62023 Processing Pipeline
Publication
A Gene Regulatory Network Cooperatively Controlled by Pdx1 and Sox9 Governs Lineage Allocation of Foregut Progenitor Cells.Cell reports (2015) — PMID 26440894
Processing Steps
Generate Jupyter Notebook-
1
Mean foreground intensities were obtained for each spot and imported into the mathematical software package âRâ, which is used for all data input, diagnostic plots, normalization and quality checking steps of the analysis process using scripts developed in-house by Peter White specifically for this analysis.
$ Bash example
# This step involves custom R scripts for data input, diagnostic plots, normalization, and quality checking. # The exact script name and parameters are not provided in the description, as they are described as "in-house scripts developed by Peter White specifically for this analysis". # A placeholder command is used to represent the execution of such a script. Rscript peter_white_analysis.R \ --input input_intensities.tsv \ --output_normalized output_normalized_data.tsv \ --output_qc_report output_qc_report.pdf -
2
In outline, the Cy3 (green) intensities were not background corrected (this has been shown to only introduce noise), and corrected for the scanner offset (40 was subtracted for each intensity).
Custom Script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# Define input and output file names (placeholders) INPUT_FILE="raw_intensities.tsv" OUTPUT_FILE="corrected_intensities.tsv" # Define the scanner offset to be subtracted SCANNER_OFFSET=40 # Define the 1-based index of the column containing Cy3 intensities. # This is an assumption; adjust 'CY3_COLUMN_INDEX' based on the actual file format. CY3_COLUMN_INDEX=3 # Subtract the scanner offset from the specified Cy3 intensity column. # This command assumes a tab-separated input file and preserves the header. # If there is no header, remove 'NR==1 { print; next }'. # If the file is not tab-separated, adjust 'FS' and 'OFS' accordingly. awk -v offset="${SCANNER_OFFSET}" -v col_idx="${CY3_COLUMN_INDEX}" ' BEGIN { FS=OFS="\t" } NR==1 { print; next } { $col_idx = $col_idx - offset; print }' "${INPUT_FILE}" > "${OUTPUT_FILE}" -
3
The dataset was filtered to remove positive control elements and any elements that had been flagged as bad.
$ Bash example
# Example: Assuming input, positive controls, and bad elements are BED files. # Replace 'input.bed', 'positive_controls.bed', and 'bad_elements.bed' with your actual file paths. # The 'positive_controls.bed' and 'bad_elements.bed' files should contain the genomic regions to be removed. # conda install -c bioconda bedtools # Remove positive control elements from the input dataset bedtools subtract -a input.bed -b positive_controls.bed > temp_filtered_controls.bed # From the remaining dataset, remove elements that have been flagged as bad bedtools subtract -a temp_filtered_controls.bed -b bad_elements.bed > filtered_output.bed # Clean up the temporary file rm temp_filtered_controls.bed
-
4
Using the negative controls on the arrays, the background threshold was determined and all values less than this value were set to the threshold value.
Custom Python script (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)$ Bash example
# Create dummy input files for demonstration # This would be replaced by actual data in a real pipeline mkdir -p data echo -e "ProbeID\tSample1\tSample2" > data/input_array_data.tsv echo -e "ProbeA\t100\t120" >> data/input_array_data.tsv echo -e "ProbeB\t5\t8" >> data/input_array_data.tsv echo -e "ProbeC\t150\t160" >> data/input_array_data.tsv echo -e "NegControl1\t1\t2" >> data/input_array_data.tsv echo -e "NegControl2\t3\t4" >> data/input_array_data.tsv echo -e "ProbeD\t20\t25" >> data/input_array_data.tsv echo -e "NegControl1\nNegControl2" > data/negative_control_probes.txt # Save the Python script cat << 'EOF' > background_correction.py import pandas as pd import numpy as np import sys def apply_background_threshold(data_file, negative_control_probes_file, output_file): """ Determines background threshold from negative controls and floors values. Args: data_file (str): Path to the input data file (e.g., TSV). negative_control_probes_file (str): Path to a file listing negative control probe IDs (one per line). output_file (str): Path to save the processed data. """ try: df = pd.read_csv(data_file, sep='\t', index_col=0) # Assuming TSV with first column as index (probe IDs) except Exception as e: print(f"Error reading data file {data_file}: {e}", file=sys.stderr) sys.exit(1) try: with open(negative_control_probes_file, 'r') as f: negative_control_probes = [line.strip() for line in f if line.strip()] except Exception as e: print(f"Error reading negative control probes file {negative_control_probes_file}: {e}", file=sys.stderr) sys.exit(1) # Filter for negative control probes present in the data common_negative_controls = [p for p in negative_control_probes if p in df.index] if not common_negative_controls: print("Warning: No negative control probes found in the data. Cannot perform background correction.", file=sys.stderr) sys.exit(1) negative_control_data = df.loc[common_negative_controls].values.flatten() # Determine background threshold (e.g., mean of positive values from negative controls) # Filter out non-positive values if they represent missing data or true zeros that shouldn't contribute to background mean. positive_negative_control_data = negative_control_data[negative_control_data > 0] if len(positive_negative_control_data) == 0: print("Warning: No positive values found in negative controls. Cannot determine background threshold.", file=sys.stderr) sys.exit(1) background_threshold = np.mean(positive_negative_control_data) print(f"Determined background threshold: {background_threshold:.4f}") # Set all values less than the threshold to the threshold value df_corrected = df.applymap(lambda x: max(x, background_threshold)) df_corrected.to_csv(output_file, sep='\t') print(f"Processed data saved to {output_file}") if __name__ == "__main__": if len(sys.argv) != 4: print("Usage: python background_correction.py <input_data_file> <negative_control_probes_file> <output_file>", file=sys.stderr) sys.exit(1) input_data_file = sys.argv[1] negative_control_probes_file = sys.argv[2] output_file = sys.argv[3] apply_background_threshold(input_data_file, negative_control_probes_file, output_file) EOF # Install dependencies if not already present # conda install -c anaconda pandas numpy # Execute the background correction script python background_correction.py data/input_array_data.tsv data/negative_control_probes.txt data/output_array_data_background_corrected.tsv -
5
Finally, the data was normalized using the Limma Quantile Normalization package in âRâ (Smyth 2004, Bolstad et al., 2003).
$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("limma", update = FALSE, ask = FALSE)' # --- Placeholder for input data --- # In a real scenario, 'expression_matrix.tsv' would be your actual gene expression data. # It should be a tab-separated file with gene identifiers in the first column # and sample expression values in subsequent columns. # Example: # Gene Sample1 Sample2 Sample3 # GeneA 100 120 90 # GeneB 50 60 40 # GeneC 200 210 180 # Create a dummy input file for demonstration purposes if it doesn't exist if [ ! -f "expression_matrix.tsv" ]; then echo -e "Gene\tSample1\tSample2\tSample3" > expression_matrix.tsv echo -e "GeneA\t100\t120\t90" >> expression_matrix.tsv echo -e "GeneB\t50\t60\t40" >> expression_matrix.tsv echo -e "GeneC\t200\t210\t180" >> expression_matrix.tsv echo -e "GeneD\t70\t80\t60" >> expression_matrix.tsv fi # R script for Limma Quantile Normalization # This script loads expression data, performs quantile normalization using the limma package, # and saves the normalized data to a new file. R_SCRIPT=""" # Load the limma package library(limma) # --- Configuration --- input_file <- "expression_matrix.tsv" # Path to your input data file output_file <- "normalized_expression_matrix.tsv" # Path for the output normalized data file # --- Main script --- # 1. Load data # Assumes input is a tab-separated matrix with gene IDs as row names (first column) # and sample names as column headers. # Adjust 'header', 'row.names', 'sep' parameters if your file format differs. data_matrix <- read.delim(input_file, header = TRUE, row.names = 1, sep = "\t") # Ensure the data matrix is numeric for normalization data_matrix <- as.matrix(data_matrix) # 2. Perform Quantile Normalization # The 'normalizeBetweenArrays' function from limma is used with 'method=\"quantile\"'. normalized_data <- normalizeBetweenArrays(data_matrix, method = \"quantile\") # 3. Save normalized data # Writes the normalized matrix to a new tab-separated file. # 'col.names = NA' ensures that the row names (gene IDs) are written as the first column # without an explicit header for that column. write.table(normalized_data, file = output_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("Quantile normalization complete. Normalized data saved to:", output_file)) """ # Execute the R script using R's command-line interface R -e "$R_SCRIPT" # Optional: Clean up the dummy input file if it was created for demonstration # rm expression_matrix.tsv
Tools Used
Raw Source Text
Mean foreground intensities were obtained for each spot and imported into the mathematical software package âRâ, which is used for all data input, diagnostic plots, normalization and quality checking steps of the analysis process using scripts developed in-house by Peter White specifically for this analysis. In outline, the Cy3 (green) intensities were not background corrected (this has been shown to only introduce noise), and corrected for the scanner offset (40 was subtracted for each intensity). The dataset was filtered to remove positive control elements and any elements that had been flagged as bad. Using the negative controls on the arrays, the background threshold was determined and all values less than this value were set to the threshold value. Finally, the data was normalized using the Limma Quantile Normalization package in âRâ (Smyth 2004, Bolstad et al., 2003).