GSE62023 Processing Pipeline

GSE code_examples 5 steps

Publication

A Gene Regulatory Network Cooperatively Controlled by Pdx1 and Sox9 Governs Lineage Allocation of Foregut Progenitor Cells.

Cell reports (2015) — PMID 26440894

Dataset

GSE62023

Identification of Sox9/Pdx1-coregulated Genes During Pancreas Organogenesis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Mean foreground intensities were obtained for each spot and imported into the mathematical software package âRâ, which is used for all data input, diagnostic plots, normalization and quality checking steps of the analysis process using scripts developed in-house by Peter White specifically for this analysis.

R vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This step involves custom R scripts for data input, diagnostic plots, normalization, and quality checking.
# The exact script name and parameters are not provided in the description, as they are described as "in-house scripts developed by Peter White specifically for this analysis".
# A placeholder command is used to represent the execution of such a script.

Rscript peter_white_analysis.R \
    --input input_intensities.tsv \
    --output_normalized output_normalized_data.tsv \
    --output_qc_report output_qc_report.pdf

View on GitHub

In outline, the Cy3 (green) intensities were not background corrected (this has been shown to only introduce noise), and corrected for the scanner offset (40 was subtracted for each intensity).

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Define input and output file names (placeholders)
INPUT_FILE="raw_intensities.tsv"
OUTPUT_FILE="corrected_intensities.tsv"

# Define the scanner offset to be subtracted
SCANNER_OFFSET=40

# Define the 1-based index of the column containing Cy3 intensities.
# This is an assumption; adjust 'CY3_COLUMN_INDEX' based on the actual file format.
CY3_COLUMN_INDEX=3

# Subtract the scanner offset from the specified Cy3 intensity column.
# This command assumes a tab-separated input file and preserves the header.
# If there is no header, remove 'NR==1 { print; next }'.
# If the file is not tab-separated, adjust 'FS' and 'OFS' accordingly.
awk -v offset="${SCANNER_OFFSET}" -v col_idx="${CY3_COLUMN_INDEX}" '
BEGIN { FS=OFS="\t" }
NR==1 { print; next }
{
    $col_idx = $col_idx - offset;
    print
}' "${INPUT_FILE}" > "${OUTPUT_FILE}"

The dataset was filtered to remove positive control elements and any elements that had been flagged as bad.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# Example: Assuming input, positive controls, and bad elements are BED files.
# Replace 'input.bed', 'positive_controls.bed', and 'bad_elements.bed' with your actual file paths.
# The 'positive_controls.bed' and 'bad_elements.bed' files should contain the genomic regions to be removed.

# conda install -c bioconda bedtools

# Remove positive control elements from the input dataset
bedtools subtract -a input.bed -b positive_controls.bed > temp_filtered_controls.bed

# From the remaining dataset, remove elements that have been flagged as bad
bedtools subtract -a temp_filtered_controls.bed -b bad_elements.bed > filtered_output.bed

# Clean up the temporary file
rm temp_filtered_controls.bed

View on GitHub

Using the negative controls on the arrays, the background threshold was determined and all values less than this value were set to the threshold value.

Custom Python script (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash)

$ Bash example

# Create dummy input files for demonstration
# This would be replaced by actual data in a real pipeline
mkdir -p data
echo -e "ProbeID\tSample1\tSample2" > data/input_array_data.tsv
echo -e "ProbeA\t100\t120" >> data/input_array_data.tsv
echo -e "ProbeB\t5\t8" >> data/input_array_data.tsv
echo -e "ProbeC\t150\t160" >> data/input_array_data.tsv
echo -e "NegControl1\t1\t2" >> data/input_array_data.tsv
echo -e "NegControl2\t3\t4" >> data/input_array_data.tsv
echo -e "ProbeD\t20\t25" >> data/input_array_data.tsv

echo -e "NegControl1\nNegControl2" > data/negative_control_probes.txt

# Save the Python script
cat << 'EOF' > background_correction.py
import pandas as pd
import numpy as np
import sys

def apply_background_threshold(data_file, negative_control_probes_file, output_file):
    """
    Determines background threshold from negative controls and floors values.

    Args:
        data_file (str): Path to the input data file (e.g., TSV).
        negative_control_probes_file (str): Path to a file listing negative control probe IDs (one per line).
        output_file (str): Path to save the processed data.
    """
    try:
        df = pd.read_csv(data_file, sep='\t', index_col=0) # Assuming TSV with first column as index (probe IDs)
    except Exception as e:
        print(f"Error reading data file {data_file}: {e}", file=sys.stderr)
        sys.exit(1)

    try:
        with open(negative_control_probes_file, 'r') as f:
            negative_control_probes = [line.strip() for line in f if line.strip()]
    except Exception as e:
        print(f"Error reading negative control probes file {negative_control_probes_file}: {e}", file=sys.stderr)
        sys.exit(1)

    # Filter for negative control probes present in the data
    common_negative_controls = [p for p in negative_control_probes if p in df.index]
    if not common_negative_controls:
        print("Warning: No negative control probes found in the data. Cannot perform background correction.", file=sys.stderr)
        sys.exit(1)

    negative_control_data = df.loc[common_negative_controls].values.flatten()

    # Determine background threshold (e.g., mean of positive values from negative controls)
    # Filter out non-positive values if they represent missing data or true zeros that shouldn't contribute to background mean.
    positive_negative_control_data = negative_control_data[negative_control_data > 0]
    if len(positive_negative_control_data) == 0:
        print("Warning: No positive values found in negative controls. Cannot determine background threshold.", file=sys.stderr)
        sys.exit(1)

    background_threshold = np.mean(positive_negative_control_data)

    print(f"Determined background threshold: {background_threshold:.4f}")

    # Set all values less than the threshold to the threshold value
    df_corrected = df.applymap(lambda x: max(x, background_threshold))

    df_corrected.to_csv(output_file, sep='\t')
    print(f"Processed data saved to {output_file}")

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python background_correction.py <input_data_file> <negative_control_probes_file> <output_file>", file=sys.stderr)
        sys.exit(1)

    input_data_file = sys.argv[1]
    negative_control_probes_file = sys.argv[2]
    output_file = sys.argv[3]

    apply_background_threshold(input_data_file, negative_control_probes_file, output_file)
EOF

# Install dependencies if not already present
# conda install -c anaconda pandas numpy

# Execute the background correction script
python background_correction.py data/input_array_data.tsv data/negative_control_probes.txt data/output_array_data_background_corrected.tsv

Finally, the data was normalized using the Limma Quantile Normalization package in âRâ (Smyth 2004, Bolstad et al., 2003).

limma vnot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("limma", update = FALSE, ask = FALSE)'

# --- Placeholder for input data ---
# In a real scenario, 'expression_matrix.tsv' would be your actual gene expression data.
# It should be a tab-separated file with gene identifiers in the first column
# and sample expression values in subsequent columns.
# Example:
# Gene    Sample1    Sample2    Sample3
# GeneA   100        120        90
# GeneB   50         60         40
# GeneC   200        210        180

# Create a dummy input file for demonstration purposes if it doesn't exist
if [ ! -f "expression_matrix.tsv" ]; then
    echo -e "Gene\tSample1\tSample2\tSample3" > expression_matrix.tsv
    echo -e "GeneA\t100\t120\t90" >> expression_matrix.tsv
    echo -e "GeneB\t50\t60\t40" >> expression_matrix.tsv
    echo -e "GeneC\t200\t210\t180" >> expression_matrix.tsv
    echo -e "GeneD\t70\t80\t60" >> expression_matrix.tsv
fi

# R script for Limma Quantile Normalization
# This script loads expression data, performs quantile normalization using the limma package,
# and saves the normalized data to a new file.
R_SCRIPT="""
# Load the limma package
library(limma)

# --- Configuration ---
input_file <- "expression_matrix.tsv" # Path to your input data file
output_file <- "normalized_expression_matrix.tsv" # Path for the output normalized data file

# --- Main script ---

# 1. Load data
# Assumes input is a tab-separated matrix with gene IDs as row names (first column)
# and sample names as column headers.
# Adjust 'header', 'row.names', 'sep' parameters if your file format differs.
data_matrix <- read.delim(input_file, header = TRUE, row.names = 1, sep = "\t")

# Ensure the data matrix is numeric for normalization
data_matrix <- as.matrix(data_matrix)

# 2. Perform Quantile Normalization
# The 'normalizeBetweenArrays' function from limma is used with 'method=\"quantile\"'.
normalized_data <- normalizeBetweenArrays(data_matrix, method = \"quantile\")

# 3. Save normalized data
# Writes the normalized matrix to a new tab-separated file.
# 'col.names = NA' ensures that the row names (gene IDs) are written as the first column
# without an explicit header for that column.
write.table(normalized_data, file = output_file, sep = "\t", quote = FALSE, col.names = NA)

message(paste("Quantile normalization complete. Normalized data saved to:", output_file))
"""

# Execute the R script using R's command-line interface
R -e "$R_SCRIPT"

# Optional: Clean up the dummy input file if it was created for demonstration
# rm expression_matrix.tsv

View on GitHub

Tools Used

Raw Source Text

Mean foreground intensities were obtained for each spot and imported into the mathematical software package âRâ, which is used for all data input, diagnostic plots, normalization and quality checking steps of the analysis process using scripts developed in-house by Peter White specifically for this analysis. In outline, the Cy3 (green) intensities were not background corrected (this has been shown to only introduce noise), and corrected for the scanner offset (40 was subtracted for each intensity). The dataset was filtered to remove positive control elements and any elements that had been flagged as bad. Using the negative controls on the arrays, the background threshold was determined and all values less than this value were set to the threshold value. Finally, the data was normalized using the Limma Quantile Normalization package in âRâ (Smyth 2004, Bolstad et al., 2003).

← Back to Analysis