GSE61948 Processing Pipeline

GSE code_examples 9 steps

Publication

A Gene Regulatory Network Cooperatively Controlled by Pdx1 and Sox9 Governs Lineage Allocation of Foregut Progenitor Cells.

Cell reports (2015) — PMID 26440894

Dataset

GSE61948

Transcriptome and cistrome analysis reveals synergistic roles for Sox9 and Pdx1 in lineage allocation of foregut progenitor cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Quality Control.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Quality Control for eCLIP assays typically involves several steps:
# 1. Initial read quality assessment (e.g., FastQC)
# 2. Adapter trimming (e.g., Cutadapt)
# 3. UMI deduplication (e.g., UMI-tools dedup) - performed after alignment

# --- Step 1: Initial Read Quality Assessment with FastQC ---
# FastQC provides a general overview of read quality, adapter content, GC content, etc.
# conda install -c bioconda fastqc=0.11.9

# Example: Run FastQC on raw paired-end FASTQ files
# Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your actual input filenames.
# The '-o .' option specifies the output directory (current directory).
fastqc -o . input_R1.fastq.gz input_R2.fastq.gz

# --- Step 2: Adapter Trimming with Cutadapt ---
# Cutadapt removes sequencing adapters and low-quality bases. 
# For eCLIP, specific 3' and 5' adapters are typically removed.
# The adapter sequences below are commonly used in Yeo lab eCLIP pipelines (e.g., Skipper).
# conda install -c bioconda cutadapt=3.4

# Define eCLIP-specific adapter sequences
# ADAPTER_3PRIME: Illumina universal adapter
# ADAPTER_5PRIME: eCLIP specific 5' adapter
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
ADAPTER_5PRIME="AAGCAGTGGTATCAACGCAGAGTAC"

# Example: Trim adapters from paired-end FASTQ files
# -a: 3' adapter for R1 (and R2 if reverse complement)
# -g: 5' adapter for R1 (and R2 if reverse complement)
# -o: Output file for R1
# -p: Output file for R2
# --minimum-length 15: Discard reads shorter than 15 bp after trimming
# --cores 8: Use 8 CPU cores for parallel processing
# > cutadapt.log 2>&1: Redirect stdout and stderr to a log file
cutadapt \
    -a "${ADAPTER_3PRIME}" \
    -g "${ADAPTER_5PRIME}" \
    -o trimmed_R1.fastq.gz \
    -p trimmed_R2.fastq.gz \
    --minimum-length 15 \
    --cores 8 \
    input_R1.fastq.gz input_R2.fastq.gz \
    > cutadapt.log 2>&1

# --- Step 3: UMI Deduplication with UMI-tools dedup ---
# UMI deduplication is crucial for eCLIP to remove PCR duplicates and accurately quantify reads.
# This step is performed *after* alignment to the genome, as it operates on BAM files.
# The UMI (Unique Molecular Identifier) and cell tags (UR, CR) are typically extracted and added
# to the BAM file during the alignment step (e.g., using STAR with appropriate options).
# conda install -c bioconda umi_tools=1.1.2

# Example: Deduplicate reads in an aligned BAM file
# --extract-umi-method=tag: UMIs are already in a tag in the BAM file
# --umi-tag=UR: Tag containing the UMI sequence
# --cell-tag=CR: Tag containing the cell barcode (if applicable, often not used in single-sample eCLIP)
# --method=unique: Deduplication method (unique UMIs)
# --output-stats=dedup.log: Output deduplication statistics to a log file
# -I: Input aligned BAM file
# -S: Output deduplicated BAM file
umi_tools dedup \
    --extract-umi-method=tag \
    --umi-tag=UR \
    --cell-tag=CR \
    --method=unique \
    --output-stats=dedup.log \
    -I aligned_reads.bam \
    -S deduplicated_reads.bam

View on GitHub

Quality of sequencing data is analyzed using the software FastQC v0.10.1.

FastQC v0.10.1 GitHub

$ Bash example

# Install FastQC if not already installed
# conda install -c bioconda fastqc

# Example usage of FastQC
# Replace 'input.fastq.gz' with your actual input sequencing data file(s)
# Replace 'output_dir' with your desired output directory for reports
fastqc input.fastq.gz -o output_dir

View on GitHub

The results are examined to determine if samples are of questionable quality on an array of metrics.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9

$ Bash example

# Install FastQC if not already installed
# conda install -c bioconda fastqc

# Create an output directory for FastQC reports
mkdir -p fastqc_reports

# Run FastQC on input FASTQ files
# Replace sample_R1.fastq.gz and sample_R2.fastq.gz with actual input files
# The -o flag specifies the output directory
fastqc -o fastqc_reports sample_R1.fastq.gz sample_R2.fastq.gz

Mapping.

STAR (Inferred with models/gemini-2.5-flash) v2.7.9a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Placeholder for STAR genome index for human (hg38)
GENOME_DIR="/path/to/STAR_genome_index/hg38"
INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for trimmed input FASTQ file
OUTPUT_PREFIX="mapped_reads" # Prefix for output files
NUM_THREADS=8 # Example number of threads

# Run STAR alignment for eCLIP reads
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --runThreadN "${NUM_THREADS}" \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --alignIntronMax 1

View on GitHub

Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e.

STAR v2.3.0e GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Prepare STAR genome index (run once per reference genome) ---
# This step needs to be performed before alignment. Replace paths and genome details as appropriate.
# Example for human GRCh38:
# STAR --runThreadN 8 \
#      --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_genome_index/GRCh38 \
#      --genomeFastaFiles /path/to/GRCh38.fa \
#      --sjdbGTFfile /path/to/GRCh38.gtf \
#      --sjdbOverhang 100 # Recommended for typical read lengths

# --- Alignment step ---
# Define variables
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for human GRCh38 genome index
READ_FILES="sample_R1.fastq.gz sample_R2.fastq.gz" # Placeholder for input FASTQ file(s) (e.g., paired-end)
OUTPUT_PREFIX="sample_aligned" # Prefix for all output files
THREADS=8 # Example number of threads

# Run STAR alignment
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ_FILES} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesCommand zcat \
     --outSAMattributes Standard \
     --quantMode GeneCounts # Optional: for gene quantification and outputting ReadsPerGene.out.tab

View on GitHub

Parameters are set to default and reads are mapped to references along with splice junction databases.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Placeholder for Reference Data ---
# Genome FASTA: GRCh38 (hg38) from UCSC or Ensembl
# GTF Annotation: Gencode v44 (or latest) for GRCh38

# --- Step 1: Build STAR Genome Index (Run once per reference genome) ---
# This step uses the reference genome FASTA and a GTF file for known splice junctions.
# Replace <num_threads>, <path_to_genome_fasta>, <path_to_gtf_annotation> with actual paths.
# mkdir -p /path/to/STAR_index/hg38_gencode_v44
# STAR \
#   --runMode genomeGenerate \
#   --genomeDir /path/to/STAR_index/hg38_gencode_v44 \
#   --genomeFastaFiles /path/to/hg38.fa \
#   --sjdbGTFfile /path/to/gencode.v44.annotation.gtf \
#   --runThreadN 8 # Adjust number of threads as needed

# --- Step 2: Align Reads to Reference Genome ---
# This command maps paired-end reads to the pre-built STAR genome index.
# Parameters are set to common defaults for RNA-seq/eCLIP alignment.
# Replace <num_threads>, <path_to_STAR_index>, <path_to_read1.fastq.gz>, <path_to_read2.fastq.gz> with actual paths.
# Replace <output_directory> and <output_prefix> with desired output locations and names.

# Example input files (replace with your actual data)
READ1="input_reads_R1.fastq.gz"
READ2="input_reads_R2.fastq.gz"

# Example reference index (replace with your actual index path)
STAR_INDEX="/path/to/STAR_index/hg38_gencode_v44"

# Output directory and prefix
OUTPUT_DIR="aligned_output"
OUTPUT_PREFIX="sample_aligned"

mkdir -p "${OUTPUT_DIR}"

STAR \
  --runThreadN 8 \ # Adjust number of threads as needed
  --genomeDir "${STAR_INDEX}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --readFilesCommand zcat \ # Use zcat for gzipped fastq files
  --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \
  --outSAMtype BAM SortedByCoordinate \ # Output sorted BAM file
  --outBAMcompression 6 \ # Compression level for BAM
  --quantMode GeneCounts \ # Quantify gene expression (optional, but common)
  --twopassMode Basic \ # Perform basic two-pass alignment for novel splice junction discovery
  --outFilterMismatchNmax 999 \ # Default, allows many mismatches if short reads
  --outFilterMismatchNoverLmax 0.04 \ # Default, max fraction of mismatches per read length
  --outFilterMultimapNmax 20 \ # Default, max number of loci a read is allowed to map to
  --alignSJDBoverhangMin 1 \ # Default, minimum overhang for splice junctions from database
  --alignSJoverhangMin 8 \ # Default, minimum overhang for novel splice junctions
  --alignIntronMin 20 \ # Default, minimum intron length
  --alignIntronMax 1000000 \ # Default, maximum intron length
  --alignMatesGapMax 1000000 # Default, maximum distance between mates

View on GitHub

Gene Expression Quantification.

Salmon (Inferred with models/gemini-2.5-flash) v1.10.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Salmon (if not already installed)
# conda install -c bioconda salmon

# Placeholder for reference transcriptome and index
# Replace 'GRCh38_transcriptome.fa' with your actual reference transcriptome FASTA file.
# Replace 'GRCh38_salmon_index' with your desired index directory name.
# Example command to build index (uncomment and modify if needed):
# salmon index -t GRCh38_transcriptome.fa -i GRCh38_salmon_index

# Placeholder for input RNA-seq reads (paired-end assumed)
# Replace 'reads_1.fastq.gz' and 'reads_2.fastq.gz' with your actual input FASTQ files.
# Replace 'salmon_quant_output' with your desired output directory name.

# Run Salmon quantification
salmon quant \
  -i GRCh38_salmon_index \
  -l A \
  -1 reads_1.fastq.gz \
  -2 reads_2.fastq.gz \
  -o salmon_quant_output \
  --validateMappings \
  --gcBias \
  --seqBias

View on GitHub

To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0).

Cufflinks v2.2.0 GitHub

$ Bash example

# Install Cufflinks if not already installed
# conda install -c bioconda cufflinks=2.2.0

# Define input and output files/directories
# Replace 'reads.bam' with your actual aligned RNA-seq BAM file
# Replace 'genes.gtf' with your actual reference genome annotation GTF file (e.g., from Ensembl, GENCODE)
INPUT_BAM="reads.bam"
REFERENCE_GTF="genes.gtf"
OUTPUT_DIR="cufflinks_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run Cufflinks to quantify gene expression
# -o: Output directory
# -G: Reference annotation GTF file
cufflinks -o "${OUTPUT_DIR}" -G "${REFERENCE_GTF}" "${INPUT_BAM}"

View on GitHub

Expression values are calculated for entries in the gene annotation references.

Salmon (Inferred with models/gemini-2.5-flash) v1.10.2 GitHub

$ Bash example

# Install Salmon (example using Conda)
# conda install -c bioconda salmon

# Placeholder for Salmon index directory.
# This index is built from the 'gene annotation references' (e.g., a transcriptome FASTA file like Homo_sapiens.GRCh38.cdna.all.fa.gz).
# Example command to build index (run once per reference):
# salmon index -t "Homo_sapiens.GRCh38.cdna.all.fa.gz" -i "salmon_index_grch38"
SALMON_INDEX_DIR="salmon_index_grch38"

# Placeholder for input FASTQ files (paired-end reads)
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz"

# Placeholder for output directory where quantification results will be stored
OUTPUT_DIR="salmon_quant_results"

# Calculate expression values using Salmon
salmon quant -i "${SALMON_INDEX_DIR}" \
             -l A \
             -1 "${READS_R1}" \
             -2 "${READS_R2}" \
             -p 8 \
             --validateMappings \
             -o "${OUTPUT_DIR}"

View on GitHub

Tools Used

STAR Cufflinks

Raw Source Text

Quality Control. Quality of sequencing data is analyzed using the software FastQC v0.10.1. The results are examined to determine if samples are of questionable quality on an array of metrics.
Mapping. Alignment of sequencing data to reference genomes is performed with the software RNA-Star 2.3.0e. Parameters are set to default and reads are mapped to references along with splice junction databases.
Gene Expression Quantification. To obtain gene expression values, several quantification methods are used (Sailfish 0.6.3, Cufflinks 2.2.0). Expression values are calculated for entries in the gene annotation references.
Genome_build: hg19
Supplementary_files_format_and_content: Txt file of ensemble identifiers with RPKMs of samples used in study

← Back to Analysis