GSE148039 Processing Pipeline

GSE code_examples 7 steps

Publication

Context-dependent functional compensation between Ythdf m<sup>6</sup>A reader proteins.

Genes & development (2020) — PMID 32943573

Dataset

Ythdf m6A readers compensate each other in a context dependent manner

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

We used Illumina CASAVA 1.8.2 software to generate fastq files.

Illumina CASAVA v1.8.2

$ Bash example

# Illumina CASAVA 1.8.2 software was used to generate fastq files from BCL files.
# This typically involved using the 'configureBclToFastq.pl' script followed by 'make'.

# Replace <path_to_run_folder>, <path_to_output_dir>, <path_to_sample_sheet.csv>, and <num_cores> with actual values.

# Example installation (CASAVA was proprietary and typically bundled with Illumina systems):
# Modern bcl2fastq (successor to CASAVA's BCL to FASTQ conversion) can be installed via conda:
# conda install -c bioconda bcl2fastq2

# Step 1: Configure the conversion using configureBclToFastq.pl
configureBclToFastq.pl \
    --input-dir <path_to_run_folder>/Data/Intensities/BaseCalls \
    --output-dir <path_to_output_dir> \
    --sample-sheet <path_to_sample_sheet.csv>

# Step 2: Execute the conversion using make (run this command from the <path_to_output_dir> created above)
# cd <path_to_output_dir>
make -j <num_cores>

Reads were trimmed using cutadapt (Martin 2011) (parameters: -a ADAPTER1 -a âA{10}â -a âT{10}â -A âA{10}â -A âT{10}â âtimes 2 -q 20 -m 25).

cutadapt v1.0 GitHub

$ Bash example

# Install cutadapt using conda
# conda install -c bioconda cutadapt

# Note: 'ADAPTER1' should be replaced with the actual adapter sequence used.
# The presence of both -a (for forward read) and -A (for reverse read) parameters suggests paired-end reads.
cutadapt \
  -a ADAPTER1 \
  -a "A{10}" \
  -a "T{10}" \
  -A "A{10}" \
  -A "T{10}" \
  --times 2 \
  -q 20 \
  -m 25 \
  -o trimmed_reads_R1.fastq.gz \
  -p trimmed_reads_R2.fastq.gz \
  reads_R1.fastq.gz reads_R2.fastq.gz

View on GitHub

Reads were mapped to genome mm10 using STAR (Dobin et al.

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.10a

# Define variables for paths and files
GENOME_DIR="/path/to/STAR_index/mm10" # Placeholder for STAR genome index directory
GENOME_FASTA="/path/to/mm10.fa"       # Placeholder for mm10 reference genome FASTA file
GTF_FILE="/path/to/mm10.gtf"           # Placeholder for mm10 gene annotation GTF file (recommended for RNA-seq/eCLIP)

READS_R1="reads_R1.fastq.gz"         # Placeholder for input forward reads file
READS_R2="reads_R2.fastq.gz"         # Placeholder for input reverse reads file (remove if single-end)
OUTPUT_PREFIX="aligned_reads"        # Prefix for output files
NUM_THREADS=8                        # Number of threads to use

# 1. Build STAR genome index (if not already built)
# This step only needs to be run once per genome version.
# If you already have an index, skip this block.
# mkdir -p ${GENOME_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 \
#      --runThreadN ${NUM_THREADS}

# 2. Map reads to the mm10 genome using STAR
STAR --runMode alignReads \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --limitBAMsortRAM 60000000000 \
     --runThreadN ${NUM_THREADS}

# Note: For single-end reads, remove ${READS_R2} from --readFilesIn.

View on GitHub

2013) v2.4.2a (parameters: âalignEndsType EndToEnd, âoutFilterMismatchNoverLmax 0.05, âtwopassMode Basic).

STAR (Inferred with models/gemini-2.5-flash) v2.4.2a GitHub

$ Bash example

# Reference genome: hg38 (placeholder, replace with actual path to STAR index)
# Input reads: input_read1.fastq.gz, input_read2.fastq.gz (placeholders, replace with actual files)
# Output prefix: aligned_output_ (placeholder, replace as needed)

STAR --genomeDir "/path/to/STAR_genome_index/hg38" \
     --readFilesIn "input_read1.fastq.gz" "input_read2.fastq.gz" \
     --outFileNamePrefix "aligned_output_" \
     --alignEndsType EndToEnd \
     --outFilterMismatchNoverLmax 0.05 \
     --twopassMode Basic \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8

View on GitHub

Sample counting was done using STAR, quantifying mm10 RefSeq annotated genes.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# --- Reference Data Setup (Example paths, replace with actual paths) ---
# Download mm10 genome FASTA (e.g., from UCSC or NCBI)
# Example: wget -P /path/to/ref_data/mm10/ ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
# gunzip /path/to/ref_data/mm10/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

# Download mm10 RefSeq GTF (e.g., from NCBI RefSeq or UCSC)
# For RefSeq, NCBI is a primary source. Example for GRCm38.p6 (mm10):
# wget -P /path/to/ref_data/mm10/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/27_GRCm38.p6/GCF_000001635.27_GRCm38.p6_genomic.gtf.gz
# gunzip /path/to/ref_data/mm10/GCF_000001635.27_GRCm38.p6_genomic.gtf.gz

# Define paths for genome and GTF
GENOME_FASTA="/path/to/ref_data/mm10/Mus_musculus.GRCm38.dna.primary_assembly.fa"
GTF_FILE="/path/to/ref_data/mm10/GCF_000001635.27_GRCm38.p6_genomic.gtf" # This GTF contains RefSeq annotations

# --- STAR Genome Indexing (Run once per genome/annotation combination) ---
# This step is a prerequisite for alignment and quantification.
# Replace /path/to/STAR_index/mm10_refseq with your desired index directory.
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/mm10_refseq \
#      --genomeFastaFiles "${GENOME_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --sjdbOverhang 100 \
#      --runThreadN 8

# --- Sample Counting using STAR (Alignment and Gene Quantification) ---
# Input FASTQ files (replace with actual sample files)
READ1="/path/to/sample/sample_R1.fastq.gz"
READ2="/path/to/sample/sample_R2.fastq.gz" # Remove if single-end

# Output directory for STAR results
OUTPUT_DIR="/path/to/output/sample_star_counts"
mkdir -p "${OUTPUT_DIR}"

# Path to the pre-built STAR genome index
STAR_INDEX_DIR="/path/to/STAR_index/mm10_refseq"

STAR --runMode alignReads \
     --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --readFilesCommand zcat \
     --quantMode GeneCounts \
     --outFileNamePrefix "${OUTPUT_DIR}/sample_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbGTFfile "${GTF_FILE}" \
     --runThreadN 8

View on GitHub

Normalization of the counts and differential expression analysis was performed using DESeq2 (Love et al.

DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# For R:
# sudo apt-get update
# sudo apt-get install r-base

# For Bioconductor and DESeq2:
# R -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create a placeholder R script for DESeq2 analysis
# This script assumes you have a counts matrix (e.g., counts.csv) and a sample information file (e.g., coldata.csv)
# counts.csv: rows are genes, columns are samples, values are raw counts
# coldata.csv: rows are samples, columns are sample metadata (e.g., condition, batch)

cat << 'EOF' > run_deseq2.R
library(DESeq2)

# --- Configuration --- #
# Input files (replace with your actual file paths)
counts_file <- "counts.csv"
sample_info_file <- "coldata.csv"

# Output file
results_file <- "deseq2_results.csv"

# Design formula (replace 'condition' with your primary variable of interest)
design_formula <- ~ condition

# --- Load Data --- #
# Load counts data
# Assuming counts.csv has gene IDs as the first column and sample names as headers
counts_data <- read.csv(counts_file, row.names = 1, check.names = FALSE)

# Load sample information
# Assuming coldata.csv has sample names as the first column and metadata as other columns
# Ensure row names match column names of counts_data
sample_info <- read.csv(sample_info_file, row.names = 1)

# Ensure sample order matches between counts and sample_info
sample_info <- sample_info[colnames(counts_data), , drop = FALSE]

# --- DESeq2 Analysis --- #
# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts_data,
                              colData = sample_info,
                              design = design_formula)

# Pre-filtering: remove genes with very low counts
dds <- dds[rowSums(counts(dds)) >= 10, ]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results (e.g., for the 'condition' variable, comparing 'treatment' vs 'control')
# Replace 'condition_treatment_vs_control' with your specific contrast if needed
res <- results(dds)

# Order results by adjusted p-value
res <- res[order(res$padj), ]

# --- Save Results --- #
write.csv(as.data.frame(res), file = results_file)

message(paste("DESeq2 analysis complete. Results saved to", results_file))
EOF

# Create dummy input files for demonstration
# In a real scenario, these would be your actual data files

# Dummy counts.csv
cat << 'EOF' > counts.csv
GeneID,Sample_A1,Sample_A2,Sample_B1,Sample_B2
Gene1,100,120,50,60
Gene2,20,25,80,90
Gene3,5,8,15,12
Gene4,150,160,140,130
Gene5,30,35,32,38
EOF

# Dummy coldata.csv
cat << 'EOF' > coldata.csv
Sample,condition
Sample_A1,control
Sample_A2,control
Sample_B1,treatment
Sample_B2,treatment
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

2014) with the parameters: betaPrior=True, cooksCutoff=FALSE, independentFiltering=FALSE.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.4.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BiocManager if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")

# Install DESeq2 package if not already installed
# BiocManager::install("DESeq2")

# Create dummy input files for demonstration
# In a real scenario, replace these with your actual count matrix and sample metadata.
# Example counts_matrix.csv (replace with your actual data)
cat 'gene_id,sample1,sample2,sample3,sample4
geneA,100,120,50,60
geneB,200,210,100,110
geneC,50,60,200,220
geneD,10,12,5,6' > counts_matrix.csv

# Example sample_info.csv (replace with your actual data)
cat 'sample_id,condition
sample1,control
sample2,control
sample3,treated
sample4,treated' > sample_info.csv

Rscript -e '
library(DESeq2)

# Load count data (replace with your actual path)
countData <- read.csv("counts_matrix.csv", row.names = 1)

# Load sample information (replace with your actual path)
colData <- read.csv("sample_info.csv", row.names = 1)

# Ensure sample names in colData match column names in countData
colData <- colData[colnames(countData), , drop = FALSE]

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = countData,
                              colData = colData,
                              design = ~ condition) # Adjust "condition" to your experimental design variable

# Run DESeq2 analysis with specified parameters
dds <- DESeq(dds,
             betaPrior = TRUE,
             cooksCutoff = FALSE,
             independentFiltering = FALSE)

# Get results
res <- results(dds)

# Save results to a CSV file
write.csv(as.data.frame(res), file = "deseq2_differential_expression_results.csv")

message("DESeq2 analysis complete. Results saved to deseq2_differential_expression_results.csv")
'

View on GitHub

Tools Used

STAR DESeq2

Raw Source Text

We used Illumina CASAVA 1.8.2 software to generate fastq files.
Reads were trimmed using cutadapt (Martin 2011) (parameters: -a ADAPTER1 -a âA{10}â -a âT{10}â -A âA{10}â -A âT{10}â âtimes 2 -q 20 -m 25).
Reads were mapped to genome mm10 using STAR (Dobin et al. 2013) v2.4.2a (parameters: âalignEndsType EndToEnd, âoutFilterMismatchNoverLmax 0.05, âtwopassMode Basic).
Sample counting was done using STAR, quantifying mm10 RefSeq annotated genes.
Normalization of the counts and differential expression analysis was performed using DESeq2 (Love et al. 2014) with the parameters: betaPrior=True, cooksCutoff=FALSE, independentFiltering=FALSE.
Genome_build: mm10
Supplementary_files_format_and_content: DESeq2 output

← Back to Analysis