GSE148039 Processing Pipeline
GSE
code_examples
7 steps
Publication
Context-dependent functional compensation between Ythdf m<sup>6</sup>A reader proteins.Genes & development (2020) — PMID 32943573
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
We used Illumina CASAVA 1.8.2 software to generate fastq files.
Illumina CASAVA v1.8.2$ Bash example
# Illumina CASAVA 1.8.2 software was used to generate fastq files from BCL files. # This typically involved using the 'configureBclToFastq.pl' script followed by 'make'. # Replace <path_to_run_folder>, <path_to_output_dir>, <path_to_sample_sheet.csv>, and <num_cores> with actual values. # Example installation (CASAVA was proprietary and typically bundled with Illumina systems): # Modern bcl2fastq (successor to CASAVA's BCL to FASTQ conversion) can be installed via conda: # conda install -c bioconda bcl2fastq2 # Step 1: Configure the conversion using configureBclToFastq.pl configureBclToFastq.pl \ --input-dir <path_to_run_folder>/Data/Intensities/BaseCalls \ --output-dir <path_to_output_dir> \ --sample-sheet <path_to_sample_sheet.csv> # Step 2: Execute the conversion using make (run this command from the <path_to_output_dir> created above) # cd <path_to_output_dir> make -j <num_cores> -
2
Reads were trimmed using cutadapt (Martin 2011) (parameters: -a ADAPTER1 -a âA{10}â -a âT{10}â -A âA{10}â -A âT{10}â âtimes 2 -q 20 -m 25).
$ Bash example
# Install cutadapt using conda # conda install -c bioconda cutadapt # Note: 'ADAPTER1' should be replaced with the actual adapter sequence used. # The presence of both -a (for forward read) and -A (for reverse read) parameters suggests paired-end reads. cutadapt \ -a ADAPTER1 \ -a "A{10}" \ -a "T{10}" \ -A "A{10}" \ -A "T{10}" \ --times 2 \ -q 20 \ -m 25 \ -o trimmed_reads_R1.fastq.gz \ -p trimmed_reads_R2.fastq.gz \ reads_R1.fastq.gz reads_R2.fastq.gz -
3
Reads were mapped to genome mm10 using STAR (Dobin et al.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.10a # Define variables for paths and files GENOME_DIR="/path/to/STAR_index/mm10" # Placeholder for STAR genome index directory GENOME_FASTA="/path/to/mm10.fa" # Placeholder for mm10 reference genome FASTA file GTF_FILE="/path/to/mm10.gtf" # Placeholder for mm10 gene annotation GTF file (recommended for RNA-seq/eCLIP) READS_R1="reads_R1.fastq.gz" # Placeholder for input forward reads file READS_R2="reads_R2.fastq.gz" # Placeholder for input reverse reads file (remove if single-end) OUTPUT_PREFIX="aligned_reads" # Prefix for output files NUM_THREADS=8 # Number of threads to use # 1. Build STAR genome index (if not already built) # This step only needs to be run once per genome version. # If you already have an index, skip this block. # mkdir -p ${GENOME_DIR} # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles ${GENOME_FASTA} \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 \ # --runThreadN ${NUM_THREADS} # 2. Map reads to the mm10 genome using STAR STAR --runMode alignReads \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS_R1} ${READS_R2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --limitBAMsortRAM 60000000000 \ --runThreadN ${NUM_THREADS} # Note: For single-end reads, remove ${READS_R2} from --readFilesIn. -
4
2013) v2.4.2a (parameters: âalignEndsType EndToEnd, âoutFilterMismatchNoverLmax 0.05, âtwopassMode Basic).
$ Bash example
# Reference genome: hg38 (placeholder, replace with actual path to STAR index) # Input reads: input_read1.fastq.gz, input_read2.fastq.gz (placeholders, replace with actual files) # Output prefix: aligned_output_ (placeholder, replace as needed) STAR --genomeDir "/path/to/STAR_genome_index/hg38" \ --readFilesIn "input_read1.fastq.gz" "input_read2.fastq.gz" \ --outFileNamePrefix "aligned_output_" \ --alignEndsType EndToEnd \ --outFilterMismatchNoverLmax 0.05 \ --twopassMode Basic \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 -
5
Sample counting was done using STAR, quantifying mm10 RefSeq annotated genes.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # --- Reference Data Setup (Example paths, replace with actual paths) --- # Download mm10 genome FASTA (e.g., from UCSC or NCBI) # Example: wget -P /path/to/ref_data/mm10/ ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz # gunzip /path/to/ref_data/mm10/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz # Download mm10 RefSeq GTF (e.g., from NCBI RefSeq or UCSC) # For RefSeq, NCBI is a primary source. Example for GRCm38.p6 (mm10): # wget -P /path/to/ref_data/mm10/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/27_GRCm38.p6/GCF_000001635.27_GRCm38.p6_genomic.gtf.gz # gunzip /path/to/ref_data/mm10/GCF_000001635.27_GRCm38.p6_genomic.gtf.gz # Define paths for genome and GTF GENOME_FASTA="/path/to/ref_data/mm10/Mus_musculus.GRCm38.dna.primary_assembly.fa" GTF_FILE="/path/to/ref_data/mm10/GCF_000001635.27_GRCm38.p6_genomic.gtf" # This GTF contains RefSeq annotations # --- STAR Genome Indexing (Run once per genome/annotation combination) --- # This step is a prerequisite for alignment and quantification. # Replace /path/to/STAR_index/mm10_refseq with your desired index directory. # STAR --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index/mm10_refseq \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GTF_FILE}" \ # --sjdbOverhang 100 \ # --runThreadN 8 # --- Sample Counting using STAR (Alignment and Gene Quantification) --- # Input FASTQ files (replace with actual sample files) READ1="/path/to/sample/sample_R1.fastq.gz" READ2="/path/to/sample/sample_R2.fastq.gz" # Remove if single-end # Output directory for STAR results OUTPUT_DIR="/path/to/output/sample_star_counts" mkdir -p "${OUTPUT_DIR}" # Path to the pre-built STAR genome index STAR_INDEX_DIR="/path/to/STAR_index/mm10_refseq" STAR --runMode alignReads \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --quantMode GeneCounts \ --outFileNamePrefix "${OUTPUT_DIR}/sample_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbGTFfile "${GTF_FILE}" \ --runThreadN 8 -
6
Normalization of the counts and differential expression analysis was performed using DESeq2 (Love et al.
$ Bash example
# Install R and Bioconductor if not already installed # For R: # sudo apt-get update # sudo apt-get install r-base # For Bioconductor and DESeq2: # R -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create a placeholder R script for DESeq2 analysis # This script assumes you have a counts matrix (e.g., counts.csv) and a sample information file (e.g., coldata.csv) # counts.csv: rows are genes, columns are samples, values are raw counts # coldata.csv: rows are samples, columns are sample metadata (e.g., condition, batch) cat << 'EOF' > run_deseq2.R library(DESeq2) # --- Configuration --- # # Input files (replace with your actual file paths) counts_file <- "counts.csv" sample_info_file <- "coldata.csv" # Output file results_file <- "deseq2_results.csv" # Design formula (replace 'condition' with your primary variable of interest) design_formula <- ~ condition # --- Load Data --- # # Load counts data # Assuming counts.csv has gene IDs as the first column and sample names as headers counts_data <- read.csv(counts_file, row.names = 1, check.names = FALSE) # Load sample information # Assuming coldata.csv has sample names as the first column and metadata as other columns # Ensure row names match column names of counts_data sample_info <- read.csv(sample_info_file, row.names = 1) # Ensure sample order matches between counts and sample_info sample_info <- sample_info[colnames(counts_data), , drop = FALSE] # --- DESeq2 Analysis --- # # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = counts_data, colData = sample_info, design = design_formula) # Pre-filtering: remove genes with very low counts dds <- dds[rowSums(counts(dds)) >= 10, ] # Run DESeq2 analysis dds <- DESeq(dds) # Get results (e.g., for the 'condition' variable, comparing 'treatment' vs 'control') # Replace 'condition_treatment_vs_control' with your specific contrast if needed res <- results(dds) # Order results by adjusted p-value res <- res[order(res$padj), ] # --- Save Results --- # write.csv(as.data.frame(res), file = results_file) message(paste("DESeq2 analysis complete. Results saved to", results_file)) EOF # Create dummy input files for demonstration # In a real scenario, these would be your actual data files # Dummy counts.csv cat << 'EOF' > counts.csv GeneID,Sample_A1,Sample_A2,Sample_B1,Sample_B2 Gene1,100,120,50,60 Gene2,20,25,80,90 Gene3,5,8,15,12 Gene4,150,160,140,130 Gene5,30,35,32,38 EOF # Dummy coldata.csv cat << 'EOF' > coldata.csv Sample,condition Sample_A1,control Sample_A2,control Sample_B1,treatment Sample_B2,treatment EOF # Execute the R script Rscript run_deseq2.R -
7
2014) with the parameters: betaPrior=True, cooksCutoff=FALSE, independentFiltering=FALSE.
DESeq2 (Inferred with models/gemini-2.5-flash) v1.4.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BiocManager if not already installed # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # Install DESeq2 package if not already installed # BiocManager::install("DESeq2") # Create dummy input files for demonstration # In a real scenario, replace these with your actual count matrix and sample metadata. # Example counts_matrix.csv (replace with your actual data) cat 'gene_id,sample1,sample2,sample3,sample4 geneA,100,120,50,60 geneB,200,210,100,110 geneC,50,60,200,220 geneD,10,12,5,6' > counts_matrix.csv # Example sample_info.csv (replace with your actual data) cat 'sample_id,condition sample1,control sample2,control sample3,treated sample4,treated' > sample_info.csv Rscript -e ' library(DESeq2) # Load count data (replace with your actual path) countData <- read.csv("counts_matrix.csv", row.names = 1) # Load sample information (replace with your actual path) colData <- read.csv("sample_info.csv", row.names = 1) # Ensure sample names in colData match column names in countData colData <- colData[colnames(countData), , drop = FALSE] # Create DESeqDataSet object dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition) # Adjust "condition" to your experimental design variable # Run DESeq2 analysis with specified parameters dds <- DESeq(dds, betaPrior = TRUE, cooksCutoff = FALSE, independentFiltering = FALSE) # Get results res <- results(dds) # Save results to a CSV file write.csv(as.data.frame(res), file = "deseq2_differential_expression_results.csv") message("DESeq2 analysis complete. Results saved to deseq2_differential_expression_results.csv") '
Raw Source Text
We used Illumina CASAVA 1.8.2 software to generate fastq files.
Reads were trimmed using cutadapt (Martin 2011) (parameters: -a ADAPTER1 -a âA{10}â -a âT{10}â -A âA{10}â -A âT{10}â âtimes 2 -q 20 -m 25).
Reads were mapped to genome mm10 using STAR (Dobin et al. 2013) v2.4.2a (parameters: âalignEndsType EndToEnd, âoutFilterMismatchNoverLmax 0.05, âtwopassMode Basic).
Sample counting was done using STAR, quantifying mm10 RefSeq annotated genes.
Normalization of the counts and differential expression analysis was performed using DESeq2 (Love et al. 2014) with the parameters: betaPrior=True, cooksCutoff=FALSE, independentFiltering=FALSE.
Genome_build: mm10
Supplementary_files_format_and_content: DESeq2 output