GSE148039 Processing Pipeline

GSE code_examples 7 steps

Publication

Context-dependent functional compensation between Ythdf m<sup>6</sup>A reader proteins.

Genes & development (2020) — PMID 32943573

Dataset

GSE148039

Ythdf m6A readers compensate each other in a context dependent manner

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    We used Illumina CASAVA 1.8.2 software to generate fastq files.

    Illumina CASAVA v1.8.2
    $ Bash example
    # Illumina CASAVA 1.8.2 software was used to generate fastq files from BCL files.
    # This typically involved using the 'configureBclToFastq.pl' script followed by 'make'.
    
    # Replace <path_to_run_folder>, <path_to_output_dir>, <path_to_sample_sheet.csv>, and <num_cores> with actual values.
    
    # Example installation (CASAVA was proprietary and typically bundled with Illumina systems):
    # Modern bcl2fastq (successor to CASAVA's BCL to FASTQ conversion) can be installed via conda:
    # conda install -c bioconda bcl2fastq2
    
    # Step 1: Configure the conversion using configureBclToFastq.pl
    configureBclToFastq.pl \
        --input-dir <path_to_run_folder>/Data/Intensities/BaseCalls \
        --output-dir <path_to_output_dir> \
        --sample-sheet <path_to_sample_sheet.csv>
    
    # Step 2: Execute the conversion using make (run this command from the <path_to_output_dir> created above)
    # cd <path_to_output_dir>
    make -j <num_cores>
  2. 2

    Reads were trimmed using cutadapt (Martin 2011) (parameters: -a ADAPTER1 -a “A{10}” -a “T{10}” -A “A{10}” -A “T{10}” –times 2 -q 20 -m 25).

    cutadapt v1.0 GitHub
    $ Bash example
    # Install cutadapt using conda
    # conda install -c bioconda cutadapt
    
    # Note: 'ADAPTER1' should be replaced with the actual adapter sequence used.
    # The presence of both -a (for forward read) and -A (for reverse read) parameters suggests paired-end reads.
    cutadapt \
      -a ADAPTER1 \
      -a "A{10}" \
      -a "T{10}" \
      -A "A{10}" \
      -A "T{10}" \
      --times 2 \
      -q 20 \
      -m 25 \
      -o trimmed_reads_R1.fastq.gz \
      -p trimmed_reads_R2.fastq.gz \
      reads_R1.fastq.gz reads_R2.fastq.gz
  3. 3

    Reads were mapped to genome mm10 using STAR (Dobin et al.

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.7.10a
    
    # Define variables for paths and files
    GENOME_DIR="/path/to/STAR_index/mm10" # Placeholder for STAR genome index directory
    GENOME_FASTA="/path/to/mm10.fa"       # Placeholder for mm10 reference genome FASTA file
    GTF_FILE="/path/to/mm10.gtf"           # Placeholder for mm10 gene annotation GTF file (recommended for RNA-seq/eCLIP)
    
    READS_R1="reads_R1.fastq.gz"         # Placeholder for input forward reads file
    READS_R2="reads_R2.fastq.gz"         # Placeholder for input reverse reads file (remove if single-end)
    OUTPUT_PREFIX="aligned_reads"        # Prefix for output files
    NUM_THREADS=8                        # Number of threads to use
    
    # 1. Build STAR genome index (if not already built)
    # This step only needs to be run once per genome version.
    # If you already have an index, skip this block.
    # mkdir -p ${GENOME_DIR}
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles ${GENOME_FASTA} \
    #      --sjdbGTFfile ${GTF_FILE} \
    #      --sjdbOverhang 100 \
    #      --runThreadN ${NUM_THREADS}
    
    # 2. Map reads to the mm10 genome using STAR
    STAR --runMode alignReads \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READS_R1} ${READS_R2} \
         --readFilesCommand zcat \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --alignSJDBoverhangMin 1 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.1 \
         --limitBAMsortRAM 60000000000 \
         --runThreadN ${NUM_THREADS}
    
    # Note: For single-end reads, remove ${READS_R2} from --readFilesIn.
  4. 4

    2013) v2.4.2a (parameters: –alignEndsType EndToEnd, –outFilterMismatchNoverLmax 0.05, –twopassMode Basic).

    STAR (Inferred with models/gemini-2.5-flash) v2.4.2a GitHub
    $ Bash example
    # Reference genome: hg38 (placeholder, replace with actual path to STAR index)
    # Input reads: input_read1.fastq.gz, input_read2.fastq.gz (placeholders, replace with actual files)
    # Output prefix: aligned_output_ (placeholder, replace as needed)
    
    STAR --genomeDir "/path/to/STAR_genome_index/hg38" \
         --readFilesIn "input_read1.fastq.gz" "input_read2.fastq.gz" \
         --outFileNamePrefix "aligned_output_" \
         --alignEndsType EndToEnd \
         --outFilterMismatchNoverLmax 0.05 \
         --twopassMode Basic \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 8
  5. 5

    Sample counting was done using STAR, quantifying mm10 RefSeq annotated genes.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # --- Reference Data Setup (Example paths, replace with actual paths) ---
    # Download mm10 genome FASTA (e.g., from UCSC or NCBI)
    # Example: wget -P /path/to/ref_data/mm10/ ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
    # gunzip /path/to/ref_data/mm10/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
    
    # Download mm10 RefSeq GTF (e.g., from NCBI RefSeq or UCSC)
    # For RefSeq, NCBI is a primary source. Example for GRCm38.p6 (mm10):
    # wget -P /path/to/ref_data/mm10/ ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/27_GRCm38.p6/GCF_000001635.27_GRCm38.p6_genomic.gtf.gz
    # gunzip /path/to/ref_data/mm10/GCF_000001635.27_GRCm38.p6_genomic.gtf.gz
    
    # Define paths for genome and GTF
    GENOME_FASTA="/path/to/ref_data/mm10/Mus_musculus.GRCm38.dna.primary_assembly.fa"
    GTF_FILE="/path/to/ref_data/mm10/GCF_000001635.27_GRCm38.p6_genomic.gtf" # This GTF contains RefSeq annotations
    
    # --- STAR Genome Indexing (Run once per genome/annotation combination) ---
    # This step is a prerequisite for alignment and quantification.
    # Replace /path/to/STAR_index/mm10_refseq with your desired index directory.
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_index/mm10_refseq \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GTF_FILE}" \
    #      --sjdbOverhang 100 \
    #      --runThreadN 8
    
    # --- Sample Counting using STAR (Alignment and Gene Quantification) ---
    # Input FASTQ files (replace with actual sample files)
    READ1="/path/to/sample/sample_R1.fastq.gz"
    READ2="/path/to/sample/sample_R2.fastq.gz" # Remove if single-end
    
    # Output directory for STAR results
    OUTPUT_DIR="/path/to/output/sample_star_counts"
    mkdir -p "${OUTPUT_DIR}"
    
    # Path to the pre-built STAR genome index
    STAR_INDEX_DIR="/path/to/STAR_index/mm10_refseq"
    
    STAR --runMode alignReads \
         --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --readFilesCommand zcat \
         --quantMode GeneCounts \
         --outFileNamePrefix "${OUTPUT_DIR}/sample_" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterType BySJout \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --alignSJoverhangMin 8 \
         --alignSJDBoverhangMin 1 \
         --sjdbGTFfile "${GTF_FILE}" \
         --runThreadN 8
  6. 6

    Normalization of the counts and differential expression analysis was performed using DESeq2 (Love et al.

    DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor if not already installed
    # For R:
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # For Bioconductor and DESeq2:
    # R -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create a placeholder R script for DESeq2 analysis
    # This script assumes you have a counts matrix (e.g., counts.csv) and a sample information file (e.g., coldata.csv)
    # counts.csv: rows are genes, columns are samples, values are raw counts
    # coldata.csv: rows are samples, columns are sample metadata (e.g., condition, batch)
    
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # --- Configuration --- #
    # Input files (replace with your actual file paths)
    counts_file <- "counts.csv"
    sample_info_file <- "coldata.csv"
    
    # Output file
    results_file <- "deseq2_results.csv"
    
    # Design formula (replace 'condition' with your primary variable of interest)
    design_formula <- ~ condition
    
    # --- Load Data --- #
    # Load counts data
    # Assuming counts.csv has gene IDs as the first column and sample names as headers
    counts_data <- read.csv(counts_file, row.names = 1, check.names = FALSE)
    
    # Load sample information
    # Assuming coldata.csv has sample names as the first column and metadata as other columns
    # Ensure row names match column names of counts_data
    sample_info <- read.csv(sample_info_file, row.names = 1)
    
    # Ensure sample order matches between counts and sample_info
    sample_info <- sample_info[colnames(counts_data), , drop = FALSE]
    
    # --- DESeq2 Analysis --- #
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                  colData = sample_info,
                                  design = design_formula)
    
    # Pre-filtering: remove genes with very low counts
    dds <- dds[rowSums(counts(dds)) >= 10, ]
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results (e.g., for the 'condition' variable, comparing 'treatment' vs 'control')
    # Replace 'condition_treatment_vs_control' with your specific contrast if needed
    res <- results(dds)
    
    # Order results by adjusted p-value
    res <- res[order(res$padj), ]
    
    # --- Save Results --- #
    write.csv(as.data.frame(res), file = results_file)
    
    message(paste("DESeq2 analysis complete. Results saved to", results_file))
    EOF
    
    # Create dummy input files for demonstration
    # In a real scenario, these would be your actual data files
    
    # Dummy counts.csv
    cat << 'EOF' > counts.csv
    GeneID,Sample_A1,Sample_A2,Sample_B1,Sample_B2
    Gene1,100,120,50,60
    Gene2,20,25,80,90
    Gene3,5,8,15,12
    Gene4,150,160,140,130
    Gene5,30,35,32,38
    EOF
    
    # Dummy coldata.csv
    cat << 'EOF' > coldata.csv
    Sample,condition
    Sample_A1,control
    Sample_A2,control
    Sample_B1,treatment
    Sample_B2,treatment
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
    
  7. 7

    2014) with the parameters: betaPrior=True, cooksCutoff=FALSE, independentFiltering=FALSE.

    DESeq2 (Inferred with models/gemini-2.5-flash) v1.4.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install BiocManager if not already installed
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    
    # Install DESeq2 package if not already installed
    # BiocManager::install("DESeq2")
    
    # Create dummy input files for demonstration
    # In a real scenario, replace these with your actual count matrix and sample metadata.
    # Example counts_matrix.csv (replace with your actual data)
    cat 'gene_id,sample1,sample2,sample3,sample4
    geneA,100,120,50,60
    geneB,200,210,100,110
    geneC,50,60,200,220
    geneD,10,12,5,6' > counts_matrix.csv
    
    # Example sample_info.csv (replace with your actual data)
    cat 'sample_id,condition
    sample1,control
    sample2,control
    sample3,treated
    sample4,treated' > sample_info.csv
    
    Rscript -e '
    library(DESeq2)
    
    # Load count data (replace with your actual path)
    countData <- read.csv("counts_matrix.csv", row.names = 1)
    
    # Load sample information (replace with your actual path)
    colData <- read.csv("sample_info.csv", row.names = 1)
    
    # Ensure sample names in colData match column names in countData
    colData <- colData[colnames(countData), , drop = FALSE]
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = countData,
                                  colData = colData,
                                  design = ~ condition) # Adjust "condition" to your experimental design variable
    
    # Run DESeq2 analysis with specified parameters
    dds <- DESeq(dds,
                 betaPrior = TRUE,
                 cooksCutoff = FALSE,
                 independentFiltering = FALSE)
    
    # Get results
    res <- results(dds)
    
    # Save results to a CSV file
    write.csv(as.data.frame(res), file = "deseq2_differential_expression_results.csv")
    
    message("DESeq2 analysis complete. Results saved to deseq2_differential_expression_results.csv")
    '
    

Tools Used

Raw Source Text
We used Illumina CASAVA 1.8.2 software to generate fastq files.
Reads were trimmed using cutadapt (Martin 2011) (parameters: -a ADAPTER1 -a “A{10}” -a “T{10}” -A “A{10}” -A “T{10}” –times 2 -q 20 -m 25).
Reads were mapped to genome mm10 using STAR (Dobin et al. 2013) v2.4.2a (parameters: –alignEndsType EndToEnd, –outFilterMismatchNoverLmax 0.05, –twopassMode Basic).
Sample counting was done using STAR, quantifying mm10 RefSeq annotated genes.
Normalization of the counts and differential expression analysis was performed using DESeq2 (Love et al. 2014) with the parameters: betaPrior=True, cooksCutoff=FALSE, independentFiltering=FALSE.
Genome_build: mm10
Supplementary_files_format_and_content: DESeq2 output
← Back to Analysis