GSE124072 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

GSE124072

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Illumina Casava1.7 software used for basecalling.

    Illumina Casava v1.7
    $ Bash example
    # Illumina Casava 1.7 is proprietary software used for basecalling on Illumina sequencers.
    # It converts raw intensity data (.bcl files) into base calls and quality scores, typically generating FASTQ files.
    # This process is usually integrated into the sequencing instrument's workflow and does not involve a user-executable command-line tool in the same way as open-source bioinformatics tools.
    # The output of this step would be FASTQ files.
  2. 2

    Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    # Path to the STAR genome index for hg19. This index needs to be pre-built using STAR's --runMode genomeGenerate.
    # Reference genome hg19 is specified in the description.
    STAR_INDEX_DIR="/path/to/STAR_index/hg19"
    
    # Input FASTQ files (assuming paired-end reads that have already been trimmed for adaptors and masked for low-complexity/quality, as per description)
    # Replace with actual input file names
    READ1="input_reads_R1.fastq.gz"
    READ2="input_reads_R2.fastq.gz"
    
    # Output directory and prefix
    OUTPUT_DIR="star_alignment_output"
    OUTPUT_PREFIX="${OUTPUT_DIR}/aligned_reads"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # STAR alignment command for RNA-based assays (e.g., eCLIP)
    # Parameters are common for sensitive alignment of RNA-seq data, including splice junction detection.
    STAR \
      --genomeDir "${STAR_INDEX_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --runThreadN 16 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes Standard \
      --outFilterType BySJout \
      --outFilterMultimapNmax 20 \
      --outFilterMismatchNmax 999 \
      --outFilterMismatchNoverLmax 0.04 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --alignSJoverhangMin 8 \
      --alignSJDBoverhangMin 1 \
      --sjdbScore 1 \
      --readFilesCommand zcat \
      --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)
    
    # The primary output will be "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam"
    # You might want to index the BAM file for downstream tools (e.g., samtools index)
    # samtools index "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam"
  3. 3

    Transcript abundance was calculated using Htseq

    HTSeq v0.13.5 (Inferred from common usage) GitHub
    $ Bash example
    # Install HTSeq (if not already installed)
    # It's recommended to install HTSeq in a dedicated Python environment.
    # conda create -n htseq_env python=3.8
    # conda activate htseq_env
    # pip install HTSeq
    
    # Placeholder for input files and output
    # Replace 'aligned_reads.bam' with the path to your actual alignment file (e.g., from STAR or HISAT2).
    # Replace 'genes.gtf' with the path to your gene annotation file.
    # A common source for human gene annotations is GENCODE (e.g., GRCh38 primary assembly).
    ALIGNED_BAM="aligned_reads.bam"
    GENE_ANNOTATION_GTF="genes.gtf"
    OUTPUT_COUNTS_FILE="gene_counts.txt"
    
    # Calculate transcript abundance using htseq-count
    # --format=bam: Specifies that the input alignment file is in BAM format.
    # --order=pos: Assumes reads in the BAM file are sorted by position. This is often required or recommended.
    # --mode=union: This is the default and most commonly used mode for handling reads that overlap multiple features.
    # --stranded=no: Assumes the RNA-seq data is unstranded. Use 'yes' or 'reverse' if your library preparation was stranded.
    # --type=exon: Specifies that features of type 'exon' in the GTF file should be counted.
    # --idattr=gene_id: Specifies that the 'gene_id' attribute in the GTF file should be used as the feature identifier.
    htseq-count \
      --format=bam \
      --order=pos \
      --mode=union \
      --stranded=no \
      --type=exon \
      --idattr=gene_id \
      "${ALIGNED_BAM}" \
      "${GENE_ANNOTATION_GTF}" \
      > "${OUTPUT_COUNTS_FILE}"

Tools Used

Raw Source Text
Illumina Casava1.7 software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR
Transcript abundance was calculated using Htseq
Genome_build: Homo sapiens UCSC hg19
← Back to Analysis