GSE134164 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

GSE134164

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-seq2]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Illumina Casava1.7 software used for basecalling.

    Casava v1.7 GitHub
    $ Bash example
    # Illumina Casava 1.7 was a proprietary software suite. Basecalling was an integrated and automated process performed by the Illumina instrument's Real-Time Analysis (RTA) software, which was part of the Casava workflow. There is no standalone, user-executable bash command for "basecalling" within Casava 1.7. This step converts raw intensity data (BCL files) into FASTQ files.
    #
    # The following command is a modern equivalent for converting BCL files (output of basecalling) to FASTQ:
    # conda install -c bioconda bcl2fastq
    bcl2fastq --runfolder-dir /path/to/illumina_run_folder --output-dir /path/to/output_fastqs --no-lane-splitting --minimum-trimmed-read-length 35 --mask-short-adapter-reads 35
  2. 2

    Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Placeholder for STAR index directory for hg19.
    # This index should be built using STAR's --runMode genomeGenerate command
    # with hg19 reference genome and GTF annotation (e.g., GENCODE v19).
    # Example command to build the index (run once):
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_index_hg19 \
    #      --genomeFastaFiles /path/to/hg19.fa \
    #      --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \
    #      --runThreadN 8
    
    STAR_INDEX_DIR="/path/to/STAR_index_hg19"
    INPUT_FASTQ="input_reads.fastq.gz" # Assuming single-end reads. For paired-end, use "read1.fastq.gz read2.fastq.gz"
    OUTPUT_PREFIX="mapped_reads"
    NUM_THREADS=8 # Adjust as needed based on available resources
    
    # Note: The description mentions that reads were "trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence".
    # These are typically performed as pre-processing steps using tools like fastp or Trim Galore! before running STAR.
    # Example pre-processing (not part of the STAR command itself):
    # fastp -i ${INPUT_FASTQ} -o ${TRIMMED_FASTQ} --trim_poly_g --detect_adapter_for_pe --qualified_quality_phred 15 --length_required 20
    
    STAR --genomeDir ${STAR_INDEX_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --runThreadN ${NUM_THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 \
         --readFilesCommand zcat
  3. 3

    Transcript abundance was calculated using Htseq

    HTSeq v2.0.2 GitHub
    $ Bash example
    # Install HTSeq (e.g., using pip or conda)
    # pip install HTSeq
    # conda install -c bioconda htseq
    
    # Define input and output file paths
    INPUT_BAM="input_aligned_reads.bam" # Placeholder: Replace with your actual aligned BAM file
    GENE_ANNOTATION_GTF="Homo_sapiens.GRCh38.109.gtf" # Placeholder: Replace with your actual GTF annotation file (e.g., from Ensembl or Gencode)
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Calculate transcript abundance using htseq-count
    # Parameters explained:
    # --format=bam: Specifies that the input alignment file is in BAM format.
    # --stranded=reverse: Assumes a reverse-stranded library preparation (common for many Illumina RNA-seq protocols). Adjust to 'no' or 'yes' if your library is unstranded or forward-stranded.
    # --mode=union: Defines how to handle reads overlapping multiple features. 'union' mode is a common choice, counting a read if it overlaps any part of a feature.
    # --type=exon: Specifies that features of type 'exon' from the GTF file should be used for counting. Adjust if you need to count other feature types (e.g., 'gene', 'CDS').
    # --idattr=gene_id: Specifies the attribute in the GTF file that contains the feature identifier (e.g., gene_id, transcript_id). 'gene_id' is typical for gene-level counts.
    # --minaqual=10: Sets the minimum alignment quality score. Reads with a quality score below this value will be ignored.
    htseq-count \
      --format=bam \
      --stranded=reverse \
      --mode=union \
      --type=exon \
      --idattr=gene_id \
      --minaqual=10 \
      "${INPUT_BAM}" \
      "${GENE_ANNOTATION_GTF}" \
      > "${OUTPUT_COUNTS}"
    

Tools Used

Raw Source Text
Illumina Casava1.7 software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR
Transcript abundance was calculated using Htseq
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: tab-delimited text files include raw count for each gene
← Back to Analysis