GSE134164 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-seq2]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina Casava1.7 software used for basecalling.

Casava v1.7 GitHub

$ Bash example

# Illumina Casava 1.7 was a proprietary software suite. Basecalling was an integrated and automated process performed by the Illumina instrument's Real-Time Analysis (RTA) software, which was part of the Casava workflow. There is no standalone, user-executable bash command for "basecalling" within Casava 1.7. This step converts raw intensity data (BCL files) into FASTQ files.
#
# The following command is a modern equivalent for converting BCL files (output of basecalling) to FASTQ:
# conda install -c bioconda bcl2fastq
bcl2fastq --runfolder-dir /path/to/illumina_run_folder --output-dir /path/to/output_fastqs --no-lane-splitting --minimum-trimmed-read-length 35 --mask-short-adapter-reads 35

View on GitHub

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR

STAR v2.7.0f GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for STAR index directory for hg19.
# This index should be built using STAR's --runMode genomeGenerate command
# with hg19 reference genome and GTF annotation (e.g., GENCODE v19).
# Example command to build the index (run once):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index_hg19 \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \
#      --runThreadN 8

STAR_INDEX_DIR="/path/to/STAR_index_hg19"
INPUT_FASTQ="input_reads.fastq.gz" # Assuming single-end reads. For paired-end, use "read1.fastq.gz read2.fastq.gz"
OUTPUT_PREFIX="mapped_reads"
NUM_THREADS=8 # Adjust as needed based on available resources

# Note: The description mentions that reads were "trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence".
# These are typically performed as pre-processing steps using tools like fastp or Trim Galore! before running STAR.
# Example pre-processing (not part of the STAR command itself):
# fastp -i ${INPUT_FASTQ} -o ${TRIMMED_FASTQ} --trim_poly_g --detect_adapter_for_pe --qualified_quality_phred 15 --length_required 20

STAR --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --runThreadN ${NUM_THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 20 \
     --readFilesCommand zcat

View on GitHub

Transcript abundance was calculated using Htseq

HTSeq v2.0.2 GitHub

$ Bash example

# Install HTSeq (e.g., using pip or conda)
# pip install HTSeq
# conda install -c bioconda htseq

# Define input and output file paths
INPUT_BAM="input_aligned_reads.bam" # Placeholder: Replace with your actual aligned BAM file
GENE_ANNOTATION_GTF="Homo_sapiens.GRCh38.109.gtf" # Placeholder: Replace with your actual GTF annotation file (e.g., from Ensembl or Gencode)
OUTPUT_COUNTS="gene_counts.txt"

# Calculate transcript abundance using htseq-count
# Parameters explained:
# --format=bam: Specifies that the input alignment file is in BAM format.
# --stranded=reverse: Assumes a reverse-stranded library preparation (common for many Illumina RNA-seq protocols). Adjust to 'no' or 'yes' if your library is unstranded or forward-stranded.
# --mode=union: Defines how to handle reads overlapping multiple features. 'union' mode is a common choice, counting a read if it overlaps any part of a feature.
# --type=exon: Specifies that features of type 'exon' from the GTF file should be used for counting. Adjust if you need to count other feature types (e.g., 'gene', 'CDS').
# --idattr=gene_id: Specifies the attribute in the GTF file that contains the feature identifier (e.g., gene_id, transcript_id). 'gene_id' is typical for gene-level counts.
# --minaqual=10: Sets the minimum alignment quality score. Reads with a quality score below this value will be ignored.
htseq-count \
  --format=bam \
  --stranded=reverse \
  --mode=union \
  --type=exon \
  --idattr=gene_id \
  --minaqual=10 \
  "${INPUT_BAM}" \
  "${GENE_ANNOTATION_GTF}" \
  > "${OUTPUT_COUNTS}"

View on GitHub

Tools Used

STAR

Raw Source Text

Illumina Casava1.7 software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR
Transcript abundance was calculated using Htseq
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: tab-delimited text files include raw count for each gene

← Back to Analysis