GSE124072 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina Casava1.7 software used for basecalling.

Illumina Casava v1.7

$ Bash example

# Illumina Casava 1.7 is proprietary software used for basecalling on Illumina sequencers.
# It converts raw intensity data (.bcl files) into base calls and quality scores, typically generating FASTQ files.
# This process is usually integrated into the sequencing instrument's workflow and does not involve a user-executable command-line tool in the same way as open-source bioinformatics tools.
# The output of this step would be FASTQ files.

Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR

STAR v2.7.9a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Path to the STAR genome index for hg19. This index needs to be pre-built using STAR's --runMode genomeGenerate.
# Reference genome hg19 is specified in the description.
STAR_INDEX_DIR="/path/to/STAR_index/hg19"

# Input FASTQ files (assuming paired-end reads that have already been trimmed for adaptors and masked for low-complexity/quality, as per description)
# Replace with actual input file names
READ1="input_reads_R1.fastq.gz"
READ2="input_reads_R2.fastq.gz"

# Output directory and prefix
OUTPUT_DIR="star_alignment_output"
OUTPUT_PREFIX="${OUTPUT_DIR}/aligned_reads"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# STAR alignment command for RNA-based assays (e.g., eCLIP)
# Parameters are common for sensitive alignment of RNA-seq data, including splice junction detection.
STAR \
  --genomeDir "${STAR_INDEX_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --runThreadN 16 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard \
  --outFilterType BySJout \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.04 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --alignSJoverhangMin 8 \
  --alignSJDBoverhangMin 1 \
  --sjdbScore 1 \
  --readFilesCommand zcat \
  --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)

# The primary output will be "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam"
# You might want to index the BAM file for downstream tools (e.g., samtools index)
# samtools index "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam"

View on GitHub

Transcript abundance was calculated using Htseq

HTSeq v0.13.5 (Inferred from common usage) GitHub

$ Bash example

# Install HTSeq (if not already installed)
# It's recommended to install HTSeq in a dedicated Python environment.
# conda create -n htseq_env python=3.8
# conda activate htseq_env
# pip install HTSeq

# Placeholder for input files and output
# Replace 'aligned_reads.bam' with the path to your actual alignment file (e.g., from STAR or HISAT2).
# Replace 'genes.gtf' with the path to your gene annotation file.
# A common source for human gene annotations is GENCODE (e.g., GRCh38 primary assembly).
ALIGNED_BAM="aligned_reads.bam"
GENE_ANNOTATION_GTF="genes.gtf"
OUTPUT_COUNTS_FILE="gene_counts.txt"

# Calculate transcript abundance using htseq-count
# --format=bam: Specifies that the input alignment file is in BAM format.
# --order=pos: Assumes reads in the BAM file are sorted by position. This is often required or recommended.
# --mode=union: This is the default and most commonly used mode for handling reads that overlap multiple features.
# --stranded=no: Assumes the RNA-seq data is unstranded. Use 'yes' or 'reverse' if your library preparation was stranded.
# --type=exon: Specifies that features of type 'exon' in the GTF file should be counted.
# --idattr=gene_id: Specifies that the 'gene_id' attribute in the GTF file should be used as the feature identifier.
htseq-count \
  --format=bam \
  --order=pos \
  --mode=union \
  --stranded=no \
  --type=exon \
  --idattr=gene_id \
  "${ALIGNED_BAM}" \
  "${GENE_ANNOTATION_GTF}" \
  > "${OUTPUT_COUNTS_FILE}"

View on GitHub

Tools Used

STAR

Raw Source Text

Illumina Casava1.7 software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR
Transcript abundance was calculated using Htseq
Genome_build: Homo sapiens UCSC hg19

← Back to Analysis