GSE134164 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE134164The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-seq2]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina Casava1.7 software used for basecalling.
$ Bash example
# Illumina Casava 1.7 was a proprietary software suite. Basecalling was an integrated and automated process performed by the Illumina instrument's Real-Time Analysis (RTA) software, which was part of the Casava workflow. There is no standalone, user-executable bash command for "basecalling" within Casava 1.7. This step converts raw intensity data (BCL files) into FASTQ files. # # The following command is a modern equivalent for converting BCL files (output of basecalling) to FASTQ: # conda install -c bioconda bcl2fastq bcl2fastq --runfolder-dir /path/to/illumina_run_folder --output-dir /path/to/output_fastqs --no-lane-splitting --minimum-trimmed-read-length 35 --mask-short-adapter-reads 35
-
2
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for STAR index directory for hg19. # This index should be built using STAR's --runMode genomeGenerate command # with hg19 reference genome and GTF annotation (e.g., GENCODE v19). # Example command to build the index (run once): # STAR --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index_hg19 \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \ # --runThreadN 8 STAR_INDEX_DIR="/path/to/STAR_index_hg19" INPUT_FASTQ="input_reads.fastq.gz" # Assuming single-end reads. For paired-end, use "read1.fastq.gz read2.fastq.gz" OUTPUT_PREFIX="mapped_reads" NUM_THREADS=8 # Adjust as needed based on available resources # Note: The description mentions that reads were "trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence". # These are typically performed as pre-processing steps using tools like fastp or Trim Galore! before running STAR. # Example pre-processing (not part of the STAR command itself): # fastp -i ${INPUT_FASTQ} -o ${TRIMMED_FASTQ} --trim_poly_g --detect_adapter_for_pe --qualified_quality_phred 15 --length_required 20 STAR --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --runThreadN ${NUM_THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 \ --readFilesCommand zcat -
3
Transcript abundance was calculated using Htseq
$ Bash example
# Install HTSeq (e.g., using pip or conda) # pip install HTSeq # conda install -c bioconda htseq # Define input and output file paths INPUT_BAM="input_aligned_reads.bam" # Placeholder: Replace with your actual aligned BAM file GENE_ANNOTATION_GTF="Homo_sapiens.GRCh38.109.gtf" # Placeholder: Replace with your actual GTF annotation file (e.g., from Ensembl or Gencode) OUTPUT_COUNTS="gene_counts.txt" # Calculate transcript abundance using htseq-count # Parameters explained: # --format=bam: Specifies that the input alignment file is in BAM format. # --stranded=reverse: Assumes a reverse-stranded library preparation (common for many Illumina RNA-seq protocols). Adjust to 'no' or 'yes' if your library is unstranded or forward-stranded. # --mode=union: Defines how to handle reads overlapping multiple features. 'union' mode is a common choice, counting a read if it overlaps any part of a feature. # --type=exon: Specifies that features of type 'exon' from the GTF file should be used for counting. Adjust if you need to count other feature types (e.g., 'gene', 'CDS'). # --idattr=gene_id: Specifies the attribute in the GTF file that contains the feature identifier (e.g., gene_id, transcript_id). 'gene_id' is typical for gene-level counts. # --minaqual=10: Sets the minimum alignment quality score. Reads with a quality score below this value will be ignored. htseq-count \ --format=bam \ --stranded=reverse \ --mode=union \ --type=exon \ --idattr=gene_id \ --minaqual=10 \ "${INPUT_BAM}" \ "${GENE_ANNOTATION_GTF}" \ > "${OUTPUT_COUNTS}"
Tools Used
Raw Source Text
Illumina Casava1.7 software used for basecalling. Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR Transcript abundance was calculated using Htseq Genome_build: Homo sapiens UCSC hg19 Supplementary_files_format_and_content: tab-delimited text files include raw count for each gene