GSE124072 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE124072The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina Casava1.7 software used for basecalling.
Illumina Casava v1.7$ Bash example
# Illumina Casava 1.7 is proprietary software used for basecalling on Illumina sequencers. # It converts raw intensity data (.bcl files) into base calls and quality scores, typically generating FASTQ files. # This process is usually integrated into the sequencing instrument's workflow and does not involve a user-executable command-line tool in the same way as open-source bioinformatics tools. # The output of this step would be FASTQ files.
-
2
Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Path to the STAR genome index for hg19. This index needs to be pre-built using STAR's --runMode genomeGenerate. # Reference genome hg19 is specified in the description. STAR_INDEX_DIR="/path/to/STAR_index/hg19" # Input FASTQ files (assuming paired-end reads that have already been trimmed for adaptors and masked for low-complexity/quality, as per description) # Replace with actual input file names READ1="input_reads_R1.fastq.gz" READ2="input_reads_R2.fastq.gz" # Output directory and prefix OUTPUT_DIR="star_alignment_output" OUTPUT_PREFIX="${OUTPUT_DIR}/aligned_reads" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # STAR alignment command for RNA-based assays (e.g., eCLIP) # Parameters are common for sensitive alignment of RNA-seq data, including splice junction detection. STAR \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --runThreadN 16 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --readFilesCommand zcat \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB) # The primary output will be "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam" # You might want to index the BAM file for downstream tools (e.g., samtools index) # samtools index "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam" -
3
Transcript abundance was calculated using Htseq
$ Bash example
# Install HTSeq (if not already installed) # It's recommended to install HTSeq in a dedicated Python environment. # conda create -n htseq_env python=3.8 # conda activate htseq_env # pip install HTSeq # Placeholder for input files and output # Replace 'aligned_reads.bam' with the path to your actual alignment file (e.g., from STAR or HISAT2). # Replace 'genes.gtf' with the path to your gene annotation file. # A common source for human gene annotations is GENCODE (e.g., GRCh38 primary assembly). ALIGNED_BAM="aligned_reads.bam" GENE_ANNOTATION_GTF="genes.gtf" OUTPUT_COUNTS_FILE="gene_counts.txt" # Calculate transcript abundance using htseq-count # --format=bam: Specifies that the input alignment file is in BAM format. # --order=pos: Assumes reads in the BAM file are sorted by position. This is often required or recommended. # --mode=union: This is the default and most commonly used mode for handling reads that overlap multiple features. # --stranded=no: Assumes the RNA-seq data is unstranded. Use 'yes' or 'reverse' if your library preparation was stranded. # --type=exon: Specifies that features of type 'exon' in the GTF file should be counted. # --idattr=gene_id: Specifies that the 'gene_id' attribute in the GTF file should be used as the feature identifier. htseq-count \ --format=bam \ --order=pos \ --mode=union \ --stranded=no \ --type=exon \ --idattr=gene_id \ "${ALIGNED_BAM}" \ "${GENE_ANNOTATION_GTF}" \ > "${OUTPUT_COUNTS_FILE}"
Tools Used
Raw Source Text
Illumina Casava1.7 software used for basecalling. Sequenced reads were trimmed for adaptor sequence, and masked for low-complexity or low-quality sequence, then mapped to reference genome hg19 using STAR Transcript abundance was calculated using Htseq Genome_build: Homo sapiens UCSC hg19