GSE78960 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Genomic analysis of the molecular neuropathology of tuberous sclerosis using a human stem cell model.

Genome medicine (2016) — PMID 27655340

Dataset

Modeling the Neuropathology of Tuberous Sclerosis with Human Stem Cells Reveals a Role for Inflammation and Angiogenic Growth Factors [Treatment]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Linker tags were removed from RNA sequencing and ribosome profiling reads by the FASTX Toolkit, v0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/)

RNA-seq v0.0.13 GitHub

$ Bash example

bash
# Install FASTX Toolkit (if not already installed)
# conda install -c bioconda fastx_toolkit

# Define input and output file names
INPUT_FASTQ="input_reads.fastq"
OUTPUT_FASTQ="reads_linker_removed.fastq"

# Placeholder for the linker tag sequence. This needs to be replaced with the actual linker sequence.
# Example: ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
LINKER_SEQUENCE="YOUR_LINKER_TAG_SEQUENCE"

# Remove linker tags using fastx_clipper
# -a: adapter sequence to clip
# -i: input FASTQ file
# -o: output FASTQ file
# -Q 33: specify quality score format (Phred+33, common for Illumina)
fastx_clipper -a "${LINKER_SEQUENCE}" -i "${INPUT_FASTQ}" -o "${OUTPUT_FASTQ}" -Q 33

View on GitHub

All reads that mapped to rRNAs, tRNAs or mitochondrial rRNAs were removed, and the remaining reads were mapped to RefSeq (v38) by TopHat v2.0.13.

TopHat v2.0.13 GitHub

$ Bash example

# Create a directory for reference data
mkdir -p ref_data

# Download GRCh38.p13 genome (RefSeq assembly GCF_000001405.39)
# wget -P ref_data https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/397/GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
# gunzip ref_data/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
# mv ref_data/GCF_000001405.39_GRCh38.p13_genomic.fna ref_data/GRCh38.p13_genomic.fna

# Build Bowtie2 index for GRCh38.p13 (TopHat v2 uses Bowtie2 by default)
# bowtie2-build ref_data/GRCh38.p13_genomic.fna ref_data/grch38_refseq_index

# Download RefSeq GTF for GRCh38.p13
# wget -P ref_data https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/397/GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
# gunzip ref_data/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
# mv ref_data/GCF_000001405.39_GRCh38.p13_genomic.gtf ref_data/grch38_refseq.gtf

# Installation of TopHat
# conda install -c bioconda tophat=2.0.13

# Define input and output
INPUT_READS="filtered_reads.fastq" # Placeholder for reads after removing rRNA/tRNA/mito
OUTPUT_DIR="tophat_output"
GENOME_INDEX_PREFIX="ref_data/grch38_refseq_index"
GTF_FILE="ref_data/grch38_refseq.gtf"

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# Execute TopHat
tophat -o "${OUTPUT_DIR}" \
       -G "${GTF_FILE}" \
       "${GENOME_INDEX_PREFIX}" \
       "${INPUT_READS}"

View on GitHub

Finally all read counts that mapped uniquely to genes were extracted for expression analysis with the help of samtools, v1.1.

samtools v1.1 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.1

# Extract read counts per reference sequence (chromosome/contig).
# This command provides counts for mapped reads, unmapped reads, and number of bases per reference.
# While samtools idxstats counts all mapped reads per reference, the description "mapped uniquely to genes"
# implies further filtering or interpretation that might be handled by downstream tools or custom scripts
# not explicitly mentioned here. For true gene-level unique counts, tools like featureCounts or htseq-count
# are typically used on a BAM file that has been filtered for unique alignments (e.g., using samtools view -F 0x100 -F 0x4).
# Replace 'input.bam' with your actual alignment file.
# Replace 'output_read_counts.txt' with your desired output file name.
# A placeholder reference genome (e.g., hg38) is assumed for context, though not directly used by idxstats.
samtools idxstats input.bam > output_read_counts.txt

View on GitHub

Tools Used

RNA-seq TopHat

Raw Source Text

Linker tags were removed from RNA sequencing and ribosome profiling reads by the FASTX Toolkit, v0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/)
All reads that mapped to rRNAs, tRNAs or mitochondrial rRNAs were removed, and the remaining reads were mapped to RefSeq (v38) by TopHat v2.0.13.
Finally all read counts that mapped uniquely to genes were extracted for expression analysis with the help of samtools, v1.1.
Genome_build: GRCh37.p13
Supplementary_files_format_and_content: .txt files report raw read counts that mapped uniquely to genes

← Back to Analysis