GSE78960 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Genomic analysis of the molecular neuropathology of tuberous sclerosis using a human stem cell model.

Genome medicine (2016) — PMID 27655340

Dataset

GSE78960

Modeling the Neuropathology of Tuberous Sclerosis with Human Stem Cells Reveals a Role for Inflammation and Angiogenic Growth Factors [Treatment]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Linker tags were removed from RNA sequencing and ribosome profiling reads by the FASTX Toolkit, v0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/)

    $ Bash example
    bash
    # Install FASTX Toolkit (if not already installed)
    # conda install -c bioconda fastx_toolkit
    
    # Define input and output file names
    INPUT_FASTQ="input_reads.fastq"
    OUTPUT_FASTQ="reads_linker_removed.fastq"
    
    # Placeholder for the linker tag sequence. This needs to be replaced with the actual linker sequence.
    # Example: ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    LINKER_SEQUENCE="YOUR_LINKER_TAG_SEQUENCE"
    
    # Remove linker tags using fastx_clipper
    # -a: adapter sequence to clip
    # -i: input FASTQ file
    # -o: output FASTQ file
    # -Q 33: specify quality score format (Phred+33, common for Illumina)
    fastx_clipper -a "${LINKER_SEQUENCE}" -i "${INPUT_FASTQ}" -o "${OUTPUT_FASTQ}" -Q 33
    
  2. 2

    All reads that mapped to rRNAs, tRNAs or mitochondrial rRNAs were removed, and the remaining reads were mapped to RefSeq (v38) by TopHat v2.0.13.

    $ Bash example
    # Create a directory for reference data
    mkdir -p ref_data
    
    # Download GRCh38.p13 genome (RefSeq assembly GCF_000001405.39)
    # wget -P ref_data https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/397/GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
    # gunzip ref_data/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
    # mv ref_data/GCF_000001405.39_GRCh38.p13_genomic.fna ref_data/GRCh38.p13_genomic.fna
    
    # Build Bowtie2 index for GRCh38.p13 (TopHat v2 uses Bowtie2 by default)
    # bowtie2-build ref_data/GRCh38.p13_genomic.fna ref_data/grch38_refseq_index
    
    # Download RefSeq GTF for GRCh38.p13
    # wget -P ref_data https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/397/GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
    # gunzip ref_data/GCF_000001405.39_GRCh38.p13_genomic.gtf.gz
    # mv ref_data/GCF_000001405.39_GRCh38.p13_genomic.gtf ref_data/grch38_refseq.gtf
    
    # Installation of TopHat
    # conda install -c bioconda tophat=2.0.13
    
    # Define input and output
    INPUT_READS="filtered_reads.fastq" # Placeholder for reads after removing rRNA/tRNA/mito
    OUTPUT_DIR="tophat_output"
    GENOME_INDEX_PREFIX="ref_data/grch38_refseq_index"
    GTF_FILE="ref_data/grch38_refseq.gtf"
    
    # Create output directory
    mkdir -p "${OUTPUT_DIR}"
    
    # Execute TopHat
    tophat -o "${OUTPUT_DIR}" \
           -G "${GTF_FILE}" \
           "${GENOME_INDEX_PREFIX}" \
           "${INPUT_READS}"
  3. 3

    Finally all read counts that mapped uniquely to genes were extracted for expression analysis with the help of samtools, v1.1.

    samtools v1.1 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.1
    
    # Extract read counts per reference sequence (chromosome/contig).
    # This command provides counts for mapped reads, unmapped reads, and number of bases per reference.
    # While samtools idxstats counts all mapped reads per reference, the description "mapped uniquely to genes"
    # implies further filtering or interpretation that might be handled by downstream tools or custom scripts
    # not explicitly mentioned here. For true gene-level unique counts, tools like featureCounts or htseq-count
    # are typically used on a BAM file that has been filtered for unique alignments (e.g., using samtools view -F 0x100 -F 0x4).
    # Replace 'input.bam' with your actual alignment file.
    # Replace 'output_read_counts.txt' with your desired output file name.
    # A placeholder reference genome (e.g., hg38) is assumed for context, though not directly used by idxstats.
    samtools idxstats input.bam > output_read_counts.txt

Tools Used

Raw Source Text
Linker tags were removed from RNA sequencing and ribosome profiling reads by the FASTX Toolkit, v0.0.13 (http://hannonlab.cshl.edu/fastx_toolkit/)
All reads that mapped to rRNAs, tRNAs or mitochondrial rRNAs were removed, and the remaining reads were mapped to RefSeq (v38) by TopHat v2.0.13.
Finally all read counts that mapped uniquely to genes were extracted for expression analysis with the help of samtools, v1.1.
Genome_build: GRCh37.p13
Supplementary_files_format_and_content: .txt files report raw read counts that mapped uniquely to genes
← Back to Analysis