GSE77702 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Distinct and shared functions of ALS-associated proteins TDP-43, FUS and TAF15 revealed by multisystem analyses.

Nature communications (2016) — PMID 27378374

Dataset

Distinct and shared functions of ALS-associated TDP-43, FUS, and TAF15 revealed by comprehensive multi-system integrative analyses [RNA-Seq_human]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.

cutadapt v4.0 (Inferred) GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output file names (placeholders for paired-end RNA-seq data)
INPUT_R1="raw_rna_seq_R1.fastq.gz"
INPUT_R2="raw_rna_seq_R2.fastq.gz"
OUTPUT_R1="trimmed_rna_seq_R1.fastq.gz"
OUTPUT_R2="trimmed_rna_seq_R2.fastq.gz"

# Run cutadapt to trim polyA tails, adapters, and low quality ends
cutadapt \
  --match-read-wildcards \
  --times 2 \
  -e 0 \
  -O 5 \
  --quality-cutoff 6 \
  -m 18 \
  -b TCGTATGCCGTCTTCTGCTTG \
  -b ATCTCGTATGCCGTCTTCTGCTTG \
  -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
  -b TGGAATTCTCGGGTGCCAAGG \
  -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \
  -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}"

View on GitHub

Reads were then mapped against a database of repetitive elements derived from RepBase18.05.

bowtie (Inferred with models/gemini-2.5-flash) v1.2.3 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bowtie (if not already installed)
# conda install -c bioconda bowtie

# Placeholder for RepBase18.05 index
# You would typically download the RepBase18.05 FASTA file and build the bowtie index once.
# Example:
# wget https://www.girinst.org/repbase/update/RepBase18.05.fasta.gz
# gunzip RepBase18.05.fasta.gz
# bowtie-build RepBase18.05.fasta RepBase18.05_index

REPBASE_INDEX="RepBase18.05_index" # Path to the bowtie index for RepBase18.05
INPUT_FASTQ="input_reads.fastq.gz" # Replace with your actual input reads file (e.g., adapter-trimmed reads)
MAPPED_SAM="mapped_to_repeats.sam" # Output SAM file for reads mapping to repeats
UNMAPPED_FASTQ="unmapped_from_repeats.fastq.gz" # Output FASTQ file for reads that did NOT map to repeats

# Map reads to the repetitive elements database
# -v 2: Allow up to 2 mismatches
# -m 1: Suppress alignments for reads that map to more than 1 location (report only unique best alignments)
# --best --strata: Report alignments from the best stratum, and among those, report the best alignments
# -S: Output alignments in SAM format
# -p 8: Use 8 threads
# --un: Write reads that do not map to the index to the specified file
bowtie -v 2 -m 1 --best --strata -S -p 8 --un "${UNMAPPED_FASTQ}" "${REPBASE_INDEX}" "${INPUT_FASTQ}" > "${MAPPED_SAM}"

View on GitHub

Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).

Bowtie v1.0.0 GitHub

$ Bash example

# Install Bowtie 1.0.0 (example using conda)
# conda install -c bioconda bowtie=1.0.0

# Align reads using Bowtie 1.0.0
# Assuming 'repbase_index' is the basename for the Bowtie index generated from Repbase sequences
# Assuming 'reads.fastq' is the input FASTQ file
# Output will be in SAM format to 'output.sam'
bowtie -S -q -p 16 -e 100 -l 20 repbase_index reads.fastq > output.sam

View on GitHub

Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.

STAR v2.3.0e GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.3.0e

# Define variables
INPUT_READS="reads_not_mapped_to_repbase.fastq.gz" # Placeholder for the input reads (e.g., FASTQ or gzipped FASTQ)
GENOME_DIR="path/to/hg19_star_index" # Placeholder for the STAR genome index for hg19
OUTPUT_PREFIX="aligned_to_hg19"

# Reference Dataset: hg19 human genome (UCSC assembly)
# Download hg19 fasta from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz)
# Download a corresponding GTF file (e.g., from GENCODE or UCSC Table Browser) for splice junction annotation.

# Example command to build STAR genome index (run once):
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles hg19.fa \
#      --sjdbGTFfile genes.gtf \
#      --runThreadN 8 # Adjust number of threads as needed

# Align reads to hg19 using STAR
# Note: --outFilterMultimapScoreRange 1 is not a standard STAR parameter for version 2.3.0e and may cause the command to fail.
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_READS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 1 \
     --outFilterMultimapScoreRange 1 \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 # Adjust number of threads as needed

View on GitHub

counts of reads for each gene annotated in gencode v17 were calculated from featureCounts

featureCounts v2.0.3 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Subread package (which includes featureCounts)
# conda install -c bioconda subread

# Placeholder for Gencode v17 annotation GTF file
# Download from Gencode archives if needed, e.g., for GRCh37/hg19
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_17/gencode.v17.annotation.gtf.gz
# gunzip gencode.v17.annotation.gtf.gz
GENCODE_GTF="gencode.v17.annotation.gtf"

# Placeholder for input alignment file (BAM format)
INPUT_BAM="input_reads.bam"

# Output file for gene counts
OUTPUT_COUNTS="gene_counts.txt"

# Run featureCounts to calculate read counts for each gene
# -a: Annotation file (GTF/GFF)
# -o: Output file for counts
# -F GTF: Specify annotation file format as GTF
# -t exon: Count reads overlapping 'exon' features
# -g gene_id: Aggregate counts by 'gene_id' attribute
# -s 0: Unstranded (0), forward (1), or reverse (2). Assuming unstranded if not specified.
# -T 8: Use 8 threads (adjust as needed)
# -M: Multi-mapping reads will also be counted
# --fraction: Assign fractional counts to multi-mapping reads (if -M is used)
featureCounts -a "${GENCODE_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              -F GTF \
              -t exon \
              -g gene_id \
              -s 0 \
              -T 8 \
              -M \
              --fraction \
              "${INPUT_BAM}"

Tools Used

STAR

Raw Source Text

RNA-seq libraries were first trimmed of polyA tails, adapters, and low quality ends using cutadapt with parameters --match-read-wildcards --times 2 -e 0 -O 5 --quality-cutoff' 6 -m 18 -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b TGGAATTCTCGGGTGCCAAGG -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT.
Reads were then mapped against a database of repetitive elements derived from RepBase18.05. Bowtie version 1.0.0 with parameters -S -q -p 16 -e 100 -l 20 was used to align reads against an index generated from Repbase sequences (Langmead et al., 2009).
Reads not mapped to Repbase sequences were aligned to the hg19 human genome (UCSC assembly) using STAR (Dobin et al., 2013) version 2.3.0e with parameters --outSAMunmapped Within âoutFilterMultimapNmax 1 âoutFilterMultimapScoreRange 1.
counts of reads for each gene annotated in gencode v17 were calculated from featureCounts
Genome_build: hg19
Supplementary_files_format_and_content: count file, contains counts of reads for each sample

← Back to Analysis