GSE201897 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Science translational medicine (2022) — PMID 35767654

Dataset

GSE201897

MECP2-related pathways are dysregulated in a cortical organoid model of Myotonic dystrophy [bulk RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNAseq reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i) (Dobin et al., 2013).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star=2.4.0i

# --- Reference Data Preparation (Conceptual) ---
# The description states mapping to "human-specific repetitive elements from RepBase (version 18.05)".
# This implies a STAR genome index was generated using these sequences.
# Example command to generate such an index (assuming 'repbase_18.05_human_repeats.fasta' is the FASTA file):
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_RepBase_18.05_Index \
#      --genomeFastaFiles /path/to/repbase_18.05_human_repeats.fasta \
#      --runThreadN 8 # Adjust thread count as needed

# --- Alignment Step ---
# Assuming adapter-trimmed RNAseq reads are available (e.g., from Cutadapt)
# and a STAR genome index for RepBase 18.05 human repetitive elements has been generated.

# Define variables for clarity
STAR_GENOME_DIR="/path/to/STAR_RepBase_18.05_Index" # Placeholder for the RepBase genome index
TRIMMED_READS_R1="sample_trimmed_R1.fastq.gz" # Placeholder for trimmed forward reads
TRIMMED_READS_R2="sample_trimmed_R2.fastq.gz" # Placeholder for trimmed reverse reads (adjust for single-end if needed)
OUTPUT_PREFIX="sample_repbase_alignment"
NUM_THREADS=8 # Adjust thread count as needed

STAR --genomeDir "${STAR_GENOME_DIR}" \
     --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate

View on GitHub

Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with the actual path to your pre-built hg19 STAR genome index
GENOME_DIR="/path/to/STAR_index/hg19"
# Replace with your input R1 FASTQ file (e.g., after repeat-mapping reads removal)
READS_R1="sample_R1.fastq.gz"
# Replace with your input R2 FASTQ file (remove this line if single-end reads)
READS_R2="sample_R2.fastq.gz"
OUTPUT_PREFIX="sample_aligned_"
THREADS=16 # Adjust based on available CPU cores

# Note: The description states "Repeat-mapping reads were removed".
# This command uses --outFilterMultimapNmax 1 to ensure only uniquely mapping reads are reported by STAR,
# aligning with the pre-processing step of removing repeat-mapping reads.

# Run STAR alignment
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READS_R1}" "${READS_R2}" \
  --readFilesCommand zcat \
  --runThreadN "${THREADS}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard \
  --outFilterMultimapNmax 1 \
  --outFilterType BySJout \
  --outFilterScoreMinOverLread 0.3 \
  --outFilterMatchNminOverLread 0.3 \
  --alignSJDBoverhangMin 1 \
  --alignSJoverhangMin 8 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)

View on GitHub

Read counts for all genes annotated in GENCODE (hg19) were calculated using the read summarization program featureCounts (Liao et al., 2014).

featureCounts v1.4.6-p5

$ Bash example

# Install Subread (which includes featureCounts)
# conda install -c bioconda subread

# Download GENCODE hg19 annotation (if not already available)
# For example, from GENCODE archive:
# wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz

GENCODE_GTF="gencode.v19.annotation.gtf" # Path to your GENCODE hg19 GTF file
INPUT_BAM="input.bam" # Placeholder for your aligned BAM file(s)
OUTPUT_COUNTS="gene_counts.txt"

# Calculate read counts for all genes using featureCounts
# -a: Annotation file
# -o: Output file
# -F GTF: Specify GTF format for the annotation file
# -t exon: Specify 'exon' as the feature type to count
# -g gene_id: Specify 'gene_id' as the attribute to group features by (summarizes exon counts to gene level)
# -s 0: Assume unstranded data (change to 1 for stranded, 2 for reverse stranded if applicable)
# -T 8: Use 8 threads (adjust as needed for performance)
featureCounts -a ${GENCODE_GTF} -o ${OUTPUT_COUNTS} -F GTF -t exon -g gene_id -s 0 -T 8 ${INPUT_BAM}

Tools Used

STAR

Raw Source Text

RNAseq reads were adapter-trimmed using Cutadapt (v1.14) and mapped to human-specific repetitive elements from RepBase (version 18.05) by STAR (v2.4.0i) (Dobin et al., 2013).
Repeat-mapping reads were removed, and remaining reads were mapped to the human genome assembly (hg19) with STAR
Read counts for all genes annotated in GENCODE (hg19) were calculated using the read summarization program featureCounts (Liao et al., 2014).
Assembly: hg19
Supplementary files format and content: .txts with raw gene counts and RPKMs for each experimental group

← Back to Analysis