GSE210263 Processing Pipeline

RNA-Seq code_examples 10 steps

Publication

Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.

Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348

Dataset

Transcriptomic profiles of muscular dystrophy with myositis (mdm) in extensor digitorum longus, psoas, and soleus muscles from mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Post sequencing read quality was checked using the FastQC quality control tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for high throughput sequence data.

FastQC vnot specified

$ Bash example

# Install FastQC (e.g., using conda)
# conda install -c bioconda fastqc

# Example usage: Check quality of a gzipped FASTQ file
# Replace 'your_reads.fastq.gz' with the actual input file(s)
fastqc your_reads.fastq.gz

If the per-base quality score was below 20 at any position along the 75 bp length stretch, those samples were processed using sliding window quality filtering (window size = 4 bp) in Trimmomatic v0.32.

Trimmomatic v0.32

$ Bash example

# Install Trimmomatic (if not already installed)
# conda install -c bioconda trimmomatic

# Assuming paired-end reads for a typical scenario. Adjust to SE for single-end if needed.
# Replace input_R1.fastq.gz, input_R2.fastq.gz with your actual input files.
# Replace output_R1_paired.fastq.gz, output_R1_unpaired.fastq.gz, etc., with your desired output file names.

# Trimmomatic v0.32 command for sliding window quality filtering.
# SLIDINGWINDOW:4:20 means a 4-base pair window, with an average quality score of 20.
# Bases are removed from the start or end of the read if the average quality within the window falls below 20.

java -jar /path/to/trimmomatic-0.32.jar PE \
    input_R1.fastq.gz input_R2.fastq.gz \
    output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz \
    output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz \
    SLIDINGWINDOW:4:20

After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis

repair.sh (Inferred with models/gemini-2.5-flash) v38.90 GitHub

$ Bash example

# repair.sh installation (part of BBMap suite)
# Download BBMap suite:
# wget https://sourceforge.net/projects/bbmap/files/BBMap_38.90.tar.gz
# tar -xzf BBMap_38.90.tar.gz
# export PATH="/path/to/bbmap:$PATH" # Adjust path accordingly to where BBMap is extracted

# Define input files from the read-trimming step
# These are assumed to be the output from a previous trimming step, which might contain unpaired reads or out-of-sync pairs.
INPUT_TRIMMED_R1="trimmed_R1.fq.gz"
INPUT_TRIMMED_R2="trimmed_R2.fq.gz"

# Define output files for strictly paired reads
OUTPUT_PAIRED_R1="paired_R1.fq.gz"
OUTPUT_PAIRED_R2="paired_R2.fq.gz"

# Use repair.sh to ensure only paired-end reads are used for downstream analysis.
# This command takes potentially mixed paired/unpaired reads and outputs only strictly paired reads,
# discarding any reads that do not have a valid pair.
repair.sh in1="${INPUT_TRIMMED_R1}" in2="${INPUT_TRIMMED_R2}" \
          out1="${OUTPUT_PAIRED_R1}" out2="${OUTPUT_PAIRED_R2}" \
          overwrite=t # Overwrite existing output files if they exist

View on GitHub

After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis (Table S3).

bash (Inferred with models/gemini-2.5-flash) v5.x

$ Bash example

# Assuming read-trimming produced paired-end files named sample_R1_paired.fastq.gz and sample_R2_paired.fastq.gz.
# These files are then designated for downstream analysis.

# Define variables for the paired-end read files that will be used as input for the next step.
# Replace 'sample_R1_paired.fastq.gz' and 'sample_R2_paired.fastq.gz' with actual file names from the read-trimming output.
READ1_PAIRED="sample_R1_paired.fastq.gz"
READ2_PAIRED="sample_R2_paired.fastq.gz"

# This step primarily describes the selection and availability of these files.
# The actual 'use' would be by a subsequent bioinformatics tool (e.g., an aligner).
echo "Selected paired-end reads: ${READ1_PAIRED} and ${READ2_PAIRED} for downstream analysis."

Original samples that showed a satisfactory per base quality score (>20) were used without filtering.

fastp (Inferred with models/gemini-2.5-flash) v0.23.2 GitHub

$ Bash example

# Install fastp if not already installed
# conda install -c bioconda fastp

# Define input and output file paths
INPUT_FASTQ="original_sample.fastq.gz"
OUTPUT_FASTQ="processed_sample.fastq.gz"

# Execute fastp with options to effectively disable filtering,
# as samples already meet the satisfactory per base quality score (>20).
# This command ensures that no quality trimming, adapter trimming, or polyX/polyG trimming is performed.
fastp -i "${INPUT_FASTQ}" -o "${OUTPUT_FASTQ}" \
    --disable_quality_filtering \
    --disable_adapter_trimming \
    --disable_trim_poly_g \
    --disable_trim_poly_x

View on GitHub

Sequencing adapters were trimmed while converting initial BCL data to fastq files from the sequencing center prior to receiving the data files and no adaptor contamination was detected in FASTQC analysis.

FastQC vNot specified GitHub

$ Bash example

# Install FastQC (if not already installed)
# conda install -c bioconda fastqc

# Run FastQC on the fastq files to check for adapter contamination.
# The description states that adapters were trimmed prior to receiving the data,
# and FastQC was used to confirm no contamination remained.
# Assuming fastq files are in the current directory and end with .fastq.gz
fastqc *.fastq.gz

View on GitHub

Prepared fastq files were aligned to the Mus musculus GRCm38.p4 genome annotation using the Tophat alignment tool.

TopHat v2.1.1 GitHub

$ Bash example

# Install TopHat2 (often available via Bioconda)
# conda install -c bioconda tophat2
# conda install -c bioconda bowtie2

# Create a directory for reference files
mkdir -p reference
cd reference

# Download Mus musculus GRCm38.p4 primary assembly FASTA file
# Using Ensembl release 94, which corresponds to GRCm38
wget https://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz

# Build Bowtie2 index for TopHat2
bowtie2-build Mus_musculus.GRCm38.dna.primary_assembly.fa GRCm38_p4_index

# Download Mus musculus GRCm38.94 GTF annotation file
wget https://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz
gunzip Mus_musculus.GRCm38.94.gtf.gz

cd ..

# Define input and output paths
FASTQ_FILE="input.fastq.gz" # Replace with your actual fastq file
BOWTIE2_INDEX="reference/GRCm38_p4_index"
GTF_FILE="reference/Mus_musculus.GRCm38.94.gtf"
OUTPUT_DIR="tophat_alignment_output"

# Run TopHat2 alignment
# -G: Provide a GTF file for splice junction discovery
# -o: Output directory
tophat2 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${BOWTIE2_INDEX}" "${FASTQ_FILE}"

View on GitHub

In order to calculate the insert sizes between paired-end reads, a subset of 250,000 reads from each sample was aligned using the BWA-aln short read alignment tool available on the Galaxy web platform.

BWA v0.7.17 GitHub

$ Bash example

# Install BWA, Samtools, and Seqtk (if not already installed)
# conda install -c bioconda bwa samtools seqtk

# Define variables
REF_GENOME="hg38.fa" # Placeholder for reference genome, e.g., hg38
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz"
NUM_READS=250000
OUTPUT_PREFIX="sample_subset"
THREADS=8 # Number of threads for BWA aln

# 1. Index the reference genome (if not already indexed)
# bwa index "${REF_GENOME}"

# 2. Subset 250,000 reads from each paired-end file
# Using seqtk sample with a seed for reproducibility
seqtk sample -s100 "${READS_R1}" "${NUM_READS}" > "${OUTPUT_PREFIX}_R1_subset.fastq"
seqtk sample -s100 "${READS_R2}" "${NUM_READS}" > "${OUTPUT_PREFIX}_R2_subset.fastq"

# 3. Align R1 reads using BWA-aln
bwa aln -t "${THREADS}" "${REF_GENOME}" "${OUTPUT_PREFIX}_R1_subset.fastq" > "${OUTPUT_PREFIX}_R1.sai"

# 4. Align R2 reads using BWA-aln
bwa aln -t "${THREADS}" "${REF_GENOME}" "${OUTPUT_PREFIX}_R2_subset.fastq" > "${OUTPUT_PREFIX}_R2.sai"

# 5. Generate paired-end SAM alignment using BWA sampe
bwa sampe "${REF_GENOME}" "${OUTPUT_PREFIX}_R1.sai" "${OUTPUT_PREFIX}_R2.sai" \
    "${OUTPUT_PREFIX}_R1_subset.fastq" "${OUTPUT_PREFIX}_R2_subset.fastq" > "${OUTPUT_PREFIX}.sam"

# 6. Convert SAM to BAM, sort, and index
samtools view -bS "${OUTPUT_PREFIX}.sam" | samtools sort -o "${OUTPUT_PREFIX}.bam"
samtools index "${OUTPUT_PREFIX}.bam"

# Clean up intermediate files (optional)
# rm "${OUTPUT_PREFIX}_R1_subset.fastq" "${OUTPUT_PREFIX}_R2_subset.fastq"
# rm "${OUTPUT_PREFIX}_R1.sai" "${OUTPUT_PREFIX}_R2.sai"
# rm "${OUTPUT_PREFIX}.sam"

View on GitHub

The built-in reference mouse genome (mm10) was used to carry out the alignment under default settings.

STAR (Inferred with models/gemini-2.5-flash) v2.7.3a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Placeholder for mm10 STAR index. This index needs to be pre-built using STAR's --runMode genomeGenerate command.
GENOME_DIR="/path/to/STAR_index/mm10" 
READ1="input_R1.fastq.gz" # Placeholder for input Read 1 FASTQ file
READ2="input_R2.fastq.gz" # Placeholder for input Read 2 FASTQ file (remove if single-end)
OUTPUT_PREFIX="star_output/" # Output directory and file prefix
NUM_THREADS=8 # Number of threads to use

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_PREFIX}"

# Run STAR alignment with parameters commonly used in eCLIP pipelines (interpreted as 'default settings' in this context)
# Reference genome: mm10 (Mus musculus, GRCm38)
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting, adjust based on available memory

View on GitHub

Alignment statistics for the pre-alignments were generated using CollectInsertSizeMetrics Picard tool (http://broadinstitute.github.io/picard/), and average insert sizes and standard deviations were fed into subsequent complete read alignments generated with Tophat v2.1.1

TopHat v2.1.1 GitHub

$ Bash example

# Install TopHat and Bowtie2 (TopHat's aligner)
# conda install -c bioconda tophat=2.1.1
# conda install -c bioconda bowtie2

# --- Define input and reference files ---
# Placeholder for reference genome Bowtie2 index (e.g., human hg38)
# Replace with your actual path to the Bowtie2 index prefix
BOWTIE2_INDEX_PREFIX="path/to/your/genome/index/hg38"

# Placeholder for a GTF annotation file (e.g., Gencode for hg38)
# Replace with your actual path to the GTF file
GTF_FILE="path/to/your/annotation/gencode.v38.annotation.gtf"

# Placeholder for input paired-end FASTQ files
# Replace with your actual FASTQ file paths
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz"

# --- Parameters derived from CollectInsertSizeMetrics Picard tool ---
# These values would typically be extracted from the output of CollectInsertSizeMetrics
# For example, MEDIAN_INSERT_SIZE and STANDARD_DEVIATION from the metrics file.
# Replace with actual values from your CollectInsertSizeMetrics output
INSERT_SIZE_MEAN=200    # Example: Median insert size from Picard
INSERT_SIZE_STDDEV=50   # Example: Standard deviation from Picard

# --- TopHat alignment command ---
# Generates complete read alignments using TopHat v2.1.1
# The --mate-inner-dist and --mate-std-dev parameters are fed from Picard's output.
tophat2 \
  --mate-inner-dist ${INSERT_SIZE_MEAN} \
  --mate-std-dev ${INSERT_SIZE_STDDEV} \
  --gtf ${GTF_FILE} \
  -o tophat_output \
  ${BOWTIE2_INDEX_PREFIX} \
  ${READS_R1} ${READS_R2}

View on GitHub

Tools Used

TopHat

Raw Source Text

Post sequencing read quality was checked using the FastQC quality control tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for high throughput sequence data.
If the per-base quality score was below 20 at any position along the 75 bp length stretch, those samples were processed using sliding window quality filtering (window size = 4 bp) in Trimmomatic v0.32. After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis
After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis (Table S3). Original samples that showed a satisfactory per base quality score (>20) were used without filtering. Sequencing adapters were trimmed while converting initial BCL data to fastq files from the sequencing center prior to receiving the data files and no adaptor contamination was detected in FASTQC analysis.
Prepared fastq files were aligned to the Mus musculus GRCm38.p4 genome annotation using the Tophat alignment tool. In order to calculate the insert sizes between paired-end reads,  a subset of 250,000 reads from each sample was aligned using the BWA-aln short read alignment tool available on the Galaxy web platform. The built-in reference mouse genome (mm10) was used to carry out the alignment under default settings.
Alignment statistics for the pre-alignments were generated using CollectInsertSizeMetrics Picard tool (http://broadinstitute.github.io/picard/), and average insert sizes and standard deviations were fed into subsequent complete read alignments generated with Tophat v2.1.1
Assembly: GRCm38.p4
Supplementary files format and content: tab delimited count file (.txt)

← Back to Analysis