GSE210263 Processing Pipeline
Publication
Zfp697 is an RNA-binding protein that regulates skeletal muscle inflammation and remodeling.Proceedings of the National Academy of Sciences of the United States of America (2024) — PMID 39141348
Dataset
GSE210263Transcriptomic profiles of muscular dystrophy with myositis (mdm) in extensor digitorum longus, psoas, and soleus muscles from mice
Processing Steps
Generate Jupyter Notebook-
1
Post sequencing read quality was checked using the FastQC quality control tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for high throughput sequence data.
FastQC vnot specified$ Bash example
# Install FastQC (e.g., using conda) # conda install -c bioconda fastqc # Example usage: Check quality of a gzipped FASTQ file # Replace 'your_reads.fastq.gz' with the actual input file(s) fastqc your_reads.fastq.gz
-
2
If the per-base quality score was below 20 at any position along the 75 bp length stretch, those samples were processed using sliding window quality filtering (window size = 4 bp) in Trimmomatic v0.32.
Trimmomatic v0.32$ Bash example
# Install Trimmomatic (if not already installed) # conda install -c bioconda trimmomatic # Assuming paired-end reads for a typical scenario. Adjust to SE for single-end if needed. # Replace input_R1.fastq.gz, input_R2.fastq.gz with your actual input files. # Replace output_R1_paired.fastq.gz, output_R1_unpaired.fastq.gz, etc., with your desired output file names. # Trimmomatic v0.32 command for sliding window quality filtering. # SLIDINGWINDOW:4:20 means a 4-base pair window, with an average quality score of 20. # Bases are removed from the start or end of the read if the average quality within the window falls below 20. java -jar /path/to/trimmomatic-0.32.jar PE \ input_R1.fastq.gz input_R2.fastq.gz \ output_R1_paired.fastq.gz output_R1_unpaired.fastq.gz \ output_R2_paired.fastq.gz output_R2_unpaired.fastq.gz \ SLIDINGWINDOW:4:20 -
3
After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis
$ Bash example
# repair.sh installation (part of BBMap suite) # Download BBMap suite: # wget https://sourceforge.net/projects/bbmap/files/BBMap_38.90.tar.gz # tar -xzf BBMap_38.90.tar.gz # export PATH="/path/to/bbmap:$PATH" # Adjust path accordingly to where BBMap is extracted # Define input files from the read-trimming step # These are assumed to be the output from a previous trimming step, which might contain unpaired reads or out-of-sync pairs. INPUT_TRIMMED_R1="trimmed_R1.fq.gz" INPUT_TRIMMED_R2="trimmed_R2.fq.gz" # Define output files for strictly paired reads OUTPUT_PAIRED_R1="paired_R1.fq.gz" OUTPUT_PAIRED_R2="paired_R2.fq.gz" # Use repair.sh to ensure only paired-end reads are used for downstream analysis. # This command takes potentially mixed paired/unpaired reads and outputs only strictly paired reads, # discarding any reads that do not have a valid pair. repair.sh in1="${INPUT_TRIMMED_R1}" in2="${INPUT_TRIMMED_R2}" \ out1="${OUTPUT_PAIRED_R1}" out2="${OUTPUT_PAIRED_R2}" \ overwrite=t # Overwrite existing output files if they exist -
4
After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis (Table S3).
bash (Inferred with models/gemini-2.5-flash) v5.x$ Bash example
# Assuming read-trimming produced paired-end files named sample_R1_paired.fastq.gz and sample_R2_paired.fastq.gz. # These files are then designated for downstream analysis. # Define variables for the paired-end read files that will be used as input for the next step. # Replace 'sample_R1_paired.fastq.gz' and 'sample_R2_paired.fastq.gz' with actual file names from the read-trimming output. READ1_PAIRED="sample_R1_paired.fastq.gz" READ2_PAIRED="sample_R2_paired.fastq.gz" # This step primarily describes the selection and availability of these files. # The actual 'use' would be by a subsequent bioinformatics tool (e.g., an aligner). echo "Selected paired-end reads: ${READ1_PAIRED} and ${READ2_PAIRED} for downstream analysis." -
5
Original samples that showed a satisfactory per base quality score (>20) were used without filtering.
$ Bash example
# Install fastp if not already installed # conda install -c bioconda fastp # Define input and output file paths INPUT_FASTQ="original_sample.fastq.gz" OUTPUT_FASTQ="processed_sample.fastq.gz" # Execute fastp with options to effectively disable filtering, # as samples already meet the satisfactory per base quality score (>20). # This command ensures that no quality trimming, adapter trimming, or polyX/polyG trimming is performed. fastp -i "${INPUT_FASTQ}" -o "${OUTPUT_FASTQ}" \ --disable_quality_filtering \ --disable_adapter_trimming \ --disable_trim_poly_g \ --disable_trim_poly_x -
6
Sequencing adapters were trimmed while converting initial BCL data to fastq files from the sequencing center prior to receiving the data files and no adaptor contamination was detected in FASTQC analysis.
$ Bash example
# Install FastQC (if not already installed) # conda install -c bioconda fastqc # Run FastQC on the fastq files to check for adapter contamination. # The description states that adapters were trimmed prior to receiving the data, # and FastQC was used to confirm no contamination remained. # Assuming fastq files are in the current directory and end with .fastq.gz fastqc *.fastq.gz
-
7
Prepared fastq files were aligned to the Mus musculus GRCm38.p4 genome annotation using the Tophat alignment tool.
$ Bash example
# Install TopHat2 (often available via Bioconda) # conda install -c bioconda tophat2 # conda install -c bioconda bowtie2 # Create a directory for reference files mkdir -p reference cd reference # Download Mus musculus GRCm38.p4 primary assembly FASTA file # Using Ensembl release 94, which corresponds to GRCm38 wget https://ftp.ensembl.org/pub/release-94/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz gunzip Mus_musculus.GRCm38.dna.primary_assembly.fa.gz # Build Bowtie2 index for TopHat2 bowtie2-build Mus_musculus.GRCm38.dna.primary_assembly.fa GRCm38_p4_index # Download Mus musculus GRCm38.94 GTF annotation file wget https://ftp.ensembl.org/pub/release-94/gtf/mus_musculus/Mus_musculus.GRCm38.94.gtf.gz gunzip Mus_musculus.GRCm38.94.gtf.gz cd .. # Define input and output paths FASTQ_FILE="input.fastq.gz" # Replace with your actual fastq file BOWTIE2_INDEX="reference/GRCm38_p4_index" GTF_FILE="reference/Mus_musculus.GRCm38.94.gtf" OUTPUT_DIR="tophat_alignment_output" # Run TopHat2 alignment # -G: Provide a GTF file for splice junction discovery # -o: Output directory tophat2 -G "${GTF_FILE}" -o "${OUTPUT_DIR}" "${BOWTIE2_INDEX}" "${FASTQ_FILE}" -
8
In order to calculate the insert sizes between paired-end reads, a subset of 250,000 reads from each sample was aligned using the BWA-aln short read alignment tool available on the Galaxy web platform.
$ Bash example
# Install BWA, Samtools, and Seqtk (if not already installed) # conda install -c bioconda bwa samtools seqtk # Define variables REF_GENOME="hg38.fa" # Placeholder for reference genome, e.g., hg38 READS_R1="sample_R1.fastq.gz" READS_R2="sample_R2.fastq.gz" NUM_READS=250000 OUTPUT_PREFIX="sample_subset" THREADS=8 # Number of threads for BWA aln # 1. Index the reference genome (if not already indexed) # bwa index "${REF_GENOME}" # 2. Subset 250,000 reads from each paired-end file # Using seqtk sample with a seed for reproducibility seqtk sample -s100 "${READS_R1}" "${NUM_READS}" > "${OUTPUT_PREFIX}_R1_subset.fastq" seqtk sample -s100 "${READS_R2}" "${NUM_READS}" > "${OUTPUT_PREFIX}_R2_subset.fastq" # 3. Align R1 reads using BWA-aln bwa aln -t "${THREADS}" "${REF_GENOME}" "${OUTPUT_PREFIX}_R1_subset.fastq" > "${OUTPUT_PREFIX}_R1.sai" # 4. Align R2 reads using BWA-aln bwa aln -t "${THREADS}" "${REF_GENOME}" "${OUTPUT_PREFIX}_R2_subset.fastq" > "${OUTPUT_PREFIX}_R2.sai" # 5. Generate paired-end SAM alignment using BWA sampe bwa sampe "${REF_GENOME}" "${OUTPUT_PREFIX}_R1.sai" "${OUTPUT_PREFIX}_R2.sai" \ "${OUTPUT_PREFIX}_R1_subset.fastq" "${OUTPUT_PREFIX}_R2_subset.fastq" > "${OUTPUT_PREFIX}.sam" # 6. Convert SAM to BAM, sort, and index samtools view -bS "${OUTPUT_PREFIX}.sam" | samtools sort -o "${OUTPUT_PREFIX}.bam" samtools index "${OUTPUT_PREFIX}.bam" # Clean up intermediate files (optional) # rm "${OUTPUT_PREFIX}_R1_subset.fastq" "${OUTPUT_PREFIX}_R2_subset.fastq" # rm "${OUTPUT_PREFIX}_R1.sai" "${OUTPUT_PREFIX}_R2.sai" # rm "${OUTPUT_PREFIX}.sam" -
9
The built-in reference mouse genome (mm10) was used to carry out the alignment under default settings.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Placeholder for mm10 STAR index. This index needs to be pre-built using STAR's --runMode genomeGenerate command. GENOME_DIR="/path/to/STAR_index/mm10" READ1="input_R1.fastq.gz" # Placeholder for input Read 1 FASTQ file READ2="input_R2.fastq.gz" # Placeholder for input Read 2 FASTQ file (remove if single-end) OUTPUT_PREFIX="star_output/" # Output directory and file prefix NUM_THREADS=8 # Number of threads to use # Create output directory if it doesn't exist mkdir -p "${OUTPUT_PREFIX}" # Run STAR alignment with parameters commonly used in eCLIP pipelines (interpreted as 'default settings' in this context) # Reference genome: mm10 (Mus musculus, GRCm38) STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting, adjust based on available memory -
10
Alignment statistics for the pre-alignments were generated using CollectInsertSizeMetrics Picard tool (http://broadinstitute.github.io/picard/), and average insert sizes and standard deviations were fed into subsequent complete read alignments generated with Tophat v2.1.1
$ Bash example
# Install TopHat and Bowtie2 (TopHat's aligner) # conda install -c bioconda tophat=2.1.1 # conda install -c bioconda bowtie2 # --- Define input and reference files --- # Placeholder for reference genome Bowtie2 index (e.g., human hg38) # Replace with your actual path to the Bowtie2 index prefix BOWTIE2_INDEX_PREFIX="path/to/your/genome/index/hg38" # Placeholder for a GTF annotation file (e.g., Gencode for hg38) # Replace with your actual path to the GTF file GTF_FILE="path/to/your/annotation/gencode.v38.annotation.gtf" # Placeholder for input paired-end FASTQ files # Replace with your actual FASTQ file paths READS_R1="sample_R1.fastq.gz" READS_R2="sample_R2.fastq.gz" # --- Parameters derived from CollectInsertSizeMetrics Picard tool --- # These values would typically be extracted from the output of CollectInsertSizeMetrics # For example, MEDIAN_INSERT_SIZE and STANDARD_DEVIATION from the metrics file. # Replace with actual values from your CollectInsertSizeMetrics output INSERT_SIZE_MEAN=200 # Example: Median insert size from Picard INSERT_SIZE_STDDEV=50 # Example: Standard deviation from Picard # --- TopHat alignment command --- # Generates complete read alignments using TopHat v2.1.1 # The --mate-inner-dist and --mate-std-dev parameters are fed from Picard's output. tophat2 \ --mate-inner-dist ${INSERT_SIZE_MEAN} \ --mate-std-dev ${INSERT_SIZE_STDDEV} \ --gtf ${GTF_FILE} \ -o tophat_output \ ${BOWTIE2_INDEX_PREFIX} \ ${READS_R1} ${READS_R2}
Tools Used
Raw Source Text
Post sequencing read quality was checked using the FastQC quality control tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for high throughput sequence data. If the per-base quality score was below 20 at any position along the 75 bp length stretch, those samples were processed using sliding window quality filtering (window size = 4 bp) in Trimmomatic v0.32. After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis After filtering, only the paired-end reads collected from the read-trimming were used for downstream data analysis (Table S3). Original samples that showed a satisfactory per base quality score (>20) were used without filtering. Sequencing adapters were trimmed while converting initial BCL data to fastq files from the sequencing center prior to receiving the data files and no adaptor contamination was detected in FASTQC analysis. Prepared fastq files were aligned to the Mus musculus GRCm38.p4 genome annotation using the Tophat alignment tool. In order to calculate the insert sizes between paired-end reads, a subset of 250,000 reads from each sample was aligned using the BWA-aln short read alignment tool available on the Galaxy web platform. The built-in reference mouse genome (mm10) was used to carry out the alignment under default settings. Alignment statistics for the pre-alignments were generated using CollectInsertSizeMetrics Picard tool (http://broadinstitute.github.io/picard/), and average insert sizes and standard deviations were fed into subsequent complete read alignments generated with Tophat v2.1.1 Assembly: GRCm38.p4 Supplementary files format and content: tab delimited count file (.txt)