GSE72509 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.Science (New York, N.Y.) (2015) — PMID 26382853
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina software CASAVA used for basecalling.
CASAVA vNot specified$ Bash example
# Illumina CASAVA software was used for basecalling and demultiplexing. # This process typically involves two main steps: # 1. Configuration using configureBclToFastq.pl to generate a Makefile. # 2. Execution of the Makefile using 'make' to perform the actual basecalling and demultiplexing. # Define placeholder paths for an Illumina run. # Replace these with your actual paths. INPUT_BCL_DIR="/path/to/illumina/run/folder/Data/Intensities/BaseCalls" # Directory containing BCL files SAMPLE_SHEET_PATH="/path/to/SampleSheet.csv" # Path to your SampleSheet.csv OUTPUT_FASTQ_DIR="/path/to/output/fastq_files" # Directory where FASTQ files will be generated # Create the output directory if it doesn't exist mkdir -p "${OUTPUT_FASTQ_DIR}" # Step 1: Configure basecalling and demultiplexing # This command generates a Makefile in the current working directory (or specified output dir). # It reads BCL files from the input directory and uses the sample sheet for demultiplexing. # The --mismatches parameter specifies the number of allowed mismatches for barcode demultiplexing. configureBclToFastq.pl --input-dir "${INPUT_BCL_DIR}" \ --output-dir "${OUTPUT_FASTQ_DIR}" \ --sample-sheet "${SAMPLE_SHEET_PATH}" \ --mismatches 1 # Step 2: Execute the generated Makefile # This command performs the actual basecalling and demultiplexing, producing FASTQ files. # Adjust the '-j N' parameter to specify the number of CPU cores to use for parallel processing. make -j 8 -
2
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
$ Bash example
# --- Installation (commented out) --- # conda install -c bioconda fastp # conda install -c bioconda bbmap # Provides bbduk.sh # conda install -c bioconda gmap # Provides gsnap # conda install -c bioconda samtools # --- Reference Data Preparation (commented out) --- # Download hg19 reference genome # mkdir -p references/hg19 # wget -P references/hg19 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip references/hg19/hg19.fa.gz # Build GSNAP index for hg19 # mkdir -p gsnap_indexes # gmap_build -d hg19 -D gsnap_indexes references/hg19/hg19.fa # --- Input Files --- # Assuming input reads are paired-end and gzipped READ1="input_R1.fastq.gz" READ2="input_R2.fastq.gz" OUTPUT_PREFIX="mapped_reads" # --- Pre-processing: Trimming and Quality Filtering with fastp --- # Trim adaptors, filter low quality reads, and remove polyG tails (common for Illumina) # Output will be gzipped fastp \ -i "${READ1}" \ -o "${OUTPUT_PREFIX}.trimmed_R1.fastq.gz" \ -I "${READ2}" \ -O "${OUTPUT_PREFIX}.trimmed_R2.fastq.gz" \ --detect_adapter_for_pe \ --qualified_quality_phred 15 \ --length_required 30 \ --low_complexity_filter \ --complexity_threshold 30 \ --thread 8 \ --json "${OUTPUT_PREFIX}.fastp.json" \ --html "${OUTPUT_PREFIX}.fastp.html" TRIMMED_R1="${OUTPUT_PREFIX}.trimmed_R1.fastq.gz" TRIMMED_R2="${OUTPUT_PREFIX}.trimmed_R2.fastq.gz" # --- Pre-processing: rRNA Contamination Filtering with bbduk.sh --- # Filter reads against an rRNA database. Reads matching rRNA are discarded. # Using BBMap's internal rRNA reference ('ref=ribo'). bbduk.sh \ in1="${TRIMMED_R1}" \ in2="${TRIMMED_R2}" \ out1="${OUTPUT_PREFIX}.rRNA_filtered_R1.fastq.gz" \ out2="${OUTPUT_PREFIX}.rRNA_filtered_R2.fastq.gz" \ ref=ribo \ k=31 \ hdist=1 \ stats="${OUTPUT_PREFIX}.bbduk_rRNA_stats.txt" \ threads=8 RRNA_FILTERED_R1="${OUTPUT_PREFIX}.rRNA_filtered_R1.fastq.gz" RRNA_FILTERED_R2="${OUTPUT_PREFIX}.rRNA_filtered_R2.fastq.gz" # --- Mapping to hg19 whole genome using GSNAP --- # -d hg19: Use the hg19 database # -D gsnap_indexes: Directory where the hg19 database was built # -A sam: Output in SAM format # -o: Output file name # --gunzip: Decompress gzipped input files on the fly # -t 8: Use 8 threads # --pairmax-dna=1000: Max insert size for DNA reads (adjust based on expected fragment size) # GSNAP is splice-aware, but for "whole genome" mapping, it's often used for general alignment. # If this were explicitly RNA-seq, additional splicing parameters might be considered. gsnap \ -d hg19 \ -D gsnap_indexes \ -A sam \ -o "${OUTPUT_PREFIX}.sam" \ --gunzip \ -t 8 \ --pairmax-dna=1000 \ "${RRNA_FILTERED_R1}" "${RRNA_FILTERED_R2}" # --- Post-processing: Convert SAM to BAM, sort, and index --- samtools view -bS "${OUTPUT_PREFIX}.sam" -o "${OUTPUT_PREFIX}.bam" samtools sort "${OUTPUT_PREFIX}.bam" -o "${OUTPUT_PREFIX}.sorted.bam" samtools index "${OUTPUT_PREFIX}.sorted.bam" # Clean up intermediate files (optional) # rm "${OUTPUT_PREFIX}.sam" # rm "${TRIMMED_R1}" "${TRIMMED_R2}" # rm "${RRNA_FILTERED_R1}" "${RRNA_FILTERED_R2}" -
3
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.
Custom script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# --- Installation (commented out) --- # # No specific tool installation for the RPKM calculation itself, # # as it's a formula often implemented via custom scripts (e.g., Python, R, Awk). # # Tools for upstream steps (e.g., samtools for total reads, featureCounts for counts) # # would be installed separately. # --- Define input files and parameters --- # Placeholder for a file containing gene/exon counts and their lengths. # This file is assumed to be pre-processed from alignment and annotation. # Format: gene_id <tab> read_count <tab> length_in_bases GENE_COUNTS_LENGTHS_FILE="path/to/your/gene_counts_and_lengths.tsv" # e.g., derived from featureCounts output # Placeholder for the total number of mapped reads in the library. # This value is typically obtained from the alignment summary or by counting reads in the BAM file. TOTAL_MAPPED_READS=50000000 # Example: Replace with actual total mapped reads (e.g., from samtools view -c -F 4 aligned.bam) # Placeholder for output file name OUTPUT_RPKM_FILE="sample_rpkm.tsv" # --- Reference Datasets (Implicitly used in upstream steps to generate inputs) --- # Reference Genome Assembly: hg38 (Human Genome build 38) - for alignment # Annotation Source: GENCODE (e.g., gencode.v44) - for gene/exon lengths and counting # --- Calculate RPKM using the formula from Chepelev et al., 2009 --- # RPKM = (10^9 * C) / (N * L) # Where: # C = Number of reads mapped to the gene (from GENE_COUNTS_LENGTHS_FILE, column 2) # N = Total number of mapped reads in the library (TOTAL_MAPPED_READS) # L = Length of the gene in bases (from GENE_COUNTS_LENGTHS_FILE, column 3) echo -e "gene_id\tRPKM" > "${OUTPUT_RPKM_FILE}" awk -v total_reads="${TOTAL_MAPPED_READS}" 'BEGIN { OFS="\t" } NR > 1 { gene_id = $1; count = $2; length_bases = $3; # Handle cases where length_bases might be zero to avoid division by zero if (length_bases == 0) { rpkm = 0; } else { # RPKM = (count * 10^9) / (total_reads * length_bases) rpkm = (count * 1000000000) / (total_reads * length_bases); } print gene_id, rpkm; }' "${GENE_COUNTS_LENGTHS_FILE}" >> "${OUTPUT_RPKM_FILE}" echo "RPKM calculation complete. Results saved to ${OUTPUT_RPKM_FILE}" -
4
In short, exons from all isoforms of a gene were merged to create one meta-transcript.
$ Bash example
# Install RSEM if not already installed # conda install -c bioconda rsem # Define input files and output prefix # Replace with actual GTF and FASTA files for your organism and assembly INPUT_GTF="Homo_sapiens.GRCh38.109.gtf" # Example: Ensembl GTF for human GRCh38 REFERENCE_FASTA="Homo_sapiens.GRCh38.dna.primary_assembly.fa" # Example: Primary assembly FASTA for human GRCh38 OUTPUT_PREFIX="GRCh38_rsem_reference" # Prefix for the generated RSEM index files # Prepare RSEM reference. This command internally processes the GTF to create # a 'union exon' model (conceptually a meta-transcript) for each gene, # which RSEM uses for accurate transcript quantification. rsem-prepare-reference \ --gtf "${INPUT_GTF}" \ "${REFERENCE_FASTA}" \ "${OUTPUT_PREFIX}" -
5
The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # Define variables # RSEM requires a reference index, which is built once from a genome FASTA and GTF file. # Example command to build reference: # rsem-prepare-reference --gtf Homo_sapiens.GRCh38.109.gtf Homo_sapiens.GRCh38.dna.primary_assembly.fa human_hg38_rsem_ref # Reference genome FASTA: ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # GTF annotation: ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz RSEM_REFERENCE_PATH="/path/to/rsem_references/human_hg38_rsem_ref" # Placeholder for the indexed RSEM reference INPUT_BAM="aligned_reads.bam" # Input BAM file from a splice-aware aligner (e.g., STAR) SAMPLE_PREFIX="sample_id" # Prefix for output files OUTPUT_DIR="rsem_quantification" NUM_THREADS=8 # Number of threads for parallel processing # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run RSEM to calculate expression # This command processes an input BAM file (assumed to be genome-aligned) # and quantifies expression at both gene and isoform levels. # The description "normalized by the size of the meta-transcript and by the size of the library" # is directly addressed by RSEM's TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase of transcript per Million mapped reads) # outputs in the .genes.results and .isoforms.results files. # --bam: Specifies that the input is a BAM file. # --no-qualities: Speeds up processing by not reading quality scores (often not needed for quantification). # --num-threads: Utilizes multiple CPU threads for faster execution. # --paired-end: Add this flag if your data is paired-end. # --strand-specific: Add --forward-prob 0 or --reverse-prob 0 if your library is strand-specific. rsem-calculate-expression \ --bam \ --no-qualities \ --num-threads "${NUM_THREADS}" \ "${INPUT_BAM}" \ "${RSEM_REFERENCE_PATH}" \ "${OUTPUT_DIR}/${SAMPLE_PREFIX}" # The primary output files for this step will be: # - ${OUTPUT_DIR}/${SAMPLE_PREFIX}.genes.results: Contains gene-level expected counts, TPM, FPKM. # - ${OUTPUT_DIR}/${SAMPLE_PREFIX}.isoforms.results: Contains isoform-level expected counts, TPM, FPKM.
Raw Source Text
Illumina software CASAVA used for basecalling. Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP. Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library. Genome_build: GRCh37 (hg19) Supplementary_files_format_and_content: SLE_RPKMs.txt: Tab-delimited text file includes RPKM values for each patient and control.