GSE72502 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.Science (New York, N.Y.) (2015) — PMID 26382853
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Illumina software used for basecalling.
bcl2fastq (Inferred with models/gemini-2.5-flash) vv2.20 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bcl2fastq (example using conda) # conda install -c bioconda bcl2fastq2 # Example usage of bcl2fastq for basecalling and demultiplexing # Replace /path/to/illumina/run_folder with the actual path to your Illumina run folder containing BCL files # Replace /path/to/output_fastq with the desired output directory for FASTQ files bcl2fastq --runfolder-dir /path/to/illumina/run_folder \ --output-dir /path/to/output_fastq \ --no-lane-splitting \ --minimum-trimmed-read-length 35 \ --mask-short-adapter-reads 35 \ --barcode-mismatches 1 -
2
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
GSNAP v2021-12-25 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install GSNAP (if not already installed) # conda install -c bioconda gsnap # Reference genome preparation (if not already done) # Download hg19 reference genome from UCSC # wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip hg19.fa.gz # Create GSNAP index for hg19. -k 15 is a common k-mer size for RNA-seq. # gsnap_build -d hg19_index -g hg19.fa -k 15 # Define variables GENOME_INDEX_DIR="/path/to/gsnap_indexes" # Placeholder: Directory containing the hg19_index GENOME_INDEX_NAME="hg19_index" INPUT_FASTQ="input_reads_preprocessed.fastq" # Placeholder: Reads after trimming, rRNA filtering, and quality filtering OUTPUT_SAM="aligned_reads.sam" THREADS=8 # Number of CPU threads to use # GSNAP alignment command # Description: Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, # filtered for low-complexity or low-quality sequence (these steps are assumed to be completed # and the output is INPUT_FASTQ). Then mapped to hg19 whole genome using GSNAP. # Parameters chosen are common for RNA-seq alignment with GSNAP, given the rRNA filtering step. # --orientation=fr is typically for paired-end reads, but can be used for single-end if library prep has specific orientation. # --novel-splice-sites=1 allows for detection of one novel splice site per read. # --max-mismatches=0.05 allows up to 5% mismatches relative to read length. # --quality-protocol=sanger specifies the quality score encoding. gsnap -d "${GENOME_INDEX_NAME}" \ -D "${GENOME_INDEX_DIR}" \ -t "${THREADS}" \ --orientation=fr \ --novel-splice-sites=1 \ --max-mismatches=0.05 \ --quality-protocol=sanger \ -A sam \ "${INPUT_FASTQ}" > "${OUTPUT_SAM}" # Optional: Convert SAM to BAM and sort (common post-alignment steps) # samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "${OUTPUT_SAM%.sam}.bam" # samtools index "${OUTPUT_SAM%.sam}.bam" -
3
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.
Custom RPKM Script (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash)$ Bash example
# Install subread (which includes featureCounts) and samtools if not already available # conda install -c bioconda subread samtools # Define input and reference files # Replace 'aligned_reads.bam' with your actual aligned BAM file INPUT_BAM="aligned_reads.bam" # Placeholder for a human GRCh38 GTF file. Obtain from Ensembl, GENCODE, or UCSC. # Example: Homo_sapiens.GRCh38.109.gtf from Ensembl GENOME_GTF="Homo_sapiens.GRCh38.109.gtf" OUTPUT_COUNTS="gene_counts.txt" OUTPUT_RPKM="gene_rpkm.txt" # 1. Count reads per gene (aggregating exon counts to gene level) # -p: Assume paired-end reads (remove if single-end) # -t exon: Count features of type 'exon' # -g gene_id: Aggregate counts by 'gene_id' # -a: Annotation file in GTF/GFF format featureCounts -p -t exon -g gene_id -a "${GENOME_GTF}" -o "${OUTPUT_COUNTS}" "${INPUT_BAM}" # 2. Calculate RPKM using the formula from Chepelev et al., 2009 # RPKM = (number of reads mapped to a gene * 10^9) / (total number of reads in the library * gene length in kilobases) # Get total number of mapped reads in the library from the BAM file # -c: count reads # -F 260: Exclude unmapped (0x4), secondary alignment (0x100), and supplementary alignment (0x800) reads TOTAL_READS=$(samtools view -c -F 260 "${INPUT_BAM}") # Process the featureCounts output to calculate RPKM # The featureCounts output file has a header, then columns like: # Geneid Chr Start End Strand Length /path/to/aligned_reads.bam awk -v total_reads="${TOTAL_READS}" ' BEGIN { OFS="\t"; print "Geneid", "Length_Bases", "Raw_Counts", "RPKM"; } NR > 2 { # Skip the first two header lines from featureCounts output gene_id = $1; gene_length_bases = $6; # Gene length in bases (sum of exon lengths) raw_counts = $7; # Raw read counts for the gene if (gene_length_bases > 0 && total_reads > 0) { # Calculate RPKM rpkm = (raw_counts * 1000000000) / (total_reads * (gene_length_bases / 1000)); print gene_id, gene_length_bases, raw_counts, rpkm; } else { # Handle cases with zero length or zero total reads to avoid division by zero print gene_id, gene_length_bases, raw_counts, 0; } }' "${OUTPUT_COUNTS}" > "${OUTPUT_RPKM}" echo "RPKM calculation complete. Results saved to ${OUTPUT_RPKM}" -
4
In short, exons from all isoforms of a gene were merged to create one meta-transcript.
$ Bash example
# Install AGAT if not already installed # conda install -c bioconda agat # Define input GTF file (e.g., Gencode annotation) GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for a reference GTF file (e.g., human GRCh38) # Define output GTF file for meta-transcripts OUTPUT_GTF="gene_meta_transcripts.gtf" # Step: Merge exons from all isoforms of a gene to create one meta-transcript # agat_sp_merge_annotations.pl: Script to merge annotations # --gtf: Input GTF file # --type exon: Specify that we want to merge 'exon' features # --group_by gene: Group features by their parent gene_id before merging # --output: Output GTF file agat_sp_merge_annotations.pl --gtf "${GTF_FILE}" --type exon --group_by gene --output "${OUTPUT_GTF}" echo "Meta-transcripts (merged exons per gene) saved to ${OUTPUT_GTF}" -
5
The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
$ Bash example
# Install RSEM (example using conda) # conda install -c bioconda rsem=1.3.6 # Define variables # Placeholder for the genome FASTA file used for alignment and RSEM reference preparation. GENOME_FASTA="path/to/reference_genome.fa" # Placeholder for the GTF annotation file. This file should define the "meta-transcripts" # and their exon structures. For a general case, this would be a standard annotation like Gencode. GTF_ANNOTATION="path/to/custom_or_gencode.gtf" # Name for the RSEM reference index. This will create a directory with this name. RSEM_REFERENCE_NAME="rsem_index_meta_transcripts" # Path to the input BAM file containing aligned reads. INPUT_BAM="path/to/aligned_reads.bam" # Prefix for the output quantification files (e.g., .genes.results, .isoforms.results). OUTPUT_PREFIX="quantification_results" # Number of threads to use for parallel processing. NUM_THREADS=8 # Strandedness of the library: 0 for unstranded, 0.5 for partially stranded, 1 for fully stranded. # Adjust based on your library preparation protocol. STRANDEDNESS_PROB=0.5 # Example: partially stranded # Step 1: Prepare RSEM reference (run once for a given genome and annotation) # This step builds the necessary index files for RSEM from the genome FASTA and GTF. # The GTF_ANNOTATION must contain the definitions of the "meta-transcripts" for them to be quantified. # rsem-prepare-reference --gtf "${GTF_ANNOTATION}" "${GENOME_FASTA}" "${RSEM_REFERENCE_NAME}" # Step 2: Calculate expression using RSEM # This command quantifies transcript and gene abundances from aligned reads. # It counts reads falling into exons of defined transcripts (including meta-transcripts from the GTF) # and normalizes by transcript length and library size (e.g., to FPKM/TPM). rsem-calculate-expression \ --bam \ --paired-end \ --forward-prob "${STRANDEDNESS_PROB}" \ --num-threads "${NUM_THREADS}" \ "${INPUT_BAM}" \ "${RSEM_REFERENCE_NAME}" \ "${OUTPUT_PREFIX}" # Output files will include: # ${OUTPUT_PREFIX}.genes.results: Contains gene-level quantification (e.g., FPKM, TPM, expected counts). # ${OUTPUT_PREFIX}.isoforms.results: Contains isoform/transcript-level quantification. # ${OUTPUT_PREFIX}.stat: Contains run statistics.
Raw Source Text
Illumina software used for basecalling. Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP. Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library. Genome_build: GRCh37 (hg19) Supplementary_files_format_and_content: Excel file includes RPKM values for each Sample.