GSE73853 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
RNA-binding protein CPEB1 remodels host and viral RNA landscapes.Nature structural & molecular biology (2016) — PMID 27775709
Dataset
GSE73853Transcriptome analysis of diverse cell types infected with human cytomegalovirus [RNA-Seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-seq raw reads were trimmed for adaptor sequences or low-quality bases and mapped to both the human genome (hg19 build) and the HCMV Merlin genome (Genbank AY446894.2) with GSNAP, with additional filtering for repetitive regions.
$ Bash example
# Install GSNAP (if not already installed) # conda install -c bioconda gsnap # conda install -c bioconda samtools # --- Reference Genome Preparation --- # Download human genome (hg19 build) FASTA from UCSC # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip hg19.fa.gz # Download HCMV Merlin genome (Genbank AY446894.2) FASTA from NCBI # wget -O HCMV_Merlin.fa "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=efetch&db=nuccore&retmode=text&rettype=fasta&id=AY446894.2" # Concatenate genomes into a single FASTA file for combined indexing # cat hg19.fa HCMV_Merlin.fa > combined_genome.fa # Build GSNAP index for the combined genome # mkdir -p gsnap_index # gmap_build -D gsnap_index -d combined_genome_index combined_genome.fa # --- GSNAP Alignment --- # Assuming trimmed_reads_R1.fastq.gz and trimmed_reads_R2.fastq.gz are the input trimmed RNA-seq reads # Replace with actual input file names gsnap \ --gunzip \ -D gsnap_index \ -d combined_genome_index \ -A sam \ --pairmax-dna=1000 \ --splice-mode=rna \ --orientation=fr \ --mask-low-complexity \ trimmed_reads_R1.fastq.gz \ trimmed_reads_R2.fastq.gz \ > aligned_reads.sam # Convert SAM to BAM and sort samtools view -bS aligned_reads.sam > aligned_reads.bam samtools sort aligned_reads.bam -o aligned_reads.sorted.bam samtools index aligned_reads.sorted.bam
-
2
Virus-mapping reads were obtained from the HSV-2 and HIV-1 samples using Genbank accessions NC_001798.1 and K03455.1, respectively.
bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bowtie2, samtools, and ncbi-datasets-cli if not already installed # conda install -c bioconda bowtie2 samtools ncbi-datasets-cli # --- Download HSV-2 reference genome (NC_001798.1) --- echo "Downloading HSV-2 reference genome (NC_001798.1)..." datasets download genome accession NC_001798.1 --filename HSV2.zip unzip -o HSV2.zip mv ncbi_dataset/data/NC_001798.1/NC_001798.1.fna HSV2.fasta rm -rf ncbi_dataset HSV2.zip # --- Build Bowtie2 index for HSV-2 --- echo "Building Bowtie2 index for HSV-2..." bowtie2-build HSV2.fasta HSV2_index # --- Download HIV-1 reference genome (K03455.1) --- echo "Downloading HIV-1 reference genome (K03455.1)..." datasets download genome accession K03455.1 --filename HIV1.zip unzip -o HIV1.zip mv ncbi_dataset/data/K03455.1/K03455.1.fna HIV1.fasta rm -rf ncbi_dataset HIV1.zip # --- Build Bowtie2 index for HIV-1 --- echo "Building Bowtie2 index for HIV-1..." bowtie2-build HIV1.fasta HIV1_index # --- Align HSV-2 sample reads --- # Assuming 'hsv2_reads.fastq' is the input FASTQ file for HSV-2 # If reads are paired-end, use -1 <read1.fastq> -2 <read2.fastq> echo "Aligning HSV-2 reads..." bowtie2 -x HSV2_index -U hsv2_reads.fastq -S hsv2_mapped.sam --threads 4 # Convert SAM to BAM, sort, and index samtools view -bS hsv2_mapped.sam | samtools sort -o hsv2_mapped.bam samtools index hsv2_mapped.bam rm hsv2_mapped.sam # Clean up SAM file # --- Align HIV-1 sample reads --- # Assuming 'hiv1_reads.fastq' is the input FASTQ file for HIV-1 # If reads are paired-end, use -1 <read1.fastq> -2 <read2.fastq> echo "Aligning HIV-1 reads..." bowtie2 -x HIV1_index -U hiv1_reads.fastq -S hiv1_mapped.sam --threads 4 # Convert SAM to BAM, sort, and index samtools view -bS hiv1_mapped.sam | samtools sort -o hiv1_mapped.bam samtools index hiv1_mapped.bam rm hiv1_mapped.sam # Clean up SAM file echo "Mapping complete. Output BAM files: hsv2_mapped.bam, hiv1_mapped.bam"
-
3
Genome coverage files for visualization (bigWig format) were generated using UCSC tools bedGraphToBigWig and genomeCoverageBed, and strand-corrected.
$ Bash example
# Install bedtools # conda install -c bioconda bedtools # Install UCSC tools (specifically bedGraphToBigWig and fetchChromSizes) # conda install -c bioconda ucsc-bedgraphtobigwig ucsc-fetchchromsizes # Define input BAM file and output prefixes INPUT_BAM="aligned.bam" # Replace with your actual aligned BAM file OUTPUT_PREFIX="sample_coverage" # Prefix for output bigWig files GENOME="hg38" # Placeholder: Replace with your actual genome assembly (e.g., hg19, mm10) # Get chrom.sizes file for the specified genome # This file is essential for both bedtools genomeCoverage and bedGraphToBigWig CHROM_SIZES="${GENOME}.chrom.sizes" fetchChromSizes "${GENOME}" > "${CHROM_SIZES}" # Generate strand-specific bedGraph files using bedtools genomeCoverage # -ibam: Input BAM file # -g: Genome file (chrom.sizes) # -bg: Output in bedGraph format # -strand +: Calculate coverage for the positive strand # -strand -: Calculate coverage for the negative strand bedtools genomeCoverage -ibam "${INPUT_BAM}" -g "${CHROM_SIZES}" -bg -strand + > "${OUTPUT_PREFIX}_plus.bedGraph" bedtools genomeCoverage -ibam "${INPUT_BAM}" -g "${CHROM_SIZES}" -bg -strand - > "${OUTPUT_PREFIX}_minus.bedGraph" # Sort bedGraph files by chromosome and start position # This is a requirement for bedGraphToBigWig sort -k1,1 -k2,2n "${OUTPUT_PREFIX}_plus.bedGraph" > "${OUTPUT_PREFIX}_plus_sorted.bedGraph" sort -k1,1 -k2,2n "${OUTPUT_PREFIX}_minus.bedGraph" > "${OUTPUT_PREFIX}_minus_sorted.bedGraph" # Convert sorted bedGraph files to bigWig format using UCSC bedGraphToBigWig # Input bedGraph file, chrom.sizes file, output bigWig file bedGraphToBigWig "${OUTPUT_PREFIX}_plus_sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_PREFIX}_plus.bigWig" bedGraphToBigWig "${OUTPUT_PREFIX}_minus_sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_PREFIX}_minus.bigWig" # Clean up intermediate files (optional) # rm "${OUTPUT_PREFIX}_plus.bedGraph" "${OUTPUT_PREFIX}_minus.bedGraph" # rm "${OUTPUT_PREFIX}_plus_sorted.bedGraph" "${OUTPUT_PREFIX}_minus_sorted.bedGraph" # rm "${CHROM_SIZES}"
Tools Used
Raw Source Text
RNA-seq raw reads were trimmed for adaptor sequences or low-quality bases and mapped to both the human genome (hg19 build) and the HCMV Merlin genome (Genbank AY446894.2) with GSNAP, with additional filtering for repetitive regions. Virus-mapping reads were obtained from the HSV-2 and HIV-1 samples using Genbank accessions NC_001798.1 and K03455.1, respectively. Genome coverage files for visualization (bigWig format) were generated using UCSC tools bedGraphToBigWig and genomeCoverageBed, and strand-corrected. Genome_build: hg19 Supplementary_files_format_and_content: bigWig (each .bw includes human transcriptome only)