GSE73853 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

RNA-binding protein CPEB1 remodels host and viral RNA landscapes.

Nature structural & molecular biology (2016) — PMID 27775709

Dataset

GSE73853

Transcriptome analysis of diverse cell types infected with human cytomegalovirus [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    RNA-seq raw reads were trimmed for adaptor sequences or low-quality bases and mapped to both the human genome (hg19 build) and the HCMV Merlin genome (Genbank AY446894.2) with GSNAP, with additional filtering for repetitive regions.

    GSNAP v2023-09-27 GitHub
    $ Bash example
    # Install GSNAP (if not already installed)
    # conda install -c bioconda gsnap
    # conda install -c bioconda samtools
    
    # --- Reference Genome Preparation ---
    # Download human genome (hg19 build) FASTA from UCSC
    # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # gunzip hg19.fa.gz
    
    # Download HCMV Merlin genome (Genbank AY446894.2) FASTA from NCBI
    # wget -O HCMV_Merlin.fa "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=efetch&db=nuccore&retmode=text&rettype=fasta&id=AY446894.2"
    
    # Concatenate genomes into a single FASTA file for combined indexing
    # cat hg19.fa HCMV_Merlin.fa > combined_genome.fa
    
    # Build GSNAP index for the combined genome
    # mkdir -p gsnap_index
    # gmap_build -D gsnap_index -d combined_genome_index combined_genome.fa
    
    # --- GSNAP Alignment ---
    # Assuming trimmed_reads_R1.fastq.gz and trimmed_reads_R2.fastq.gz are the input trimmed RNA-seq reads
    # Replace with actual input file names
    
    gsnap \
      --gunzip \
      -D gsnap_index \
      -d combined_genome_index \
      -A sam \
      --pairmax-dna=1000 \
      --splice-mode=rna \
      --orientation=fr \
      --mask-low-complexity \
      trimmed_reads_R1.fastq.gz \
      trimmed_reads_R2.fastq.gz \
      > aligned_reads.sam
    
    # Convert SAM to BAM and sort
    samtools view -bS aligned_reads.sam > aligned_reads.bam
    samtools sort aligned_reads.bam -o aligned_reads.sorted.bam
    samtools index aligned_reads.sorted.bam
  2. 2

    Virus-mapping reads were obtained from the HSV-2 and HIV-1 samples using Genbank accessions NC_001798.1 and K03455.1, respectively.

    bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install bowtie2, samtools, and ncbi-datasets-cli if not already installed
    # conda install -c bioconda bowtie2 samtools ncbi-datasets-cli
    
    # --- Download HSV-2 reference genome (NC_001798.1) ---
    echo "Downloading HSV-2 reference genome (NC_001798.1)..."
    datasets download genome accession NC_001798.1 --filename HSV2.zip
    unzip -o HSV2.zip
    mv ncbi_dataset/data/NC_001798.1/NC_001798.1.fna HSV2.fasta
    rm -rf ncbi_dataset HSV2.zip
    
    # --- Build Bowtie2 index for HSV-2 ---
    echo "Building Bowtie2 index for HSV-2..."
    bowtie2-build HSV2.fasta HSV2_index
    
    # --- Download HIV-1 reference genome (K03455.1) ---
    echo "Downloading HIV-1 reference genome (K03455.1)..."
    datasets download genome accession K03455.1 --filename HIV1.zip
    unzip -o HIV1.zip
    mv ncbi_dataset/data/K03455.1/K03455.1.fna HIV1.fasta
    rm -rf ncbi_dataset HIV1.zip
    
    # --- Build Bowtie2 index for HIV-1 ---
    echo "Building Bowtie2 index for HIV-1..."
    bowtie2-build HIV1.fasta HIV1_index
    
    # --- Align HSV-2 sample reads ---
    # Assuming 'hsv2_reads.fastq' is the input FASTQ file for HSV-2
    # If reads are paired-end, use -1 <read1.fastq> -2 <read2.fastq>
    echo "Aligning HSV-2 reads..."
    bowtie2 -x HSV2_index -U hsv2_reads.fastq -S hsv2_mapped.sam --threads 4
    
    # Convert SAM to BAM, sort, and index
    samtools view -bS hsv2_mapped.sam | samtools sort -o hsv2_mapped.bam
    samtools index hsv2_mapped.bam
    rm hsv2_mapped.sam # Clean up SAM file
    
    # --- Align HIV-1 sample reads ---
    # Assuming 'hiv1_reads.fastq' is the input FASTQ file for HIV-1
    # If reads are paired-end, use -1 <read1.fastq> -2 <read2.fastq>
    echo "Aligning HIV-1 reads..."
    bowtie2 -x HIV1_index -U hiv1_reads.fastq -S hiv1_mapped.sam --threads 4
    
    # Convert SAM to BAM, sort, and index
    samtools view -bS hiv1_mapped.sam | samtools sort -o hiv1_mapped.bam
    samtools index hiv1_mapped.bam
    rm hiv1_mapped.sam # Clean up SAM file
    
    echo "Mapping complete. Output BAM files: hsv2_mapped.bam, hiv1_mapped.bam"
  3. 3

    Genome coverage files for visualization (bigWig format) were generated using UCSC tools bedGraphToBigWig and genomeCoverageBed, and strand-corrected.

    UCSC tools vNot specified GitHub
    $ Bash example
    # Install bedtools
    # conda install -c bioconda bedtools
    
    # Install UCSC tools (specifically bedGraphToBigWig and fetchChromSizes)
    # conda install -c bioconda ucsc-bedgraphtobigwig ucsc-fetchchromsizes
    
    # Define input BAM file and output prefixes
    INPUT_BAM="aligned.bam" # Replace with your actual aligned BAM file
    OUTPUT_PREFIX="sample_coverage" # Prefix for output bigWig files
    GENOME="hg38" # Placeholder: Replace with your actual genome assembly (e.g., hg19, mm10)
    
    # Get chrom.sizes file for the specified genome
    # This file is essential for both bedtools genomeCoverage and bedGraphToBigWig
    CHROM_SIZES="${GENOME}.chrom.sizes"
    fetchChromSizes "${GENOME}" > "${CHROM_SIZES}"
    
    # Generate strand-specific bedGraph files using bedtools genomeCoverage
    # -ibam: Input BAM file
    # -g: Genome file (chrom.sizes)
    # -bg: Output in bedGraph format
    # -strand +: Calculate coverage for the positive strand
    # -strand -: Calculate coverage for the negative strand
    bedtools genomeCoverage -ibam "${INPUT_BAM}" -g "${CHROM_SIZES}" -bg -strand + > "${OUTPUT_PREFIX}_plus.bedGraph"
    bedtools genomeCoverage -ibam "${INPUT_BAM}" -g "${CHROM_SIZES}" -bg -strand - > "${OUTPUT_PREFIX}_minus.bedGraph"
    
    # Sort bedGraph files by chromosome and start position
    # This is a requirement for bedGraphToBigWig
    sort -k1,1 -k2,2n "${OUTPUT_PREFIX}_plus.bedGraph" > "${OUTPUT_PREFIX}_plus_sorted.bedGraph"
    sort -k1,1 -k2,2n "${OUTPUT_PREFIX}_minus.bedGraph" > "${OUTPUT_PREFIX}_minus_sorted.bedGraph"
    
    # Convert sorted bedGraph files to bigWig format using UCSC bedGraphToBigWig
    # Input bedGraph file, chrom.sizes file, output bigWig file
    bedGraphToBigWig "${OUTPUT_PREFIX}_plus_sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_PREFIX}_plus.bigWig"
    bedGraphToBigWig "${OUTPUT_PREFIX}_minus_sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_PREFIX}_minus.bigWig"
    
    # Clean up intermediate files (optional)
    # rm "${OUTPUT_PREFIX}_plus.bedGraph" "${OUTPUT_PREFIX}_minus.bedGraph"
    # rm "${OUTPUT_PREFIX}_plus_sorted.bedGraph" "${OUTPUT_PREFIX}_minus_sorted.bedGraph"
    # rm "${CHROM_SIZES}"

Tools Used

Raw Source Text
RNA-seq raw reads were trimmed for adaptor sequences or low-quality bases and mapped to both the human genome (hg19 build) and the HCMV Merlin genome (Genbank AY446894.2) with GSNAP, with additional filtering for repetitive regions. Virus-mapping reads were obtained from the HSV-2 and HIV-1 samples using Genbank accessions NC_001798.1 and K03455.1, respectively.
Genome coverage files for visualization (bigWig format) were generated using UCSC tools bedGraphToBigWig and genomeCoverageBed, and strand-corrected.
Genome_build: hg19
Supplementary_files_format_and_content: bigWig (each .bw includes human transcriptome only)
← Back to Analysis