GSE73853 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

RNA-binding protein CPEB1 remodels host and viral RNA landscapes.

Nature structural & molecular biology (2016) — PMID 27775709

Dataset

Transcriptome analysis of diverse cell types infected with human cytomegalovirus [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-seq raw reads were trimmed for adaptor sequences or low-quality bases and mapped to both the human genome (hg19 build) and the HCMV Merlin genome (Genbank AY446894.2) with GSNAP, with additional filtering for repetitive regions.

GSNAP v2023-09-27 GitHub

$ Bash example

# Install GSNAP (if not already installed)
# conda install -c bioconda gsnap
# conda install -c bioconda samtools

# --- Reference Genome Preparation ---
# Download human genome (hg19 build) FASTA from UCSC
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz

# Download HCMV Merlin genome (Genbank AY446894.2) FASTA from NCBI
# wget -O HCMV_Merlin.fa "https://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?tool=efetch&db=nuccore&retmode=text&rettype=fasta&id=AY446894.2"

# Concatenate genomes into a single FASTA file for combined indexing
# cat hg19.fa HCMV_Merlin.fa > combined_genome.fa

# Build GSNAP index for the combined genome
# mkdir -p gsnap_index
# gmap_build -D gsnap_index -d combined_genome_index combined_genome.fa

# --- GSNAP Alignment ---
# Assuming trimmed_reads_R1.fastq.gz and trimmed_reads_R2.fastq.gz are the input trimmed RNA-seq reads
# Replace with actual input file names

gsnap \
  --gunzip \
  -D gsnap_index \
  -d combined_genome_index \
  -A sam \
  --pairmax-dna=1000 \
  --splice-mode=rna \
  --orientation=fr \
  --mask-low-complexity \
  trimmed_reads_R1.fastq.gz \
  trimmed_reads_R2.fastq.gz \
  > aligned_reads.sam

# Convert SAM to BAM and sort
samtools view -bS aligned_reads.sam > aligned_reads.bam
samtools sort aligned_reads.bam -o aligned_reads.sorted.bam
samtools index aligned_reads.sorted.bam

View on GitHub

Virus-mapping reads were obtained from the HSV-2 and HIV-1 samples using Genbank accessions NC_001798.1 and K03455.1, respectively.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.5.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bowtie2, samtools, and ncbi-datasets-cli if not already installed
# conda install -c bioconda bowtie2 samtools ncbi-datasets-cli

# --- Download HSV-2 reference genome (NC_001798.1) ---
echo "Downloading HSV-2 reference genome (NC_001798.1)..."
datasets download genome accession NC_001798.1 --filename HSV2.zip
unzip -o HSV2.zip
mv ncbi_dataset/data/NC_001798.1/NC_001798.1.fna HSV2.fasta
rm -rf ncbi_dataset HSV2.zip

# --- Build Bowtie2 index for HSV-2 ---
echo "Building Bowtie2 index for HSV-2..."
bowtie2-build HSV2.fasta HSV2_index

# --- Download HIV-1 reference genome (K03455.1) ---
echo "Downloading HIV-1 reference genome (K03455.1)..."
datasets download genome accession K03455.1 --filename HIV1.zip
unzip -o HIV1.zip
mv ncbi_dataset/data/K03455.1/K03455.1.fna HIV1.fasta
rm -rf ncbi_dataset HIV1.zip

# --- Build Bowtie2 index for HIV-1 ---
echo "Building Bowtie2 index for HIV-1..."
bowtie2-build HIV1.fasta HIV1_index

# --- Align HSV-2 sample reads ---
# Assuming 'hsv2_reads.fastq' is the input FASTQ file for HSV-2
# If reads are paired-end, use -1 <read1.fastq> -2 <read2.fastq>
echo "Aligning HSV-2 reads..."
bowtie2 -x HSV2_index -U hsv2_reads.fastq -S hsv2_mapped.sam --threads 4

# Convert SAM to BAM, sort, and index
samtools view -bS hsv2_mapped.sam | samtools sort -o hsv2_mapped.bam
samtools index hsv2_mapped.bam
rm hsv2_mapped.sam # Clean up SAM file

# --- Align HIV-1 sample reads ---
# Assuming 'hiv1_reads.fastq' is the input FASTQ file for HIV-1
# If reads are paired-end, use -1 <read1.fastq> -2 <read2.fastq>
echo "Aligning HIV-1 reads..."
bowtie2 -x HIV1_index -U hiv1_reads.fastq -S hiv1_mapped.sam --threads 4

# Convert SAM to BAM, sort, and index
samtools view -bS hiv1_mapped.sam | samtools sort -o hiv1_mapped.bam
samtools index hiv1_mapped.bam
rm hiv1_mapped.sam # Clean up SAM file

echo "Mapping complete. Output BAM files: hsv2_mapped.bam, hiv1_mapped.bam"

View on GitHub

Genome coverage files for visualization (bigWig format) were generated using UCSC tools bedGraphToBigWig and genomeCoverageBed, and strand-corrected.

UCSC tools vNot specified GitHub

$ Bash example

# Install bedtools
# conda install -c bioconda bedtools

# Install UCSC tools (specifically bedGraphToBigWig and fetchChromSizes)
# conda install -c bioconda ucsc-bedgraphtobigwig ucsc-fetchchromsizes

# Define input BAM file and output prefixes
INPUT_BAM="aligned.bam" # Replace with your actual aligned BAM file
OUTPUT_PREFIX="sample_coverage" # Prefix for output bigWig files
GENOME="hg38" # Placeholder: Replace with your actual genome assembly (e.g., hg19, mm10)

# Get chrom.sizes file for the specified genome
# This file is essential for both bedtools genomeCoverage and bedGraphToBigWig
CHROM_SIZES="${GENOME}.chrom.sizes"
fetchChromSizes "${GENOME}" > "${CHROM_SIZES}"

# Generate strand-specific bedGraph files using bedtools genomeCoverage
# -ibam: Input BAM file
# -g: Genome file (chrom.sizes)
# -bg: Output in bedGraph format
# -strand +: Calculate coverage for the positive strand
# -strand -: Calculate coverage for the negative strand
bedtools genomeCoverage -ibam "${INPUT_BAM}" -g "${CHROM_SIZES}" -bg -strand + > "${OUTPUT_PREFIX}_plus.bedGraph"
bedtools genomeCoverage -ibam "${INPUT_BAM}" -g "${CHROM_SIZES}" -bg -strand - > "${OUTPUT_PREFIX}_minus.bedGraph"

# Sort bedGraph files by chromosome and start position
# This is a requirement for bedGraphToBigWig
sort -k1,1 -k2,2n "${OUTPUT_PREFIX}_plus.bedGraph" > "${OUTPUT_PREFIX}_plus_sorted.bedGraph"
sort -k1,1 -k2,2n "${OUTPUT_PREFIX}_minus.bedGraph" > "${OUTPUT_PREFIX}_minus_sorted.bedGraph"

# Convert sorted bedGraph files to bigWig format using UCSC bedGraphToBigWig
# Input bedGraph file, chrom.sizes file, output bigWig file
bedGraphToBigWig "${OUTPUT_PREFIX}_plus_sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_PREFIX}_plus.bigWig"
bedGraphToBigWig "${OUTPUT_PREFIX}_minus_sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_PREFIX}_minus.bigWig"

# Clean up intermediate files (optional)
# rm "${OUTPUT_PREFIX}_plus.bedGraph" "${OUTPUT_PREFIX}_minus.bedGraph"
# rm "${OUTPUT_PREFIX}_plus_sorted.bedGraph" "${OUTPUT_PREFIX}_minus_sorted.bedGraph"
# rm "${CHROM_SIZES}"

View on GitHub

Tools Used

UCSC tools

Raw Source Text

RNA-seq raw reads were trimmed for adaptor sequences or low-quality bases and mapped to both the human genome (hg19 build) and the HCMV Merlin genome (Genbank AY446894.2) with GSNAP, with additional filtering for repetitive regions. Virus-mapping reads were obtained from the HSV-2 and HIV-1 samples using Genbank accessions NC_001798.1 and K03455.1, respectively.
Genome coverage files for visualization (bigWig format) were generated using UCSC tools bedGraphToBigWig and genomeCoverageBed, and strand-corrected.
Genome_build: hg19
Supplementary_files_format_and_content: bigWig (each .bw includes human transcriptome only)

← Back to Analysis