GSE198899 Processing Pipeline

OTHER code_examples 6 steps

Publication

Discovery and functional interrogation of SARS-CoV-2 protein-RNA interactions.

Research square (2022) — PMID 35313591

Dataset

GSE198899

Discovery and functional interrogation of the virus and host RNA interactome of SARS-CoV-2 proteins [Ribo-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: Ribo-Seq

Ribo-seq vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# Define variables
FASTQ_R1="sample_R1.fastq.gz"
FASTQ_R2="sample_R2.fastq.gz" # Remove if single-end
OUTPUT_DIR="ribo_seq_analysis_output"
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for human genome GRCh38
RRNA_FASTA="/path/to/rRNA_sequences.fasta" # Placeholder for rRNA sequences (e.g., from Ensembl or NCBI)
RRNA_INDEX_DIR="/path/to/STAR_rRNA_index" # Directory for STAR rRNA index

mkdir -p "${OUTPUT_DIR}"

# --- 1. Quality Control ---
# fastqc "${FASTQ_R1}" -o "${OUTPUT_DIR}"
# fastqc "${FASTQ_R2}" -o "${OUTPUT_DIR}" # If paired-end

# --- 2. Adapter Trimming and Quality Filtering ---
# Using Trim Galore! (wrapper for Cutadapt and FastQC)
# For single-end reads:
# trim_galore --quality 20 --length 20 --output_dir "${OUTPUT_DIR}" "${FASTQ_R1}"
# TRIMMED_FASTQ="${OUTPUT_DIR}/$(basename "${FASTQ_R1}" .fastq.gz)_trimmed.fq.gz"

# For paired-end reads:
trim_galore --quality 20 --length 20 --paired "${FASTQ_R1}" "${FASTQ_R2}" --output_dir "${OUTPUT_DIR}"
TRIMMED_FASTQ_R1="${OUTPUT_DIR}/$(basename "${FASTQ_R1}" .fastq.gz)_trimmed.fq.gz"
TRIMMED_FASTQ_R2="${OUTPUT_DIR}/$(basename "${FASTQ_R2}" .fastq.gz)_trimmed.fq.gz"

# --- 3. Depletion of rRNA reads ---
# Build rRNA index (run once if not already done):
# STAR --runMode genomeGenerate --genomeDir "${RRNA_INDEX_DIR}" --genomeFastaFiles "${RRNA_FASTA}" --runThreadN 8

# Align to rRNA and keep unmapped reads
STAR --genomeDir "${RRNA_INDEX_DIR}" \
     --readFilesIn "${TRIMMED_FASTQ_R1}" "${TRIMMED_FASTQ_R2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/rRNA_depleted_" \
     --outSAMtype BAM Unsorted \
     --outFilterMultimapNmax 100 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.6 \
     --outFilterMatchNminOverLread 0.6 \
     --outReadsUnmapped Fastx \
     --runThreadN 8

RRNA_DEPLETED_R1="${OUTPUT_DIR}/rRNA_depleted_Unmapped.out.mate1"
RRNA_DEPLETED_R2="${OUTPUT_DIR}/rRNA_depleted_Unmapped.out.mate2"

# --- 4. Alignment to reference genome (e.g., human GRCh38) ---
# Build genome index (run once if not already done):
# STAR --runMode genomeGenerate --genomeDir "${GENOME_DIR}" --genomeFastaFiles /path/to/GRCh38.fa --sjdbGTFfile /path/to/GRCh38.gtf --runThreadN 8

STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${RRNA_DEPLETED_R1}" "${RRNA_DEPLETED_R2}" \
     --readFilesCommand cat \
     --outFileNamePrefix "${OUTPUT_DIR}/genome_aligned_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.6 \
     --outFilterMatchNminOverLread 0.6 \
     --alignIntronMax 1 \
     --alignEndsType EndToEnd \
     --runThreadN 8

ALIGNED_BAM="${OUTPUT_DIR}/genome_aligned_Aligned.sortedByCoord.out.bam"

# --- 5. Index the BAM file ---
samtools index "${ALIGNED_BAM}"

# --- 6. (Conceptual) P-site mapping, footprint length analysis, and quantification ---
# This step typically involves specialized Ribo-seq tools or custom scripts
# Example:
# ribo_profiling_tool --input_bam "${ALIGNED_BAM}" --output_prefix "${OUTPUT_DIR}/ribo_analysis" --genome_fasta /path/to/GRCh38.fa --gtf /path/to/GRCh38.gtf
# Further analysis might include differential translation, ORF calling, etc.

Raw sequencing reads were trimmed for adapter sequences using cutadapt (v1.14).

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=1.14

# Define input and output files
# Replace 'raw_reads.fastq.gz' with your actual input file.
# Replace 'trimmed_reads.fastq.gz' with your desired output file.
INPUT_READS="raw_reads.fastq.gz"
OUTPUT_TRIMMED_READS="trimmed_reads.fastq.gz"

# Define common Illumina adapter sequence (example for 3' end).
# Note: The specific adapter sequence used should be provided if known.
# This is a placeholder for a common Illumina universal adapter.
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Trim adapter sequences and filter for minimum read length (e.g., 20 bp)
cutadapt -a "${ADAPTER_SEQUENCE}" \
         -m 20 \
         -o "${OUTPUT_TRIMMED_READS}" \
         "${INPUT_READS}"

View on GitHub

Trimmed reads were mapped against RepBase (v18.05) to remove reads mapping to repetitive sequences.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.3.4.1 GitHub

$ Bash example

# Define input and output file paths
TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz"
TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz"
# Output prefix for non-repetitive reads (bowtie2 will append .1.fastq.gz and .2.fastq.gz)
OUTPUT_NON_REP_PREFIX="non_repetitive_reads"
REPBASE_INDEX_PREFIX="repbase_v18.05_index" # This is the prefix for the bowtie2 index files

# --- Installation (commented out) ---
# # Install bowtie2
# # conda install -c bioconda bowtie2

# # Download RepBase (v18.05) and build bowtie2 index if not already available
# # RepBase requires login for direct download from girinst.org. 
# # Users typically obtain pre-indexed versions or use a local mirror/subscription.
# # For demonstration, here's a conceptual process:
# # mkdir -p repbase_index
# # cd repbase_index
# # # Example: Download RepBase fasta for a specific genome (e.g., hg38)
# # # wget -O RepBase_v18.05_hg38.fa.gz "https://www.girinst.org/server/RepBase/protected/RepBase_v18.05_hg38.fasta.gz" # Placeholder URL
# # # gunzip RepBase_v18.05_hg38.fa.gz
# # # bowtie2-build RepBase_v18.05_hg38.fasta ${REPBASE_INDEX_PREFIX}
# # # cd ..

# --- Execution Command ---
# Map trimmed reads against RepBase (v18.05) to identify and filter out repetitive sequences.
# Reads that do not map to RepBase are kept for downstream analysis.
# --threads $(nproc): Use all available CPU cores for mapping.
# --very-sensitive-local: Uses a local alignment strategy with very sensitive parameters,
#                         suitable for finding short repetitive elements.
# -x: Specifies the basename of the bowtie2 index files for RepBase.
# -1, -2: Input paired-end reads (R1 and R2).
# --un-conc-gz: Writes reads that did not align to the index to two gzipped files,
#               one for each mate, with .1.fastq.gz and .2.fastq.gz suffixes appended to the prefix.
# -S /dev/null: Suppresses SAM output to stdout, as we are only interested in the unmapped reads.
bowtie2 \
    --threads $(nproc) \
    --very-sensitive-local \
    -x "${REPBASE_INDEX_PREFIX}" \
    -1 "${TRIMMED_READS_R1}" \
    -2 "${TRIMMED_READS_R2}" \
    --un-conc-gz "${OUTPUT_NON_REP_PREFIX}" \
    -S /dev/null

View on GitHub

Remaining reads were mapped to the appropriate genome build using STAR aligner (v2.5.2b).

STAR v2.5.2b GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.5.2b

# --- Prepare STAR genome index (run once per genome) ---
# Replace 'path/to/genome_fasta.fa' with your genome FASTA file (e.g., hg38.fa, mm10.fa)
# Replace 'path/to/annotations.gtf' with your gene annotations GTF file (e.g., gencode.v38.annotation.gtf)
# Replace 'path/to/star_genome_index_dir' with your desired output directory for the index
# Adjust --sjdbOverhang based on your read length (read_length - 1)
# STAR --runMode genomeGenerate \
#      --genomeDir path/to/star_genome_index_dir \
#      --genomeFastaFiles path/to/genome_fasta.fa \
#      --sjdbGTFfile path/to/annotations.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN 8 # Adjust threads as needed

# --- Map reads using STAR ---
# Replace 'path/to/star_genome_index_dir' with the path to your STAR-indexed genome for the appropriate build.
# Replace 'path/to/reads_R1.fastq.gz' and 'path/to/reads_R2.fastq.gz' with your input FASTQ files.
# For single-end reads, only use --readFilesIn path/to/reads_R1.fastq.gz
# Replace 'output_prefix' with your desired prefix for output files (e.g., sample_name)

STAR --genomeDir path/to/star_genome_index_dir \
     --readFilesIn path/to/reads_R1.fastq.gz path/to/reads_R2.fastq.gz \
     --runThreadN 8 \
     --outFileNamePrefix output_prefix. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outSAMunmapped Within \
     --quantMode GeneCounts # Optional: for gene counts quantification

View on GitHub

Raw read counts were generated from all mapped reads using featureCounts (v1.5.3) package.

featureCounts v1.5.3

$ Bash example

# Install featureCounts (part of the Subread package)
# conda install -c bioconda subread

# Placeholder for annotation file (e.g., GENCODE human annotation)
# Download from: https://www.gencodegenes.org/human/
# For example, for human GRCh38, release 44:
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
# gunzip gencode.v44.annotation.gtf.gz
ANNOTATION_FILE="gencode.v44.annotation.gtf"

# Placeholder for input BAM file(s) containing mapped reads
# Replace 'input_mapped_reads.bam' with your actual alignment file(s)
INPUT_BAM="input_mapped_reads.bam"

# Output file for raw read counts
OUTPUT_COUNTS="raw_read_counts.txt"

# Generate raw read counts from all mapped reads
# -a: specify the annotation file (GTF/GFF format)
# -o: specify the output file for the read counts
# -T: number of threads (adjust as needed, e.g., -T 8)
# -t exon: specify feature type to count (default is exon)
# -g gene_id: specify attribute to group features by (default is gene_id)
featureCounts -a "${ANNOTATION_FILE}" -o "${OUTPUT_COUNTS}" "${INPUT_BAM}"

Normalized counts were derived from DESeq output.

DESeq2 v1.44.0 GitHub

$ Bash example

# Install R and DESeq2 if not already installed (example using Bioconda)
# conda create -n deseq2_env r-base bioconductor-deseq2 -c conda-forge -c bioconda
# conda activate deseq2_env

# Create an R script to perform DESeq2 normalization
cat << 'EOF' > deseq2_normalize.R
# Load DESeq2 library
library(DESeq2)

# --- User-defined parameters --- 
# Input count matrix file (e.g., from featureCounts, HTSeq-count, or tximport output).
# This file should have gene/feature IDs as row names and sample names as column names.
# Example format:
# gene_id\tsample1\tsample2\tsample3
# geneA\t100\t150\t120
# geneB\t50\t70\t60
count_matrix_file <- "path/to/your/raw_count_matrix.tsv"

# Sample information file (metadata) with experimental design.
# This file should have sample names as row names and experimental factors as column names.
# Example format:
# sample_id\tcondition\tbatch
# sample1\ttreated\tbatch1
# sample2\tcontrol\tbatch1
# sample3\ttreated\tbatch2
sample_info_file <- "path/to/your/sample_metadata.tsv"

# Output file for normalized counts
output_normalized_counts_file <- "normalized_counts.tsv"

# Design formula for DESeq2. Replace 'condition' with your actual experimental factor(s).
# For simple normalization, a design like '~ 1' or '~ condition' is common.
# If you have multiple factors, e.g., '~ batch + condition'
design_formula_str <- "~ condition" # IMPORTANT: Adjust this based on your experimental design

# Read count matrix
# Ensure the first column is row names (gene IDs) and subsequent columns are counts
count_data <- read.table(count_matrix_file, header=TRUE, row.names=1, sep="\t", check.names=FALSE)

# Ensure count data is integer (DESeq2 requires integer counts)
count_data <- round(count_data)
count_data[is.na(count_data)] <- 0 # Replace NA with 0 if any

# Read sample information
# Ensure the first column is row names (sample IDs) and subsequent columns are factors
sample_info <- read.table(sample_info_file, header=TRUE, row.names=1, sep="\t", check.names=FALSE)

# Ensure sample_info row names match count_data column names and are in the same order
sample_info <- sample_info[colnames(count_data), , drop=FALSE]

# Convert design formula string to an R formula object
design_formula <- as.formula(design_formula_str)

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = design_formula)

# Optional: Pre-filtering to remove rows with very low counts across all samples
# This can reduce memory usage and speed up calculations without losing much information.
# For example, keep only genes with at least 10 counts in total across all samples.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis to estimate size factors and dispersions
# Even if only normalized counts are needed, this step is required for proper normalization.
dds <- DESeq(dds)

# Extract normalized counts (size factor normalized counts)
normalized_counts <- counts(dds, normalized=TRUE)

# Save normalized counts to a TSV file
write.table(normalized_counts, file=output_normalized_counts_file, sep="\t", quote=FALSE, col.names=NA)

message(paste("Normalized counts saved to:", output_normalized_counts_file))
EOF

# Execute the R script
Rscript deseq2_normalize.R

View on GitHub

Tools Used

Ribo-seq STAR DESeq2

Raw Source Text

Library strategy: Ribo-Seq
Raw sequencing reads were trimmed for adapter sequences using cutadapt (v1.14). Trimmed reads were mapped against RepBase (v18.05) to remove reads mapping to repetitive sequences. Remaining reads were mapped to the appropriate genome build using STAR aligner (v2.5.2b). Raw read counts were generated from all mapped reads using featureCounts (v1.5.3) package.
Normalized counts were derived from DESeq output.
Assembly: hg19 (GRCh37)
Supplementary files format and content: counts_for_pub.txt contains featureCounts output in comma-delimited format.
Supplementary files format and content: diffexp.tsv files contain DESeq2 output in tab-separated format.

← Back to Analysis