GSE173507 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Discovery and functional interrogation of SARS-CoV-2 protein-RNA interactions.

Research square (2022) — PMID 35313591

Dataset

Discovery and functional interrogation of the virus and host RNA interactome of SARS-CoV-2 proteins [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1
Raw reads were trimmed using cutadapt (v1.14) using the following parameters -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz

cutadapt v1.14 GitHub
$ Bash example
```
# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

cutadapt -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz
```
View on GitHub
2
Trimmed reads were mapped to and filtered of repeat elements (RepBase 18.05) with STAR (2.5.2) using the following parameters: --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8

STAR v2.5.2 GitHub
$ Bash example
```
STAR --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8
```
View on GitHub

Reads unmapped to repeat elements were mapped to the human genome with STAR using the same parameters as the previous step, using an hg19/ChlSab2 index in place of the repeat element index

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# STAR_INDEX_DIR: Path to the STAR index built from hg19 and ChlSab2 reference genomes.
#   hg19 (Human genome assembly GRCh37) can be obtained from UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
#   ChlSab2 (Chimpanzee chromosome 2) is often used as a spike-in or for specific comparative analyses.
#   The index would typically be built using STAR's --runMode genomeGenerate with both reference sequences.
STAR_INDEX_DIR="/path/to/hg19_ChlSab2_STAR_index"

# INPUT_FASTQ: Placeholder for the input FASTQ file containing reads that did not map to repeat elements.
#   This file would be the output of a previous filtering step.
INPUT_FASTQ="unmapped_reads.fastq.gz"

# OUTPUT_PREFIX: Prefix for all output files generated by STAR.
OUTPUT_PREFIX="genome_mapped_reads"

# N_THREADS: Number of threads (CPU cores) to use for the alignment.
N_THREADS=8

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_PREFIX}_output"

# Run STAR alignment
# Parameters are chosen to reflect common practices for unique mapping after repeat filtering,
# and are consistent with parameters found in eCLIP pipelines (e.g., Yeo lab).
STAR \
    --genomeDir "${STAR_INDEX_DIR}" \
    --readFilesIn "${INPUT_FASTQ}" \
    --runThreadN "${N_THREADS}" \
    --outFileNamePrefix "${OUTPUT_PREFIX}_output/" \
    --outSAMtype BAM SortedByCoordinate \
    --outSAMattributes All \
    --outFilterMultimapNmax 1 \
    --outFilterMismatchNmax 3 \
    --alignIntronMax 1 \
    --outFilterType BySJout \
    --outFilterScoreMinOverLread 0.3 \
    --outFilterMatchNminOverLread 0.3 \
    --limitBAMsortRAM 30000000000

View on GitHub

Subread featureCounts (-a gencode.v19.annotation.gtf -s 2 -p -o counts.txt data.bam) was used to count features using human annotations (Gencode v19)

featureCounts v(Inferred with models/gemini-2.5-flash)

$ Bash example

# Define input and output files
INPUT_BAM="data.bam" # Placeholder for your input BAM file
OUTPUT_COUNTS="counts.txt"

# Define annotation file
GENCODE_GTF="gencode.v19.annotation.gtf"
GENCODE_GTF_URL="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"

# Install featureCounts (part of Subread package)
# conda install -c bioconda subread

# Download Gencode v19 annotation if not already present
if [ ! -f "${GENCODE_GTF}" ]; then
    echo "Downloading Gencode v19 annotation..."
    wget -O "${GENCODE_GTF}.gz" "${GENCODE_GTF_URL}"
    gunzip "${GENCODE_GTF}.gz"
fi

# Run featureCounts
featureCounts -a "${GENCODE_GTF}" -s 2 -p -o "${OUTPUT_COUNTS}" "${INPUT_BAM}"

Tools Used

STAR

Raw Source Text

Raw reads were trimmed using cutadapt (v1.14) using the following parameters -O 5 -f  fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz
Trimmed reads were mapped to and filtered of repeat elements (RepBase 18.05) with STAR (2.5.2) using the following parameters: --alignEndsType  EndToEnd  --genomeDir  repbase  --genomeLoad  NoSharedMemory  --outBAMcompression  10  --outFileNamePrefix  data  --outFilterMultimapNmax  10  --outFilterMultimapScoreRange  1  --outFilterScoreMin  10  --outFilterType  BySJout  --outReadsUnmapped  Fastx  --outSAMattrRGline  ID:foo  --outSAMattributes  All  --outSAMmode  Full  --outSAMtype  BAM  Unsorted  --outSAMunmapped  Within  --outStd  Log  --readFilesIn data.fastqTr.fq  --runMode  alignReads  --runThreadN  8
Reads unmapped to repeat elements were mapped to the human genome with STAR using the same parameters as the previous step, using an hg19/ChlSab2 index in place of the repeat element index
Subread featureCounts (-a gencode.v19.annotation.gtf -s 2 -p -o counts.txt data.bam) was used to count features using human annotations (Gencode v19)
Genome_build: ChlSab2
Genome_build: hg19
Supplementary_files_format_and_content: BigWig

← Back to Analysis