GSE173507 Processing Pipeline
Publication
Discovery and functional interrogation of SARS-CoV-2 protein-RNA interactions.Research square (2022) — PMID 35313591
Dataset
GSE173507Discovery and functional interrogation of the virus and host RNA interactome of SARS-CoV-2 proteins [RNA-Seq]
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were trimmed using cutadapt (v1.14) using the following parameters -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt cutadapt -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz
-
2
Trimmed reads were mapped to and filtered of repeat elements (RepBase 18.05) with STAR (2.5.2) using the following parameters: --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8
$ Bash example
STAR --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8
-
3
Reads unmapped to repeat elements were mapped to the human genome with STAR using the same parameters as the previous step, using an hg19/ChlSab2 index in place of the repeat element index
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # STAR_INDEX_DIR: Path to the STAR index built from hg19 and ChlSab2 reference genomes. # hg19 (Human genome assembly GRCh37) can be obtained from UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz). # ChlSab2 (Chimpanzee chromosome 2) is often used as a spike-in or for specific comparative analyses. # The index would typically be built using STAR's --runMode genomeGenerate with both reference sequences. STAR_INDEX_DIR="/path/to/hg19_ChlSab2_STAR_index" # INPUT_FASTQ: Placeholder for the input FASTQ file containing reads that did not map to repeat elements. # This file would be the output of a previous filtering step. INPUT_FASTQ="unmapped_reads.fastq.gz" # OUTPUT_PREFIX: Prefix for all output files generated by STAR. OUTPUT_PREFIX="genome_mapped_reads" # N_THREADS: Number of threads (CPU cores) to use for the alignment. N_THREADS=8 # Create output directory if it doesn't exist mkdir -p "${OUTPUT_PREFIX}_output" # Run STAR alignment # Parameters are chosen to reflect common practices for unique mapping after repeat filtering, # and are consistent with parameters found in eCLIP pipelines (e.g., Yeo lab). STAR \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --runThreadN "${N_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}_output/" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNminOverLread 0.3 \ --limitBAMsortRAM 30000000000 -
4
Subread featureCounts (-a gencode.v19.annotation.gtf -s 2 -p -o counts.txt data.bam) was used to count features using human annotations (Gencode v19)
featureCounts v(Inferred with models/gemini-2.5-flash)$ Bash example
# Define input and output files INPUT_BAM="data.bam" # Placeholder for your input BAM file OUTPUT_COUNTS="counts.txt" # Define annotation file GENCODE_GTF="gencode.v19.annotation.gtf" GENCODE_GTF_URL="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz" # Install featureCounts (part of Subread package) # conda install -c bioconda subread # Download Gencode v19 annotation if not already present if [ ! -f "${GENCODE_GTF}" ]; then echo "Downloading Gencode v19 annotation..." wget -O "${GENCODE_GTF}.gz" "${GENCODE_GTF_URL}" gunzip "${GENCODE_GTF}.gz" fi # Run featureCounts featureCounts -a "${GENCODE_GTF}" -s 2 -p -o "${OUTPUT_COUNTS}" "${INPUT_BAM}"
Tools Used
Raw Source Text
Raw reads were trimmed using cutadapt (v1.14) using the following parameters -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz Trimmed reads were mapped to and filtered of repeat elements (RepBase 18.05) with STAR (2.5.2) using the following parameters: --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8 Reads unmapped to repeat elements were mapped to the human genome with STAR using the same parameters as the previous step, using an hg19/ChlSab2 index in place of the repeat element index Subread featureCounts (-a gencode.v19.annotation.gtf -s 2 -p -o counts.txt data.bam) was used to count features using human annotations (Gencode v19) Genome_build: ChlSab2 Genome_build: hg19 Supplementary_files_format_and_content: BigWig