GSE173507 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Discovery and functional interrogation of SARS-CoV-2 protein-RNA interactions.

Research square (2022) — PMID 35313591

Dataset

GSE173507

Discovery and functional interrogation of the virus and host RNA interactome of SARS-CoV-2 proteins [RNA-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw reads were trimmed using cutadapt (v1.14) using the following parameters -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt
    
    cutadapt -O 5 -f fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz
  2. 2

    Trimmed reads were mapped to and filtered of repeat elements (RepBase 18.05) with STAR (2.5.2) using the following parameters: --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8

    $ Bash example
    STAR --alignEndsType EndToEnd --genomeDir repbase --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix data --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --outStd Log --readFilesIn data.fastqTr.fq --runMode alignReads --runThreadN 8
  3. 3

    Reads unmapped to repeat elements were mapped to the human genome with STAR using the same parameters as the previous step, using an hg19/ChlSab2 index in place of the repeat element index

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    # STAR_INDEX_DIR: Path to the STAR index built from hg19 and ChlSab2 reference genomes.
    #   hg19 (Human genome assembly GRCh37) can be obtained from UCSC Genome Browser (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz).
    #   ChlSab2 (Chimpanzee chromosome 2) is often used as a spike-in or for specific comparative analyses.
    #   The index would typically be built using STAR's --runMode genomeGenerate with both reference sequences.
    STAR_INDEX_DIR="/path/to/hg19_ChlSab2_STAR_index"
    
    # INPUT_FASTQ: Placeholder for the input FASTQ file containing reads that did not map to repeat elements.
    #   This file would be the output of a previous filtering step.
    INPUT_FASTQ="unmapped_reads.fastq.gz"
    
    # OUTPUT_PREFIX: Prefix for all output files generated by STAR.
    OUTPUT_PREFIX="genome_mapped_reads"
    
    # N_THREADS: Number of threads (CPU cores) to use for the alignment.
    N_THREADS=8
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_PREFIX}_output"
    
    # Run STAR alignment
    # Parameters are chosen to reflect common practices for unique mapping after repeat filtering,
    # and are consistent with parameters found in eCLIP pipelines (e.g., Yeo lab).
    STAR \
        --genomeDir "${STAR_INDEX_DIR}" \
        --readFilesIn "${INPUT_FASTQ}" \
        --runThreadN "${N_THREADS}" \
        --outFileNamePrefix "${OUTPUT_PREFIX}_output/" \
        --outSAMtype BAM SortedByCoordinate \
        --outSAMattributes All \
        --outFilterMultimapNmax 1 \
        --outFilterMismatchNmax 3 \
        --alignIntronMax 1 \
        --outFilterType BySJout \
        --outFilterScoreMinOverLread 0.3 \
        --outFilterMatchNminOverLread 0.3 \
        --limitBAMsortRAM 30000000000
  4. 4

    Subread featureCounts (-a gencode.v19.annotation.gtf -s 2 -p -o counts.txt data.bam) was used to count features using human annotations (Gencode v19)

    featureCounts v(Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Define input and output files
    INPUT_BAM="data.bam" # Placeholder for your input BAM file
    OUTPUT_COUNTS="counts.txt"
    
    # Define annotation file
    GENCODE_GTF="gencode.v19.annotation.gtf"
    GENCODE_GTF_URL="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"
    
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Download Gencode v19 annotation if not already present
    if [ ! -f "${GENCODE_GTF}" ]; then
        echo "Downloading Gencode v19 annotation..."
        wget -O "${GENCODE_GTF}.gz" "${GENCODE_GTF_URL}"
        gunzip "${GENCODE_GTF}.gz"
    fi
    
    # Run featureCounts
    featureCounts -a "${GENCODE_GTF}" -s 2 -p -o "${OUTPUT_COUNTS}" "${INPUT_BAM}"

Tools Used

Raw Source Text
Raw reads were trimmed using cutadapt (v1.14) using the following parameters -O 5 -f  fastq --match-read-wildcards --times 2 -e 0.0 --quality-cutoff 6 -m 18 -o data.fastqTr.fq -b TCGTATGCCGTCTTCTGCTTG -b ATCTCGTATGCCGTCTTCTGCTTG -b CGACAGGTTCAGAGTTCTACAGTCCGACGATC -b GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -b AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA -b TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT data.fastq.gz
Trimmed reads were mapped to and filtered of repeat elements (RepBase 18.05) with STAR (2.5.2) using the following parameters: --alignEndsType  EndToEnd  --genomeDir  repbase  --genomeLoad  NoSharedMemory  --outBAMcompression  10  --outFileNamePrefix  data  --outFilterMultimapNmax  10  --outFilterMultimapScoreRange  1  --outFilterScoreMin  10  --outFilterType  BySJout  --outReadsUnmapped  Fastx  --outSAMattrRGline  ID:foo  --outSAMattributes  All  --outSAMmode  Full  --outSAMtype  BAM  Unsorted  --outSAMunmapped  Within  --outStd  Log  --readFilesIn data.fastqTr.fq  --runMode  alignReads  --runThreadN  8
Reads unmapped to repeat elements were mapped to the human genome with STAR using the same parameters as the previous step, using an hg19/ChlSab2 index in place of the repeat element index
Subread featureCounts (-a gencode.v19.annotation.gtf -s 2 -p -o counts.txt data.bam) was used to count features using human annotations (Gencode v19)
Genome_build: ChlSab2
Genome_build: hg19
Supplementary_files_format_and_content: BigWig
← Back to Analysis