GSE207251 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Prions induce an early Arc response and a subsequent reduction in mGluR5 in the hippocampus.

Neurobiology of disease (2022) — PMID 35905927

Dataset

GSE207251

Prions induce an early Arc response in the hippocampus of prion-infected mice

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Adapters trimmed with cutadapt (v1.4.0)

    cutadapt v1.4.0 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=1.4.0
    
    # Define adapter sequences based on common Illumina TruSeq adapters used in eCLIP assays.
    # These sequences are often found in the Yeo lab's eCLIP CWL workflow (e.g., https://github.com/yeolab/eclip/blob/master/eclip.cwl).
    ADAPTER_3_PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Illumina TruSeq 3' adapter
    ADAPTER_5_PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina TruSeq 5' adapter
    
    # Define placeholder input and output files
    INPUT_FASTQ="input_reads.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads.fastq.gz"
    
    # Parameters inferred from common eCLIP preprocessing steps:
    # -a: 3' adapter sequence to be removed from the 3' end of reads.
    # -g: 5' adapter sequence to be removed from the 5' end of reads.
    # -u 6: Trim 6 bases from the 5' end of reads (common for UMI or random primer removal in eCLIP).
    # -q 20: Quality trim from the 3' end using a quality threshold of 20.
    # -m 18: Discard reads shorter than 18 bp after trimming.
    # -o: Specify the output file.
    # Note: --cores or parallel processing options were not available in cutadapt v1.4.0.
    
    cutadapt -a "${ADAPTER_3_PRIME}" \
             -g "${ADAPTER_5_PRIME}" \
             -u 6 \
             -q 20 \
             -m 18 \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  2. 2

    Reads mapping to repetitive elements (RepBase v18.04) with STAR were removed

    $ Bash example
    # --- Installation (commented out) ---
    # conda create -n star_env star -y
    # conda activate star_env
    
    # --- Reference Data Setup (commented out, assuming RepBase index is pre-built) ---
    # Download RepBase v18.04 FASTA (e.g., from GIRINST or a local mirror)
    # Example: wget -O repbase_v18.04.fasta "https://www.girinst.org/repbase/update/RepBase_v18.04.fasta.gz"
    # gunzip repbase_v18.04.fasta.gz
    
    # Build STAR index for RepBase v18.04
    # REPBASE_FASTA="repbase_v18.04.fasta" # Path to RepBase v18.04 FASTA file
    # REPBASE_STAR_INDEX="/path/to/repbase_star_index" # Directory for the STAR index
    # mkdir -p "${REPBASE_STAR_INDEX}"
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${REPBASE_STAR_INDEX}" \
    #      --genomeFastaFiles "${REPBASE_FASTA}" \
    #      --genomeSAindexNbases 10 # Recommended for smaller genomes/decoy sequences
    
    # --- Parameters ---
    READ1="input_R1.fastq.gz" # Path to input Read 1 FASTQ file
    READ2="input_R2.fastq.gz" # Path to input Read 2 FASTQ file (remove if single-end)
    REPBASE_STAR_INDEX="/path/to/repbase_star_index" # Path to the pre-built STAR index for RepBase v18.04
    OUTPUT_PREFIX="filtered_from_repbase" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # --- Execution Command ---
    # Align reads to the RepBase index. Reads that map to repetitive elements are effectively "removed"
    # by outputting only the unmapped reads for subsequent analysis.
    STAR --genomeDir "${REPBASE_STAR_INDEX}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype None \
         --outReadsUnmapped Fastx \
         --outFilterMultimapNmax 100 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 1 \
         --outFilterScoreMinOverLread 0 \
         --outFilterMatchNminOverLread 0 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --alignSJoverhangMin 8 \
         --alignSJDBoverhangMin 1 \
         --sjdbScore 1 \
         --runThreadN "${THREADS}"
    
    # Output files:
    # - ${OUTPUT_PREFIX}_Unmapped.out.mate1: FASTQ file containing Read 1 sequences that did NOT map to RepBase.
    # - ${OUTPUT_PREFIX}_Unmapped.out.mate2: FASTQ file containing Read 2 sequences that did NOT map to RepBase (if paired-end).
    # These unmapped reads are the ones that proceed to the next step of the pipeline (e.g., alignment to the main genome).
  3. 3

    Remaining reads were mapped to mouse genome mm9 with STAR (v2.4.0i)

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star=2.4.0i
    
    # Define variables (adjust paths as needed)
    # Placeholder for input reads. "Remaining reads" implies a single FASTQ or gzipped FASTQ.
    INPUT_FASTQ="remaining_reads.fastq.gz"
    # Placeholder for the STAR genome index directory for mouse genome mm9.
    # This directory should contain genome.fa, SA, SAindex, etc., pre-built by STAR --runMode genomeGenerate.
    # Example: /path/to/star_index/mm9
    GENOME_DIR="/path/to/star_index/mm9"
    # Prefix for output files (e.g., mapped_reads.Aligned.sortedByCoord.out.bam)
    OUTPUT_PREFIX="mapped_reads"
    # Number of threads to use for alignment
    NUM_THREADS=8 # Adjust based on available CPU cores
    
    # Run STAR alignment
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${INPUT_FASTQ}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate # Common output format for downstream analysis
  4. 4

    FeatureCounts was used to assign reads to genes (v1.5.0)

    featureCounts v1.5.0
    $ Bash example
    # Install featureCounts (part of the Subread package)
    # conda install -c bioconda subread
    
    # Example: Assign reads from an RNA-seq BAM file to genes using a GTF annotation.
    # This command assumes unstranded data (-s 0). Adjust -s 1 or -s 2 for stranded libraries.
    # Replace 'Homo_sapiens.GRCh38.109.gtf' with your actual gene annotation file (e.g., from Ensembl or GENCODE).
    # Replace 'aligned_reads.bam' with your input BAM file(s).
    # Replace 'gene_counts.txt' with your desired output file name.
    featureCounts -a Homo_sapiens.GRCh38.109.gtf \
                  -o gene_counts.txt \
                  -F GTF \
                  -t exon \
                  -g gene_id \
                  -s 0 \
                  -T 8 \
                  aligned_reads.bam

Tools Used

Raw Source Text
Adapters trimmed with cutadapt (v1.4.0)
Reads mapping to repetitive elements (RepBase v18.04) with STAR were removed
Remaining reads were mapped to mouse genome mm9 with STAR (v2.4.0i)
FeatureCounts was used to assign reads to genes (v1.5.0)
Assembly: mm9
Supplementary files format and content: tpm_filtered.csv, comma separated file with transcripts per million calculated for all expressed genes.
Supplementary files format and content: deseq2_results.csv, comma separated file with results from differential expression analysis (prion vs mock) calculated with DESeq2
← Back to Analysis