GSE277161 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.

Molecular cell (2024) — PMID 39303722

Dataset

GSE277161

Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [Ribo-STAMP cell lines]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    We aligned reads to the human genome version GRCh38 with annotation version Gencode v40 using STAR (v2.7.1a).

    $ Bash example
    # Install STAR (e.g., using Bioconda)
    # conda install -c bioconda star=2.7.1a
    
    # --- Reference Data Setup ---
    # The human genome GRCh38 and Gencode v40 annotation are required to build the STAR index.
    # Example commands to download and build the index (run once):
    # mkdir -p /path/to/STAR_genome_index_GRCh38_Gencode_v40
    # cd /path/to/STAR_genome_index_GRCh38_Gencode_v40
    # wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/GRCh38.primary_assembly.genome.fa.gz
    # wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gtf.gz
    # gunzip GRCh38.primary_assembly.genome.fa.gz
    # gunzip gencode.v40.annotation.gtf.gz
    #
    # STAR --runThreadN 8 --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_genome_index_GRCh38_Gencode_v40 \
    #      --genomeFastaFiles GRCh38.primary_assembly.genome.fa \
    #      --sjdbGTFfile gencode.v40.annotation.gtf \
    #      --sjdbOverhang 100 # Adjust sjdbOverhang based on your read length - 1
    
    # --- Alignment Step ---
    # Define input files and genome directory
    INPUT_READS_R1="sample_R1.fastq.gz" # Replace with your actual R1 FASTQ file
    INPUT_READS_R2="sample_R2.fastq.gz" # Replace with your actual R2 FASTQ file (if paired-end)
    GENOME_INDEX_DIR="/path/to/STAR_genome_index_GRCh38_Gencode_v40" # Path to your pre-built STAR index
    OUTPUT_PREFIX="aligned_reads_"
    NUM_THREADS=8 # Number of threads to use
    
    # Align reads to the human genome (GRCh38 with Gencode v40 annotation)
    STAR --runThreadN ${NUM_THREADS} \
         --genomeDir ${GENOME_INDEX_DIR} \
         --readFilesIn ${INPUT_READS_R1} ${INPUT_READS_R2} \
         --readFilesCommand zcat \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outBAMcompression 6 \
         --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)
    
  2. 2

    Bam files were then filtered to include only read1 values using samtools (v1.16) with option “view -hbf 64.”

    samtools v1.16 GitHub
    $ Bash example
    # Placeholder for input BAM file
    # input.bam: The original BAM file to be filtered.
    INPUT_BAM="input.bam"
    
    # Placeholder for output BAM file
    # read1_filtered.bam: The output BAM file containing only read1 values.
    OUTPUT_BAM="read1_filtered.bam"
    
    samtools view -hbf 64 "${INPUT_BAM}" > "${OUTPUT_BAM}"
  3. 3

    C-to-U edit sites were obtained using SAILOR.

    $ Bash example
    # Clone the SAILOR repository
    # git clone https://github.com/gersteinlab/sailor.git
    # cd sailor
    
    # Install dependencies (assuming Python 3 and required libraries like pysam, numpy, scipy)
    # pip install pysam numpy scipy
    
    # Example usage: Detect C-to-U RNA editing sites
    # Replace 'aligned_reads.bam' with your actual input BAM file (e.g., from STAR or HISAT2 alignment)
    # Replace 'hg38.fa' with your reference genome FASTA file (e.g., from UCSC or Ensembl)
    # Replace 'c_to_u_edits' with your desired output prefix
    python sailor.py -i aligned_reads.bam -r hg38.fa -o c_to_u_edits
  4. 4

    Edits were divided by the featurecounts (v1.5.2) output for each gene’s exons to generate EPR values based on GENCODE v40 annotations.

    featureCounts v1.5.2 GitHub
    $ Bash example
    # Install featureCounts (part of Subread package)
    # conda install -c bioconda subread
    
    # Define variables
    # Placeholder for GENCODE v40 annotations. Replace with the actual path to your GTF file.
    ANNOTATION_GTF="/path/to/gencode.v40.annotation.gtf"
    # Placeholder for the input BAM file containing aligned reads.
    INPUT_BAM="input_aligned_reads.bam"
    # Output file for gene exon counts.
    OUTPUT_FILE="gene_exon_counts.txt"
    
    # Execute featureCounts to count reads over exons for each gene.
    # -a: Specify the annotation file (GTF/GFF).
    # -o: Specify the output file for counts.
    # -F GTF: Specify that the annotation file is in GTF format.
    # -t exon: Count features of type 'exon'.
    # -g gene_id: Aggregate counts by 'gene_id' (i.e., sum exon counts for each gene).
    # Note: Strandedness (-s 0/1/2) is not specified in the description. Adjust if your data is stranded.
    featureCounts -a "${ANNOTATION_GTF}" -o "${OUTPUT_FILE}" -F GTF -t exon -g gene_id "${INPUT_BAM}"

Tools Used

Raw Source Text
We aligned reads to the human genome version GRCh38 with annotation version Gencode v40 using STAR (v2.7.1a).
Bam files were then filtered to include only read1 values using samtools (v1.16) with option “view -hbf 64.”
C-to-U edit sites were obtained using SAILOR.
Edits were divided by the featurecounts (v1.5.2) output for each gene’s exons to generate EPR values based on GENCODE v40 annotations.
Assembly: GRCh38
Supplementary files format and content: Bam files and EPR (edits-per-read) quantification
← Back to Analysis