GSE202881 Processing Pipeline

OTHER code_examples 8 steps

Publication

Pyruvate Kinase M (PKM) binds ribosomes in a poly-ADP ribosylation dependent manner to induce translational stalling.

Nucleic acids research (2023) — PMID 37224531

Dataset

GSE202881

Pyruvate Kinase M (PKM) binds ribosomes in a poly-ADP ribosylation dependent manner to induce translational stalling

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Ribosome profiling data were processed using RiboFlow.

    Ribo-seq vv1.0.0
    $ Bash example
    # Install Nextflow (if not already installed)
    # curl -s https://get.nextflow.io | bash
    # mv nextflow /usr/local/bin/
    
    # Example command for running the RiboFlow pipeline.
    # This command assumes you have a samplesheet (e.g., samples.csv or samples.tsv)
    # and reference genome files (FASTA, GTF). 
    # Replace 'path/to/samples.csv', 'path/to/genome.fasta', 'path/to/genome.gtf'
    # with the actual paths to your input files and reference data.
    # 'GRCh38' is used as a placeholder for the genome assembly name.
    # The '-profile docker' or '-profile singularity' is recommended for reproducibility
    # and requires Docker or Singularity to be installed and running.
    
    nextflow run riboflow/riboflow -r v1.0.0 \
        -profile docker \
        --input "path/to/samples.csv" \
        --genome "GRCh38" \
        --fasta "path/to/genome.fasta" \
        --gtf "path/to/genome.gtf" \
        --outdir "riboflow_output"
  2. 2

    We extracted the first 12 nucleotides from the 5’ end of the reads using UMI-tools with the following parameters: “umi_tools extract -p "^(?P.{12})(?P.{4}).+$" --extract-method=regex”.

    UMI-tools v1.1.2 GitHub
    $ Bash example
    # Install UMI-tools (example using conda)
    # conda install -c bioconda umi-tools
    
    # Define input and output files (placeholders)
    INPUT_FASTQ="input_read.fastq.gz"
    OUTPUT_FASTQ="output_read.fastq.gz"
    
    # Extract UMIs from the 5' end of reads using the specified regex pattern.
    # The pattern captures the first 12 bases as the UMI (named 'umi_1') and
    # the subsequent 4 bases as a discardable sequence (named 'discard_1').
    # The UMI is then appended to the read header and removed from the read sequence.
    umi_tools extract \
        --extract-method=regex \
        -p "^(?P<umi_1>.{12})(?P<discard_1>.{4}).+$" \
        -I "${INPUT_FASTQ}" \
        -S "${OUTPUT_FASTQ}"
  3. 3

    The four nucleotides downstream of the UMIs are discarded as they are incorporated during the reverse transcription step.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.1 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # This command assumes that the UMIs have already been extracted or handled in a preceding step,
    # and the four nucleotides to be discarded are now at the 5' end of the input reads.
    # It trims 4 bases from the 5' end of the reads.
    cutadapt -u 4 -o trimmed_reads.fastq.gz input_reads.fastq.gz
  4. 4

    Next, we used cutadapt to clip the 3’ adapter AAAAAAAAAACAAAAAAAAAA.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt
    
    # Clip the 3' adapter from a FASTQ file
    # Replace 'input.fastq.gz' with your actual input file and 'output.fastq.gz' with your desired output file.
    cutadapt -a AAAAAAAAAACAAAAAAAAAA -o output.fastq.gz input.fastq.gz
  5. 5

    After UMI extraction and adapter trimming, ribosomal and transfer RNAs were filtered by alignment using Bowtie2.

    Bowtie2 vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Bowtie2 (if not already installed)
    # conda install -c bioconda bowtie2
    
    # Define variables
    INPUT_FASTQ="trimmed_reads.fastq.gz" # Input FASTQ file after UMI extraction and adapter trimming
    OUTPUT_FASTQ="filtered_rRNA_tRNA_reads.fastq.gz" # Output FASTQ file containing reads with ribosomal and transfer RNAs removed
    RRNA_TRNA_INDEX_BASE="rRNA_tRNA_index" # Basename for the Bowtie2 index of ribosomal and transfer RNA sequences
    NUM_THREADS=8 # Number of threads to use for alignment
    
    # --- Reference Data Preparation (Example) ---
    # For human (e.g., hg38), ribosomal and transfer RNA sequences can be obtained from various sources:
    # - UCSC Genome Browser: Specific RNA files or extracted from repeatmasker tracks.
    # - NCBI RefSeq: Individual rRNA (e.g., NR_003286.2 for 18S, NR_003287.2 for 28S) and tRNA sequences.
    # - Rfam database: Comprehensive collection of RNA families (e.g., RF00001 for 5S rRNA, RF00005 for tRNA).
    # - A custom combined FASTA file of known ribosomal and transfer RNAs relevant to the organism being studied.
    #
    # Example command to build the Bowtie2 index (uncomment and modify if needed):
    # # Assuming you have a combined FASTA file named 'combined_rRNA_tRNA.fa'
    # # cat human_rRNA.fa human_tRNA.fa > combined_rRNA_tRNA.fa
    # bowtie2-build combined_rRNA_tRNA.fa ${RRNA_TRNA_INDEX_BASE}
    
    # Run Bowtie2 to filter ribosomal and transfer RNAs
    # Reads that align to the rRNA/tRNA index are considered ribosomal/transfer RNAs and are discarded.
    # Reads that do NOT align to the rRNA/tRNA index are kept and written to the output file.
    bowtie2 \
        -x "${RRNA_TRNA_INDEX_BASE}" \
        -U "${INPUT_FASTQ}" \
        --un-gz "${OUTPUT_FASTQ}" \
        -S /dev/null \
        -p "${NUM_THREADS}" \
        --very-fast # Using a fast preset like --very-fast is common for filtering steps
                    # to quickly identify and remove obvious matches. Other presets like
                    # --fast, --sensitive, or --very-sensitive can be used depending on
                    # the desired stringency and computational resources.
    
  6. 6

    The remaining reads were mapped to human transcriptome and alignments with mapping quality greater than two were retained.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.9a GitHub
    $ Bash example
    # Install STAR and Samtools if not already available
    # conda install -c bioconda star samtools
    
    # Define variables
    READS_FASTQ="remaining_reads.fastq.gz" # Placeholder for input reads (e.g., output from a previous trimming/deduplication step)
    STAR_INDEX_DIR="/path/to/STAR_human_genome_and_transcriptome_index" # Placeholder for STAR index built from human genome FASTA and GTF
    OUTPUT_PREFIX="aligned_reads" # Prefix for output files
    THREADS=8 # Adjust as needed for available CPU cores
    
    # 1. Map reads to the human transcriptome (genome with GTF-guided splicing)
    # Parameters are commonly used in eCLIP pipelines for robust alignment.
    # --outFilterMultimapNmax 1: Retain only uniquely mapping reads.
    # --outFilterMismatchNmax 3: Allow up to 3 mismatches.
    # --outFilterScoreMinOverLread 0.6 and --outFilterMatchNminOverLread 0.6: Ensure good alignment quality relative to read length.
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${READS_FASTQ}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.6 \
         --outFilterMatchNminOverLread 0.6 \
         --outReadsUnmapped Fastx \
         --outSAMattributes All \
         --limitBAMsortRAM 30000000000 # Adjust RAM as needed (e.g., 30GB)
    
    # 2. Retain alignments with mapping quality greater than two (MAPQ > 2, which means MAPQ >= 3)
    samtools view -b -h -q 3 "${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam" > "${OUTPUT_PREFIX}.filtered.bam"
    
    # 3. Index the filtered BAM file for downstream processing
    samtools index "${OUTPUT_PREFIX}.filtered.bam"
  7. 7

    UMIs were used for deduplication and .ribo files are created using RiboPy.

    RiboPy (Inferred with models/gemini-2.5-flash) v0.1.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install RiboPy (if not already installed)
    # conda create -n ribopy_env python=3.8
    # conda activate ribopy_env
    # pip install ribopy
    
    # Assuming 'aligned_reads_with_umis.bam' is the input BAM file
    # with UMIs tagged (e.g., in the 'RX' tag, common for UMI-tools output)
    # and 'sample_name' is the prefix for the output .ribo file.
    
    # Create .ribo file with UMI deduplication
    ribopy count \
        --bam aligned_reads_with_umis.bam \
        --output sample_name.ribo \
        --umi-tag RX \
        --dedup
  8. 8

    Library strategy: Ribo-seq

    Ribo-seq v1.2.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install riboviz (example using conda)
    # conda create -n riboviz_env python=3.8
    # conda activate riboviz_env
    # pip install riboviz
    
    # Define input and output paths
    READS="sample.fastq.gz" # Placeholder for input Ribo-seq FASTQ file (often single-end)
    OUTPUT_DIR="riboviz_output"
    CONFIG_FILE="riboviz_config.yaml"
    
    # Define reference datasets (placeholders - replace with actual paths)
    # For human (Homo sapiens), common references include hg38.
    REFERENCE_GENOME="/path/to/reference/genome/hg38.fa"
    GENOME_ANNOTATION="/path/to/reference/annotation/gencode.v38.annotation.gtf"
    
    # Create a dummy riboviz configuration file
    # This file defines the analysis parameters, input/output, and reference files.
    # For a real analysis, this file would be more detailed.
    # Refer to riboviz documentation for comprehensive configuration options:
    # https://riboviz.readthedocs.io/en/latest/user_guide/configuration.html
    cat << EOF > ${CONFIG_FILE}
    # Example riboviz configuration
    # This is a simplified example. A real config would be more extensive.
    dir_in: .
    dir_out: ${OUTPUT_DIR}
    rpf_in:
      - ${READS}
    riboviz_datadir: /path/to/riboviz_data # Directory containing pre-built indices or other data
    fasta: ${REFERENCE_GENOME}
    gtf: ${GENOME_ANNOTATION}
    features: CDS
    stop_codons: ["TAG", "TAA", "TGA"]
    start_codons: ["ATG"]
    min_read_length: 10
    max_read_length: 50
    # Add other parameters like adapter sequences, UMI handling, etc. as needed.
    EOF
    
    # Execute riboviz workflow using the generated configuration file
    # This command runs the riboviz pipeline, performing trimming, alignment,
    # and ribosome footprint analysis based on the configuration.
    python -m riboviz.workflow --config ${CONFIG_FILE}

Tools Used

Raw Source Text
Ribosome profiling data were processed using RiboFlow. We extracted the first 12 nucleotides from the 5’ end of the reads using UMI-tools with the following parameters: “umi_tools extract -p "^(?P.{12})(?P.{4}).+$" --extract-method=regex”. The four nucleotides downstream of the UMIs are discarded as they are incorporated during the reverse transcription step. Next, we used cutadapt to clip the 3’ adapter AAAAAAAAAACAAAAAAAAAA. After UMI extraction and adapter trimming, ribosomal and transfer RNAs were filtered by alignment using Bowtie2. The remaining reads were mapped to human transcriptome and alignments with mapping quality greater than two were retained. UMIs were used for deduplication and .ribo files are created using RiboPy.
Assembly: hg38
Supplementary files format and content: CSV file containing the RPKM value of detectable transcripts
Library strategy: Ribo-seq
← Back to Analysis