GSE104501 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Short poly(A) tails are a conserved feature of highly expressed genes.

Nature structural & molecular biology (2017) — PMID 29106412

Dataset

Poly(A) Tail Length of L4 C. elegans

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Base calling, trimming of adapter sequences, removal of duplicated reads and determination of poly(A) tail sizes were performed by tailseeker2.

tailseeker2 v2.0.0

$ Bash example

# Install tailseeker2 (example, actual installation might vary)
# conda create -n tailseeker2_env python=3.8
# conda activate tailseeker2_env
# pip install tailseeker2 # Or via bioconda if available

# Placeholder for input and output files
INPUT_FASTQ="raw_nanopore_reads.fastq.gz"
OUTPUT_TSV="polyA_tail_lengths.tsv"
REFERENCE_GENOME="path/to/human_hg38.fasta" # Using hg38 as a common placeholder for reference genome

# The description states: "Base calling, trimming of adapter sequences, removal of duplicated reads and determination of poly(A) tail sizes were performed by tailseeker2."
# tailseeker2 processes base-called FASTQ files; it does not perform base calling from raw .fast5 files. It is assumed the input FASTQ is already base-called.
# While tailseeker2 performs adapter trimming (--trim-adapters) and poly(A) tail size determination, it does not have a direct command-line option for explicit removal of duplicated reads.
# It is assumed that duplicate removal is either handled by a preceding step or is implicitly managed by tailseeker2's internal filtering, or the description is generalized.

tailseeker2 \
    --input "${INPUT_FASTQ}" \
    --output "${OUTPUT_TSV}" \
    --reference "${REFERENCE_GENOME}" \
    --trim-adapters \
    --min-tail-length 10 \
    --min-read-length 50 \
    --threads $(nproc)

Reads were analyzed by mapping to the WS247 assembly of the C. elegans genome using RNA-STAR.

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# --- Reference Data Preparation (C. elegans WS247) ---
# Download C. elegans WS247 genome FASTA and GTF annotation files.
# WS247 is an older release. For current data, refer to WormBase (e.g., WS289).
# For WS247, you would typically access older WormBase FTP archives or specific project repositories.
# Example placeholder paths for downloaded files:
# GENOME_FASTA="/path/to/c_elegans.WS247.genomic.fa"
# GTF_ANNOTATION="/path/to/c_elegans.WS247.annotations.gtf"

# Create a directory for the STAR genome index
mkdir -p C_elegans_WS247_STAR_index

# Build STAR genome index for C. elegans WS247
# Replace /path/to/c_elegans.WS247.genomic.fa and /path/to/c_elegans.WS247.annotations.gtf with actual paths
STAR --runMode genomeGenerate \
     --genomeDir C_elegans_WS247_STAR_index \
     --genomeFastaFiles /path/to/c_elegans.WS247.genomic.fa \
     --sjdbGTFfile /path/to/c_elegans.WS247.annotations.gtf \
     --sjdbOverhang 100 \
     --runThreadN 8

# --- Reads Alignment ---
# Replace reads_R1.fastq.gz and reads_R2.fastq.gz with your actual input FASTQ files.
# Assuming paired-end reads. For single-end, remove the second file from --readFilesIn.
STAR --runMode alignReads \
     --genomeDir C_elegans_WS247_STAR_index \
     --readFilesIn reads_R1.fastq.gz reads_R2.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix aligned_reads_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --sjdbScore 1 \
     --runThreadN 8

View on GitHub

Poly(A) lengths were then assigned to individual coding genes by intersecting the mapped sequences with WormBase.org WS247 gene annotations using BEDTools.

bedtools v2.30.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# --- Reference Data Setup ---
# Download WormBase WS247 C. elegans gene annotations (GFF3 format).
# The specific URL for WS247 GFF3 might vary slightly or be hosted differently over time.
# This is a representative path for a past release.
# wget -O c_elegans.PRJNA13758.WS247.gff3.gz "ftp://ftp.wormbase.org/pub/wormbase/releases/WS247/GFF/c_elegans.PRJNA13758.WS247.gff3.gz"
# gunzip c_elegans.PRJNA13758.WS247.gff3.gz

# Convert GFF3 to BED format for bedtools intersect.
# This example filters for 'gene' features and converts to 0-based start, 1-based end BED format.
# It extracts chromosome, start-1, end, gene_id (from Name= or ID= attribute), score (placeholder), strand.
# A more robust GFF3 to BED conversion might use tools like gff2bed from BEDOPS.
# awk '$3 == "gene"' c_elegans.PRJNA13758.WS247.gff3 | \
#     awk -v OFS='\t' '{
#         chr=$1; start=$4-1; end=$5; strand=$7;
#         gene_id=".";
#         if ($9 ~ /ID=([^;]+)/) { gene_id = substr($9, RSTART+3, RLENGTH-3); }
#         else if ($9 ~ /Name=([^;]+)/) { gene_id = substr($9, RSTART+5, RLENGTH-5); }
#         print chr, start, end, gene_id, "0", strand
#     }' | \
#     grep -P "^[[:alnum:]]+" > wormbase_ws247_genes.bed

# --- Input Files ---
# Placeholder for mapped sequences (e.g., poly(A) site reads, or 3'UTR reads).
# These sequences should be in BED format (e.g., chr, start, end, name, score, strand).
# Example: echo -e "chrI\t1000\t1050\tread1\t0\t+\nchrI\t2000\t2030\tread2\t0\t-" > mapped_sequences.bed
mapped_sequences_bed="mapped_sequences.bed"

# Placeholder for WormBase WS247 gene annotations in BED format.
# This file is assumed to have been prepared using the steps above or similar.
wormbase_genes_bed="wormbase_ws247_genes.bed"

# --- Execution Command ---
# Intersect mapped sequences with WormBase WS247 gene annotations.
# -wao: Write the original entry in A and B for each overlap, plus the number of overlapping bases.
# This output allows associating each poly(A) sequence with its overlapping gene and its attributes,
# which is crucial for subsequent assignment of poly(A) lengths to individual coding genes.
bedtools intersect -a "${mapped_sequences_bed}" -b "${wormbase_genes_bed}" -wao > polyA_sequences_and_overlapping_genes.bed

# Further processing (e.g., using awk, Python, or R) would be needed to group
# by gene (from the B file columns in the output) and calculate poly(A) lengths
# (derived from the A file columns in the output) per gene.

View on GitHub

Tools Used

STAR

Raw Source Text

Base calling, trimming of adapter sequences, removal of duplicated reads and determination of poly(A) tail sizes were performed by tailseeker2.
Reads were analyzed by mapping to the WS247 assembly of the C. elegans genome using RNA-STAR.
Poly(A) lengths were then assigned to individual coding genes by intersecting the mapped sequences with WormBase.org WS247 gene annotations using BEDTools.
Genome_build: WS247
Supplementary_files_format_and_content: Csv; contains the number of hits per gene, and the mean, median, standard deviation, and range of poly(A) tail lengths for RNAs that had more than 10 reads, as well as each individual poly(A) tail length that was read for each gene.

← Back to Analysis