GSE74672 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Stratification of enterochromaffin cells by single-cell expression analysis.

eLife (2025) — PMID 40184163

Dataset

Single-cell RNA-seq of mouse hypothalamus

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Read processing was performed as described (Islam et al., Nat Methods.

fastp (Inferred with models/gemini-2.5-flash) v0.23.2 GitHub

$ Bash example

# Install fastp if not already installed
# conda create -n fastp_env fastp=0.23.2 -y
# conda activate fastp_env

# Define input and output file paths
READ1="input_R1.fastq.gz"
READ2="input_R2.fastq.gz"
OUTPUT_READ1="processed_R1.fastq.gz"
OUTPUT_READ2="processed_R2.fastq.gz"
OUTPUT_JSON="fastp_report.json"
OUTPUT_HTML="fastp_report.html"
THREADS=8 # Adjust as needed

# Run fastp for read processing (adapter trimming, quality filtering) as per Skipper workflow defaults
fastp \
    --in1 "${READ1}" \
    --in2 "${READ2}" \
    --out1 "${OUTPUT_READ1}" \
    --out2 "${OUTPUT_READ2}" \
    --json "${OUTPUT_JSON}" \
    --html "${OUTPUT_HTML}" \
    --thread "${THREADS}" \
    --detect_adapter_for_pe \
    --qualified_quality_phred 15 \
    --unqualified_percent_limit 40 \
    --length_required 18 \
    --low_complexity_filter \
    --complexity_threshold 30 \
    --trim_poly_g \
    --poly_g_min_len 10 \
    --trim_poly_x \
    --poly_x_min_len 10 \
    --cut_front \
    --cut_tail \
    --cut_window_size 4 \
    --cut_mean_quality 20

View on GitHub

2014 Feb;11(2):163-6), except that we removed any RNA molecule (i.e.

bbduk (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash)

$ Bash example

# This step describes a wet-lab procedure for removing RNA molecules. 
# In a bioinformatics pipeline, this would typically be represented by a computational step to filter reads mapping to common RNA species (e.g., rRNA, tRNA).
# A common tool for this is bbduk from the BBMap suite.
# Reference genome for rRNA/tRNA sequences would be required.

# Example using bbduk to remove rRNA reads (assuming 'rRNA_references.fa' contains rRNA sequences)
# conda install -c bioconda bbmap

# Placeholder for input FASTQ files
INPUT_FASTQ_R1="input_R1.fastq.gz"
INPUT_FASTQ_R2="input_R2.fastq.gz"

# Placeholder for rRNA reference file (e.g., from NCBI, Ensembl, or custom)
# You would need to create or download a FASTA file containing known rRNA sequences for your organism.
# Example: Download rRNA sequences for human (hg38) or mouse (mm10) from UCSC or NCBI.
# For example, a common approach is to use the 'ribo-fwd.fa' and 'ribo-rev.fa' files often provided with bbduk or similar tools, or create a custom one.
# For demonstration, let's assume a generic 'rRNA_references.fa' is available.
# You might need to build an index for bbduk if using a large custom reference, but for small rRNA sets, direct filtering is often sufficient.

rRNA_REFERENCE="rRNA_references.fa"

# Output filtered FASTQ files
OUTPUT_FILTERED_R1="filtered_R1.fastq.gz"
OUTPUT_FILTERED_R2="filtered_R2.fastq.gz"
OUTPUT_DISCARDED_R1="discarded_rRNA_R1.fastq.gz"
OUTPUT_DISCARDED_R2="discarded_rRNA_R2.fastq.gz"

# Run bbduk to remove reads matching rRNA sequences
bbduk.sh in="${INPUT_FASTQ_R1}" in2="${INPUT_FASTQ_R2}" 
         out="${OUTPUT_FILTERED_R1}" out2="${OUTPUT_FILTERED_R2}" 
         outm="${OUTPUT_DISCARDED_R1}" outm2="${OUTPUT_DISCARDED_R2}" 
         ref="${rRNA_REFERENCE}" 
         k=31 hdist=1 stats=stats.txt 
         -Xmx4g t=8

# Explanation of parameters:
# in, in2: Input paired-end FASTQ files.
# out, out2: Output paired-end FASTQ files with rRNA reads removed.
# outm, outm2: Output paired-end FASTQ files containing the discarded rRNA reads (optional, useful for QC).
# ref: Path to the rRNA reference FASTA file.
# k=31: Kmer length for matching (default is often 31 for adapter trimming, can be adjusted for rRNA).
# hdist=1: Hamming distance for matching (allows 1 mismatch).
# stats=stats.txt: Output statistics about the filtering process.
# -Xmx4g: Allocate 4GB of memory (adjust based on available RAM and dataset size).
# t=8: Use 8 threads (adjust based on available CPU cores).

# Note: The specific rRNA reference file and parameters might vary based on the organism and specific experimental design.

Unique Molecular Identifier) supported only by a single read ("singleton molecules").

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2

$ Bash example

# conda install -c bioconda umi_tools=1.1.2

# Example input: Aligned BAM file with UMIs in read IDs (e.g., from STAR alignment)
# Example output: Deduplicated BAM file and a log file containing statistics including singleton molecules.

# Placeholder variables for input/output files
INPUT_BAM="aligned_reads.bam"
OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
STATS_LOG="umi_dedup_stats.log"

# Run umi_tools dedup to remove PCR duplicates and report statistics, including singletons.
# This command assumes UMIs are extracted from the read ID after a colon separator.
# Adjust --umi-separator and --extract-umi-method based on your UMI tagging strategy.
# --paired is used for paired-end reads.
# --output-stats will generate a log file with detailed deduplication statistics, including singleton counts.
umi_tools dedup \
    --input "${INPUT_BAM}" \
    --output "${OUTPUT_DEDUP_BAM}" \
    --umi-separator ":" \
    --extract-umi-method "read_id" \
    --paired \
    --output-stats "${STATS_LOG}"

# To specifically identify or count singleton molecules, you would parse the ${STATS_LOG} file.
# For example, to find the line reporting singletons:
# grep "Number of reads with singletons" "${STATS_LOG}"

This removed a large number of false positive molecules, artefacts that can arise by sequencing error, PCR-induced mutations or translocations and cross-contamination.

samtools (Inferred with models/gemini-2.5-flash) v1.9 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.9

# This command filters a BAM file to remove various artifacts:
# - Reads with mapping quality less than 10 (-q 10), addressing sequencing errors.
# - PCR or optical duplicates (-F 1024), addressing PCR-induced mutations.
# - Secondary alignments (-F 256).
# - Unmapped reads (-F 4).
# The input BAM file is typically one that has already been marked for duplicates (e.g., by Picard MarkDuplicates).
# Replace 'input.mkdup.bam' with your actual input file and 'output.filtered.bam' with your desired output file name.
samtools view -q 10 -F 1024 -F 256 -F 4 -b -h -o output.filtered.bam input.mkdup.bam

View on GitHub

The first 6 bases of each read represent the random Unique Molecular Identifier used for molecule counting.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Extract the first 6 bases as UMI and append to read name
umi_tools extract \
    --extract-method=regex \
    --bc-pattern="^(.{6})" \
    -I input.fastq.gz \
    -S output_umi_extracted.fastq.gz \
    --log umi_extract.log

View on GitHub

After follows three or more Gs, stemming from the template switching at the mRNA 5' end during first strand cDNA sythesis.

cutadapt (Inferred with models/gemini-2.5-flash) v4.1 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output file paths
# Replace with actual input/output FASTQ file names
INPUT_FASTQ="input_read1.fastq.gz"
OUTPUT_FASTQ="output_read1_trimmed.fastq.gz"

# Trim three or more Gs from the 5' end of reads, stemming from template switching.
# -g GGG: Specifies a 5' adapter sequence of GGG. Cutadapt will find and remove this
#         sequence and any subsequent matching bases from the 5' end of the reads.
#         This effectively targets the poly-G artifact at the start of the cDNA.
# -q 20,20: Trims low-quality bases (quality score < 20) from both 5' and 3' ends.
# --minimum-length 20: Discards reads shorter than 20 bp after trimming.
cutadapt -g GGG \
         -q 20,20 \
         --minimum-length 20 \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

Raw Source Text

Read processing was performed as described (Islam et al., Nat Methods. 2014 Feb;11(2):163-6), except that we removed any RNA molecule (i.e. Unique Molecular Identifier) supported only by a single read ("singleton molecules"). This removed a large number of false positive molecules, artefacts that can arise by sequencing error, PCR-induced mutations or translocations and cross-contamination.
The first 6 bases of each read represent the random Unique Molecular Identifier used for molecule counting. After follows three or more Gs, stemming from the template switching at the mRNA 5' end during first strand cDNA sythesis.
Genome_build: UCSC mm10
Supplementary_files_format_and_content: Tab-delimited file of mRNA molecule counts for each gene and cell. Cells with less than 1500 molecules have been removed from this file.

← Back to Analysis