GSE9306 Processing Pipeline

GSE code_examples 2 steps

Publication

RNA sequence analysis defines Dicer's role in mouse embryonic stem cells.

Proceedings of the National Academy of Sciences of the United States of America (2007) — PMID 17989215

Dataset

Short RNA profiling of ES cells with and without Dicer

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequences were extracted from 454 reads that had 10 bp of perfect match to the ligated adaptors (measuring from the 3' end of the 5' adaptor and 5' end of the 3' adaptor).

cutadapt (Inferred with models/gemini-2.5-flash) v4.x GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt

# Define placeholder adaptor sequences for 454 reads.
# These are common Roche 454 Titanium adaptors.
# ADAPTOR_5_PRIME_END refers to the 5' adaptor sequence that might be at the 5' end of the read.
# ADAPTOR_3_PRIME_END refers to the 3' adaptor sequence that might be at the 3' end of the read.
ADAPTOR_5_PRIME_END="GCCTCCCTCGCGCCATCAG"
ADAPTOR_3_PRIME_END="GCCTTGCCAGCCCGCTCAG"

# Define input and output file names
INPUT_FASTQ="raw_454_reads.fastq"
OUTPUT_FASTQ="trimmed_454_reads.fastq"

# Use cutadapt to trim adaptors from 454 reads.
# -g: Trims the 5' adaptor (ADAPTOR_5_PRIME_END) from the 5' end of the read.
# -a: Trims the 3' adaptor (ADAPTOR_3_PRIME_END) from the 3' end of the read.
# -O 10: Requires a minimum overlap of 10 bp for a match to the adaptor sequence.
cutadapt -g "$ADAPTOR_5_PRIME_END" -a "$ADAPTOR_3_PRIME_END" -O 10 -o "$OUTPUT_FASTQ" "$INPUT_FASTQ"

View on GitHub

Contributed here are the unannotated sequence data from each library, including those sequences that did not match the mouse genome or any known non-coding RNAs.

Samtools (Inferred with models/gemini-2.5-flash) v1.17 GitHub

$ Bash example

# Install Samtools
# conda install -c bioconda samtools

# Placeholder for input BAM file, which is assumed to be the result of
# aligning raw sequences against the mouse genome and known non-coding RNAs.
# Reads that did not match either reference are expected to be marked as unmapped (FLAG 0x4).
INPUT_BAM="final_filtered_alignment.bam"
OUTPUT_UNANNOTATED_FASTQ="unannotated_sequences.fastq.gz"

# Extract reads marked as unmapped (FLAG 0x4) from the BAM file and convert them to FASTQ format.
# These sequences represent the "unannotated sequence data" that did not match
# the mouse genome or any known non-coding RNAs during upstream alignment/filtering.
samtools fastq -f 4 ${INPUT_BAM} > ${OUTPUT_UNANNOTATED_FASTQ}

# Reference datasets used in upstream alignment/filtering (not directly in this Samtools step):
# - Mouse genome: GRCm39 (Mus musculus) from sources like Ensembl or UCSC.
# - Known non-coding RNAs: e.g., GRCm39 ncRNA from Ensembl, or Rfam database.

View on GitHub

Raw Source Text

Sequences were extracted from 454 reads that had 10 bp of perfect match to the ligated adaptors (measuring from the 3' end of the 5' adaptor and 5' end of the 3' adaptor).
Contributed here are the unannotated sequence data from each library, including those sequences that did not match the mouse genome or any known non-coding RNAs.

← Back to Analysis