GSE110701 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

Disruption in A-to-I Editing Levels Affects C. elegans Development More Than a Complete Lack of Editing.

Cell reports (2019) — PMID 31018137

Dataset

Disruption in A-to-I editing levels affects C. elegans development more than a complete lack of editing

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Filter sequences with irrelevant barcodes

filter_reads_by_barcode.py (Inferred with models/gemini-2.5-flash) vN/A (Inferred from yeolab/eclip workflow)

$ Bash example

# Install dependencies if needed (e.g., Python)
# This script is typically part of the yeolab/eclip workflow and assumes a Python environment.

# Example: Create a dummy valid_barcodes.txt for demonstration.
# In a real scenario, this file would contain the expected barcodes for your experiment.
# For example, if your barcodes are 'AAAA' and 'TTTT':
# echo -e "AAAA\nTTTT" > valid_barcodes.txt

# Placeholder for input FASTQ file
INPUT_FASTQ="input_reads_with_barcodes.fastq.gz"
# Placeholder for output FASTQ file
OUTPUT_FASTQ="filtered_reads.fastq.gz"
# Placeholder for the file containing valid barcodes
VALID_BARCODES_FILE="valid_barcodes.txt"

# Ensure the script is available. It's usually found in the 'tools/' directory of the eclip repo.
# Assuming the script is in the current working directory or in PATH.
# If not, provide the full path: /path/to/yeolab/eclip/tools/filter_reads_by_barcode.py

python filter_reads_by_barcode.py \
    -i "${INPUT_FASTQ}" \
    -o "${OUTPUT_FASTQ}" \
    -b "${VALID_BARCODES_FILE}"

Split relevant sequences by barcode

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already available
# conda install -c bioconda umi_tools

# Extract Unique Molecular Identifiers (UMIs) from the 5' end of Read 1
# and append them to the read header. This is a common step in eCLIP
# to prepare reads for deduplication.
# This example assumes a 10bp UMI at the beginning of Read 1.
# Replace 'input_R1.fastq.gz' with your actual input file.
# Replace 'output_R1_umi_extracted.fastq.gz' with your desired output file.
# Adjust '--bc-pattern' if your UMI length or position is different.
umi_tools extract \
    --input input_R1.fastq.gz \
    --output output_R1_umi_extracted.fastq.gz \
    --bc-pattern=NNNNNNNNNN \
    --log umi_extract.log

View on GitHub

Remove the linker from each sequence.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub

$ Bash example

# conda install -c bioconda cutadapt

# Define input and output files
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="trimmed_output.fastq.gz"

# Define the 3' linker sequence for eCLIP (example)
# This sequence might vary slightly depending on the specific library preparation kit.
# A common eCLIP 3' adapter sequence is AGATCGGAAGAGCACACGTCTGAACTCCAGTCA.
LINKER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Run cutadapt to remove the 3' linker
cutadapt -a "${LINKER_SEQUENCE}" \
         -o "${OUTPUT_FASTQ}" \
         --minimum-length 18 \
         --error-rate 0.1 \
         --cores 4 \
         --report=minimal \
         "${INPUT_FASTQ}"

View on GitHub

Trim additional 3 bases.

cutadapt (Inferred with models/gemini-2.5-flash) v1.18 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Trim additional 3 bases from the 5' end of the reads
# This example assumes trimming from the 5' end, a common step in eCLIP after UMI extraction or to remove fixed-length artifacts.
# If trimming from the 3' end is intended, use '-u -3' instead of '-u 3'.
cutadapt -u 3 -o trimmed_reads.fastq.gz input_reads.fastq.gz

View on GitHub

Merge identical reads.

picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 GitHub

$ Bash example

# Install Picard (if not already installed)
# conda install -c bioconda picard
# Or download the JAR directly:
# wget https://github.com/broadinstitute/picard/releases/download/2.27.4/picard.jar

# Assuming input.bam is a coordinate-sorted BAM file
# Replace input.bam with your actual input file path
# Replace output_deduplicated.bam with your desired output file path
# Replace deduplication_metrics.txt with your desired metrics file path
java -jar picard.jar MarkDuplicates \
    I=input.bam \
    O=output_deduplicated.bam \
    M=deduplication_metrics.txt \
    ASSUME_SORTED=true \
    REMOVE_DUPLICATES=true

View on GitHub

Sequences were aligned to gene transcripts using Bowtie, aloowing multiple alignments.

Bowtie v1.2.3 GitHub

$ Bash example

# Install Bowtie
# conda install -c bioconda bowtie=1.2.3

# Define reference transcriptome and index prefix
# Using GRCh38 as a placeholder for the latest human assembly.
# Replace 'GRCh38_transcriptome.fa' with the actual path to your transcriptome FASTA file.
# Example source for human transcriptome: GENCODE (e.g., gencode.v44.transcripts.fa.gz)
TRANSCRIPTOME_FA="GRCh38_transcriptome.fa"
INDEX_PREFIX="GRCh38_transcriptome_index"

# Build Bowtie index for the transcriptome
# This step needs to be run once for the reference transcriptome.
# bowtie-build "${TRANSCRIPTOME_FA}" "${INDEX_PREFIX}"

# Define input FASTQ file and output SAM file
# Replace 'input_reads.fastq' with your actual input sequencing reads.
INPUT_FASTQ="input_reads.fastq"
OUTPUT_SAM="aligned_reads.sam"

# Align sequences to gene transcripts using Bowtie, allowing multiple alignments.
# -x: specifies the index prefix
# -q: indicates input reads are in FASTQ format
# -S: specifies the output SAM file
# -a: reports all valid alignments for each read
bowtie -x "${INDEX_PREFIX}" -q "${INPUT_FASTQ}" -S "${OUTPUT_SAM}" -a

View on GitHub

DEseq package in R (http://www.r-project.org) was used from the data obtained from bowtie to identify differentially expressed genes.

Bowtie vDESeq2 v1.40.2, Bowtie2 v2.5.2

$ Bash example

# Install Bowtie2 (if not already installed)
# conda install -c bioconda bowtie2

# Install Samtools (if not already installed)
# conda install -c bioconda samtools

# Install Subread (for featureCounts, if not already installed)
# conda install -c bioconda subread

# Install R and DESeq2 package (if not already installed)
# conda install -c conda-forge r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'

# --- Step 1: Build Bowtie2 index (if not already built) ---
# Replace 'genome.fa' with your reference genome FASTA file (e.g., hg38.fa)
# Replace 'index_name' with your desired index prefix (e.g., hg38_index)
# bowtie2-build genome.fa index_name

# --- Step 2: Align reads with Bowtie2 ---
# Replace 'index_name' with your Bowtie2 index prefix
# Replace 'sample1_R1.fastq.gz' and 'sample1_R2.fastq.gz' with your actual FASTQ files
# Replace 'sample1.sam' and 'sample1.sorted.bam' with your desired output file names
# For paired-end reads:
bowtie2 -x index_name -1 sample1_R1.fastq.gz -2 sample1_R2.fastq.gz -S sample1.sam

# Convert SAM to BAM, sort, and index
samtools view -bS sample1.sam > sample1.bam
samtools sort sample1.bam -o sample1.sorted.bam
samtools index sample1.sorted.bam

# Repeat for all samples (e.g., sample2, sample3, etc.)

# --- Step 3: Count reads per gene with featureCounts ---
# Replace 'annotation.gtf' with your gene annotation GTF file (e.g., gencode.v38.annotation.gtf)
# Replace 'output_counts.txt' with your desired output count matrix file name
# Replace 'sample*.sorted.bam' with all your sorted BAM files
featureCounts -p -a annotation.gtf -o output_counts.txt sample1.sorted.bam sample2.sorted.bam sample3.sorted.bam

# --- Step 4: Perform Differential Expression Analysis with DESeq2 in R ---
# Create a sample information file (e.g., 'sample_info.csv') like this:
# sample,condition
# sample1,control
# sample2,control
# sample3,treated
# sample4,treated

Rscript -e '
  library(DESeq2)
  
  # Load count data from featureCounts output
  # skip=1 to skip the header line from featureCounts that starts with #
  count_data <- read.table("output_counts.txt", header = TRUE, row.names = 1, skip = 1)
  
  # Remove the first 5 columns (Chr, Start, End, Strand, Length) which are metadata from featureCounts
  count_data <- count_data[, 6:ncol(count_data)]
  
  # Clean up column names to match sample names (e.g., remove .sorted.bam suffix)
  colnames(count_data) <- gsub(".sorted.bam", "", colnames(count_data))
  
  # Load sample information
  sample_info <- read.csv("sample_info.csv", row.names = 1)
  
  # Ensure sample names in colData match and are in the same order as countData columns
  sample_info <- sample_info[colnames(count_data), , drop = FALSE]
  
  # Create DESeq2 object
  dds <- DESeqDataSetFromMatrix(countData = count_data,
                                colData = sample_info,
                                design = ~ condition) # Adjust 'condition' to your experimental design variable
  
  # Run DESeq2 analysis
  dds <- DESeq(dds)
  
  # Get results (e.g., comparing 'treated' vs 'control')
  # Replace 'condition_treated_vs_control' with your actual contrast
  res <- results(dds, contrast=c("condition", "treated", "control"))
  
  # Order results by adjusted p-value
  res_ordered <- res[order(res$padj), ]
  
  # Save results
  write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")
  
  # Optional: Save normalized counts
  normalized_counts <- counts(dds, normalized=TRUE)
  write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv")
'

Raw Source Text

Filter sequences with irrelevant barcodes
Split relevant sequences by barcode
Remove the linker from each sequence. Trim additional 3 bases.
Merge identical reads.
Sequences were aligned to gene transcripts using Bowtie, aloowing multiple alignments.
DEseq package in R (http://www.r-project.org) was used from the data obtained from bowtie to identify differentially expressed genes.
Genome_build: WS220/ce10

← Back to Analysis