GSE110701 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

Disruption in A-to-I Editing Levels Affects C. elegans Development More Than a Complete Lack of Editing.

Cell reports (2019) — PMID 31018137

Dataset

GSE110701

Disruption in A-to-I editing levels affects C. elegans development more than a complete lack of editing

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Filter sequences with irrelevant barcodes

    filter_reads_by_barcode.py (Inferred with models/gemini-2.5-flash) vN/A (Inferred from yeolab/eclip workflow)
    $ Bash example
    # Install dependencies if needed (e.g., Python)
    # This script is typically part of the yeolab/eclip workflow and assumes a Python environment.
    
    # Example: Create a dummy valid_barcodes.txt for demonstration.
    # In a real scenario, this file would contain the expected barcodes for your experiment.
    # For example, if your barcodes are 'AAAA' and 'TTTT':
    # echo -e "AAAA\nTTTT" > valid_barcodes.txt
    
    # Placeholder for input FASTQ file
    INPUT_FASTQ="input_reads_with_barcodes.fastq.gz"
    # Placeholder for output FASTQ file
    OUTPUT_FASTQ="filtered_reads.fastq.gz"
    # Placeholder for the file containing valid barcodes
    VALID_BARCODES_FILE="valid_barcodes.txt"
    
    # Ensure the script is available. It's usually found in the 'tools/' directory of the eclip repo.
    # Assuming the script is in the current working directory or in PATH.
    # If not, provide the full path: /path/to/yeolab/eclip/tools/filter_reads_by_barcode.py
    
    python filter_reads_by_barcode.py \
        -i "${INPUT_FASTQ}" \
        -o "${OUTPUT_FASTQ}" \
        -b "${VALID_BARCODES_FILE}"
  2. 2

    Split relevant sequences by barcode

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools if not already available
    # conda install -c bioconda umi_tools
    
    # Extract Unique Molecular Identifiers (UMIs) from the 5' end of Read 1
    # and append them to the read header. This is a common step in eCLIP
    # to prepare reads for deduplication.
    # This example assumes a 10bp UMI at the beginning of Read 1.
    # Replace 'input_R1.fastq.gz' with your actual input file.
    # Replace 'output_R1_umi_extracted.fastq.gz' with your desired output file.
    # Adjust '--bc-pattern' if your UMI length or position is different.
    umi_tools extract \
        --input input_R1.fastq.gz \
        --output output_R1_umi_extracted.fastq.gz \
        --bc-pattern=NNNNNNNNNN \
        --log umi_extract.log
  3. 3

    Remove the linker from each sequence.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="trimmed_output.fastq.gz"
    
    # Define the 3' linker sequence for eCLIP (example)
    # This sequence might vary slightly depending on the specific library preparation kit.
    # A common eCLIP 3' adapter sequence is AGATCGGAAGAGCACACGTCTGAACTCCAGTCA.
    LINKER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Run cutadapt to remove the 3' linker
    cutadapt -a "${LINKER_SEQUENCE}" \
             -o "${OUTPUT_FASTQ}" \
             --minimum-length 18 \
             --error-rate 0.1 \
             --cores 4 \
             --report=minimal \
             "${INPUT_FASTQ}"
  4. 4

    Trim additional 3 bases.

    cutadapt (Inferred with models/gemini-2.5-flash) v1.18 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # Trim additional 3 bases from the 5' end of the reads
    # This example assumes trimming from the 5' end, a common step in eCLIP after UMI extraction or to remove fixed-length artifacts.
    # If trimming from the 3' end is intended, use '-u -3' instead of '-u 3'.
    cutadapt -u 3 -o trimmed_reads.fastq.gz input_reads.fastq.gz
  5. 5

    Merge identical reads.

    picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 GitHub
    $ Bash example
    # Install Picard (if not already installed)
    # conda install -c bioconda picard
    # Or download the JAR directly:
    # wget https://github.com/broadinstitute/picard/releases/download/2.27.4/picard.jar
    
    # Assuming input.bam is a coordinate-sorted BAM file
    # Replace input.bam with your actual input file path
    # Replace output_deduplicated.bam with your desired output file path
    # Replace deduplication_metrics.txt with your desired metrics file path
    java -jar picard.jar MarkDuplicates \
        I=input.bam \
        O=output_deduplicated.bam \
        M=deduplication_metrics.txt \
        ASSUME_SORTED=true \
        REMOVE_DUPLICATES=true
  6. 6

    Sequences were aligned to gene transcripts using Bowtie, aloowing multiple alignments.

    Bowtie v1.2.3 GitHub
    $ Bash example
    # Install Bowtie
    # conda install -c bioconda bowtie=1.2.3
    
    # Define reference transcriptome and index prefix
    # Using GRCh38 as a placeholder for the latest human assembly.
    # Replace 'GRCh38_transcriptome.fa' with the actual path to your transcriptome FASTA file.
    # Example source for human transcriptome: GENCODE (e.g., gencode.v44.transcripts.fa.gz)
    TRANSCRIPTOME_FA="GRCh38_transcriptome.fa"
    INDEX_PREFIX="GRCh38_transcriptome_index"
    
    # Build Bowtie index for the transcriptome
    # This step needs to be run once for the reference transcriptome.
    # bowtie-build "${TRANSCRIPTOME_FA}" "${INDEX_PREFIX}"
    
    # Define input FASTQ file and output SAM file
    # Replace 'input_reads.fastq' with your actual input sequencing reads.
    INPUT_FASTQ="input_reads.fastq"
    OUTPUT_SAM="aligned_reads.sam"
    
    # Align sequences to gene transcripts using Bowtie, allowing multiple alignments.
    # -x: specifies the index prefix
    # -q: indicates input reads are in FASTQ format
    # -S: specifies the output SAM file
    # -a: reports all valid alignments for each read
    bowtie -x "${INDEX_PREFIX}" -q "${INPUT_FASTQ}" -S "${OUTPUT_SAM}" -a
  7. 7

    DEseq package in R (http://www.r-project.org) was used from the data obtained from bowtie to identify differentially expressed genes.

    Bowtie vDESeq2 v1.40.2, Bowtie2 v2.5.2
    $ Bash example
    # Install Bowtie2 (if not already installed)
    # conda install -c bioconda bowtie2
    
    # Install Samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Install Subread (for featureCounts, if not already installed)
    # conda install -c bioconda subread
    
    # Install R and DESeq2 package (if not already installed)
    # conda install -c conda-forge r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'
    
    # --- Step 1: Build Bowtie2 index (if not already built) ---
    # Replace 'genome.fa' with your reference genome FASTA file (e.g., hg38.fa)
    # Replace 'index_name' with your desired index prefix (e.g., hg38_index)
    # bowtie2-build genome.fa index_name
    
    # --- Step 2: Align reads with Bowtie2 ---
    # Replace 'index_name' with your Bowtie2 index prefix
    # Replace 'sample1_R1.fastq.gz' and 'sample1_R2.fastq.gz' with your actual FASTQ files
    # Replace 'sample1.sam' and 'sample1.sorted.bam' with your desired output file names
    # For paired-end reads:
    bowtie2 -x index_name -1 sample1_R1.fastq.gz -2 sample1_R2.fastq.gz -S sample1.sam
    
    # Convert SAM to BAM, sort, and index
    samtools view -bS sample1.sam > sample1.bam
    samtools sort sample1.bam -o sample1.sorted.bam
    samtools index sample1.sorted.bam
    
    # Repeat for all samples (e.g., sample2, sample3, etc.)
    
    # --- Step 3: Count reads per gene with featureCounts ---
    # Replace 'annotation.gtf' with your gene annotation GTF file (e.g., gencode.v38.annotation.gtf)
    # Replace 'output_counts.txt' with your desired output count matrix file name
    # Replace 'sample*.sorted.bam' with all your sorted BAM files
    featureCounts -p -a annotation.gtf -o output_counts.txt sample1.sorted.bam sample2.sorted.bam sample3.sorted.bam
    
    # --- Step 4: Perform Differential Expression Analysis with DESeq2 in R ---
    # Create a sample information file (e.g., 'sample_info.csv') like this:
    # sample,condition
    # sample1,control
    # sample2,control
    # sample3,treated
    # sample4,treated
    
    Rscript -e '
      library(DESeq2)
      
      # Load count data from featureCounts output
      # skip=1 to skip the header line from featureCounts that starts with #
      count_data <- read.table("output_counts.txt", header = TRUE, row.names = 1, skip = 1)
      
      # Remove the first 5 columns (Chr, Start, End, Strand, Length) which are metadata from featureCounts
      count_data <- count_data[, 6:ncol(count_data)]
      
      # Clean up column names to match sample names (e.g., remove .sorted.bam suffix)
      colnames(count_data) <- gsub(".sorted.bam", "", colnames(count_data))
      
      # Load sample information
      sample_info <- read.csv("sample_info.csv", row.names = 1)
      
      # Ensure sample names in colData match and are in the same order as countData columns
      sample_info <- sample_info[colnames(count_data), , drop = FALSE]
      
      # Create DESeq2 object
      dds <- DESeqDataSetFromMatrix(countData = count_data,
                                    colData = sample_info,
                                    design = ~ condition) # Adjust 'condition' to your experimental design variable
      
      # Run DESeq2 analysis
      dds <- DESeq(dds)
      
      # Get results (e.g., comparing 'treated' vs 'control')
      # Replace 'condition_treated_vs_control' with your actual contrast
      res <- results(dds, contrast=c("condition", "treated", "control"))
      
      # Order results by adjusted p-value
      res_ordered <- res[order(res$padj), ]
      
      # Save results
      write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")
      
      # Optional: Save normalized counts
      normalized_counts <- counts(dds, normalized=TRUE)
      write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv")
    '
    
Raw Source Text
Filter sequences with irrelevant barcodes
Split relevant sequences by barcode
Remove the linker from each sequence. Trim additional 3 bases.
Merge identical reads.
Sequences were aligned to gene transcripts using Bowtie, aloowing multiple alignments.
DEseq package in R (http://www.r-project.org) was used from the data obtained from bowtie to identify differentially expressed genes.
Genome_build: WS220/ce10
← Back to Analysis