GSE110701 Processing Pipeline
RNA-Seq
code_examples
7 steps
Publication
Disruption in A-to-I Editing Levels Affects C. elegans Development More Than a Complete Lack of Editing.Cell reports (2019) — PMID 31018137
Dataset
GSE110701Disruption in A-to-I editing levels affects C. elegans development more than a complete lack of editing
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Filter sequences with irrelevant barcodes
filter_reads_by_barcode.py (Inferred with models/gemini-2.5-flash) vN/A (Inferred from yeolab/eclip workflow)$ Bash example
# Install dependencies if needed (e.g., Python) # This script is typically part of the yeolab/eclip workflow and assumes a Python environment. # Example: Create a dummy valid_barcodes.txt for demonstration. # In a real scenario, this file would contain the expected barcodes for your experiment. # For example, if your barcodes are 'AAAA' and 'TTTT': # echo -e "AAAA\nTTTT" > valid_barcodes.txt # Placeholder for input FASTQ file INPUT_FASTQ="input_reads_with_barcodes.fastq.gz" # Placeholder for output FASTQ file OUTPUT_FASTQ="filtered_reads.fastq.gz" # Placeholder for the file containing valid barcodes VALID_BARCODES_FILE="valid_barcodes.txt" # Ensure the script is available. It's usually found in the 'tools/' directory of the eclip repo. # Assuming the script is in the current working directory or in PATH. # If not, provide the full path: /path/to/yeolab/eclip/tools/filter_reads_by_barcode.py python filter_reads_by_barcode.py \ -i "${INPUT_FASTQ}" \ -o "${OUTPUT_FASTQ}" \ -b "${VALID_BARCODES_FILE}" -
2
Split relevant sequences by barcode
$ Bash example
# Install umi_tools if not already available # conda install -c bioconda umi_tools # Extract Unique Molecular Identifiers (UMIs) from the 5' end of Read 1 # and append them to the read header. This is a common step in eCLIP # to prepare reads for deduplication. # This example assumes a 10bp UMI at the beginning of Read 1. # Replace 'input_R1.fastq.gz' with your actual input file. # Replace 'output_R1_umi_extracted.fastq.gz' with your desired output file. # Adjust '--bc-pattern' if your UMI length or position is different. umi_tools extract \ --input input_R1.fastq.gz \ --output output_R1_umi_extracted.fastq.gz \ --bc-pattern=NNNNNNNNNN \ --log umi_extract.log -
3
Remove the linker from each sequence.
$ Bash example
# conda install -c bioconda cutadapt # Define input and output files INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed_output.fastq.gz" # Define the 3' linker sequence for eCLIP (example) # This sequence might vary slightly depending on the specific library preparation kit. # A common eCLIP 3' adapter sequence is AGATCGGAAGAGCACACGTCTGAACTCCAGTCA. LINKER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Run cutadapt to remove the 3' linker cutadapt -a "${LINKER_SEQUENCE}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length 18 \ --error-rate 0.1 \ --cores 4 \ --report=minimal \ "${INPUT_FASTQ}" -
4
Trim additional 3 bases.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt # Trim additional 3 bases from the 5' end of the reads # This example assumes trimming from the 5' end, a common step in eCLIP after UMI extraction or to remove fixed-length artifacts. # If trimming from the 3' end is intended, use '-u -3' instead of '-u 3'. cutadapt -u 3 -o trimmed_reads.fastq.gz input_reads.fastq.gz
-
5
Merge identical reads.
$ Bash example
# Install Picard (if not already installed) # conda install -c bioconda picard # Or download the JAR directly: # wget https://github.com/broadinstitute/picard/releases/download/2.27.4/picard.jar # Assuming input.bam is a coordinate-sorted BAM file # Replace input.bam with your actual input file path # Replace output_deduplicated.bam with your desired output file path # Replace deduplication_metrics.txt with your desired metrics file path java -jar picard.jar MarkDuplicates \ I=input.bam \ O=output_deduplicated.bam \ M=deduplication_metrics.txt \ ASSUME_SORTED=true \ REMOVE_DUPLICATES=true -
6
Sequences were aligned to gene transcripts using Bowtie, aloowing multiple alignments.
$ Bash example
# Install Bowtie # conda install -c bioconda bowtie=1.2.3 # Define reference transcriptome and index prefix # Using GRCh38 as a placeholder for the latest human assembly. # Replace 'GRCh38_transcriptome.fa' with the actual path to your transcriptome FASTA file. # Example source for human transcriptome: GENCODE (e.g., gencode.v44.transcripts.fa.gz) TRANSCRIPTOME_FA="GRCh38_transcriptome.fa" INDEX_PREFIX="GRCh38_transcriptome_index" # Build Bowtie index for the transcriptome # This step needs to be run once for the reference transcriptome. # bowtie-build "${TRANSCRIPTOME_FA}" "${INDEX_PREFIX}" # Define input FASTQ file and output SAM file # Replace 'input_reads.fastq' with your actual input sequencing reads. INPUT_FASTQ="input_reads.fastq" OUTPUT_SAM="aligned_reads.sam" # Align sequences to gene transcripts using Bowtie, allowing multiple alignments. # -x: specifies the index prefix # -q: indicates input reads are in FASTQ format # -S: specifies the output SAM file # -a: reports all valid alignments for each read bowtie -x "${INDEX_PREFIX}" -q "${INPUT_FASTQ}" -S "${OUTPUT_SAM}" -a -
7
DEseq package in R (http://www.r-project.org) was used from the data obtained from bowtie to identify differentially expressed genes.
Bowtie vDESeq2 v1.40.2, Bowtie2 v2.5.2$ Bash example
# Install Bowtie2 (if not already installed) # conda install -c bioconda bowtie2 # Install Samtools (if not already installed) # conda install -c bioconda samtools # Install Subread (for featureCounts, if not already installed) # conda install -c bioconda subread # Install R and DESeq2 package (if not already installed) # conda install -c conda-forge r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")' # --- Step 1: Build Bowtie2 index (if not already built) --- # Replace 'genome.fa' with your reference genome FASTA file (e.g., hg38.fa) # Replace 'index_name' with your desired index prefix (e.g., hg38_index) # bowtie2-build genome.fa index_name # --- Step 2: Align reads with Bowtie2 --- # Replace 'index_name' with your Bowtie2 index prefix # Replace 'sample1_R1.fastq.gz' and 'sample1_R2.fastq.gz' with your actual FASTQ files # Replace 'sample1.sam' and 'sample1.sorted.bam' with your desired output file names # For paired-end reads: bowtie2 -x index_name -1 sample1_R1.fastq.gz -2 sample1_R2.fastq.gz -S sample1.sam # Convert SAM to BAM, sort, and index samtools view -bS sample1.sam > sample1.bam samtools sort sample1.bam -o sample1.sorted.bam samtools index sample1.sorted.bam # Repeat for all samples (e.g., sample2, sample3, etc.) # --- Step 3: Count reads per gene with featureCounts --- # Replace 'annotation.gtf' with your gene annotation GTF file (e.g., gencode.v38.annotation.gtf) # Replace 'output_counts.txt' with your desired output count matrix file name # Replace 'sample*.sorted.bam' with all your sorted BAM files featureCounts -p -a annotation.gtf -o output_counts.txt sample1.sorted.bam sample2.sorted.bam sample3.sorted.bam # --- Step 4: Perform Differential Expression Analysis with DESeq2 in R --- # Create a sample information file (e.g., 'sample_info.csv') like this: # sample,condition # sample1,control # sample2,control # sample3,treated # sample4,treated Rscript -e ' library(DESeq2) # Load count data from featureCounts output # skip=1 to skip the header line from featureCounts that starts with # count_data <- read.table("output_counts.txt", header = TRUE, row.names = 1, skip = 1) # Remove the first 5 columns (Chr, Start, End, Strand, Length) which are metadata from featureCounts count_data <- count_data[, 6:ncol(count_data)] # Clean up column names to match sample names (e.g., remove .sorted.bam suffix) colnames(count_data) <- gsub(".sorted.bam", "", colnames(count_data)) # Load sample information sample_info <- read.csv("sample_info.csv", row.names = 1) # Ensure sample names in colData match and are in the same order as countData columns sample_info <- sample_info[colnames(count_data), , drop = FALSE] # Create DESeq2 object dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_info, design = ~ condition) # Adjust 'condition' to your experimental design variable # Run DESeq2 analysis dds <- DESeq(dds) # Get results (e.g., comparing 'treated' vs 'control') # Replace 'condition_treated_vs_control' with your actual contrast res <- results(dds, contrast=c("condition", "treated", "control")) # Order results by adjusted p-value res_ordered <- res[order(res$padj), ] # Save results write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv") # Optional: Save normalized counts normalized_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv") '
Raw Source Text
Filter sequences with irrelevant barcodes Split relevant sequences by barcode Remove the linker from each sequence. Trim additional 3 bases. Merge identical reads. Sequences were aligned to gene transcripts using Bowtie, aloowing multiple alignments. DEseq package in R (http://www.r-project.org) was used from the data obtained from bowtie to identify differentially expressed genes. Genome_build: WS220/ce10