GSE232514 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.Nature communications (2024) — PMID 39152130
Dataset
GSE232514Expanded repertoire of RNA-editing-based detection for RNA binding protein interactions (2)
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
remove adapter with Cutadapt
$ Bash example
# Install Cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output file paths (placeholders) INPUT_READ1="input_R1.fastq.gz" INPUT_READ2="input_R2.fastq.gz" OUTPUT_READ1="trimmed_R1.fastq.gz" OUTPUT_READ2="trimmed_R2.fastq.gz" REPORT_FILE="cutadapt_report.txt" # Define adapter sequences (common Illumina adapters for eCLIP, adjust as needed based on library prep) # For single-end reads, only -a is needed. For paired-end, -a for R1 and -A for R2. ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Universal Adapter ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example: Illumina Small RNA 3' Adapter (often used for R2 in some protocols) # Define trimming parameters MIN_LENGTH=15 # Minimum read length after trimming QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming THREADS=8 # Number of CPU cores to use # Execute Cutadapt for paired-end reads cutadapt \ --cores "${THREADS}" \ -a "${ADAPTER_R1}" \ -A "${ADAPTER_R2}" \ -o "${OUTPUT_READ1}" \ -p "${OUTPUT_READ2}" \ -m "${MIN_LENGTH}" \ -q "${QUALITY_CUTOFF}" \ "${INPUT_READ1}" \ "${INPUT_READ2}" \ > "${REPORT_FILE}" 2>&1 -
2
align to human genome using STAR 2.7.6a
$ Bash example
# Install STAR using conda # conda create -n star_env star=2.7.6a -c bioconda -c conda-forge # conda activate star_env # Define variables (replace with actual paths and filenames) # Reference genome: Human GRCh38 (hg38) # Source: GENCODE (for GTF) and UCSC/ENA/NCBI (for FASTA) GENOME_DIR="/path/to/STAR_human_GRCh38_index" # Path to pre-built STAR genome index for human GRCh38 GTF_FILE="/path/to/gencode.v44.annotation.gtf.gz" # Path to human GRCh38 GTF file (e.g., GENCODE v44) READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" # Remove this line if reads are single-end OUTPUT_PREFIX="sample_aligned" THREADS=8 # Number of threads to use # Example command to generate STAR genome index (run once per genome/GTF combination) # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/human_GRCh38.fa \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 \ # --runThreadN ${THREADS} # Align reads to the human genome (GRCh38) STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbGTFfile ${GTF_FILE} \ --sjdbOverhang 100 -
3
align unmapped reads from above to 12X MS2 stem-loop reporter mRNA using STAR 2.7.6a
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star=2.7.6a # Placeholder for input unmapped reads (from a previous step) # This file would contain reads that did not map to a primary genome. UNMAPPED_READS="unmapped_reads.fastq.gz" # Placeholder for STAR index directory for the 12X MS2 stem-loop reporter mRNA. # This index needs to be pre-built using STAR --runMode genomeGenerate with the MS2 reporter sequence. MS2_REPORTER_STAR_INDEX="ms2_reporter_star_index" # Placeholder for output prefix OUTPUT_PREFIX="ms2_aligned_reads" # Align unmapped reads to the 12X MS2 stem-loop reporter mRNA STAR \ --genomeDir "${MS2_REPORTER_STAR_INDEX}" \ --readFilesIn "${UNMAPPED_READS}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --runThreadN 8 # Adjust thread count as needed -
4
SAILOR analysis of data for C-to-U edits
SAILOR v1.0$ Bash example
# Install SAILOR via conda # conda create -n sailor_env python=3.8 # conda activate sailor_env # conda install -c bioconda sailor # Define input and output paths INPUT_BAM="input.bam" # Replace with your input RNA-seq BAM file OUTPUT_PREFIX="sailor_output" # Prefix for output files REFERENCE_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" # Placeholder: Path to GRCh38 reference FASTA GENE_ANNOTATION_GTF="/path/to/gencode.v38.annotation.gtf" # Placeholder: Path to GENCODE v38 gene annotation GTF # Run SAILOR for C-to-U editing analysis # This command is a basic example. Refer to SAILOR documentation for more options. sailor -i "${INPUT_BAM}" \ -o "${OUTPUT_PREFIX}" \ -r "${REFERENCE_FASTA}" \ -g "${GENE_ANNOTATION_GTF}" \ --strand-specific # Example option: use if your RNA-seq library is strand-specific -
5
SAILOR analysis of data for A-to-I edits
SAILOR vv1.0$ Bash example
# Clone the SAILOR repository # git clone https://github.com/gersteinlab/SAILOR.git # cd SAILOR # Create a conda environment and install dependencies (pysam, numpy, scipy, matplotlib) # conda create -n sailor_env python=3.8 -y # conda activate sailor_env # pip install pysam numpy scipy matplotlib # Define input/output files and reference data # Replace 'path/to/your_aligned_reads.bam' with the actual path to your input BAM file INPUT_BAM="path/to/your_aligned_reads.bam" OUTPUT_PREFIX="sailor_output" # Reference genome and gene annotation (using GRCh38 as the latest assembly placeholder) # Download from sources like GENCODE, Ensembl, or UCSC REFERENCE_FASTA="path/to/GRCh38.primary_assembly.genome.fa" GENE_ANNOTATION_GTF="path/to/gencode.v38.annotation.gtf" # Strand information: 0 for unstranded, 1 for first-strand, 2 for second-strand # Assuming first-strand for typical RNA-seq data, adjust if your data is different STRAND_INFO="1" # Run SAILOR analysis for A-to-I edits # Ensure you are in the directory where SAILOR.py is located or provide its full path python SAILOR.py -i "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -r "${REFERENCE_FASTA}" -g "${GENE_ANNOTATION_GTF}" -s "${STRAND_INFO}"
Raw Source Text
remove adapter with Cutadapt align to human genome using STAR 2.7.6a align unmapped reads from above to 12X MS2 stem-loop reporter mRNA using STAR 2.7.6a SAILOR analysis of data for C-to-U edits SAILOR analysis of data for A-to-I edits Assembly: GRCh38 Assembly: 12X MS2 stem-loop reporter mRNA Supplementary files format and content: The processed data files are in .bed format. They contain SAILOR-identified C-to-U or A-to-I edit information for endogenous transcripts (*non.construct.bed) and the 12X stem-loop-bearing-bearing reporter mRNA (*reporter.construct.bed).