GSE263870 Processing Pipeline

OTHER code_examples 9 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [isstamp_lr]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

CCS calling (v8.0.0)

ccs v8.0.0 GitHub

$ Bash example

# Install PacBio SMRT Link/SMRT Tools (which includes the ccs tool).
# This can typically be done by downloading the SMRT Link installer from PacBio's website
# or, for the `pbccs` standalone tool, via bioconda:
# conda install -c bioconda pbccs=8.0.0

# Example command for Circular Consensus Sequencing (CCS) calling.
# Replace 'input.subreads.bam' with the path to your PacBio subreads BAM file.
# Replace 'output.ccs.bam' with the desired path for the output CCS BAM file.
# The version 8.0.0 refers to the specific version of the ccs tool.
ccs input.subreads.bam output.ccs.bam

View on GitHub

read deconcatenation using skera(1.2.0)

skera v1.2.0

$ Bash example

# Install skera (if not already installed)
# conda install -c bioconda skera

# Example command for read deconcatenation using skera split
# This command assumes 'input.fastq.gz' is the concatenated file
# and will split it into multiple files with 'deconcatenated_reads_' prefix.
# The exact splitting logic (e.g., by barcode, by number of files) would depend on specific use case.
skera split input.fastq.gz --output-prefix deconcatenated_reads_

read trimming with lima(v2.9.0)

lima v2.9.0

$ Bash example

# Install lima (example using conda)
# conda install -c bioconda lima=2.9.0

# Example command for read trimming and demultiplexing with lima.
# This command assumes you have a primer FASTA file and an input FASTQ file.
# Adjust parameters like --min-score, --min-length, --max-length as needed for your data.
# Replace 'primers.fasta' with the path to your actual primer file.
# Replace 'input_reads.fastq' with the path to your actual input FASTQ file.
# The --split-fastq option will output demultiplexed and trimmed reads into the specified output directory.
lima --ccs \
     --peek-guess \
     --min-score 70 \
     --min-length 50 \
     --max-length 10000 \
     --split-fastq \
     --output-dir trimmed_demux_reads \
     primers.fasta \
     input_reads.fastq

refine reads with isoseq refine (v4.1.1)

IsoSeq v4.1.1

$ Bash example

# Install pb-isoseq-sm if not already installed
# conda install -c bioconda pb-isoseq-sm=4.1.1

# Refine reads using isoseq refine
# This command typically takes demultiplexed CCS reads (from lima) and a primer file.
# Replace 'input.lima.ccs.bam' with your actual demultiplexed CCS BAM file.
# Replace 'primers.fasta' with your actual primer FASTA file.
# Replace 'output.flnc.bam' with your desired output file name for full-length non-chimeric reads.
isoseq refine --primer primers.fasta input.lima.ccs.bam output.flnc.bam

read alignment with pbmm2 (v1.13.0)

pbmm2 vv1.13.0 GitHub

$ Bash example

# Install pbmm2 (e.g., via conda)
# conda install -c bioconda pbmm2=1.13.0

# Define variables (replace with actual paths and filenames)
REFERENCE_GENOME="/path/to/your/reference_genome.fasta" # e.g., hg38.fasta
INPUT_READS="/path/to/your/input_reads.bam" # or .fastq.gz, .fasta.gz
OUTPUT_ALIGNED_BAM="output_aligned_reads.bam"
NUM_THREADS=8 # Adjust as needed

# Perform read alignment with pbmm2
# Using --preset CCS for HiFi reads, adjust if using CLR reads (--preset SUBREAD)
pbmm2 align \
  --sort \
  -j ${NUM_THREADS} \
  --preset CCS \
  ${REFERENCE_GENOME} \
  ${INPUT_READS} \
  ${OUTPUT_ALIGNED_BAM}

View on GitHub

QC with NanoPlot (v1.32.1)

NanoPlot v1.32.1 GitHub

$ Bash example

# Install NanoPlot if not already installed
# conda install -c bioconda nanoplot

# Perform QC with NanoPlot
# INPUT_FASTQ: Path to the input FASTQ file (e.g., from Oxford Nanopore sequencing)
# OUTPUT_PREFIX: Prefix for output files (e.g., nanoplot_qc_results)
nanoplot --fastq "$INPUT_FASTQ" --prefix "$OUTPUT_PREFIX"

View on GitHub

filter unmapped reads, supplementary alignment reads, reads aligned to the incorrect strand, and reads having read quality below 20

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.19

# Define input and output file paths
INPUT_BAM="input.bam"
OUTPUT_BAM="filtered.bam"

# Filter reads based on the specified criteria:
# -F 4: Exclude unmapped reads (FLAG 4 is UNMAPPED)
# -F 2048: Exclude supplementary alignments (FLAG 2048 is SUPPLEMENTARY_ALIGNMENT)
# -F 16: Exclude reads aligned to the reverse strand (FLAG 16 is READ_REVERSE_STRAND). 
#        This interprets 'reads aligned to the incorrect strand' as filtering out reads 
#        that align to the reverse strand, assuming a preference for forward-strand alignments.
# -q 20: Exclude reads with mapping quality below 20
# -b: Output in BAM format
samtools view -b -F 4 -F 2048 -F 16 -q 20 "${INPUT_BAM}" > "${OUTPUT_BAM}"

# Index the filtered BAM file for downstream use
samtools index "${OUTPUT_BAM}"

View on GitHub

call edits using custom script

REDItools (Inferred with models/gemini-2.5-flash) v1.2.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This step uses a custom Python script (call_edits.py) from the Yeo lab eCLIP pipeline,
# which in turn calls REDItoolDnaRna.py to identify RNA editing sites.

# Install REDItools (example, adjust for specific environment)
# REDItools can be installed via pip or from source.
# pip install REDItools==1.2.1

# Define input and reference files (placeholders - replace with actual paths)
INPUT_BAM="path/to/merged_replicates.bam"
GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa" # Example: hg38 primary assembly
GENOME_GTF="path/to/gencode.v38.annotation.gtf" # Example: GENCODE v38 annotations
OUTPUT_PREFIX="sample_name_edits"

# Optional: Common variants VCF (e.g., dbSNP) to filter known SNPs
COMMON_VARIANTS_VCF="path/to/dbSNP_155_GRCh38.vcf.gz" # Example: dbSNP build 155 for hg38

# Optional: Known RNA editing sites (e.g., from REDIportal) to annotate known sites
KNOWN_RNA_EDITING_SITES="path/to/REDIportal_hg38_all_sites.txt" # Example: REDIportal sites for hg38

# Parameters for REDItoolDnaRna.py, as used in the eCLIP pipeline's call_edits.py script
MIN_READ_COVERAGE=5         # Minimum read coverage
MIN_BASE_QUALITY=20         # Minimum base quality
MIN_MAPPING_QUALITY=20      # Minimum mapping quality
MIN_VARIANT_FREQUENCY=0.05  # Minimum variant frequency
MAX_VARIANT_FREQUENCY=0.95  # Maximum variant frequency
MIN_VARIANT_READS=2         # Minimum number of reads supporting the variant
STRAND_SPECIFIC=1           # 0 for non-strand-specific, 1 for strand-specific (eCLIP is strand-specific)
FILTER_MULTIHIT_FLAG="-x"   # Filter reads mapping to multiple locations
FILTER_INDELS_FLAG="-i"     # Filter sites with indels
FILTER_REPEATS_FLAG="-r"    # Filter sites in repetitive regions

# Construct the REDItoolDnaRna.py command
REDItools_COMMAND="REDItoolDnaRna.py"
REDItools_COMMAND+=" -s ${STRAND_SPECIFIC}"
REDItools_COMMAND+=" -t 16" # Number of threads (hardcoded in the eCLIP script)
REDItools_COMMAND+=" -f ${GENOME_FASTA}"
REDItools_COMMAND+=" -o ${OUTPUT_PREFIX}"
REDItools_COMMAND+=" -g ${GENOME_GTF}"
REDItools_COMMAND+=" -q ${MIN_BASE_QUALITY}"
REDItools_COMMAND+=" -m ${MIN_MAPPING_QUALITY}"
REDItools_COMMAND+=" -c ${MIN_READ_COVERAGE}"
REDItools_COMMAND+=" -v ${MIN_VARIANT_FREQUENCY}"
REDItools_COMMAND+=" -V ${MAX_VARIANT_FREQUENCY}"
REDItools_COMMAND+=" -n ${MIN_VARIANT_READS}"
REDItools_COMMAND+=" ${FILTER_MULTIHIT_FLAG}"
REDItools_COMMAND+=" ${FILTER_INDELS_FLAG}"
REDItools_COMMAND+=" ${FILTER_REPEATS_FLAG}"

# Add optional filters if files are provided
if [ -f "${COMMON_VARIANTS_VCF}" ]; then
    REDItools_COMMAND+=" -d ${COMMON_VARIANTS_VCF}"
fi
if [ -f "${KNOWN_RNA_EDITING_SITES}" ]; then
    REDItools_COMMAND+=" -e ${KNOWN_RNA_EDITING_SITES}"
fi

REDItools_COMMAND+=" ${INPUT_BAM}"

# Execute the REDItools command
eval "${REDItools_COMMAND}"

# The eCLIP pipeline's call_edits.py script also performs post-processing
# to convert REDItools output to VCF and TSV formats, which would involve
# custom parsing scripts not shown here.

View on GitHub

remove SNPs

bcftools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# This script demonstrates how to remove known SNPs from a VCF file.
# It assumes you have an input VCF file (e.g., from variant calling)
# and a reference VCF file containing known SNPs (e.g., dbSNP).

# Define input and output files
INPUT_VCF="variants.vcf.gz" # Replace with your actual input VCF file
KNOWN_SNPS_VCF="dbSNP_GRCh38_common.vcf.gz" # Placeholder: common SNPs from dbSNP for hg38
OUTPUT_VCF="variants_no_known_snps.vcf.gz"

# Reference genome assembly (e.g., hg38)
# Ensure the dbSNP VCF matches the reference used for variant calling.
GENOME_ASSEMBLY="GRCh38"

# --- Installation (uncomment to install if needed) ---
# conda install -c bioconda bcftools
# bcftools --version # Check installed version

# --- Pre-requisites ---
# Ensure input VCF and known SNPs VCF are indexed (.csi or .tbi files).
# If not, uncomment and run the following:
# bcftools index "${INPUT_VCF}"
# bcftools index "${KNOWN_SNPS_VCF}"

# Step 1: Annotate the input VCF with information from the known SNPs VCF.
# This command adds an 'DB' flag to the INFO field of variants found in both files.
bcftools annotate -a "${KNOWN_SNPS_VCF}" -c ID,INFO/DB "${INPUT_VCF}" | \
# Step 2: Filter out variants that have the 'DB' flag (i.e., are known SNPs).
bcftools filter -e 'INFO/DB' -Oz -o "${OUTPUT_VCF}"

# Create index for the output VCF
bcftools index "${OUTPUT_VCF}"

echo "Known SNPs removed. Output VCF: ${OUTPUT_VCF}"

View on GitHub

Raw Source Text

CCS calling (v8.0.0)
read deconcatenation using skera(1.2.0)
read trimming with lima(v2.9.0)
refine reads with isoseq refine (v4.1.1)
read alignment with pbmm2 (v1.13.0)
QC with NanoPlot (v1.32.1)
filter unmapped reads, supplementary alignment reads, reads aligned to the incorrect strand, and reads having read quality below 20
call edits using custom script
remove SNPs
Assembly: hg38
Supplementary files format and content: isoforms level edit detection (.bed)

← Back to Analysis