GSE263870 Processing Pipeline
OTHER
code_examples
9 steps
Publication
High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.Nature communications (2024) — PMID 39152130
Dataset
GSE263870An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [isstamp_lr]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
CCS calling (v8.0.0)
$ Bash example
# Install PacBio SMRT Link/SMRT Tools (which includes the ccs tool). # This can typically be done by downloading the SMRT Link installer from PacBio's website # or, for the `pbccs` standalone tool, via bioconda: # conda install -c bioconda pbccs=8.0.0 # Example command for Circular Consensus Sequencing (CCS) calling. # Replace 'input.subreads.bam' with the path to your PacBio subreads BAM file. # Replace 'output.ccs.bam' with the desired path for the output CCS BAM file. # The version 8.0.0 refers to the specific version of the ccs tool. ccs input.subreads.bam output.ccs.bam
-
2
read deconcatenation using skera(1.2.0)
skera v1.2.0$ Bash example
# Install skera (if not already installed) # conda install -c bioconda skera # Example command for read deconcatenation using skera split # This command assumes 'input.fastq.gz' is the concatenated file # and will split it into multiple files with 'deconcatenated_reads_' prefix. # The exact splitting logic (e.g., by barcode, by number of files) would depend on specific use case. skera split input.fastq.gz --output-prefix deconcatenated_reads_
-
3
read trimming with lima(v2.9.0)
lima v2.9.0$ Bash example
# Install lima (example using conda) # conda install -c bioconda lima=2.9.0 # Example command for read trimming and demultiplexing with lima. # This command assumes you have a primer FASTA file and an input FASTQ file. # Adjust parameters like --min-score, --min-length, --max-length as needed for your data. # Replace 'primers.fasta' with the path to your actual primer file. # Replace 'input_reads.fastq' with the path to your actual input FASTQ file. # The --split-fastq option will output demultiplexed and trimmed reads into the specified output directory. lima --ccs \ --peek-guess \ --min-score 70 \ --min-length 50 \ --max-length 10000 \ --split-fastq \ --output-dir trimmed_demux_reads \ primers.fasta \ input_reads.fastq -
4
refine reads with isoseq refine (v4.1.1)
IsoSeq v4.1.1$ Bash example
# Install pb-isoseq-sm if not already installed # conda install -c bioconda pb-isoseq-sm=4.1.1 # Refine reads using isoseq refine # This command typically takes demultiplexed CCS reads (from lima) and a primer file. # Replace 'input.lima.ccs.bam' with your actual demultiplexed CCS BAM file. # Replace 'primers.fasta' with your actual primer FASTA file. # Replace 'output.flnc.bam' with your desired output file name for full-length non-chimeric reads. isoseq refine --primer primers.fasta input.lima.ccs.bam output.flnc.bam
-
5
read alignment with pbmm2 (v1.13.0)
$ Bash example
# Install pbmm2 (e.g., via conda) # conda install -c bioconda pbmm2=1.13.0 # Define variables (replace with actual paths and filenames) REFERENCE_GENOME="/path/to/your/reference_genome.fasta" # e.g., hg38.fasta INPUT_READS="/path/to/your/input_reads.bam" # or .fastq.gz, .fasta.gz OUTPUT_ALIGNED_BAM="output_aligned_reads.bam" NUM_THREADS=8 # Adjust as needed # Perform read alignment with pbmm2 # Using --preset CCS for HiFi reads, adjust if using CLR reads (--preset SUBREAD) pbmm2 align \ --sort \ -j ${NUM_THREADS} \ --preset CCS \ ${REFERENCE_GENOME} \ ${INPUT_READS} \ ${OUTPUT_ALIGNED_BAM} -
6
QC with NanoPlot (v1.32.1)
$ Bash example
# Install NanoPlot if not already installed # conda install -c bioconda nanoplot # Perform QC with NanoPlot # INPUT_FASTQ: Path to the input FASTQ file (e.g., from Oxford Nanopore sequencing) # OUTPUT_PREFIX: Prefix for output files (e.g., nanoplot_qc_results) nanoplot --fastq "$INPUT_FASTQ" --prefix "$OUTPUT_PREFIX"
-
7
filter unmapped reads, supplementary alignment reads, reads aligned to the incorrect strand, and reads having read quality below 20
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.19 # Define input and output file paths INPUT_BAM="input.bam" OUTPUT_BAM="filtered.bam" # Filter reads based on the specified criteria: # -F 4: Exclude unmapped reads (FLAG 4 is UNMAPPED) # -F 2048: Exclude supplementary alignments (FLAG 2048 is SUPPLEMENTARY_ALIGNMENT) # -F 16: Exclude reads aligned to the reverse strand (FLAG 16 is READ_REVERSE_STRAND). # This interprets 'reads aligned to the incorrect strand' as filtering out reads # that align to the reverse strand, assuming a preference for forward-strand alignments. # -q 20: Exclude reads with mapping quality below 20 # -b: Output in BAM format samtools view -b -F 4 -F 2048 -F 16 -q 20 "${INPUT_BAM}" > "${OUTPUT_BAM}" # Index the filtered BAM file for downstream use samtools index "${OUTPUT_BAM}" -
8
call edits using custom script
REDItools (Inferred with models/gemini-2.5-flash) v1.2.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# This step uses a custom Python script (call_edits.py) from the Yeo lab eCLIP pipeline, # which in turn calls REDItoolDnaRna.py to identify RNA editing sites. # Install REDItools (example, adjust for specific environment) # REDItools can be installed via pip or from source. # pip install REDItools==1.2.1 # Define input and reference files (placeholders - replace with actual paths) INPUT_BAM="path/to/merged_replicates.bam" GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa" # Example: hg38 primary assembly GENOME_GTF="path/to/gencode.v38.annotation.gtf" # Example: GENCODE v38 annotations OUTPUT_PREFIX="sample_name_edits" # Optional: Common variants VCF (e.g., dbSNP) to filter known SNPs COMMON_VARIANTS_VCF="path/to/dbSNP_155_GRCh38.vcf.gz" # Example: dbSNP build 155 for hg38 # Optional: Known RNA editing sites (e.g., from REDIportal) to annotate known sites KNOWN_RNA_EDITING_SITES="path/to/REDIportal_hg38_all_sites.txt" # Example: REDIportal sites for hg38 # Parameters for REDItoolDnaRna.py, as used in the eCLIP pipeline's call_edits.py script MIN_READ_COVERAGE=5 # Minimum read coverage MIN_BASE_QUALITY=20 # Minimum base quality MIN_MAPPING_QUALITY=20 # Minimum mapping quality MIN_VARIANT_FREQUENCY=0.05 # Minimum variant frequency MAX_VARIANT_FREQUENCY=0.95 # Maximum variant frequency MIN_VARIANT_READS=2 # Minimum number of reads supporting the variant STRAND_SPECIFIC=1 # 0 for non-strand-specific, 1 for strand-specific (eCLIP is strand-specific) FILTER_MULTIHIT_FLAG="-x" # Filter reads mapping to multiple locations FILTER_INDELS_FLAG="-i" # Filter sites with indels FILTER_REPEATS_FLAG="-r" # Filter sites in repetitive regions # Construct the REDItoolDnaRna.py command REDItools_COMMAND="REDItoolDnaRna.py" REDItools_COMMAND+=" -s ${STRAND_SPECIFIC}" REDItools_COMMAND+=" -t 16" # Number of threads (hardcoded in the eCLIP script) REDItools_COMMAND+=" -f ${GENOME_FASTA}" REDItools_COMMAND+=" -o ${OUTPUT_PREFIX}" REDItools_COMMAND+=" -g ${GENOME_GTF}" REDItools_COMMAND+=" -q ${MIN_BASE_QUALITY}" REDItools_COMMAND+=" -m ${MIN_MAPPING_QUALITY}" REDItools_COMMAND+=" -c ${MIN_READ_COVERAGE}" REDItools_COMMAND+=" -v ${MIN_VARIANT_FREQUENCY}" REDItools_COMMAND+=" -V ${MAX_VARIANT_FREQUENCY}" REDItools_COMMAND+=" -n ${MIN_VARIANT_READS}" REDItools_COMMAND+=" ${FILTER_MULTIHIT_FLAG}" REDItools_COMMAND+=" ${FILTER_INDELS_FLAG}" REDItools_COMMAND+=" ${FILTER_REPEATS_FLAG}" # Add optional filters if files are provided if [ -f "${COMMON_VARIANTS_VCF}" ]; then REDItools_COMMAND+=" -d ${COMMON_VARIANTS_VCF}" fi if [ -f "${KNOWN_RNA_EDITING_SITES}" ]; then REDItools_COMMAND+=" -e ${KNOWN_RNA_EDITING_SITES}" fi REDItools_COMMAND+=" ${INPUT_BAM}" # Execute the REDItools command eval "${REDItools_COMMAND}" # The eCLIP pipeline's call_edits.py script also performs post-processing # to convert REDItools output to VCF and TSV formats, which would involve # custom parsing scripts not shown here. -
9
remove SNPs
$ Bash example
# This script demonstrates how to remove known SNPs from a VCF file. # It assumes you have an input VCF file (e.g., from variant calling) # and a reference VCF file containing known SNPs (e.g., dbSNP). # Define input and output files INPUT_VCF="variants.vcf.gz" # Replace with your actual input VCF file KNOWN_SNPS_VCF="dbSNP_GRCh38_common.vcf.gz" # Placeholder: common SNPs from dbSNP for hg38 OUTPUT_VCF="variants_no_known_snps.vcf.gz" # Reference genome assembly (e.g., hg38) # Ensure the dbSNP VCF matches the reference used for variant calling. GENOME_ASSEMBLY="GRCh38" # --- Installation (uncomment to install if needed) --- # conda install -c bioconda bcftools # bcftools --version # Check installed version # --- Pre-requisites --- # Ensure input VCF and known SNPs VCF are indexed (.csi or .tbi files). # If not, uncomment and run the following: # bcftools index "${INPUT_VCF}" # bcftools index "${KNOWN_SNPS_VCF}" # Step 1: Annotate the input VCF with information from the known SNPs VCF. # This command adds an 'DB' flag to the INFO field of variants found in both files. bcftools annotate -a "${KNOWN_SNPS_VCF}" -c ID,INFO/DB "${INPUT_VCF}" | \ # Step 2: Filter out variants that have the 'DB' flag (i.e., are known SNPs). bcftools filter -e 'INFO/DB' -Oz -o "${OUTPUT_VCF}" # Create index for the output VCF bcftools index "${OUTPUT_VCF}" echo "Known SNPs removed. Output VCF: ${OUTPUT_VCF}"
Raw Source Text
CCS calling (v8.0.0) read deconcatenation using skera(1.2.0) read trimming with lima(v2.9.0) refine reads with isoseq refine (v4.1.1) read alignment with pbmm2 (v1.13.0) QC with NanoPlot (v1.32.1) filter unmapped reads, supplementary alignment reads, reads aligned to the incorrect strand, and reads having read quality below 20 call edits using custom script remove SNPs Assembly: hg38 Supplementary files format and content: isoforms level edit detection (.bed)