GSE166404 Processing Pipeline

OTHER code_examples 9 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [STAMP_pacbio]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The double stranded cDNA for each sample was converted to sequencing libraries as recommended (PacBio SMRTbell Express Template Prep Kit 2.0) but with separate barcoded adapters for each sample (PacBio Barcoded Overhang Adapter Kit).

PacBio sequencing v2.0

$ Bash example

# This step describes a wet-lab library preparation protocol using PacBio kits.
# No computational command is directly associated with this step.
# The process involves converting double-stranded cDNA to sequencing libraries
# using the PacBio SMRTbell Express Template Prep Kit 2.0 and
# PacBio Barcoded Overhang Adapter Kit for separate barcoded adapters for each sample.

All of the samples were pooled in an equimolar fashion and sequenced on a SMRTcell 8M with the PacBio Sequel II instrument (2.0 chemistry/2.1 polymerase with 2 hour pre-extension and 30 hour movie times).

PacBio sequencing vChemistry 2.0/2.1

$ Bash example

# This block describes the parameters for a PacBio Sequel II sequencing run.
# The actual sequencing is performed on the instrument and generates raw data files.

# Sequencing Instrument: PacBio Sequel II
# SMRTcell Type: 8M
# Chemistry: 2.0
# Polymerase: 2.1
# Pre-extension time: 2 hours
# Movie time: 30 hours
# Pooling method: equimolar

# The raw sequencing data (e.g., subreads.bam files) are generated by the instrument's software.
# This command represents the successful completion and availability of the raw sequencing data.
echo "PacBio Sequel II sequencing completed with SMRTcell 8M, Chemistry 2.0/2.1, 2hr pre-extension, 30hr movie time. Raw data ready for downstream bioinformatics processing."

After barcodes were demultiplexed, the initial data was used to re-balance the pooling by barcode counts before further sequencing.

Custom Python script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Create a dummy input file for demonstration (replace with actual demultiplexed barcode counts)
echo -e "barcode1\t1000000" > barcode_counts.tsv
echo -e "barcode2\t500000" >> barcode_counts.tsv
echo -e "barcode3\t2000000" >> barcode_counts.tsv
echo -e "barcode4\t750000" >> barcode_counts.tsv
echo -e "barcode5\t1250000" >> barcode_counts.tsv

# Install pandas if not already installed
# conda install -c anaconda pandas
# pip install pandas

# Create the Python script for re-balancing
cat << 'EOF' > rebalance_pooling.py
import sys
import pandas as pd

def rebalance_pooling(input_counts_file, output_report_file, target_strategy="equal"):
    """
    Analyzes barcode counts from an initial sequencing run and suggests re-balancing factors.

    Args:
        input_counts_file (str): Path to a TSV file with barcode counts (e.g., 'barcode\tcount').
        output_report_file (str): Path to output the re-balancing report.
        target_strategy (str): Strategy for target pooling. Currently only 'equal' is supported.
    """
    try:
        df = pd.read_csv(input_counts_file, sep='\t', header=None, names=['barcode', 'count'])
    except FileNotFoundError:
        print(f"Error: Input file '{input_counts_file}' not found.", file=sys.stderr)
        sys.exit(1)
    except Exception as e:
        print(f"Error reading input file: {e}", file=sys.stderr)
        sys.exit(1)

    total_counts = df['count'].sum()
    if total_counts == 0:
        print("Warning: Total counts are zero. Cannot perform re-balancing.", file=sys.stderr)
        df['current_proportion'] = 0
        df['target_proportion'] = 0
        df['rebalancing_factor'] = 0
    else:
        df['current_proportion'] = df['count'] / total_counts

        if target_strategy == "equal":
            num_barcodes = len(df)
            df['target_proportion'] = 1 / num_barcodes
        else:
            print(f"Error: Unknown target strategy '{target_strategy}'.", file=sys.stderr)
            sys.exit(1)

        # Rebalancing factor: how much to adjust the concentration for the next pool
        # A factor > 1 means increase concentration, < 1 means decrease.
        # Handle division by zero if a barcode had 0 counts but should be included.
        df['rebalancing_factor'] = df.apply(
            lambda row: row['target_proportion'] / row['current_proportion'] if row['current_proportion'] > 0 else (
                float('inf') if row['target_proportion'] > 0 else 1
            ), axis=1
        )

    df.to_csv(output_report_file, sep='\t', index=False)
    print(f"Re-balancing report generated: {output_report_file}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python rebalance_pooling.py <input_counts_file> <output_report_file>", file=sys.stderr)
        sys.exit(1)
    
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    rebalance_pooling(input_file, output_file)
EOF

# Execute the re-balancing script
python rebalance_pooling.py barcode_counts.tsv rebalancing_report.tsv

# The 'rebalancing_report.tsv' will contain suggested adjustment factors for each barcode.
# This report is then used to manually or robotically adjust sample concentrations for the next sequencing run.

View on GitHub

In total, the samples were sequenced over 5 SMRTcell 8M.

PacBio Sequel System (Inferred with models/gemini-2.5-flash) vSMRTcell 8M

$ Bash example

# This step describes the generation of raw sequencing data using a PacBio Sequel System with SMRTcell 8M.
# It is a data generation step and does not correspond to a direct bash execution command within a bioinformatics pipeline.
# The output of this step would be raw sequencing data files (e.g., BAM files) ready for downstream computational analysis.

The PacBio Sequel II system was used for all sample sequencing.

PacBio sequencing vSequel II

$ Bash example

# This step describes the wet-lab sequencing process using the PacBio Sequel II system.
# Raw sequencing data (e.g., BAM/FASTQ files) would be generated from this step.
# There is no direct command-line execution for the physical sequencing instrument itself.

Following sequencing, the circular consensus sequence (CCS) reads for each set of technical replicates were processed using the Isoseq v3 pipeline (https://github.com/PacificBiosciences/IsoSeq) to generate full-length non-concatamer reads in fasta format.

IsoSeq vv3 GitHub

$ Bash example

# Install IsoSeq v3 (example using conda)
# conda create -n isoseq3_env -c bioconda -c conda-forge isoseq3
# conda activate isoseq3_env

# Assuming 'input_ccs.bam' contains the circular consensus sequence (CCS) reads
# and 'primers.fasta' contains the 5' and 3' cDNA primers used for library preparation.
# Replace 'input_ccs.bam' and 'primers.fasta' with actual file paths for each technical replicate.

# Step 1: Refine CCS reads to identify and remove primers, and filter for full-length non-concatamer (FLNC) reads.
# This step generates a BAM file containing FLNC reads.
isoseq3 refine --primer primers.fasta input_ccs.bam flnc.bam

# Step 2: Convert the FLNC BAM file to FASTA format, as specified in the description.
samtools fasta flnc.bam > flnc.fasta

View on GitHub

For this step, software package lima v2.0.0 was used with parameters: --isoseq and --dump-clips.

IsoSeq v2.0.0 GitHub

$ Bash example

# Install lima (often part of PacBio SMRT Link or pbccs package)
# conda install -c bioconda lima

# Execute lima with specified parameters
# Replace <input_file.bam> with your actual input BAM file (e.g., subreads.bam or ccs.bam)
# Replace <output_prefix> with your desired output file prefix
lima --isoseq --dump-clips <input_file.bam> <output_prefix>

View on GitHub

8
In addition, isoseq3 v3.4.0 refine was used with parameters: --require-polya.

IsoSeq v3.4.0 GitHub
$ Bash example
```
isoseq3 refine input.bam output.bam --require-polya
```
View on GitHub

Fasta files for each set of technical replicates were the pooled together and the full-length non-concatemer reads for each sample were aligned to the hg19 reference genome using minimap2 v2.17-r941 using parameters: --ax-splice, -uf, --secondary=no, -t 30.

minimap2 v2.17 GitHub

$ Bash example

# Install minimap2 if not already installed
# conda install -c bioconda minimap2

# Placeholder for reference genome index. Replace 'path/to/hg19.mmi' with the actual path.
# If an index does not exist, it can be created from the hg19 FASTA file:
# minimap2 -d path/to/hg19.mmi path/to/hg19.fa
REFERENCE_INDEX="path/to/hg19.mmi"

# Placeholder for input FASTA file containing pooled full-length non-concatemer reads.
# Replace 'path/to/input_reads.fasta' with the actual path to your input FASTA file.
INPUT_FASTA="path/to/input_reads.fasta"

# Placeholder for output BAM file. Replace 'path/to/output.bam' with your desired output path.
OUTPUT_BAM="path/to/output.bam"

# Align reads to the hg19 reference genome using minimap2
minimap2 --ax-splice -uf --secondary=no -t 30 "${REFERENCE_INDEX}" "${INPUT_FASTA}" | samtools view -bS -o "${OUTPUT_BAM}" -

View on GitHub

Tools Used

PacBio sequencing

Raw Source Text

The double stranded cDNA for each sample was converted to sequencing libraries as recommended (PacBio SMRTbell Express Template Prep Kit 2.0) but with separate barcoded adapters for each sample (PacBio Barcoded Overhang Adapter Kit).
All of the samples were pooled in an equimolar fashion and sequenced on a SMRTcell 8M with the PacBio Sequel II instrument (2.0 chemistry/2.1 polymerase with 2 hour pre-extension and 30 hour movie times). After barcodes were demultiplexed, the initial data was used to re-balance the pooling by barcode counts before further sequencing. In total, the samples were sequenced over 5 SMRTcell 8M. The PacBio Sequel II system was used for all sample sequencing.
Following sequencing, the circular consensus sequence (CCS) reads for each set of technical replicates were processed using the Isoseq v3 pipeline (https://github.com/PacificBiosciences/IsoSeq) to generate full-length non-concatamer reads in fasta format. For this step, software package lima v2.0.0 was used with parameters: --isoseq and --dump-clips. In addition, isoseq3 v3.4.0 refine was used with parameters: --require-polya. Fasta files for each set of technical replicates were the pooled together and the full-length non-concatemer reads for each sample were aligned to the hg19 reference genome using minimap2 v2.17-r941 using parameters: --ax-splice, -uf, --secondary=no, -t 30.
Genome_build: hg19
Supplementary_files_format_and_content: BED, txt

← Back to Analysis