GSE155726 Processing Pipeline

RNA-Seq code_examples 12 steps

Publication

Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.

Nature methods (2021) — PMID 33963355

Dataset

Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [scRNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 3.0.2)

Cell Ranger v3.0.2

$ Bash example

# Cell Ranger is typically installed by downloading the tarball from 10x Genomics and adding the executable to your PATH.
# For version 3.0.2, you would download the specific release from the 10x Genomics support site.
# Example (replace with actual download link and version):
# wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-3.0.2.tar.gz
# tar -xzf cellranger-3.0.2.tar.gz
# export PATH=/path/to/cellranger-3.0.2:$PATH

# Prepare reference transcriptome (example for human GRCh38)
# Download pre-built reference from 10x Genomics or build your own.
# Example download (replace with actual version if needed):
# wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
# tar -xzf refdata-gex-GRCh38-2020-A.tar.gz
# REF_TRANSCRIPTOME_PATH="./refdata-gex-GRCh38-2020-A"

# Define variables for the run
RUN_ID="my_cellranger_count_run"
FASTQ_DIR="/path/to/your/fastq_files_directory" # Directory containing FASTQ files (e.g., Sample1_S1_L001_R1_001.fastq.gz)
SAMPLE_NAME="my_sample" # Sample prefix used in FASTQ filenames (e.g., 'my_sample' for my_sample_S1_L001_R1_001.fastq.gz)
TRANSCRIPTOME_PATH="/path/to/your/10x_genomics_reference/refdata-gex-GRCh38-2020-A"

# Execute cellranger count
cellranger count \
  --id="${RUN_ID}" \
  --transcriptome="${TRANSCRIPTOME_PATH}" \
  --fastqs="${FASTQ_DIR}" \
  --sample="${SAMPLE_NAME}" \
  --expect-cells=3000 # Optional: Expected number of cells, adjust as needed

2
MD tags were added to alignments with samtools calmd --threads 15 -rb possorted_genome_bam.bam refdata-cellranger-hg19-3.0.0/fasta/genome.fa > possorted_genome_bam_MD.bam

Cell Ranger v3.0.0 GitHub
$ Bash example
```
# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.9

samtools calmd --threads 15 -rb possorted_genome_bam.bam refdata-cellranger-hg19-3.0.0/fasta/genome.fa > possorted_genome_bam_MD.bam
```
View on GitHub

Reads were split based on the CB:Z tag, resulting in one BAM file per barcode.

SplitBamByCell (Inferred with models/gemini-2.5-flash) v2.5.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install drop-seq-tools if not already installed
# conda install -c bioconda drop-seq-tools
# Or download the JAR file:
# wget https://github.com/broadinstitute/Drop-seq/releases/download/v2.5.0/drop-seq-2.5.0.jar

# Define input and output
INPUT_BAM="input.bam"
OUTPUT_DIR="split_bams"
DROPSEQ_TOOLS_JAR="drop-seq-2.5.0.jar" # Adjust path if installed via conda or different location

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run SplitBamByCell to split reads based on the CB:Z tag
# The tool will create one BAM file per unique CB:Z barcode in the specified output directory.
# The TAG parameter specifies the tag to use for splitting.
java -jar "${DROPSEQ_TOOLS_JAR}" SplitBamByCell \
    I="${INPUT_BAM}" \
    O="${OUTPUT_DIR}" \
    TAG="CB" \
    VALIDATION_STRINGENCY=SILENT

Edits were called on each "split-BAM" file using SAILOR (http://github.com/yeolab/sailor) using default parameters and dbSNP (v147) to remove known SNPs.

SAILOR vNot specified, inferred from GitHub repository (latest commit at time of execution) GitHub

$ Bash example

# SAILOR is a Python script. Clone the repository.
# git clone http://github.com/yeolab/sailor
# cd sailor
# No specific installation steps beyond cloning and ensuring Python is available.

# Define the path to the SAILOR script (adjust as necessary)
SAILOR_SCRIPT="/path/to/sailor/sailor.py" # e.g., if cloned to /opt/sailor

# Define the path to the dbSNP v147 VCF file.
# This is a placeholder. You would need to download the specific dbSNP v147 VCF for your reference genome (e.g., hg38 or hg19).
# Example for hg38:
DBSNP_VCF="/path/to/dbsnp_147_hg38.vcf.gz" 

# Placeholder for an input split-BAM file. This command would typically be run for each split-BAM file.
INPUT_BAM="your_split.bam"
OUTPUT_BAM="${INPUT_BAM%.bam}_sailor_filtered.bam"

# Execute SAILOR with default parameters and dbSNP v147
python "${SAILOR_SCRIPT}" -i "${INPUT_BAM}" -o "${OUTPUT_BAM}" -v "${DBSNP_VCF}"

View on GitHub

Coverage was generated from filtered BAM files (intermediates from SAILOR) for each cell barcode, using the following commands from BedTools (v2.27.1). :

SAILOR v2.27.1 GitHub

$ Bash example

# Install BedTools (if not already installed)
# conda install -c bioconda bedtools

# Placeholder for genome file (e.g., hg38 chromosome sizes)
# Download from UCSC or similar source if not already available.
# Example: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes

GENOME_FILE="hg38.chrom.sizes" # Replace with actual path to genome file

# Create a directory for output coverage files
mkdir -p coverage

# Example for a single filtered BAM file from SAILOR for a specific cell barcode
# Replace 'path/to/filtered_bams/cell_barcode_123_filtered.bam' with the actual input file path
INPUT_BAM="path/to/filtered_bams/cell_barcode_123_filtered.bam"
OUTPUT_BEDGRAPH="coverage/cell_barcode_123_coverage.bedgraph"

# Generate coverage in bedGraph format
bedtools genomecov -ibam "${INPUT_BAM}" -bg -g "${GENOME_FILE}" > "${OUTPUT_BEDGRAPH}"

# To process multiple files for each cell barcode, you might use a loop:
# for bam_file in path/to/filtered_bams/*_filtered.bam; do
#     # Extract cell barcode from filename (adjust pattern as needed)
#     cell_barcode=$(basename "${bam_file}" | sed 's/_filtered.bam//')
#     output_bedgraph="coverage/${cell_barcode}_coverage.bedgraph"
#     bedtools genomecov -ibam "${bam_file}" -bg -g "${GENOME_FILE}" > "${output_bedgraph}"
# done

View on GitHub

bedtools genomecov -split -strand - -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam fwd.bam > fwd.bg

Cell Ranger v3.0.0

$ Bash example

# Install bedtools if not already available
# conda install -c bioconda bedtools

# Execute the bedtools genomecov command
bedtools genomecov -split -strand - -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam fwd.bam > fwd.bg

7
bedtools sort -I fwd.bg > fwd.sorted.bg

bedtools v2.31.0 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# conda install -c bioconda bedtools
bedtools sort -I fwd.bg > fwd.sorted.bg
```
View on GitHub

bedGraphToBigWig fwd.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt fwd.sorted.bw

Cell Ranger v3.0.0

$ Bash example

# Install UCSC tools (e.g., via conda)
# conda install -c bioconda ucsc-bedgraphtobigwig

# Reference genome: hg19 (from refdata-cellranger-hg19-3.0.0)
# Chromosome sizes file: refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt

bedGraphToBigWig fwd.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt fwd.sorted.bw

bedtools genomecov -split -strand + -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam rev.bam > rev.bg

Cell Ranger v3.0.0 GitHub

$ Bash example

# Install bedtools (example using conda)
# conda install -c bioconda bedtools=2.27.1

# Define reference genome path
# The 'refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt' file is expected to be part of a Cell Ranger hg19 reference data bundle.
# If not available, a placeholder for hg19 chromosome lengths can be obtained from UCSC:
# wget -O chrNameLength.txt https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes

# Execute bedtools genomecov to calculate genome coverage in bedGraph format
bedtools genomecov -split -strand + -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam rev.bam > rev.bg

View on GitHub

bedtools sort -I rev.bg > rev.sorted.bg

bedtools v2.31.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Sorts the input BED/GFF/VCF file by chromosome and then by start position.
# The '-i' flag specifies the input file.
bedtools sort -i rev.bg > rev.sorted.bg

View on GitHub

bedGraphToBigWig rev.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt rev.sorted.bw

Cell Ranger v3.0.0 GitHub

$ Bash example

# Install UCSC tools (e.g., via conda)
# conda install -c bioconda ucsc-bedgraphtobigwig

# Define variables (adjust paths as needed)
INPUT_BEDGRAPH="rev.sorted.bg"
CHR_SIZES_FILE="refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt"
OUTPUT_BIGWIG="rev.sorted.bw"

# Execute bedGraphToBigWig command
bedGraphToBigWig "${INPUT_BEDGRAPH}" "${CHR_SIZES_FILE}" "${OUTPUT_BIGWIG}"

View on GitHub

For each barcode, edit sites were annotated across exons (Gencode v19) and the cumulative C>T depth across each gene was divided by the cumulative depth across C's as obtained from the previous step.

GENCODE vv19 GitHub

$ Bash example

# Download GENCODE v19 annotation GTF file
# wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz

# The annotation of edit sites across exons using Gencode v19 would typically involve:
# 1. Extracting exon features from the gencode.v19.annotation.gtf file (e.g., using gtf_to_bed or awk).
# 2. Intersecting detected edit sites (e.g., from a VCF or BED file from a previous step) with these exon features using a tool like bedtools.
#    Example (conceptual, assuming 'edit_sites.bed' and 'gencode_v19_exons.bed'):
#    bedtools intersect -a edit_sites.bed -b gencode_v19_exons.bed -wa -wb > annotated_edit_sites.bed

# The subsequent calculation step: "cumulative C>T depth across each gene was divided by the cumulative depth across C's"
# This is a custom calculation, likely performed with a scripting language (e.g., Python, R, or awk).
# It is assumed that a previous step has generated a file (e.g., 'gene_depths.tsv') containing:
# 1. Gene ID
# 2. Cumulative C>T depth for that gene
# 3. Cumulative C depth for that gene

# Example input file 'gene_depths.tsv':
# gene_id\tC_to_T_depth\tC_depth
# ENSG00000000003\t150\t1000
# ENSG00000000005\t200\t1200

# Perform the division: (cumulative C>T depth) / (cumulative C depth)
# This example uses awk, assuming the input file 'gene_depths.tsv' has columns for gene_id, C_to_T_depth, and C_depth.
# The output will be gene_id and the calculated ratio.
awk 'BEGIN {OFS="\t"} NR==1 {print $1, "C_to_T_ratio"; next} {if ($3 > 0) print $1, $2/$3; else print $1, 0}' gene_depths.tsv > gene_C_to_T_ratios.tsv

View on GitHub

Tools Used

SAILOR

Raw Source Text

Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 3.0.2)
MD tags were added to alignments with samtools calmd --threads 15 -rb possorted_genome_bam.bam refdata-cellranger-hg19-3.0.0/fasta/genome.fa > possorted_genome_bam_MD.bam
Reads were split based on the CB:Z tag, resulting in one BAM file per barcode.
Edits were called on each "split-BAM" file using SAILOR (http://github.com/yeolab/sailor) using default parameters and dbSNP (v147) to remove known SNPs.
Coverage was generated from filtered BAM files (intermediates from SAILOR) for each cell barcode, using the following commands from BedTools (v2.27.1). :
bedtools genomecov -split -strand - -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam fwd.bam > fwd.bg
bedtools sort -I fwd.bg > fwd.sorted.bg
bedGraphToBigWig fwd.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt fwd.sorted.bw
bedtools genomecov -split -strand + -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam rev.bam > rev.bg
bedtools sort -I rev.bg > rev.sorted.bg
bedGraphToBigWig rev.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt rev.sorted.bw
For each barcode, edit sites were annotated across exons (Gencode v19) and the cumulative C>T depth across each gene was divided by the cumulative depth across C's as obtained from the previous step.
Genome_build: hg19
Supplementary_files_format_and_content: Edit fractions across barcodes with detectable edits were merged to obtain processed STAMP_edits.txt files for each sample.

← Back to Analysis