GSE155726 Processing Pipeline
Publication
Robust single-cell discovery of RNA targets of RNA-binding proteins and ribosomes.Nature methods (2021) — PMID 33963355
Dataset
GSE155726Robust single-cell discovery of RNA targets of RNA binding proteins and ribosomes [scRNA-seq]
Processing Steps
Generate Jupyter Notebook-
1
Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 3.0.2)
Cell Ranger v3.0.2$ Bash example
# Cell Ranger is typically installed by downloading the tarball from 10x Genomics and adding the executable to your PATH. # For version 3.0.2, you would download the specific release from the 10x Genomics support site. # Example (replace with actual download link and version): # wget https://cf.10xgenomics.com/releases/cell-ranger/cellranger-3.0.2.tar.gz # tar -xzf cellranger-3.0.2.tar.gz # export PATH=/path/to/cellranger-3.0.2:$PATH # Prepare reference transcriptome (example for human GRCh38) # Download pre-built reference from 10x Genomics or build your own. # Example download (replace with actual version if needed): # wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz # tar -xzf refdata-gex-GRCh38-2020-A.tar.gz # REF_TRANSCRIPTOME_PATH="./refdata-gex-GRCh38-2020-A" # Define variables for the run RUN_ID="my_cellranger_count_run" FASTQ_DIR="/path/to/your/fastq_files_directory" # Directory containing FASTQ files (e.g., Sample1_S1_L001_R1_001.fastq.gz) SAMPLE_NAME="my_sample" # Sample prefix used in FASTQ filenames (e.g., 'my_sample' for my_sample_S1_L001_R1_001.fastq.gz) TRANSCRIPTOME_PATH="/path/to/your/10x_genomics_reference/refdata-gex-GRCh38-2020-A" # Execute cellranger count cellranger count \ --id="${RUN_ID}" \ --transcriptome="${TRANSCRIPTOME_PATH}" \ --fastqs="${FASTQ_DIR}" \ --sample="${SAMPLE_NAME}" \ --expect-cells=3000 # Optional: Expected number of cells, adjust as needed -
2
MD tags were added to alignments with samtools calmd --threads 15 -rb possorted_genome_bam.bam refdata-cellranger-hg19-3.0.0/fasta/genome.fa > possorted_genome_bam_MD.bam
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.9 samtools calmd --threads 15 -rb possorted_genome_bam.bam refdata-cellranger-hg19-3.0.0/fasta/genome.fa > possorted_genome_bam_MD.bam
-
3
Reads were split based on the CB:Z tag, resulting in one BAM file per barcode.
SplitBamByCell (Inferred with models/gemini-2.5-flash) v2.5.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install drop-seq-tools if not already installed # conda install -c bioconda drop-seq-tools # Or download the JAR file: # wget https://github.com/broadinstitute/Drop-seq/releases/download/v2.5.0/drop-seq-2.5.0.jar # Define input and output INPUT_BAM="input.bam" OUTPUT_DIR="split_bams" DROPSEQ_TOOLS_JAR="drop-seq-2.5.0.jar" # Adjust path if installed via conda or different location # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run SplitBamByCell to split reads based on the CB:Z tag # The tool will create one BAM file per unique CB:Z barcode in the specified output directory. # The TAG parameter specifies the tag to use for splitting. java -jar "${DROPSEQ_TOOLS_JAR}" SplitBamByCell \ I="${INPUT_BAM}" \ O="${OUTPUT_DIR}" \ TAG="CB" \ VALIDATION_STRINGENCY=SILENT -
4
Edits were called on each "split-BAM" file using SAILOR (http://github.com/yeolab/sailor) using default parameters and dbSNP (v147) to remove known SNPs.
$ Bash example
# SAILOR is a Python script. Clone the repository. # git clone http://github.com/yeolab/sailor # cd sailor # No specific installation steps beyond cloning and ensuring Python is available. # Define the path to the SAILOR script (adjust as necessary) SAILOR_SCRIPT="/path/to/sailor/sailor.py" # e.g., if cloned to /opt/sailor # Define the path to the dbSNP v147 VCF file. # This is a placeholder. You would need to download the specific dbSNP v147 VCF for your reference genome (e.g., hg38 or hg19). # Example for hg38: DBSNP_VCF="/path/to/dbsnp_147_hg38.vcf.gz" # Placeholder for an input split-BAM file. This command would typically be run for each split-BAM file. INPUT_BAM="your_split.bam" OUTPUT_BAM="${INPUT_BAM%.bam}_sailor_filtered.bam" # Execute SAILOR with default parameters and dbSNP v147 python "${SAILOR_SCRIPT}" -i "${INPUT_BAM}" -o "${OUTPUT_BAM}" -v "${DBSNP_VCF}" -
5
Coverage was generated from filtered BAM files (intermediates from SAILOR) for each cell barcode, using the following commands from BedTools (v2.27.1). :
$ Bash example
# Install BedTools (if not already installed) # conda install -c bioconda bedtools # Placeholder for genome file (e.g., hg38 chromosome sizes) # Download from UCSC or similar source if not already available. # Example: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O hg38.chrom.sizes GENOME_FILE="hg38.chrom.sizes" # Replace with actual path to genome file # Create a directory for output coverage files mkdir -p coverage # Example for a single filtered BAM file from SAILOR for a specific cell barcode # Replace 'path/to/filtered_bams/cell_barcode_123_filtered.bam' with the actual input file path INPUT_BAM="path/to/filtered_bams/cell_barcode_123_filtered.bam" OUTPUT_BEDGRAPH="coverage/cell_barcode_123_coverage.bedgraph" # Generate coverage in bedGraph format bedtools genomecov -ibam "${INPUT_BAM}" -bg -g "${GENOME_FILE}" > "${OUTPUT_BEDGRAPH}" # To process multiple files for each cell barcode, you might use a loop: # for bam_file in path/to/filtered_bams/*_filtered.bam; do # # Extract cell barcode from filename (adjust pattern as needed) # cell_barcode=$(basename "${bam_file}" | sed 's/_filtered.bam//') # output_bedgraph="coverage/${cell_barcode}_coverage.bedgraph" # bedtools genomecov -ibam "${bam_file}" -bg -g "${GENOME_FILE}" > "${output_bedgraph}" # done -
6
bedtools genomecov -split -strand - -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam fwd.bam > fwd.bg
Cell Ranger v3.0.0$ Bash example
# Install bedtools if not already available # conda install -c bioconda bedtools # Execute the bedtools genomecov command bedtools genomecov -split -strand - -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam fwd.bam > fwd.bg
-
7
bedtools sort -I fwd.bg > fwd.sorted.bg
$ Bash example
# conda install -c bioconda bedtools bedtools sort -I fwd.bg > fwd.sorted.bg
-
8
bedGraphToBigWig fwd.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt fwd.sorted.bw
Cell Ranger v3.0.0$ Bash example
# Install UCSC tools (e.g., via conda) # conda install -c bioconda ucsc-bedgraphtobigwig # Reference genome: hg19 (from refdata-cellranger-hg19-3.0.0) # Chromosome sizes file: refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt bedGraphToBigWig fwd.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt fwd.sorted.bw
-
9
bedtools genomecov -split -strand + -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam rev.bam > rev.bg
$ Bash example
# Install bedtools (example using conda) # conda install -c bioconda bedtools=2.27.1 # Define reference genome path # The 'refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt' file is expected to be part of a Cell Ranger hg19 reference data bundle. # If not available, a placeholder for hg19 chromosome lengths can be obtained from UCSC: # wget -O chrNameLength.txt https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes # Execute bedtools genomecov to calculate genome coverage in bedGraph format bedtools genomecov -split -strand + -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam rev.bam > rev.bg
-
10
bedtools sort -I rev.bg > rev.sorted.bg
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Sorts the input BED/GFF/VCF file by chromosome and then by start position. # The '-i' flag specifies the input file. bedtools sort -i rev.bg > rev.sorted.bg
-
11
bedGraphToBigWig rev.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt rev.sorted.bw
$ Bash example
# Install UCSC tools (e.g., via conda) # conda install -c bioconda ucsc-bedgraphtobigwig # Define variables (adjust paths as needed) INPUT_BEDGRAPH="rev.sorted.bg" CHR_SIZES_FILE="refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt" OUTPUT_BIGWIG="rev.sorted.bw" # Execute bedGraphToBigWig command bedGraphToBigWig "${INPUT_BEDGRAPH}" "${CHR_SIZES_FILE}" "${OUTPUT_BIGWIG}" -
12
For each barcode, edit sites were annotated across exons (Gencode v19) and the cumulative C>T depth across each gene was divided by the cumulative depth across C's as obtained from the previous step.
$ Bash example
# Download GENCODE v19 annotation GTF file # wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz # The annotation of edit sites across exons using Gencode v19 would typically involve: # 1. Extracting exon features from the gencode.v19.annotation.gtf file (e.g., using gtf_to_bed or awk). # 2. Intersecting detected edit sites (e.g., from a VCF or BED file from a previous step) with these exon features using a tool like bedtools. # Example (conceptual, assuming 'edit_sites.bed' and 'gencode_v19_exons.bed'): # bedtools intersect -a edit_sites.bed -b gencode_v19_exons.bed -wa -wb > annotated_edit_sites.bed # The subsequent calculation step: "cumulative C>T depth across each gene was divided by the cumulative depth across C's" # This is a custom calculation, likely performed with a scripting language (e.g., Python, R, or awk). # It is assumed that a previous step has generated a file (e.g., 'gene_depths.tsv') containing: # 1. Gene ID # 2. Cumulative C>T depth for that gene # 3. Cumulative C depth for that gene # Example input file 'gene_depths.tsv': # gene_id\tC_to_T_depth\tC_depth # ENSG00000000003\t150\t1000 # ENSG00000000005\t200\t1200 # Perform the division: (cumulative C>T depth) / (cumulative C depth) # This example uses awk, assuming the input file 'gene_depths.tsv' has columns for gene_id, C_to_T_depth, and C_depth. # The output will be gene_id and the calculated ratio. awk 'BEGIN {OFS="\t"} NR==1 {print $1, "C_to_T_ratio"; next} {if ($3 > 0) print $1, $2/$3; else print $1, 0}' gene_depths.tsv > gene_C_to_T_ratios.tsv
Tools Used
Raw Source Text
Alignment, cell barcode processing, umis processing, abundance measurements: cellranger count (version 3.0.2) MD tags were added to alignments with samtools calmd --threads 15 -rb possorted_genome_bam.bam refdata-cellranger-hg19-3.0.0/fasta/genome.fa > possorted_genome_bam_MD.bam Reads were split based on the CB:Z tag, resulting in one BAM file per barcode. Edits were called on each "split-BAM" file using SAILOR (http://github.com/yeolab/sailor) using default parameters and dbSNP (v147) to remove known SNPs. Coverage was generated from filtered BAM files (intermediates from SAILOR) for each cell barcode, using the following commands from BedTools (v2.27.1). : bedtools genomecov -split -strand - -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam fwd.bam > fwd.bg bedtools sort -I fwd.bg > fwd.sorted.bg bedGraphToBigWig fwd.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt fwd.sorted.bw bedtools genomecov -split -strand + -g refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt -bg -ibam rev.bam > rev.bg bedtools sort -I rev.bg > rev.sorted.bg bedGraphToBigWig rev.sorted.bg refdata-cellranger-hg19-3.0.0/star/chrNameLength.txt rev.sorted.bw For each barcode, edit sites were annotated across exons (Gencode v19) and the cumulative C>T depth across each gene was divided by the cumulative depth across C's as obtained from the previous step. Genome_build: hg19 Supplementary_files_format_and_content: Edit fractions across barcodes with detectable edits were merged to obtain processed STAMP_edits.txt files for each sample.