GSE145968 Processing Pipeline

OTHER code_examples 6 steps

Publication

A CRISPR RNA-binding protein screen reveals regulators of RUNX1 isoform generation.

Blood advances (2021) — PMID 33656539

Dataset

CRISPR/Cas9 screening of RNA binding proteins (RBPs) that regulate RUNX1 isoform production

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: Amplicon-seq

QIIME2 (Inferred with models/gemini-2.5-flash) v2023.9 GitHub

$ Bash example

# Install QIIME2 (example, uncomment if needed)
# conda create -n qiime2-2023.9 --file https://data.qiime2.org/dist/qiime2/2023.9/qiime2-2023.9-py38-linux-conda.yml
# conda activate qiime2-2023.9

# Placeholder for input data (e.g., demultiplexed fastq files in 'raw_data' directory)
# Example: raw_data/sample1_R1.fastq.gz, raw_data/sample1_R2.fastq.gz

# 1. Import data into QIIME2 artifact
# This example assumes Casava 1.8 paired-end demultiplexed fastq files
qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path raw_data \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux-paired-end.qza

# 2. Quality control and DADA2 denoising
# Adjust --p-trunc-len-f and --p-trunc-len-r based on quality plots (e.g., from qiime demux summarize)
# --p-trim-left-f and --p-trim-left-r remove primer sequences from the start of reads
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux-paired-end.qza \
  --p-trunc-len-f 250 \
  --p-trunc-len-r 200 \
  --p-trim-left-f 17 \
  --p-trim-left-r 21 \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --o-denoising-stats denoising-stats.qza

# 3. Taxonomic classification
# Reference database: SILVA 138 SSURef NR99 (V4 region 515F/806R primers used for classifier training)
# Download pre-trained classifier for QIIME2 2023.9:
# wget -O silva-138-99-515-806-nb-classifier.qza "https://data.qiime2.org/2023.9/common/silva-138-99-515-806-nb-classifier.qza"
qiime feature-classifier classify-sklearn \
  --i-classifier silva-138-99-515-806-nb-classifier.qza \
  --i-reads rep-seqs.qza \
  --o-classification taxonomy.qza

# 4. Export results (optional)
qiime tools export \
  --input-path table.qza \
  --output-path exported_table
qiime tools export \
  --input-path rep-seqs.qza \
  --output-path exported_rep_seqs
qiime tools export \
  --input-path taxonomy.qza \
  --output-path exported_taxonomy

View on GitHub

CRISPR screening data were aligned and processed using the published MAGeCK-VISPR pipeline (Li et al.

MAGeCK vv0.5.5 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install MAGeCK (if not already installed)
# conda install -c bioconda mageck

# Define variables (replace with actual paths and names)
SGRNA_LIBRARY="sgrna_library.txt" # Path to your sgRNA library file (e.g., containing sgRNA sequences and gene assignments)
CONTROL_FASTQ_FILES="control_sample_1.fastq.gz,control_sample_2.fastq.gz" # Comma-separated FASTQ files for control samples
TREATMENT_FASTQ_FILES="treatment_sample_1.fastq.gz,treatment_sample_2.fastq.gz" # Comma-separated FASTQ files for treatment samples
ALL_FASTQ_FILES="${CONTROL_FASTQ_FILES},${TREATMENT_FASTQ_FILES}"

CONTROL_SAMPLE_LABELS="control1,control2" # Comma-separated labels for control samples
TREATMENT_SAMPLE_LABELS="treat1,treat2" # Comma-separated labels for treatment samples
ALL_SAMPLE_LABELS="${CONTROL_SAMPLE_LABELS},${TREATMENT_SAMPLE_LABELS}"

OUTPUT_COUNT_PREFIX="crispr_screen_counts"
OUTPUT_TEST_PREFIX="crispr_screen_results"

# Step 1: Count sgRNA reads from FASTQ files. This step performs the "alignment" (matching reads to sgRNAs) and counting.
# The output is a count table (e.g., crispr_screen_counts.count.txt).
mageck count \
  -l ${SGRNA_LIBRARY} \
  -i ${ALL_FASTQ_FILES} \
  -n ${OUTPUT_COUNT_PREFIX} \
  --sample-label ${ALL_SAMPLE_LABELS}

# Step 2: Perform statistical testing for gene enrichment/depletion using the generated count table.
mageck test \
  -k ${OUTPUT_COUNT_PREFIX}.count.txt \
  -l ${SGRNA_LIBRARY} \
  -n ${OUTPUT_TEST_PREFIX} \
  --control-samples ${CONTROL_SAMPLE_LABELS} \
  --treatment-samples ${TREATMENT_SAMPLE_LABELS}

Genome Biology 2015).

Unknown (Inferred with models/gemini-2.5-flash) vUnknown GitHub

$ Bash example

# No specific tool or command could be inferred from the description: "Genome Biology 2015)".
# The description appears to be a citation rather than a step description.
# Please provide more context (e.g., assay type, specific action like alignment or peak calling)
# to infer a relevant bioinformatics tool and command.

View on GitHub

For 3'READS data, reads were trimmed using Cutadapt then aligned to hg19 using bowtie2.

Bowtie2 v2.5.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Bowtie2 (if not already installed)
# conda install -c bioconda bowtie2

# Placeholder for reference genome index and input files
# Ensure the hg19 index is built or downloaded. For example:
# bowtie2-build hg19.fa hg19_index

REF_GENOME_INDEX="/path/to/hg19_index"
TRIMMED_READS="trimmed_reads.fastq.gz"
OUTPUT_SAM="aligned_reads.sam"

# Align trimmed reads to hg19 using Bowtie2
bowtie2 -x "${REF_GENOME_INDEX}" -U "${TRIMMED_READS}" -S "${OUTPUT_SAM}"

View on GitHub

Following alignment, bigwig files were generated to visualize the peaks surrounding the gene-of-interest in the study.

UCSC tools vv377 GitHub

$ Bash example

# Install UCSC tools and bedtools if not already installed
# conda install -c bioconda ucsc-bedgraphtobigwig bedtools samtools

# Define input and output files
# Assuming 'aligned.bam' is the output from the alignment step.
INPUT_BAM="aligned.bam"
OUTPUT_BIGWIG="coverage.bw"

# Define the path to the chromosome sizes file for the reference genome.
# Replace 'hg38.chrom.sizes' with the actual path to your genome's chrom.sizes file.
# Example for hg38:
# wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
CHROM_SIZES="hg38.chrom.sizes"

# 1. Generate bedGraph from BAM using bedtools genomeCoverageBed.
# The -bg option outputs a bedGraph file.
# The -ibam option specifies the input BAM file.
# The -g option specifies the genome file (chrom.sizes) for calculating coverage across the entire genome.
bedtools genomeCoverageBed -ibam "${INPUT_BAM}" -bg -g "${CHROM_SIZES}" > "${INPUT_BAM%.bam}.bedGraph"

# 2. Sort the bedGraph file by chromosome and then by start position.
# This sorting is a strict requirement for bedGraphToBigWig.
sort -k1,1 -k2,2n "${INPUT_BAM%.bam}.bedGraph" > "${INPUT_BAM%.bam}.sorted.bedGraph"

# 3. Convert the sorted bedGraph to bigWig using ucsc-bedgraphtobigwig.
# The first argument is the sorted bedGraph file.
# The second argument is the chrom.sizes file.
# The third argument is the output bigWig file.
bedGraphToBigWig "${INPUT_BAM%.bam}.sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_BIGWIG}"

# Clean up intermediate files (optional)
# rm "${INPUT_BAM%.bam}.bedGraph" "${INPUT_BAM%.bam}.sorted.bedGraph"

View on GitHub

Reads from the four peaks in this single gene were quantified using multiBigwigSummary from DeepTools2.0

multiBigwigSummary (deepTools) v2.0 GitHub

$ Bash example

# Install deepTools if not already installed
# conda install -c bioconda deeptools

# Define input bigWig files representing the signal from the four peaks/samples.
# These files would typically be generated upstream from aligned reads (e.g., using bamCoverage).
BIGWIG_FILE_1="sample_peak1_signal.bw"
BIGWIG_FILE_2="sample_peak2_signal.bw"
BIGWIG_FILE_3="sample_peak3_signal.bw"
BIGWIG_FILE_4="sample_peak4_signal.bw"

# Define the BED file containing the genomic regions for the "four peaks in this single gene".
# This file specifies the exact coordinates to quantify.
GENE_PEAKS_BED="gene_of_interest_peaks.bed"

# Define output file names
OUTPUT_NPZ="quantified_peak_data.npz"
OUTPUT_RAW_COUNTS="quantified_peak_raw_counts.tsv"

# Quantify reads from the specified bigWig files over the regions defined in the BED file.
# The --outRawCounts option is used to output the actual quantified values in a tab-separated file.
multiBigwigSummary regions \
    --bigwigs "${BIGWIG_FILE_1}" "${BIGWIG_FILE_2}" "${BIGWIG_FILE_3}" "${BIGWIG_FILE_4}" \
    --labels "Peak1" "Peak2" "Peak3" "Peak4" \
    --BED "${GENE_PEAKS_BED}" \
    --outFileName "${OUTPUT_NPZ}" \
    --outRawCounts "${OUTPUT_RAW_COUNTS}"

View on GitHub

Tools Used

Bowtie2 UCSC tools

Raw Source Text

Library strategy: Amplicon-seq
CRISPR screening data were aligned and processed using the published MAGeCK-VISPR pipeline (Li et al. Genome Biology 2015).
For 3'READS data, reads were trimmed using Cutadapt then aligned to hg19 using bowtie2.
Following alignment, bigwig files were generated to visualize the peaks surrounding the gene-of-interest in the study. Reads from the four peaks in this single gene were quantified using multiBigwigSummary from DeepTools2.0
Genome_build: hg19
Supplementary_files_format_and_content: count' text file includes sgRNA counts per sample
Supplementary_files_format_and_content: summary' text file contains beta scores of enrichment in GFP high and low populations
Supplementary_files_format_and_content: countsummary' text file contains QC metrics
Supplementary_files_format_and_content: bigwig files show visualization of 3'READS peaks across the genome

← Back to Analysis