GSE145968 Processing Pipeline
OTHER
code_examples
6 steps
Publication
A CRISPR RNA-binding protein screen reveals regulators of RUNX1 isoform generation.Blood advances (2021) — PMID 33656539
Dataset
GSE145968CRISPR/Cas9 screening of RNA binding proteins (RBPs) that regulate RUNX1 isoform production
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: Amplicon-seq
$ Bash example
# Install QIIME2 (example, uncomment if needed) # conda create -n qiime2-2023.9 --file https://data.qiime2.org/dist/qiime2/2023.9/qiime2-2023.9-py38-linux-conda.yml # conda activate qiime2-2023.9 # Placeholder for input data (e.g., demultiplexed fastq files in 'raw_data' directory) # Example: raw_data/sample1_R1.fastq.gz, raw_data/sample1_R2.fastq.gz # 1. Import data into QIIME2 artifact # This example assumes Casava 1.8 paired-end demultiplexed fastq files qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path raw_data \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --output-path demux-paired-end.qza # 2. Quality control and DADA2 denoising # Adjust --p-trunc-len-f and --p-trunc-len-r based on quality plots (e.g., from qiime demux summarize) # --p-trim-left-f and --p-trim-left-r remove primer sequences from the start of reads qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux-paired-end.qza \ --p-trunc-len-f 250 \ --p-trunc-len-r 200 \ --p-trim-left-f 17 \ --p-trim-left-r 21 \ --o-table table.qza \ --o-representative-sequences rep-seqs.qza \ --o-denoising-stats denoising-stats.qza # 3. Taxonomic classification # Reference database: SILVA 138 SSURef NR99 (V4 region 515F/806R primers used for classifier training) # Download pre-trained classifier for QIIME2 2023.9: # wget -O silva-138-99-515-806-nb-classifier.qza "https://data.qiime2.org/2023.9/common/silva-138-99-515-806-nb-classifier.qza" qiime feature-classifier classify-sklearn \ --i-classifier silva-138-99-515-806-nb-classifier.qza \ --i-reads rep-seqs.qza \ --o-classification taxonomy.qza # 4. Export results (optional) qiime tools export \ --input-path table.qza \ --output-path exported_table qiime tools export \ --input-path rep-seqs.qza \ --output-path exported_rep_seqs qiime tools export \ --input-path taxonomy.qza \ --output-path exported_taxonomy
-
2
CRISPR screening data were aligned and processed using the published MAGeCK-VISPR pipeline (Li et al.
MAGeCK vv0.5.5 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install MAGeCK (if not already installed) # conda install -c bioconda mageck # Define variables (replace with actual paths and names) SGRNA_LIBRARY="sgrna_library.txt" # Path to your sgRNA library file (e.g., containing sgRNA sequences and gene assignments) CONTROL_FASTQ_FILES="control_sample_1.fastq.gz,control_sample_2.fastq.gz" # Comma-separated FASTQ files for control samples TREATMENT_FASTQ_FILES="treatment_sample_1.fastq.gz,treatment_sample_2.fastq.gz" # Comma-separated FASTQ files for treatment samples ALL_FASTQ_FILES="${CONTROL_FASTQ_FILES},${TREATMENT_FASTQ_FILES}" CONTROL_SAMPLE_LABELS="control1,control2" # Comma-separated labels for control samples TREATMENT_SAMPLE_LABELS="treat1,treat2" # Comma-separated labels for treatment samples ALL_SAMPLE_LABELS="${CONTROL_SAMPLE_LABELS},${TREATMENT_SAMPLE_LABELS}" OUTPUT_COUNT_PREFIX="crispr_screen_counts" OUTPUT_TEST_PREFIX="crispr_screen_results" # Step 1: Count sgRNA reads from FASTQ files. This step performs the "alignment" (matching reads to sgRNAs) and counting. # The output is a count table (e.g., crispr_screen_counts.count.txt). mageck count \ -l ${SGRNA_LIBRARY} \ -i ${ALL_FASTQ_FILES} \ -n ${OUTPUT_COUNT_PREFIX} \ --sample-label ${ALL_SAMPLE_LABELS} # Step 2: Perform statistical testing for gene enrichment/depletion using the generated count table. mageck test \ -k ${OUTPUT_COUNT_PREFIX}.count.txt \ -l ${SGRNA_LIBRARY} \ -n ${OUTPUT_TEST_PREFIX} \ --control-samples ${CONTROL_SAMPLE_LABELS} \ --treatment-samples ${TREATMENT_SAMPLE_LABELS} -
3
Genome Biology 2015).
$ Bash example
# No specific tool or command could be inferred from the description: "Genome Biology 2015)". # The description appears to be a citation rather than a step description. # Please provide more context (e.g., assay type, specific action like alignment or peak calling) # to infer a relevant bioinformatics tool and command.
-
4
For 3'READS data, reads were trimmed using Cutadapt then aligned to hg19 using bowtie2.
$ Bash example
# Install Bowtie2 (if not already installed) # conda install -c bioconda bowtie2 # Placeholder for reference genome index and input files # Ensure the hg19 index is built or downloaded. For example: # bowtie2-build hg19.fa hg19_index REF_GENOME_INDEX="/path/to/hg19_index" TRIMMED_READS="trimmed_reads.fastq.gz" OUTPUT_SAM="aligned_reads.sam" # Align trimmed reads to hg19 using Bowtie2 bowtie2 -x "${REF_GENOME_INDEX}" -U "${TRIMMED_READS}" -S "${OUTPUT_SAM}" -
5
Following alignment, bigwig files were generated to visualize the peaks surrounding the gene-of-interest in the study.
$ Bash example
# Install UCSC tools and bedtools if not already installed # conda install -c bioconda ucsc-bedgraphtobigwig bedtools samtools # Define input and output files # Assuming 'aligned.bam' is the output from the alignment step. INPUT_BAM="aligned.bam" OUTPUT_BIGWIG="coverage.bw" # Define the path to the chromosome sizes file for the reference genome. # Replace 'hg38.chrom.sizes' with the actual path to your genome's chrom.sizes file. # Example for hg38: # wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes CHROM_SIZES="hg38.chrom.sizes" # 1. Generate bedGraph from BAM using bedtools genomeCoverageBed. # The -bg option outputs a bedGraph file. # The -ibam option specifies the input BAM file. # The -g option specifies the genome file (chrom.sizes) for calculating coverage across the entire genome. bedtools genomeCoverageBed -ibam "${INPUT_BAM}" -bg -g "${CHROM_SIZES}" > "${INPUT_BAM%.bam}.bedGraph" # 2. Sort the bedGraph file by chromosome and then by start position. # This sorting is a strict requirement for bedGraphToBigWig. sort -k1,1 -k2,2n "${INPUT_BAM%.bam}.bedGraph" > "${INPUT_BAM%.bam}.sorted.bedGraph" # 3. Convert the sorted bedGraph to bigWig using ucsc-bedgraphtobigwig. # The first argument is the sorted bedGraph file. # The second argument is the chrom.sizes file. # The third argument is the output bigWig file. bedGraphToBigWig "${INPUT_BAM%.bam}.sorted.bedGraph" "${CHROM_SIZES}" "${OUTPUT_BIGWIG}" # Clean up intermediate files (optional) # rm "${INPUT_BAM%.bam}.bedGraph" "${INPUT_BAM%.bam}.sorted.bedGraph" -
6
Reads from the four peaks in this single gene were quantified using multiBigwigSummary from DeepTools2.0
$ Bash example
# Install deepTools if not already installed # conda install -c bioconda deeptools # Define input bigWig files representing the signal from the four peaks/samples. # These files would typically be generated upstream from aligned reads (e.g., using bamCoverage). BIGWIG_FILE_1="sample_peak1_signal.bw" BIGWIG_FILE_2="sample_peak2_signal.bw" BIGWIG_FILE_3="sample_peak3_signal.bw" BIGWIG_FILE_4="sample_peak4_signal.bw" # Define the BED file containing the genomic regions for the "four peaks in this single gene". # This file specifies the exact coordinates to quantify. GENE_PEAKS_BED="gene_of_interest_peaks.bed" # Define output file names OUTPUT_NPZ="quantified_peak_data.npz" OUTPUT_RAW_COUNTS="quantified_peak_raw_counts.tsv" # Quantify reads from the specified bigWig files over the regions defined in the BED file. # The --outRawCounts option is used to output the actual quantified values in a tab-separated file. multiBigwigSummary regions \ --bigwigs "${BIGWIG_FILE_1}" "${BIGWIG_FILE_2}" "${BIGWIG_FILE_3}" "${BIGWIG_FILE_4}" \ --labels "Peak1" "Peak2" "Peak3" "Peak4" \ --BED "${GENE_PEAKS_BED}" \ --outFileName "${OUTPUT_NPZ}" \ --outRawCounts "${OUTPUT_RAW_COUNTS}"
Tools Used
Raw Source Text
Library strategy: Amplicon-seq CRISPR screening data were aligned and processed using the published MAGeCK-VISPR pipeline (Li et al. Genome Biology 2015). For 3'READS data, reads were trimmed using Cutadapt then aligned to hg19 using bowtie2. Following alignment, bigwig files were generated to visualize the peaks surrounding the gene-of-interest in the study. Reads from the four peaks in this single gene were quantified using multiBigwigSummary from DeepTools2.0 Genome_build: hg19 Supplementary_files_format_and_content: count' text file includes sgRNA counts per sample Supplementary_files_format_and_content: summary' text file contains beta scores of enrichment in GFP high and low populations Supplementary_files_format_and_content: countsummary' text file contains QC metrics Supplementary_files_format_and_content: bigwig files show visualization of 3'READS peaks across the genome