GSE142307 Processing Pipeline — Yeo Lab Publications

Publication

An in vivo genome-wide CRISPR screen identifies the RNA-binding protein Staufen2 as a key regulator of myeloid leukemia.

Nature cancer (2020) — PMID 34109316

Dataset

GSE142307

Effect of Stau2 knockdown on H3K4 methylation in human bcCML cells (K562)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Bowtie2-alignment tool

Bowtie2 v2.4.5 GitHub

$ Bash example

# Install Bowtie2 and Samtools (if not already installed)
# conda install -c bioconda bowtie2=2.4.5 samtools=1.17 # Samtools version for compatibility

# Define variables
GENOME_INDEX="/path/to/genome/index/hg38" # Placeholder for hg38 genome index (e.g., from ENCODE)
INPUT_FASTQ="input.fastq.gz" # Single-end FASTQ file
OUTPUT_BAM="aligned_reads.bam"
THREADS=8 # Number of threads to use
SAMPLE_ID="sample_1" # Unique sample identifier
SAMPLE_NAME="MySample" # Sample name
LIBRARY_NAME="eCLIP_Library" # Library name
FLOWCELL_LANE="FCID_Lane1" # Flowcell ID and lane

# Align reads with Bowtie2 and pipe to samtools for BAM conversion
# -x: Path to the genome index basename
# -U: Path to the single-end FASTQ file
# -p: Number of threads
# --rg-id, --rg: Read group information for SAM/BAM header
# -S: Output SAM format to stdout, then pipe to samtools
bowtie2 -x "${GENOME_INDEX}" \
        -U "${INPUT_FASTQ}" \
        -p "${THREADS}" \
        --rg-id "${SAMPLE_ID}" \
        --rg "SM:${SAMPLE_NAME}" \
        --rg "LB:${LIBRARY_NAME}" \
        --rg "PL:ILLUMINA" \
        --rg "PU:${FLOWCELL_LANE}" \
        -S - | samtools view -bS -o "${OUTPUT_BAM}" -

View on GitHub

2

Macs2-peak calling

MACS2 v2.2.7.1 GitHub

$ Bash example

# Install MACS2 (if not already installed)
# conda install -c bioconda macs2

# Define input files and parameters
TREATMENT_BAM="treatment.sorted.bam" # Path to the treatment BAM file (e.g., ChIP-seq IP sample)
CONTROL_BAM="control.sorted.bam"   # Path to the control BAM file (e.g., Input or IgG control)
GENOME_SIZE="hs"                     # Effective genome size. Use 'hs' for human, 'mm' for mouse, 'ce' for C. elegans, 'dm' for D. melanogaster.
                                     # For other genomes, provide the estimated mappable genome size in base pairs (e.g., 2.7e9 for human hg38).
OUTPUT_PREFIX="my_chip_peaks"        # Prefix for all output files (e.g., my_chip_peaks_peaks.narrowPeak)
OUTPUT_DIR="macs2_output"            # Directory where all output files will be saved
Q_VALUE_CUTOFF="0.01"                # FDR cutoff for peak calling. Common values are 0.01 (1%) or 0.05 (5%).

# Create the output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Execute MACS2 peak calling
# -t: Treatment file (ChIP-seq IP)
# -c: Control file (Input or IgG)
# -f: Format of input files (BAMPE for paired-end BAM, BAM for single-end BAM)
# -g: Effective genome size
# -n: Name of the experiment, used as prefix for output files
# --outdir: Output directory
# -q: Q-value (FDR) cutoff for peak detection
# --keep-dup all: Keep all reads at the same genomic location (default is 'auto')
# --verbose 2: Set verbose level to 2 for more detailed logging
macs2 callpeak \
    -t "${TREATMENT_BAM}" \
    -c "${CONTROL_BAM}" \
    -f BAMPE \
    -g "${GENOME_SIZE}" \
    -n "${OUTPUT_PREFIX}" \
    --outdir "${OUTPUT_DIR}" \
    -q "${Q_VALUE_CUTOFF}" \
    --keep-dup all \
    --verbose 2

View on GitHub

Tools Used

Bowtie2