GSE149767 Processing Pipeline

OTHER code_examples 14 steps

Publication

Footprinting SHAPE-eCLIP Reveals Transcriptome-wide Hydrogen Bonds at RNA-Protein Interfaces.

Molecular cell (2020) — PMID 33242392

Dataset

Footprinting SHAPE-eCLIP reveals transcriptome-wide hydrogen bonds at RNA-protein interfaces

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

library strategy: footprinting and selective 2â-hydroxyl acylation analyzed by primer extension (fSHAPE)

fSHAPE (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Example workflow for fSHAPE data analysis
# Reference genome: hg38 (placeholder)

# 1. Quality control (optional but recommended)
# fastqc fshape_sample.fastq.gz

# 2. Alignment to reference genome (e.g., using STAR)
# Install STAR: # conda install -c bioconda star
# Build STAR index (if not already done, replace paths with actual locations):
# STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index_hg38 --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/hg38.gtf --runThreadN 8

STAR --genomeDir /path/to/STAR_index_hg38 \
     --readFilesIn fshape_sample.fastq.gz \
     --runThreadN 8 \
     --outFileNamePrefix fshape_sample_aligned_ \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --alignIntronMax 1 # Assuming no splicing for reactivity mapping, or mapping to a transcript sequence

# 3. Index the aligned BAM file
# samtools index fshape_sample_aligned_Aligned.sortedByCoord.out.bam

# 4. Calculate read depth/coverage at each position (e.g., using samtools)
# This output is crucial for identifying primer extension stops.
# samtools depth -a -o fshape_sample_depth.tsv fshape_sample_aligned_Aligned.sortedByCoord.out.bam

# 5. Custom script for fSHAPE reactivity calculation
# This step is highly specific to the fSHAPE protocol and involves:
# - Identifying primer extension stops from the depth file.
# - Normalizing counts (e.g., against control samples, total reads).
# - Calculating reactivity scores.
# - Often involves custom Python/R scripts.
# python custom_fshape_reactivity_calculator.py \
#     --depth_file fshape_sample_depth.tsv \
#     --output_reactivity fshape_sample_reactivity.tsv \
#     --control_depth_file control_sample_depth.tsv \
#     --genome_fasta /path/to/hg38.fa

View on GitHub

SHAPE libraries as follows:

ShapeMapper (Inferred with models/gemini-2.5-flash) v6.4

$ Bash example

# Install ShapeMapper (part of the RNAstructure suite)
# conda install -c bioconda rnastructure
# # Alternatively, download and compile from source or use pre-compiled binaries:
# # https://rna.urmc.rochester.edu/RNAstructure.html

# Example usage of ShapeMapper for SHAPE-MaP data analysis.
# This command assumes paired-end sequencing data and a reference RNA sequence.
# Replace 'reference.fasta', 'sample_R1.fastq.gz', 'sample_R2.fastq.gz' with actual file paths.
# The output will include reactivity profiles and other analysis files.

# Define input and output files
REFERENCE_SEQUENCE="reference.fasta" # Placeholder: Path to the reference RNA sequence in FASTA format
INPUT_FASTQ_R1="sample_R1.fastq.gz" # Placeholder: Path to the R1 FASTQ file
INPUT_FASTQ_R2="sample_R2.fastq.gz" # Placeholder: Path to the R2 FASTQ file (if paired-end)
OUTPUT_PREFIX="sample_shapemapper_output" # Prefix for output files

# Run ShapeMapper
# Note: ShapeMapper's exact command line interface can vary slightly depending on the version and specific analysis goals.
# This is a common pattern for processing SHAPE-MaP data, including alignment, read counting, and reactivity calculation.
ShapeMapper --sequence "${REFERENCE_SEQUENCE}" \
            --fastq "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}" \
            --output-prefix "${OUTPUT_PREFIX}" \
            --min-read-length 20 \
            --max-read-length 200 \
            --min-mapq 20 \
            --num-threads 8

Full code at https://github.com/meracorley/fSHAPE

fSHAPE vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install fSHAPE dependencies (example using conda)
# conda env create -f environment.yaml
# conda activate fshape_env

# Ensure the SHAPE-MaP pipeline (https://github.com/Weeks-Lab/SHAPE-MaP_pipeline) is installed and configured.
# Its path and conda environment name will be specified in config.yaml.

# Create a config.yaml file based on your data and reference files
# Example config.yaml content (replace placeholders with actual paths and values):
# samples:
#   sample1:
#     R1: "path/to/sample1_R1.fastq.gz"
#     R2: "path/to/sample1_R2.fastq.gz" # Optional, if paired-end
#     control_R1: "path/to/control1_R1.fastq.gz"
#     control_R2: "path/to/control1_R2.fastq.gz" # Optional, if paired-end
# genome_fasta: "/path/to/reference/genome/hg38.fa" # Placeholder: use your specific genome fasta
# gtf: "/path/to/reference/annotation/hg38.gtf" # Placeholder: use your specific GTF annotation
# shape_mappers_path: "/path/to/SHAPE-MaP_pipeline" # Path to the SHAPE-MaP pipeline scripts
# shape_mappers_env: "shape_mappers_conda_env" # Conda environment name for SHAPE-MaP pipeline

# Execute the fSHAPE Snakemake pipeline
# Ensure you are in the directory containing the Snakefile and config.yaml
snakemake --snakefile Snakefile --configfile config.yaml --cores 8

View on GitHub

zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq

UMI-tools v1.0.1 GitHub

$ Bash example

# Install UMI-tools (e.g., via conda)
# conda install -c bioconda umi_tools

# Extract UMIs from FASTQ file
zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq

View on GitHub

cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq

cutadapt vNot specified in description GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output files (example placeholders)
# file="sample_prefix"

cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq

View on GitHub

STAR --runThreadN 8 --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ --readFilesIn "$file".trim.fastq --outFileNamePrefix "$file" --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate

STAR v2.7.0f GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables (example placeholders)
# file="sample_id" # Replace with your actual file prefix, e.g., 'SRR1234567'
# genome_dir="/projects/ps-yeolab4/genomes/GRCh38/star/" # Ensure this path is correct and contains STAR index

STAR \
  --runThreadN 8 \
  --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ \
  --readFilesIn "$file".trim.fastq \
  --outFileNamePrefix "$file" \
  --outSAMstrandField intronMotif \
  --outFilterIntronMotifs RemoveNoncanonical \
  --outSAMtype BAM SortedByCoordinate

View on GitHub

umitools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam

umi-tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi-tools via conda
# conda install -c bioconda umi-tools=1.1.2

# Example usage:
# Assuming 'file' variable is set, e.g., file="sample_id"
umi_tools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam

View on GitHub

Total coverage (bedtools genomecov -split -strand + -dz; bedtools genomecov -split -strand - -dz) and 5' end coverage (bedtools genomecov -5 -strand + -dz; bedtools genomecov -5 -strand - -dz) summed at each position in the transcriptome with de-duplicated reads.

bedtools v2.31.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Placeholder for genome file (e.g., hg38.chrom.sizes)
# This file should contain chromosome names and their sizes, tab-separated.
# Example:
# chr1    248956422
# chr2    242193529
# ...
# You can generate this from a .fasta file using:
# samtools faidx -i chromsizes genome.fasta > genome.txt
# If reads are mapped to a transcriptome FASTA, this file should contain the sizes of the transcriptome sequences.
GENOME_FILE="genome.txt"

# Placeholder for de-duplicated reads in BED format
# This file should contain the de-duplicated reads, typically after alignment and deduplication.
# Ensure these reads are in the same coordinate system as specified by GENOME_FILE.
DEDUPLICATED_READS_BED="deduplicated_reads.bed"

# --- Step 1: Calculate total coverage for plus strand --- 
# -split: Reports coverage for spliced alignments (e.g., RNA-seq) by splitting reads at 'N' gaps.
# -strand +: Reports coverage only for the plus strand.
# -dz: Reports zero-coverage regions as well.
bedtools genomecov -split -strand + -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > total_plus_coverage.bedgraph

# --- Step 2: Calculate total coverage for minus strand --- 
bedtools genomecov -split -strand - -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > total_minus_coverage.bedgraph

# --- Step 3: Calculate 5' end coverage for plus strand --- 
# -5: Reports coverage only for the 5' most base of each read.
bedtools genomecov -5 -strand + -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > fiveprime_plus_coverage.bedgraph

# --- Step 4: Calculate 5' end coverage for minus strand --- 
bedtools genomecov -5 -strand - -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > fiveprime_minus_coverage.bedgraph

# --- Step 5: Sum all four coverages at each position --- 
# Use bedtools unionbedg to combine the four bedgraph files, ensuring all positions are considered.
# Then, use awk to sum the coverage values from the four input files for each genomic position.
# The output of unionbedg will have columns: chrom, start, end, coverage_file1, coverage_file2, coverage_file3, coverage_file4.
bedtools unionbedg \
    -i total_plus_coverage.bedgraph \
    total_minus_coverage.bedgraph \
    fiveprime_plus_coverage.bedgraph \
    fiveprime_minus_coverage.bedgraph \
    -g "${GENOME_FILE}" \
    | awk 'BEGIN{OFS="\t"}{print $1, $2, $3, $4+$5+$6+$7}' \
    > combined_total_and_fiveprime_coverage.bedgraph

# Clean up intermediate files (optional)
# rm total_plus_coverage.bedgraph total_minus_coverage.bedgraph fiveprime_plus_coverage.bedgraph fiveprime_minus_coverage.bedgraph

View on GitHub

For each human transcript with read coverage, (5'end coverage / total read coverage) calculated at each position and normalized between -protein and +protein SHAPE libraries to calculate "fSHAPE reactivity" at each nucleotide in *maps.tar.gz files. -protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vitro icSHAPE reactivity" at each nucleotide in *vitro_icshape.tar.gz files. +protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vivo icSHAPE reactivity" at each nucleotide in *vivo_icshape.tar.gz files.

SHAPE-seq vNot specified (Method-specific implementation)

$ Bash example

# Install necessary tools (if not already installed)
# conda install -c bioconda samtools bedtools

# --- Configuration ---
# Placeholder for human genome reference files.
# Replace with actual paths to your reference genome and annotation.
GENOME_FASTA="/path/to/human_genome.fa" # e.g., hg38
GENOME_GTF="/path/to/gencode.vXX.annotation.gtf" # e.g., GENCODE v44
GENOME_SIZES="/path/to/human_genome.sizes" # Chromosome sizes file (e.g., from ucsc-tools fetchChromSizes)

# Input BAM files for different SHAPE libraries (assuming these are already aligned)
MINUS_PROTEIN_BAM="sample_minus_protein.bam"
PLUS_PROTEIN_BAM="sample_plus_protein.bam"
UNTREATED_BAM="sample_untreated.bam"

OUTPUT_DIR="shape_reactivity_output"
mkdir -p "${OUTPUT_DIR}"

# --- Step 1: Calculate total read coverage for each library ---
# This provides the denominator for the (5'end coverage / total read coverage) ratio.
echo "Calculating total read coverage..."
samtools depth -a "${MINUS_PROTEIN_BAM}" > "${OUTPUT_DIR}/minus_protein.total_coverage.tsv"
samtools depth -a "${PLUS_PROTEIN_BAM}" > "${OUTPUT_DIR}/plus_protein.total_coverage.tsv"
samtools depth -a "${UNTREATED_BAM}" > "${OUTPUT_DIR}/untreated.total_coverage.tsv"

# --- Step 2: Calculate 5'end coverage and perform normalization ---
# The calculation of 5'end coverage and the subsequent normalization steps
# (fSHAPE, in vitro icSHAPE, in vivo icSHAPE reactivity) are highly specific
# to the SHAPE-seq protocol and typically implemented via custom scripts
# (e.g., Python or R) that process the aligned BAM files and reference annotations.
# There isn't a single, universally standardized command-line tool for these
# specific reactivity calculations and normalizations as described.

echo "Calculating 5'end coverage and normalizing reactivities using a custom script..."
echo "This step involves:"
echo "  1. Extracting 5' end positions from aligned reads in BAM files for each human transcript."
echo "  2. Calculating (5'end coverage / total read coverage) at each nucleotide."
echo "  3. Normalizing these ratios as follows:"
echo "     - fSHAPE reactivity: Normalized between -protein and +protein SHAPE libraries."
echo "     - in vitro icSHAPE reactivity: -protein SHAPE libraries normalized against untreated SHAPE libraries."
echo "     - in vivo icSHAPE reactivity: +protein SHAPE libraries normalized against untreated SHAPE libraries."
echo "Outputs will be generated as: ${OUTPUT_DIR}/fshape_reactivity.maps.tar.gz, ${OUTPUT_DIR}/in_vitro_icshape.tar.gz, ${OUTPUT_DIR}/in_vivo_icshape.tar.gz"

# Placeholder for a conceptual custom script call.
# A real implementation would require a specific script (e.g., Python) that takes
# the BAM files, total coverage files, and GTF annotation to perform the detailed
# per-nucleotide calculations and normalizations.
#
# Example of how such a custom script might be invoked:
# python /path/to/custom_shape_seq_processor.py \
#     --minus_protein_bam "${MINUS_PROTEIN_BAM}" \
#     --plus_protein_bam "${PLUS_PROTEIN_BAM}" \
#     --untreated_bam "${UNTREATED_BAM}" \
#     --minus_protein_total_coverage "${OUTPUT_DIR}/minus_protein.total_coverage.tsv" \
#     --plus_protein_total_coverage "${OUTPUT_DIR}/plus_protein.total_coverage.tsv" \
#     --untreated_total_coverage "${OUTPUT_DIR}/untreated.total_coverage.tsv" \
#     --genome_gtf "${GENOME_GTF}" \
#     --output_fshape "${OUTPUT_DIR}/fshape_reactivity.maps.tar.gz" \
#     --output_invitro "${OUTPUT_DIR}/in_vitro_icshape.tar.gz" \
#     --output_invivo "${OUTPUT_DIR}/in_vivo_icshape.tar.gz"

SHAPE-eCLIP and fSHAPE-eCLIP libraries as follows:

eCLIP vlatest (as of 2020-10-27, last commit to yeolab/eclip) GitHub

$ Bash example

# This script executes the Yeo Lab eCLIP CWL workflow for processing eCLIP and its variants like SHAPE-eCLIP and fSHAPE-eCLIP.
# The workflow integrates tools for adapter trimming, alignment, deduplication, peak calling (CLIPper), and IDR (merge_peaks).

# Prerequisites:
# 1. cwltool: Install with `pip install cwltool`
# 2. Docker or Singularity: Required by cwltool to run containerized tools within the workflow.
# 3. Clone the eCLIP workflow repository:
#    git clone https://github.com/yeolab/eclip.git
#    cd eclip

# Reference Data Preparation (example for human hg38):
# - Genome FASTA (e.g., hg38.fa)
# - Genome GTF (e.g., hg38.gtf)
# - STAR index for the genome (build using STAR --runMode genomeGenerate)
# - rRNA FASTA (e.g., hg38_rRNA.fa)
# - STAR index for rRNA (build using STAR --runMode genomeGenerate)
# - Adapter FASTA (e.g., adapters.fa, containing common sequencing adapters)
# - Blacklist BED file (e.g., ENCODE hg38 blacklist)

# Example command to run the eCLIP CWL workflow:
# Replace placeholder paths and values with your actual data and desired settings.
# This command assumes you are in the 'eclip' directory after cloning the repository.

cwltool eclip.cwl \
    --fastq_files "path/to/your_sample_R1.fastq.gz" \
    --genome_fasta "path/to/reference/hg38.fa" \
    --genome_gtf "path/to/reference/hg38.gtf" \
    --star_index "path/to/reference/star_index_hg38/" \
    --blacklist_bed "path/to/reference/hg38_blacklist.bed" \
    --rRNA_fasta "path/to/reference/rRNA/hg38_rRNA.fa" \
    --rRNA_index "path/to/reference/rRNA/star_index_rRNA/" \
    --adapter_fasta "path/to/adapters.fa" \
    --umi_length 4 \
    --min_read_length 18 \
    --min_mapq 20 \
    --output_dir "eclip_analysis_output" \
    --sample_name "SHAPE_eCLIP_Sample" \
    --threads 8 \
    --memory "32G"

View on GitHub

Full code at https://github.com/meracorley/f-SHAPE-eCLIP

eCLIP vfrom f-SHAPE-eCLIP repository (based on Yeo Lab Skipper) GitHub

$ Bash example

# Clone the f-SHAPE-eCLIP pipeline repository
git clone https://github.com/meracorley/f-SHAPE-eCLIP.git
cd f-SHAPE-eCLIP

# --- Installation (commented out as per instructions) ---
# # Install Snakemake if not already installed
# conda install -c bioconda -c conda-forge snakemake

# # Create and activate the conda environment for the pipeline
# # This environment contains tools like cutadapt, STAR, samtools, picard, bedtools, clipper, etc.
# conda env create -f environment.yaml
# conda activate fshape_eclip

# --- Configuration ---
# # Configure the config.yaml file with your specific input files,
# # output directories, and reference genome paths.
# # Example configuration for hg38 reference genome:
# # (Modify config.yaml directly or create a new config file)
# #
# # Example snippet from config.yaml:
# # genome_fasta: "path/to/your/resources/hg38.fa"
# # gtf: "path/to/your/resources/gencode.v38.annotation.gtf"
# # star_index: "path/to/your/resources/STAR_index_hg38"
# # chrom_sizes: "path/to/your/resources/hg38.chrom.sizes"
# #
# # Placeholder for downloading reference genome (hg38) if not available:
# # mkdir -p resources
# # wget -P resources http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# # gunzip resources/hg38.fa.gz
# # wget -P resources https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
# # gunzip resources/gencode.v38.annotation.gtf.gz
# #
# # # Build STAR index (if not pre-built)
# # STAR --runMode genomeGenerate --genomeDir resources/STAR_index_hg38 --genomeFastaFiles resources/hg38.fa --sjdbGTFfile resources/gencode.v38.annotation.gtf --runThreadN 8

# --- Execution ---
# # Run the f-SHAPE-eCLIP pipeline using Snakemake
# # Adjust --cores based on available resources.
# # The pipeline will use the conda environment defined in environment.yaml
# # and configured in config.yaml.
snakemake --use-conda --cores 8

View on GitHub

eCLIP pipeline (reads mapped with bowtie and STAR, de-duplicated, peaks called for regions with read pileup in IP samples relative to Input)

STAR v2.7.9a GitHub

$ Bash example

# Install STAR (example using Conda)
# conda install -c bioconda star

# --- Reference Data Setup (example for GRCh38) ---
# This step is typically performed once to build the STAR genome index.
# Replace '/path/to/genome_index', '/path/to/GRCh38.primary_assembly.fa', and '/path/to/gencode.v38.annotation.gtf' with actual paths.
# mkdir -p /path/to/genome_index
# STAR --runThreadN 8 \
#      --runMode genomeGenerate \
#      --genomeDir /path/to/genome_index \
#      --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \
#      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
#      --sjdbOverhang 100 # Recommended value is (ReadLength - 1)

# --- STAR Alignment Command ---
# This command aligns sequencing reads to a reference genome using STAR.
# It assumes a pre-built STAR genome index is available at '/path/to/genome_index'.
# 'input_R1.fastq.gz' and 'input_R2.fastq.gz' are placeholder names for your input FASTQ files (paired-end example).
# For single-end reads, remove the second FASTQ file from --readFilesIn.
# Output BAM file will be 'aligned_reads_Aligned.sortedByCoord.out.bam'.
STAR --genomeDir /path/to/genome_index \
     --readFilesIn input_R1.fastq.gz input_R2.fastq.gz \
     --runThreadN 8 \
     --outFileNamePrefix aligned_reads_ \
     --outSAMtype BAM SortedByCoordinate \
     --outReadsUnmapped Fastx \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)

View on GitHub

Total coverage and mutation counts summed at each position in the transcriptome with de-duplicated reads.

samtools and parse_mpileup.py (Inferred with models/gemini-2.5-flash) vsamtools 1.15.1, parse_mpileup.py from skipper workflow (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.15.1

# Download the parse_mpileup.py script from the Yeo lab skipper workflow.
# This script is essential for parsing samtools mpileup output to count mutations and coverage.
# wget https://raw.githubusercontent.com/yeolab/skipper/master/scripts/parse_mpileup.py
# chmod +x parse_mpileup.py

# Define input and reference files
DEDUP_BAM="input_deduplicated.bam" # Path to your de-duplicated BAM file
REFERENCE_FASTA="GRCh38.p14.genome.fa" # Placeholder: Path to your reference genome FASTA file (e.g., GRCh38.p14)
OUTPUT_MUTATIONS_TSV="output_mutations_coverage.tsv.gz"

# Define parameters (example values commonly used in Yeo lab eCLIP pipeline)
MIN_BASE_QUALITY=20       # Minimum base quality to consider a base
MIN_MAPPING_QUALITY=10    # Minimum mapping quality to consider a read
MAX_DEPTH=1000000         # Maximum per-BAM depth to avoid truncation for high coverage data

# Execute samtools mpileup to generate pileup data and pipe it to parse_mpileup.py
# The parse_mpileup.py script then processes this data to sum coverage and mutation counts at each position.
samtools mpileup \
  -q ${MIN_BASE_QUALITY} \
  -Q ${MIN_MAPPING_QUALITY} \
  -d ${MAX_DEPTH} \
  -A \
  -B \
  -f ${REFERENCE_FASTA} \
  ${DEDUP_BAM} \
| python parse_mpileup.py --output ${OUTPUT_MUTATIONS_TSV}

View on GitHub

For each human transcript with read coverage, (mutation coverage / total read coverage) calculated at each position and normalized between -protein and +protein libraries, or treated and untreated libraries, to calculate "fSHAPE" or "SHAPE" reactivity at each nucleotide in fshape_clip_maps.tar.gz and *shapeclip_maps.tar.gz files.

Custom SHAPE-MaP analysis script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

    # Define input files and reference data
    # Placeholder for human reference genome (e.g., GRCh38)
    REF_GENOME="/path/to/human_genome/GRCh38.primary_assembly.genome.fa"
    # Placeholder for human transcript annotation (e.g., Gencode v38)
    TRANSCRIPT_ANNOTATION="/path/to/human_genome/gencode.v38.annotation.gtf"

    # Input BAM files for -protein and +protein libraries (for fSHAPE)
    NEG_PROTEIN_BAM="/path/to/eclip_neg_protein_sample.bam"
    POS_PROTEIN_BAM="/path/to/eclip_pos_protein_sample.bam"

    # Input BAM files for treated and untreated libraries (for SHAPE)
    TREATED_BAM="/path/to/treated_sample.bam"
    UNTREATED_BAM="/path/to/untreated_sample.bam"

    # Output directory
    OUTPUT_DIR="./shape_reactivity_maps"
    mkdir -p ${OUTPUT_DIR}

    # Ensure BAM files are indexed if not already
    # samtools index ${NEG_PROTEIN_BAM}
    # samtools index ${POS_PROTEIN_BAM}
    # samtools index ${TREATED_BAM}
    # samtools index ${UNTREATED_BAM}

    # Placeholder for a custom script that calculates SHAPE/fSHAPE reactivity.
    # This script would typically:
    # 1. Process aligned reads (BAM files) to identify mutations or RT stops.
    # 2. Calculate mutation coverage / total read coverage at each nucleotide.
    # 3. Normalize reactivities between the specified control and experimental libraries.
    # 4. Generate per-nucleotide reactivity maps for each transcript.
    # 5. Archive the results into the specified .tar.gz files.

    # Example command for fSHAPE calculation (comparing -protein vs +protein libraries)
    # The actual script name and parameters would depend on the custom implementation.
    # This command generates 'fshape_clip_maps.tar.gz'
    python /path/to/custom_shape_analysis_script.py \
        --reference ${REF_GENOME} \
        --annotation ${TRANSCRIPT_ANNOTATION} \
        --control_bam ${NEG_PROTEIN_BAM} \
        --experimental_bam ${POS_PROTEIN_BAM} \
        --output_prefix ${OUTPUT_DIR}/fshape_output \
        --output_tar_gz ${OUTPUT_DIR}/fshape_clip_maps.tar.gz \
        --mode fSHAPE_protein_footprinting

    # Example command for SHAPE calculation (comparing treated vs untreated libraries)
    # This command generates 'shapeclip_maps.tar.gz' (or similar, matching the pattern *shapeclip_maps.tar.gz)
    python /path/to/custom_shape_analysis_script.py \
        --reference ${REF_GENOME} \
        --annotation ${TRANSCRIPT_ANNOTATION} \
        --control_bam ${UNTREATED_BAM} \
        --experimental_bam ${TREATED_BAM} \
        --output_prefix ${OUTPUT_DIR}/shape_output \
        --output_tar_gz ${OUTPUT_DIR}/shapeclip_maps.tar.gz \
        --mode SHAPE_treatment_effect

View on GitHub

Tools Used

STAR eCLIP

Raw Source Text

library strategy: footprinting and selective 2â-hydroxyl acylation analyzed by primer extension (fSHAPE)
SHAPE libraries as follows:
Full code at https://github.com/meracorley/fSHAPE
zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq
cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq
STAR --runThreadN 8 --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ --readFilesIn "$file".trim.fastq --outFileNamePrefix "$file" --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate
umitools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam
Total coverage (bedtools genomecov -split -strand + -dz; bedtools genomecov -split -strand - -dz) and 5' end coverage (bedtools genomecov -5 -strand + -dz; bedtools genomecov -5 -strand - -dz) summed at each position in the transcriptome with de-duplicated reads.
For each human transcript with read coverage, (5'end coverage / total read coverage) calculated at each position and normalized between -protein and +protein SHAPE libraries to calculate "fSHAPE reactivity" at each nucleotide in *maps.tar.gz files. -protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vitro icSHAPE reactivity" at each nucleotide in *vitro_icshape.tar.gz files. +protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vivo icSHAPE reactivity" at each nucleotide in *vivo_icshape.tar.gz files.
SHAPE-eCLIP and fSHAPE-eCLIP libraries as follows:
Full code at https://github.com/meracorley/f-SHAPE-eCLIP
eCLIP pipeline (reads mapped with bowtie and STAR, de-duplicated, peaks called for regions with read pileup in IP samples relative to Input)
Total coverage and mutation counts summed at each position in the transcriptome with de-duplicated reads.
For each human transcript with read coverage, (mutation coverage / total read coverage) calculated at each position and normalized between -protein and +protein libraries, or treated and untreated libraries, to calculate "fSHAPE" or "SHAPE" reactivity at each nucleotide in fshape_clip_maps.tar.gz and *shapeclip_maps.tar.gz files.
Genome_build: Hg38 for icSHAPE and fSHAPE, Hg19 for SHAPE-eCLIP and fSHAPE-eCLIP
Supplementary_files_format_and_content: Tab-delimited .map files for multiple human transcripts: Nucleotide number in transcript, Average SHAPE or fSHAPE reactivity, Standard deviation, Base. Tab-delimited .rx files for multiple human transcripts: SHAPE or fSHAPE reactivity rep1, SHAPE or fSHAPE reactivity rep2.

← Back to Analysis