GSE149767 Processing Pipeline

OTHER code_examples 14 steps

Publication

Footprinting SHAPE-eCLIP Reveals Transcriptome-wide Hydrogen Bonds at RNA-Protein Interfaces.

Molecular cell (2020) — PMID 33242392

Dataset

GSE149767

Footprinting SHAPE-eCLIP reveals transcriptome-wide hydrogen bonds at RNA-protein interfaces

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    library strategy: footprinting and selective 2’-hydroxyl acylation analyzed by primer extension (fSHAPE)

    fSHAPE (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Example workflow for fSHAPE data analysis
    # Reference genome: hg38 (placeholder)
    
    # 1. Quality control (optional but recommended)
    # fastqc fshape_sample.fastq.gz
    
    # 2. Alignment to reference genome (e.g., using STAR)
    # Install STAR: # conda install -c bioconda star
    # Build STAR index (if not already done, replace paths with actual locations):
    # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index_hg38 --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/hg38.gtf --runThreadN 8
    
    STAR --genomeDir /path/to/STAR_index_hg38 \
         --readFilesIn fshape_sample.fastq.gz \
         --runThreadN 8 \
         --outFileNamePrefix fshape_sample_aligned_ \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --alignIntronMax 1 # Assuming no splicing for reactivity mapping, or mapping to a transcript sequence
    
    # 3. Index the aligned BAM file
    # samtools index fshape_sample_aligned_Aligned.sortedByCoord.out.bam
    
    # 4. Calculate read depth/coverage at each position (e.g., using samtools)
    # This output is crucial for identifying primer extension stops.
    # samtools depth -a -o fshape_sample_depth.tsv fshape_sample_aligned_Aligned.sortedByCoord.out.bam
    
    # 5. Custom script for fSHAPE reactivity calculation
    # This step is highly specific to the fSHAPE protocol and involves:
    # - Identifying primer extension stops from the depth file.
    # - Normalizing counts (e.g., against control samples, total reads).
    # - Calculating reactivity scores.
    # - Often involves custom Python/R scripts.
    # python custom_fshape_reactivity_calculator.py \
    #     --depth_file fshape_sample_depth.tsv \
    #     --output_reactivity fshape_sample_reactivity.tsv \
    #     --control_depth_file control_sample_depth.tsv \
    #     --genome_fasta /path/to/hg38.fa
  2. 2

    SHAPE libraries as follows:

    ShapeMapper (Inferred with models/gemini-2.5-flash) v6.4
    $ Bash example
    # Install ShapeMapper (part of the RNAstructure suite)
    # conda install -c bioconda rnastructure
    # # Alternatively, download and compile from source or use pre-compiled binaries:
    # # https://rna.urmc.rochester.edu/RNAstructure.html
    
    # Example usage of ShapeMapper for SHAPE-MaP data analysis.
    # This command assumes paired-end sequencing data and a reference RNA sequence.
    # Replace 'reference.fasta', 'sample_R1.fastq.gz', 'sample_R2.fastq.gz' with actual file paths.
    # The output will include reactivity profiles and other analysis files.
    
    # Define input and output files
    REFERENCE_SEQUENCE="reference.fasta" # Placeholder: Path to the reference RNA sequence in FASTA format
    INPUT_FASTQ_R1="sample_R1.fastq.gz" # Placeholder: Path to the R1 FASTQ file
    INPUT_FASTQ_R2="sample_R2.fastq.gz" # Placeholder: Path to the R2 FASTQ file (if paired-end)
    OUTPUT_PREFIX="sample_shapemapper_output" # Prefix for output files
    
    # Run ShapeMapper
    # Note: ShapeMapper's exact command line interface can vary slightly depending on the version and specific analysis goals.
    # This is a common pattern for processing SHAPE-MaP data, including alignment, read counting, and reactivity calculation.
    ShapeMapper --sequence "${REFERENCE_SEQUENCE}" \
                --fastq "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}" \
                --output-prefix "${OUTPUT_PREFIX}" \
                --min-read-length 20 \
                --max-read-length 200 \
                --min-mapq 20 \
                --num-threads 8
  3. 3

    Full code at https://github.com/meracorley/fSHAPE

    fSHAPE vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install fSHAPE dependencies (example using conda)
    # conda env create -f environment.yaml
    # conda activate fshape_env
    
    # Ensure the SHAPE-MaP pipeline (https://github.com/Weeks-Lab/SHAPE-MaP_pipeline) is installed and configured.
    # Its path and conda environment name will be specified in config.yaml.
    
    # Create a config.yaml file based on your data and reference files
    # Example config.yaml content (replace placeholders with actual paths and values):
    # samples:
    #   sample1:
    #     R1: "path/to/sample1_R1.fastq.gz"
    #     R2: "path/to/sample1_R2.fastq.gz" # Optional, if paired-end
    #     control_R1: "path/to/control1_R1.fastq.gz"
    #     control_R2: "path/to/control1_R2.fastq.gz" # Optional, if paired-end
    # genome_fasta: "/path/to/reference/genome/hg38.fa" # Placeholder: use your specific genome fasta
    # gtf: "/path/to/reference/annotation/hg38.gtf" # Placeholder: use your specific GTF annotation
    # shape_mappers_path: "/path/to/SHAPE-MaP_pipeline" # Path to the SHAPE-MaP pipeline scripts
    # shape_mappers_env: "shape_mappers_conda_env" # Conda environment name for SHAPE-MaP pipeline
    
    # Execute the fSHAPE Snakemake pipeline
    # Ensure you are in the directory containing the Snakefile and config.yaml
    snakemake --snakefile Snakefile --configfile config.yaml --cores 8
  4. 4

    zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq

    UMI-tools v1.0.1 GitHub
    $ Bash example
    # Install UMI-tools (e.g., via conda)
    # conda install -c bioconda umi_tools
    
    # Extract UMIs from FASTQ file
    zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq
  5. 5

    cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq

    cutadapt vNot specified in description GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt
    
    # Define input and output files (example placeholders)
    # file="sample_prefix"
    
    cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq
  6. 6

    STAR --runThreadN 8 --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ --readFilesIn "$file".trim.fastq --outFileNamePrefix "$file" --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables (example placeholders)
    # file="sample_id" # Replace with your actual file prefix, e.g., 'SRR1234567'
    # genome_dir="/projects/ps-yeolab4/genomes/GRCh38/star/" # Ensure this path is correct and contains STAR index
    
    STAR \
      --runThreadN 8 \
      --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ \
      --readFilesIn "$file".trim.fastq \
      --outFileNamePrefix "$file" \
      --outSAMstrandField intronMotif \
      --outFilterIntronMotifs RemoveNoncanonical \
      --outSAMtype BAM SortedByCoordinate
  7. 7

    umitools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam

    umi-tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi-tools via conda
    # conda install -c bioconda umi-tools=1.1.2
    
    # Example usage:
    # Assuming 'file' variable is set, e.g., file="sample_id"
    umi_tools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam
  8. 8

    Total coverage (bedtools genomecov -split -strand + -dz; bedtools genomecov -split -strand - -dz) and 5' end coverage (bedtools genomecov -5 -strand + -dz; bedtools genomecov -5 -strand - -dz) summed at each position in the transcriptome with de-duplicated reads.

    bedtools v2.31.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools
    
    # Placeholder for genome file (e.g., hg38.chrom.sizes)
    # This file should contain chromosome names and their sizes, tab-separated.
    # Example:
    # chr1    248956422
    # chr2    242193529
    # ...
    # You can generate this from a .fasta file using:
    # samtools faidx -i chromsizes genome.fasta > genome.txt
    # If reads are mapped to a transcriptome FASTA, this file should contain the sizes of the transcriptome sequences.
    GENOME_FILE="genome.txt"
    
    # Placeholder for de-duplicated reads in BED format
    # This file should contain the de-duplicated reads, typically after alignment and deduplication.
    # Ensure these reads are in the same coordinate system as specified by GENOME_FILE.
    DEDUPLICATED_READS_BED="deduplicated_reads.bed"
    
    # --- Step 1: Calculate total coverage for plus strand --- 
    # -split: Reports coverage for spliced alignments (e.g., RNA-seq) by splitting reads at 'N' gaps.
    # -strand +: Reports coverage only for the plus strand.
    # -dz: Reports zero-coverage regions as well.
    bedtools genomecov -split -strand + -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > total_plus_coverage.bedgraph
    
    # --- Step 2: Calculate total coverage for minus strand --- 
    bedtools genomecov -split -strand - -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > total_minus_coverage.bedgraph
    
    # --- Step 3: Calculate 5' end coverage for plus strand --- 
    # -5: Reports coverage only for the 5' most base of each read.
    bedtools genomecov -5 -strand + -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > fiveprime_plus_coverage.bedgraph
    
    # --- Step 4: Calculate 5' end coverage for minus strand --- 
    bedtools genomecov -5 -strand - -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > fiveprime_minus_coverage.bedgraph
    
    # --- Step 5: Sum all four coverages at each position --- 
    # Use bedtools unionbedg to combine the four bedgraph files, ensuring all positions are considered.
    # Then, use awk to sum the coverage values from the four input files for each genomic position.
    # The output of unionbedg will have columns: chrom, start, end, coverage_file1, coverage_file2, coverage_file3, coverage_file4.
    bedtools unionbedg \
        -i total_plus_coverage.bedgraph \
        total_minus_coverage.bedgraph \
        fiveprime_plus_coverage.bedgraph \
        fiveprime_minus_coverage.bedgraph \
        -g "${GENOME_FILE}" \
        | awk 'BEGIN{OFS="\t"}{print $1, $2, $3, $4+$5+$6+$7}' \
        > combined_total_and_fiveprime_coverage.bedgraph
    
    # Clean up intermediate files (optional)
    # rm total_plus_coverage.bedgraph total_minus_coverage.bedgraph fiveprime_plus_coverage.bedgraph fiveprime_minus_coverage.bedgraph
  9. 9

    For each human transcript with read coverage, (5'end coverage / total read coverage) calculated at each position and normalized between -protein and +protein SHAPE libraries to calculate "fSHAPE reactivity" at each nucleotide in *maps.tar.gz files. -protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vitro icSHAPE reactivity" at each nucleotide in *vitro_icshape.tar.gz files. +protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vivo icSHAPE reactivity" at each nucleotide in *vivo_icshape.tar.gz files.

    SHAPE-seq vNot specified (Method-specific implementation)
    $ Bash example
    # Install necessary tools (if not already installed)
    # conda install -c bioconda samtools bedtools
    
    # --- Configuration ---
    # Placeholder for human genome reference files.
    # Replace with actual paths to your reference genome and annotation.
    GENOME_FASTA="/path/to/human_genome.fa" # e.g., hg38
    GENOME_GTF="/path/to/gencode.vXX.annotation.gtf" # e.g., GENCODE v44
    GENOME_SIZES="/path/to/human_genome.sizes" # Chromosome sizes file (e.g., from ucsc-tools fetchChromSizes)
    
    # Input BAM files for different SHAPE libraries (assuming these are already aligned)
    MINUS_PROTEIN_BAM="sample_minus_protein.bam"
    PLUS_PROTEIN_BAM="sample_plus_protein.bam"
    UNTREATED_BAM="sample_untreated.bam"
    
    OUTPUT_DIR="shape_reactivity_output"
    mkdir -p "${OUTPUT_DIR}"
    
    # --- Step 1: Calculate total read coverage for each library ---
    # This provides the denominator for the (5'end coverage / total read coverage) ratio.
    echo "Calculating total read coverage..."
    samtools depth -a "${MINUS_PROTEIN_BAM}" > "${OUTPUT_DIR}/minus_protein.total_coverage.tsv"
    samtools depth -a "${PLUS_PROTEIN_BAM}" > "${OUTPUT_DIR}/plus_protein.total_coverage.tsv"
    samtools depth -a "${UNTREATED_BAM}" > "${OUTPUT_DIR}/untreated.total_coverage.tsv"
    
    # --- Step 2: Calculate 5'end coverage and perform normalization ---
    # The calculation of 5'end coverage and the subsequent normalization steps
    # (fSHAPE, in vitro icSHAPE, in vivo icSHAPE reactivity) are highly specific
    # to the SHAPE-seq protocol and typically implemented via custom scripts
    # (e.g., Python or R) that process the aligned BAM files and reference annotations.
    # There isn't a single, universally standardized command-line tool for these
    # specific reactivity calculations and normalizations as described.
    
    echo "Calculating 5'end coverage and normalizing reactivities using a custom script..."
    echo "This step involves:"
    echo "  1. Extracting 5' end positions from aligned reads in BAM files for each human transcript."
    echo "  2. Calculating (5'end coverage / total read coverage) at each nucleotide."
    echo "  3. Normalizing these ratios as follows:"
    echo "     - fSHAPE reactivity: Normalized between -protein and +protein SHAPE libraries."
    echo "     - in vitro icSHAPE reactivity: -protein SHAPE libraries normalized against untreated SHAPE libraries."
    echo "     - in vivo icSHAPE reactivity: +protein SHAPE libraries normalized against untreated SHAPE libraries."
    echo "Outputs will be generated as: ${OUTPUT_DIR}/fshape_reactivity.maps.tar.gz, ${OUTPUT_DIR}/in_vitro_icshape.tar.gz, ${OUTPUT_DIR}/in_vivo_icshape.tar.gz"
    
    # Placeholder for a conceptual custom script call.
    # A real implementation would require a specific script (e.g., Python) that takes
    # the BAM files, total coverage files, and GTF annotation to perform the detailed
    # per-nucleotide calculations and normalizations.
    #
    # Example of how such a custom script might be invoked:
    # python /path/to/custom_shape_seq_processor.py \
    #     --minus_protein_bam "${MINUS_PROTEIN_BAM}" \
    #     --plus_protein_bam "${PLUS_PROTEIN_BAM}" \
    #     --untreated_bam "${UNTREATED_BAM}" \
    #     --minus_protein_total_coverage "${OUTPUT_DIR}/minus_protein.total_coverage.tsv" \
    #     --plus_protein_total_coverage "${OUTPUT_DIR}/plus_protein.total_coverage.tsv" \
    #     --untreated_total_coverage "${OUTPUT_DIR}/untreated.total_coverage.tsv" \
    #     --genome_gtf "${GENOME_GTF}" \
    #     --output_fshape "${OUTPUT_DIR}/fshape_reactivity.maps.tar.gz" \
    #     --output_invitro "${OUTPUT_DIR}/in_vitro_icshape.tar.gz" \
    #     --output_invivo "${OUTPUT_DIR}/in_vivo_icshape.tar.gz"
    
  10. 10

    SHAPE-eCLIP and fSHAPE-eCLIP libraries as follows:

    eCLIP vlatest (as of 2020-10-27, last commit to yeolab/eclip) GitHub
    $ Bash example
    # This script executes the Yeo Lab eCLIP CWL workflow for processing eCLIP and its variants like SHAPE-eCLIP and fSHAPE-eCLIP.
    # The workflow integrates tools for adapter trimming, alignment, deduplication, peak calling (CLIPper), and IDR (merge_peaks).
    
    # Prerequisites:
    # 1. cwltool: Install with `pip install cwltool`
    # 2. Docker or Singularity: Required by cwltool to run containerized tools within the workflow.
    # 3. Clone the eCLIP workflow repository:
    #    git clone https://github.com/yeolab/eclip.git
    #    cd eclip
    
    # Reference Data Preparation (example for human hg38):
    # - Genome FASTA (e.g., hg38.fa)
    # - Genome GTF (e.g., hg38.gtf)
    # - STAR index for the genome (build using STAR --runMode genomeGenerate)
    # - rRNA FASTA (e.g., hg38_rRNA.fa)
    # - STAR index for rRNA (build using STAR --runMode genomeGenerate)
    # - Adapter FASTA (e.g., adapters.fa, containing common sequencing adapters)
    # - Blacklist BED file (e.g., ENCODE hg38 blacklist)
    
    # Example command to run the eCLIP CWL workflow:
    # Replace placeholder paths and values with your actual data and desired settings.
    # This command assumes you are in the 'eclip' directory after cloning the repository.
    
    cwltool eclip.cwl \
        --fastq_files "path/to/your_sample_R1.fastq.gz" \
        --genome_fasta "path/to/reference/hg38.fa" \
        --genome_gtf "path/to/reference/hg38.gtf" \
        --star_index "path/to/reference/star_index_hg38/" \
        --blacklist_bed "path/to/reference/hg38_blacklist.bed" \
        --rRNA_fasta "path/to/reference/rRNA/hg38_rRNA.fa" \
        --rRNA_index "path/to/reference/rRNA/star_index_rRNA/" \
        --adapter_fasta "path/to/adapters.fa" \
        --umi_length 4 \
        --min_read_length 18 \
        --min_mapq 20 \
        --output_dir "eclip_analysis_output" \
        --sample_name "SHAPE_eCLIP_Sample" \
        --threads 8 \
        --memory "32G"
  11. 11

    Full code at https://github.com/meracorley/f-SHAPE-eCLIP

    eCLIP vfrom f-SHAPE-eCLIP repository (based on Yeo Lab Skipper) GitHub
    $ Bash example
    # Clone the f-SHAPE-eCLIP pipeline repository
    git clone https://github.com/meracorley/f-SHAPE-eCLIP.git
    cd f-SHAPE-eCLIP
    
    # --- Installation (commented out as per instructions) ---
    # # Install Snakemake if not already installed
    # conda install -c bioconda -c conda-forge snakemake
    
    # # Create and activate the conda environment for the pipeline
    # # This environment contains tools like cutadapt, STAR, samtools, picard, bedtools, clipper, etc.
    # conda env create -f environment.yaml
    # conda activate fshape_eclip
    
    # --- Configuration ---
    # # Configure the config.yaml file with your specific input files,
    # # output directories, and reference genome paths.
    # # Example configuration for hg38 reference genome:
    # # (Modify config.yaml directly or create a new config file)
    # #
    # # Example snippet from config.yaml:
    # # genome_fasta: "path/to/your/resources/hg38.fa"
    # # gtf: "path/to/your/resources/gencode.v38.annotation.gtf"
    # # star_index: "path/to/your/resources/STAR_index_hg38"
    # # chrom_sizes: "path/to/your/resources/hg38.chrom.sizes"
    # #
    # # Placeholder for downloading reference genome (hg38) if not available:
    # # mkdir -p resources
    # # wget -P resources http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # # gunzip resources/hg38.fa.gz
    # # wget -P resources https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
    # # gunzip resources/gencode.v38.annotation.gtf.gz
    # #
    # # # Build STAR index (if not pre-built)
    # # STAR --runMode genomeGenerate --genomeDir resources/STAR_index_hg38 --genomeFastaFiles resources/hg38.fa --sjdbGTFfile resources/gencode.v38.annotation.gtf --runThreadN 8
    
    # --- Execution ---
    # # Run the f-SHAPE-eCLIP pipeline using Snakemake
    # # Adjust --cores based on available resources.
    # # The pipeline will use the conda environment defined in environment.yaml
    # # and configured in config.yaml.
    snakemake --use-conda --cores 8
  12. 12

    eCLIP pipeline (reads mapped with bowtie and STAR, de-duplicated, peaks called for regions with read pileup in IP samples relative to Input)

    $ Bash example
    # Install STAR (example using Conda)
    # conda install -c bioconda star
    
    # --- Reference Data Setup (example for GRCh38) ---
    # This step is typically performed once to build the STAR genome index.
    # Replace '/path/to/genome_index', '/path/to/GRCh38.primary_assembly.fa', and '/path/to/gencode.v38.annotation.gtf' with actual paths.
    # mkdir -p /path/to/genome_index
    # STAR --runThreadN 8 \
    #      --runMode genomeGenerate \
    #      --genomeDir /path/to/genome_index \
    #      --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \
    #      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
    #      --sjdbOverhang 100 # Recommended value is (ReadLength - 1)
    
    # --- STAR Alignment Command ---
    # This command aligns sequencing reads to a reference genome using STAR.
    # It assumes a pre-built STAR genome index is available at '/path/to/genome_index'.
    # 'input_R1.fastq.gz' and 'input_R2.fastq.gz' are placeholder names for your input FASTQ files (paired-end example).
    # For single-end reads, remove the second FASTQ file from --readFilesIn.
    # Output BAM file will be 'aligned_reads_Aligned.sortedByCoord.out.bam'.
    STAR --genomeDir /path/to/genome_index \
         --readFilesIn input_R1.fastq.gz input_R2.fastq.gz \
         --runThreadN 8 \
         --outFileNamePrefix aligned_reads_ \
         --outSAMtype BAM SortedByCoordinate \
         --outReadsUnmapped Fastx \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)
    
  13. 13

    Total coverage and mutation counts summed at each position in the transcriptome with de-duplicated reads.

    samtools and parse_mpileup.py (Inferred with models/gemini-2.5-flash) vsamtools 1.15.1, parse_mpileup.py from skipper workflow (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.15.1
    
    # Download the parse_mpileup.py script from the Yeo lab skipper workflow.
    # This script is essential for parsing samtools mpileup output to count mutations and coverage.
    # wget https://raw.githubusercontent.com/yeolab/skipper/master/scripts/parse_mpileup.py
    # chmod +x parse_mpileup.py
    
    # Define input and reference files
    DEDUP_BAM="input_deduplicated.bam" # Path to your de-duplicated BAM file
    REFERENCE_FASTA="GRCh38.p14.genome.fa" # Placeholder: Path to your reference genome FASTA file (e.g., GRCh38.p14)
    OUTPUT_MUTATIONS_TSV="output_mutations_coverage.tsv.gz"
    
    # Define parameters (example values commonly used in Yeo lab eCLIP pipeline)
    MIN_BASE_QUALITY=20       # Minimum base quality to consider a base
    MIN_MAPPING_QUALITY=10    # Minimum mapping quality to consider a read
    MAX_DEPTH=1000000         # Maximum per-BAM depth to avoid truncation for high coverage data
    
    # Execute samtools mpileup to generate pileup data and pipe it to parse_mpileup.py
    # The parse_mpileup.py script then processes this data to sum coverage and mutation counts at each position.
    samtools mpileup \
      -q ${MIN_BASE_QUALITY} \
      -Q ${MIN_MAPPING_QUALITY} \
      -d ${MAX_DEPTH} \
      -A \
      -B \
      -f ${REFERENCE_FASTA} \
      ${DEDUP_BAM} \
    | python parse_mpileup.py --output ${OUTPUT_MUTATIONS_TSV}
  14. 14

    For each human transcript with read coverage, (mutation coverage / total read coverage) calculated at each position and normalized between -protein and +protein libraries, or treated and untreated libraries, to calculate "fSHAPE" or "SHAPE" reactivity at each nucleotide in fshape_clip_maps.tar.gz and *shapeclip_maps.tar.gz files.

    Custom SHAPE-MaP analysis script (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
        # Define input files and reference data
        # Placeholder for human reference genome (e.g., GRCh38)
        REF_GENOME="/path/to/human_genome/GRCh38.primary_assembly.genome.fa"
        # Placeholder for human transcript annotation (e.g., Gencode v38)
        TRANSCRIPT_ANNOTATION="/path/to/human_genome/gencode.v38.annotation.gtf"
    
        # Input BAM files for -protein and +protein libraries (for fSHAPE)
        NEG_PROTEIN_BAM="/path/to/eclip_neg_protein_sample.bam"
        POS_PROTEIN_BAM="/path/to/eclip_pos_protein_sample.bam"
    
        # Input BAM files for treated and untreated libraries (for SHAPE)
        TREATED_BAM="/path/to/treated_sample.bam"
        UNTREATED_BAM="/path/to/untreated_sample.bam"
    
        # Output directory
        OUTPUT_DIR="./shape_reactivity_maps"
        mkdir -p ${OUTPUT_DIR}
    
        # Ensure BAM files are indexed if not already
        # samtools index ${NEG_PROTEIN_BAM}
        # samtools index ${POS_PROTEIN_BAM}
        # samtools index ${TREATED_BAM}
        # samtools index ${UNTREATED_BAM}
    
        # Placeholder for a custom script that calculates SHAPE/fSHAPE reactivity.
        # This script would typically:
        # 1. Process aligned reads (BAM files) to identify mutations or RT stops.
        # 2. Calculate mutation coverage / total read coverage at each nucleotide.
        # 3. Normalize reactivities between the specified control and experimental libraries.
        # 4. Generate per-nucleotide reactivity maps for each transcript.
        # 5. Archive the results into the specified .tar.gz files.
    
        # Example command for fSHAPE calculation (comparing -protein vs +protein libraries)
        # The actual script name and parameters would depend on the custom implementation.
        # This command generates 'fshape_clip_maps.tar.gz'
        python /path/to/custom_shape_analysis_script.py \
            --reference ${REF_GENOME} \
            --annotation ${TRANSCRIPT_ANNOTATION} \
            --control_bam ${NEG_PROTEIN_BAM} \
            --experimental_bam ${POS_PROTEIN_BAM} \
            --output_prefix ${OUTPUT_DIR}/fshape_output \
            --output_tar_gz ${OUTPUT_DIR}/fshape_clip_maps.tar.gz \
            --mode fSHAPE_protein_footprinting
    
        # Example command for SHAPE calculation (comparing treated vs untreated libraries)
        # This command generates 'shapeclip_maps.tar.gz' (or similar, matching the pattern *shapeclip_maps.tar.gz)
        python /path/to/custom_shape_analysis_script.py \
            --reference ${REF_GENOME} \
            --annotation ${TRANSCRIPT_ANNOTATION} \
            --control_bam ${UNTREATED_BAM} \
            --experimental_bam ${TREATED_BAM} \
            --output_prefix ${OUTPUT_DIR}/shape_output \
            --output_tar_gz ${OUTPUT_DIR}/shapeclip_maps.tar.gz \
            --mode SHAPE_treatment_effect

Tools Used

Raw Source Text
library strategy: footprinting and selective 2’-hydroxyl acylation analyzed by primer extension (fSHAPE)
SHAPE libraries as follows:
Full code at https://github.com/meracorley/fSHAPE
zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq
cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq
STAR --runThreadN 8 --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ --readFilesIn "$file".trim.fastq --outFileNamePrefix "$file" --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate
umitools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam
Total coverage (bedtools genomecov -split -strand + -dz; bedtools genomecov -split -strand - -dz) and 5' end coverage (bedtools genomecov -5 -strand + -dz; bedtools genomecov -5 -strand - -dz) summed at each position in the transcriptome with de-duplicated reads.
For each human transcript with read coverage, (5'end coverage / total read coverage) calculated at each position and normalized between -protein and +protein SHAPE libraries to calculate "fSHAPE reactivity" at each nucleotide in *maps.tar.gz files. -protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vitro icSHAPE reactivity" at each nucleotide in *vitro_icshape.tar.gz files. +protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vivo icSHAPE reactivity" at each nucleotide in *vivo_icshape.tar.gz files.
SHAPE-eCLIP and fSHAPE-eCLIP libraries as follows:
Full code at https://github.com/meracorley/f-SHAPE-eCLIP
eCLIP pipeline (reads mapped with bowtie and STAR, de-duplicated, peaks called for regions with read pileup in IP samples relative to Input)
Total coverage and mutation counts summed at each position in the transcriptome with de-duplicated reads.
For each human transcript with read coverage, (mutation coverage / total read coverage) calculated at each position and normalized between -protein and +protein libraries, or treated and untreated libraries, to calculate "fSHAPE" or "SHAPE" reactivity at each nucleotide in fshape_clip_maps.tar.gz and *shapeclip_maps.tar.gz files.
Genome_build: Hg38 for icSHAPE and fSHAPE, Hg19 for SHAPE-eCLIP and fSHAPE-eCLIP
Supplementary_files_format_and_content: Tab-delimited .map files for multiple human transcripts: Nucleotide number in transcript, Average SHAPE or fSHAPE reactivity, Standard deviation, Base. Tab-delimited .rx files for multiple human transcripts: SHAPE or fSHAPE reactivity rep1, SHAPE or fSHAPE reactivity rep2.
← Back to Analysis