GSE149767 Processing Pipeline
Publication
Footprinting SHAPE-eCLIP Reveals Transcriptome-wide Hydrogen Bonds at RNA-Protein Interfaces.Molecular cell (2020) — PMID 33242392
Dataset
GSE149767Footprinting SHAPE-eCLIP reveals transcriptome-wide hydrogen bonds at RNA-protein interfaces
Processing Steps
Generate Jupyter Notebook-
1
library strategy: footprinting and selective 2â-hydroxyl acylation analyzed by primer extension (fSHAPE)
$ Bash example
# Example workflow for fSHAPE data analysis # Reference genome: hg38 (placeholder) # 1. Quality control (optional but recommended) # fastqc fshape_sample.fastq.gz # 2. Alignment to reference genome (e.g., using STAR) # Install STAR: # conda install -c bioconda star # Build STAR index (if not already done, replace paths with actual locations): # STAR --runMode genomeGenerate --genomeDir /path/to/STAR_index_hg38 --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/hg38.gtf --runThreadN 8 STAR --genomeDir /path/to/STAR_index_hg38 \ --readFilesIn fshape_sample.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix fshape_sample_aligned_ \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1 # Assuming no splicing for reactivity mapping, or mapping to a transcript sequence # 3. Index the aligned BAM file # samtools index fshape_sample_aligned_Aligned.sortedByCoord.out.bam # 4. Calculate read depth/coverage at each position (e.g., using samtools) # This output is crucial for identifying primer extension stops. # samtools depth -a -o fshape_sample_depth.tsv fshape_sample_aligned_Aligned.sortedByCoord.out.bam # 5. Custom script for fSHAPE reactivity calculation # This step is highly specific to the fSHAPE protocol and involves: # - Identifying primer extension stops from the depth file. # - Normalizing counts (e.g., against control samples, total reads). # - Calculating reactivity scores. # - Often involves custom Python/R scripts. # python custom_fshape_reactivity_calculator.py \ # --depth_file fshape_sample_depth.tsv \ # --output_reactivity fshape_sample_reactivity.tsv \ # --control_depth_file control_sample_depth.tsv \ # --genome_fasta /path/to/hg38.fa -
2
SHAPE libraries as follows:
ShapeMapper (Inferred with models/gemini-2.5-flash) v6.4$ Bash example
# Install ShapeMapper (part of the RNAstructure suite) # conda install -c bioconda rnastructure # # Alternatively, download and compile from source or use pre-compiled binaries: # # https://rna.urmc.rochester.edu/RNAstructure.html # Example usage of ShapeMapper for SHAPE-MaP data analysis. # This command assumes paired-end sequencing data and a reference RNA sequence. # Replace 'reference.fasta', 'sample_R1.fastq.gz', 'sample_R2.fastq.gz' with actual file paths. # The output will include reactivity profiles and other analysis files. # Define input and output files REFERENCE_SEQUENCE="reference.fasta" # Placeholder: Path to the reference RNA sequence in FASTA format INPUT_FASTQ_R1="sample_R1.fastq.gz" # Placeholder: Path to the R1 FASTQ file INPUT_FASTQ_R2="sample_R2.fastq.gz" # Placeholder: Path to the R2 FASTQ file (if paired-end) OUTPUT_PREFIX="sample_shapemapper_output" # Prefix for output files # Run ShapeMapper # Note: ShapeMapper's exact command line interface can vary slightly depending on the version and specific analysis goals. # This is a common pattern for processing SHAPE-MaP data, including alignment, read counting, and reactivity calculation. ShapeMapper --sequence "${REFERENCE_SEQUENCE}" \ --fastq "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}" \ --output-prefix "${OUTPUT_PREFIX}" \ --min-read-length 20 \ --max-read-length 200 \ --min-mapq 20 \ --num-threads 8 -
3
Full code at https://github.com/meracorley/fSHAPE
$ Bash example
# Install fSHAPE dependencies (example using conda) # conda env create -f environment.yaml # conda activate fshape_env # Ensure the SHAPE-MaP pipeline (https://github.com/Weeks-Lab/SHAPE-MaP_pipeline) is installed and configured. # Its path and conda environment name will be specified in config.yaml. # Create a config.yaml file based on your data and reference files # Example config.yaml content (replace placeholders with actual paths and values): # samples: # sample1: # R1: "path/to/sample1_R1.fastq.gz" # R2: "path/to/sample1_R2.fastq.gz" # Optional, if paired-end # control_R1: "path/to/control1_R1.fastq.gz" # control_R2: "path/to/control1_R2.fastq.gz" # Optional, if paired-end # genome_fasta: "/path/to/reference/genome/hg38.fa" # Placeholder: use your specific genome fasta # gtf: "/path/to/reference/annotation/hg38.gtf" # Placeholder: use your specific GTF annotation # shape_mappers_path: "/path/to/SHAPE-MaP_pipeline" # Path to the SHAPE-MaP pipeline scripts # shape_mappers_env: "shape_mappers_conda_env" # Conda environment name for SHAPE-MaP pipeline # Execute the fSHAPE Snakemake pipeline # Ensure you are in the directory containing the Snakefile and config.yaml snakemake --snakefile Snakefile --configfile config.yaml --cores 8
-
4
zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq
$ Bash example
# Install UMI-tools (e.g., via conda) # conda install -c bioconda umi_tools # Extract UMIs from FASTQ file zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq
-
5
cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define input and output files (example placeholders) # file="sample_prefix" cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq
-
6
STAR --runThreadN 8 --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ --readFilesIn "$file".trim.fastq --outFileNamePrefix "$file" --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables (example placeholders) # file="sample_id" # Replace with your actual file prefix, e.g., 'SRR1234567' # genome_dir="/projects/ps-yeolab4/genomes/GRCh38/star/" # Ensure this path is correct and contains STAR index STAR \ --runThreadN 8 \ --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ \ --readFilesIn "$file".trim.fastq \ --outFileNamePrefix "$file" \ --outSAMstrandField intronMotif \ --outFilterIntronMotifs RemoveNoncanonical \ --outSAMtype BAM SortedByCoordinate
-
7
umitools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam
$ Bash example
# Install umi-tools via conda # conda install -c bioconda umi-tools=1.1.2 # Example usage: # Assuming 'file' variable is set, e.g., file="sample_id" umi_tools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam
-
8
Total coverage (bedtools genomecov -split -strand + -dz; bedtools genomecov -split -strand - -dz) and 5' end coverage (bedtools genomecov -5 -strand + -dz; bedtools genomecov -5 -strand - -dz) summed at each position in the transcriptome with de-duplicated reads.
$ Bash example
# Install bedtools if not already installed # conda install -c bioconda bedtools # Placeholder for genome file (e.g., hg38.chrom.sizes) # This file should contain chromosome names and their sizes, tab-separated. # Example: # chr1 248956422 # chr2 242193529 # ... # You can generate this from a .fasta file using: # samtools faidx -i chromsizes genome.fasta > genome.txt # If reads are mapped to a transcriptome FASTA, this file should contain the sizes of the transcriptome sequences. GENOME_FILE="genome.txt" # Placeholder for de-duplicated reads in BED format # This file should contain the de-duplicated reads, typically after alignment and deduplication. # Ensure these reads are in the same coordinate system as specified by GENOME_FILE. DEDUPLICATED_READS_BED="deduplicated_reads.bed" # --- Step 1: Calculate total coverage for plus strand --- # -split: Reports coverage for spliced alignments (e.g., RNA-seq) by splitting reads at 'N' gaps. # -strand +: Reports coverage only for the plus strand. # -dz: Reports zero-coverage regions as well. bedtools genomecov -split -strand + -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > total_plus_coverage.bedgraph # --- Step 2: Calculate total coverage for minus strand --- bedtools genomecov -split -strand - -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > total_minus_coverage.bedgraph # --- Step 3: Calculate 5' end coverage for plus strand --- # -5: Reports coverage only for the 5' most base of each read. bedtools genomecov -5 -strand + -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > fiveprime_plus_coverage.bedgraph # --- Step 4: Calculate 5' end coverage for minus strand --- bedtools genomecov -5 -strand - -dz -i "${DEDUPLICATED_READS_BED}" -g "${GENOME_FILE}" > fiveprime_minus_coverage.bedgraph # --- Step 5: Sum all four coverages at each position --- # Use bedtools unionbedg to combine the four bedgraph files, ensuring all positions are considered. # Then, use awk to sum the coverage values from the four input files for each genomic position. # The output of unionbedg will have columns: chrom, start, end, coverage_file1, coverage_file2, coverage_file3, coverage_file4. bedtools unionbedg \ -i total_plus_coverage.bedgraph \ total_minus_coverage.bedgraph \ fiveprime_plus_coverage.bedgraph \ fiveprime_minus_coverage.bedgraph \ -g "${GENOME_FILE}" \ | awk 'BEGIN{OFS="\t"}{print $1, $2, $3, $4+$5+$6+$7}' \ > combined_total_and_fiveprime_coverage.bedgraph # Clean up intermediate files (optional) # rm total_plus_coverage.bedgraph total_minus_coverage.bedgraph fiveprime_plus_coverage.bedgraph fiveprime_minus_coverage.bedgraph -
9
For each human transcript with read coverage, (5'end coverage / total read coverage) calculated at each position and normalized between -protein and +protein SHAPE libraries to calculate "fSHAPE reactivity" at each nucleotide in *maps.tar.gz files. -protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vitro icSHAPE reactivity" at each nucleotide in *vitro_icshape.tar.gz files. +protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vivo icSHAPE reactivity" at each nucleotide in *vivo_icshape.tar.gz files.
SHAPE-seq vNot specified (Method-specific implementation)$ Bash example
# Install necessary tools (if not already installed) # conda install -c bioconda samtools bedtools # --- Configuration --- # Placeholder for human genome reference files. # Replace with actual paths to your reference genome and annotation. GENOME_FASTA="/path/to/human_genome.fa" # e.g., hg38 GENOME_GTF="/path/to/gencode.vXX.annotation.gtf" # e.g., GENCODE v44 GENOME_SIZES="/path/to/human_genome.sizes" # Chromosome sizes file (e.g., from ucsc-tools fetchChromSizes) # Input BAM files for different SHAPE libraries (assuming these are already aligned) MINUS_PROTEIN_BAM="sample_minus_protein.bam" PLUS_PROTEIN_BAM="sample_plus_protein.bam" UNTREATED_BAM="sample_untreated.bam" OUTPUT_DIR="shape_reactivity_output" mkdir -p "${OUTPUT_DIR}" # --- Step 1: Calculate total read coverage for each library --- # This provides the denominator for the (5'end coverage / total read coverage) ratio. echo "Calculating total read coverage..." samtools depth -a "${MINUS_PROTEIN_BAM}" > "${OUTPUT_DIR}/minus_protein.total_coverage.tsv" samtools depth -a "${PLUS_PROTEIN_BAM}" > "${OUTPUT_DIR}/plus_protein.total_coverage.tsv" samtools depth -a "${UNTREATED_BAM}" > "${OUTPUT_DIR}/untreated.total_coverage.tsv" # --- Step 2: Calculate 5'end coverage and perform normalization --- # The calculation of 5'end coverage and the subsequent normalization steps # (fSHAPE, in vitro icSHAPE, in vivo icSHAPE reactivity) are highly specific # to the SHAPE-seq protocol and typically implemented via custom scripts # (e.g., Python or R) that process the aligned BAM files and reference annotations. # There isn't a single, universally standardized command-line tool for these # specific reactivity calculations and normalizations as described. echo "Calculating 5'end coverage and normalizing reactivities using a custom script..." echo "This step involves:" echo " 1. Extracting 5' end positions from aligned reads in BAM files for each human transcript." echo " 2. Calculating (5'end coverage / total read coverage) at each nucleotide." echo " 3. Normalizing these ratios as follows:" echo " - fSHAPE reactivity: Normalized between -protein and +protein SHAPE libraries." echo " - in vitro icSHAPE reactivity: -protein SHAPE libraries normalized against untreated SHAPE libraries." echo " - in vivo icSHAPE reactivity: +protein SHAPE libraries normalized against untreated SHAPE libraries." echo "Outputs will be generated as: ${OUTPUT_DIR}/fshape_reactivity.maps.tar.gz, ${OUTPUT_DIR}/in_vitro_icshape.tar.gz, ${OUTPUT_DIR}/in_vivo_icshape.tar.gz" # Placeholder for a conceptual custom script call. # A real implementation would require a specific script (e.g., Python) that takes # the BAM files, total coverage files, and GTF annotation to perform the detailed # per-nucleotide calculations and normalizations. # # Example of how such a custom script might be invoked: # python /path/to/custom_shape_seq_processor.py \ # --minus_protein_bam "${MINUS_PROTEIN_BAM}" \ # --plus_protein_bam "${PLUS_PROTEIN_BAM}" \ # --untreated_bam "${UNTREATED_BAM}" \ # --minus_protein_total_coverage "${OUTPUT_DIR}/minus_protein.total_coverage.tsv" \ # --plus_protein_total_coverage "${OUTPUT_DIR}/plus_protein.total_coverage.tsv" \ # --untreated_total_coverage "${OUTPUT_DIR}/untreated.total_coverage.tsv" \ # --genome_gtf "${GENOME_GTF}" \ # --output_fshape "${OUTPUT_DIR}/fshape_reactivity.maps.tar.gz" \ # --output_invitro "${OUTPUT_DIR}/in_vitro_icshape.tar.gz" \ # --output_invivo "${OUTPUT_DIR}/in_vivo_icshape.tar.gz" -
10
SHAPE-eCLIP and fSHAPE-eCLIP libraries as follows:
$ Bash example
# This script executes the Yeo Lab eCLIP CWL workflow for processing eCLIP and its variants like SHAPE-eCLIP and fSHAPE-eCLIP. # The workflow integrates tools for adapter trimming, alignment, deduplication, peak calling (CLIPper), and IDR (merge_peaks). # Prerequisites: # 1. cwltool: Install with `pip install cwltool` # 2. Docker or Singularity: Required by cwltool to run containerized tools within the workflow. # 3. Clone the eCLIP workflow repository: # git clone https://github.com/yeolab/eclip.git # cd eclip # Reference Data Preparation (example for human hg38): # - Genome FASTA (e.g., hg38.fa) # - Genome GTF (e.g., hg38.gtf) # - STAR index for the genome (build using STAR --runMode genomeGenerate) # - rRNA FASTA (e.g., hg38_rRNA.fa) # - STAR index for rRNA (build using STAR --runMode genomeGenerate) # - Adapter FASTA (e.g., adapters.fa, containing common sequencing adapters) # - Blacklist BED file (e.g., ENCODE hg38 blacklist) # Example command to run the eCLIP CWL workflow: # Replace placeholder paths and values with your actual data and desired settings. # This command assumes you are in the 'eclip' directory after cloning the repository. cwltool eclip.cwl \ --fastq_files "path/to/your_sample_R1.fastq.gz" \ --genome_fasta "path/to/reference/hg38.fa" \ --genome_gtf "path/to/reference/hg38.gtf" \ --star_index "path/to/reference/star_index_hg38/" \ --blacklist_bed "path/to/reference/hg38_blacklist.bed" \ --rRNA_fasta "path/to/reference/rRNA/hg38_rRNA.fa" \ --rRNA_index "path/to/reference/rRNA/star_index_rRNA/" \ --adapter_fasta "path/to/adapters.fa" \ --umi_length 4 \ --min_read_length 18 \ --min_mapq 20 \ --output_dir "eclip_analysis_output" \ --sample_name "SHAPE_eCLIP_Sample" \ --threads 8 \ --memory "32G" -
11
Full code at https://github.com/meracorley/f-SHAPE-eCLIP
$ Bash example
# Clone the f-SHAPE-eCLIP pipeline repository git clone https://github.com/meracorley/f-SHAPE-eCLIP.git cd f-SHAPE-eCLIP # --- Installation (commented out as per instructions) --- # # Install Snakemake if not already installed # conda install -c bioconda -c conda-forge snakemake # # Create and activate the conda environment for the pipeline # # This environment contains tools like cutadapt, STAR, samtools, picard, bedtools, clipper, etc. # conda env create -f environment.yaml # conda activate fshape_eclip # --- Configuration --- # # Configure the config.yaml file with your specific input files, # # output directories, and reference genome paths. # # Example configuration for hg38 reference genome: # # (Modify config.yaml directly or create a new config file) # # # # Example snippet from config.yaml: # # genome_fasta: "path/to/your/resources/hg38.fa" # # gtf: "path/to/your/resources/gencode.v38.annotation.gtf" # # star_index: "path/to/your/resources/STAR_index_hg38" # # chrom_sizes: "path/to/your/resources/hg38.chrom.sizes" # # # # Placeholder for downloading reference genome (hg38) if not available: # # mkdir -p resources # # wget -P resources http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # # gunzip resources/hg38.fa.gz # # wget -P resources https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz # # gunzip resources/gencode.v38.annotation.gtf.gz # # # # # Build STAR index (if not pre-built) # # STAR --runMode genomeGenerate --genomeDir resources/STAR_index_hg38 --genomeFastaFiles resources/hg38.fa --sjdbGTFfile resources/gencode.v38.annotation.gtf --runThreadN 8 # --- Execution --- # # Run the f-SHAPE-eCLIP pipeline using Snakemake # # Adjust --cores based on available resources. # # The pipeline will use the conda environment defined in environment.yaml # # and configured in config.yaml. snakemake --use-conda --cores 8
-
12
eCLIP pipeline (reads mapped with bowtie and STAR, de-duplicated, peaks called for regions with read pileup in IP samples relative to Input)
$ Bash example
# Install STAR (example using Conda) # conda install -c bioconda star # --- Reference Data Setup (example for GRCh38) --- # This step is typically performed once to build the STAR genome index. # Replace '/path/to/genome_index', '/path/to/GRCh38.primary_assembly.fa', and '/path/to/gencode.v38.annotation.gtf' with actual paths. # mkdir -p /path/to/genome_index # STAR --runThreadN 8 \ # --runMode genomeGenerate \ # --genomeDir /path/to/genome_index \ # --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \ # --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \ # --sjdbOverhang 100 # Recommended value is (ReadLength - 1) # --- STAR Alignment Command --- # This command aligns sequencing reads to a reference genome using STAR. # It assumes a pre-built STAR genome index is available at '/path/to/genome_index'. # 'input_R1.fastq.gz' and 'input_R2.fastq.gz' are placeholder names for your input FASTQ files (paired-end example). # For single-end reads, remove the second FASTQ file from --readFilesIn. # Output BAM file will be 'aligned_reads_Aligned.sortedByCoord.out.bam'. STAR --genomeDir /path/to/genome_index \ --readFilesIn input_R1.fastq.gz input_R2.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix aligned_reads_ \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB) -
13
Total coverage and mutation counts summed at each position in the transcriptome with de-duplicated reads.
samtools and parse_mpileup.py (Inferred with models/gemini-2.5-flash) vsamtools 1.15.1, parse_mpileup.py from skipper workflow (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.15.1 # Download the parse_mpileup.py script from the Yeo lab skipper workflow. # This script is essential for parsing samtools mpileup output to count mutations and coverage. # wget https://raw.githubusercontent.com/yeolab/skipper/master/scripts/parse_mpileup.py # chmod +x parse_mpileup.py # Define input and reference files DEDUP_BAM="input_deduplicated.bam" # Path to your de-duplicated BAM file REFERENCE_FASTA="GRCh38.p14.genome.fa" # Placeholder: Path to your reference genome FASTA file (e.g., GRCh38.p14) OUTPUT_MUTATIONS_TSV="output_mutations_coverage.tsv.gz" # Define parameters (example values commonly used in Yeo lab eCLIP pipeline) MIN_BASE_QUALITY=20 # Minimum base quality to consider a base MIN_MAPPING_QUALITY=10 # Minimum mapping quality to consider a read MAX_DEPTH=1000000 # Maximum per-BAM depth to avoid truncation for high coverage data # Execute samtools mpileup to generate pileup data and pipe it to parse_mpileup.py # The parse_mpileup.py script then processes this data to sum coverage and mutation counts at each position. samtools mpileup \ -q ${MIN_BASE_QUALITY} \ -Q ${MIN_MAPPING_QUALITY} \ -d ${MAX_DEPTH} \ -A \ -B \ -f ${REFERENCE_FASTA} \ ${DEDUP_BAM} \ | python parse_mpileup.py --output ${OUTPUT_MUTATIONS_TSV} -
14
For each human transcript with read coverage, (mutation coverage / total read coverage) calculated at each position and normalized between -protein and +protein libraries, or treated and untreated libraries, to calculate "fSHAPE" or "SHAPE" reactivity at each nucleotide in fshape_clip_maps.tar.gz and *shapeclip_maps.tar.gz files.
$ Bash example
# Define input files and reference data # Placeholder for human reference genome (e.g., GRCh38) REF_GENOME="/path/to/human_genome/GRCh38.primary_assembly.genome.fa" # Placeholder for human transcript annotation (e.g., Gencode v38) TRANSCRIPT_ANNOTATION="/path/to/human_genome/gencode.v38.annotation.gtf" # Input BAM files for -protein and +protein libraries (for fSHAPE) NEG_PROTEIN_BAM="/path/to/eclip_neg_protein_sample.bam" POS_PROTEIN_BAM="/path/to/eclip_pos_protein_sample.bam" # Input BAM files for treated and untreated libraries (for SHAPE) TREATED_BAM="/path/to/treated_sample.bam" UNTREATED_BAM="/path/to/untreated_sample.bam" # Output directory OUTPUT_DIR="./shape_reactivity_maps" mkdir -p ${OUTPUT_DIR} # Ensure BAM files are indexed if not already # samtools index ${NEG_PROTEIN_BAM} # samtools index ${POS_PROTEIN_BAM} # samtools index ${TREATED_BAM} # samtools index ${UNTREATED_BAM} # Placeholder for a custom script that calculates SHAPE/fSHAPE reactivity. # This script would typically: # 1. Process aligned reads (BAM files) to identify mutations or RT stops. # 2. Calculate mutation coverage / total read coverage at each nucleotide. # 3. Normalize reactivities between the specified control and experimental libraries. # 4. Generate per-nucleotide reactivity maps for each transcript. # 5. Archive the results into the specified .tar.gz files. # Example command for fSHAPE calculation (comparing -protein vs +protein libraries) # The actual script name and parameters would depend on the custom implementation. # This command generates 'fshape_clip_maps.tar.gz' python /path/to/custom_shape_analysis_script.py \ --reference ${REF_GENOME} \ --annotation ${TRANSCRIPT_ANNOTATION} \ --control_bam ${NEG_PROTEIN_BAM} \ --experimental_bam ${POS_PROTEIN_BAM} \ --output_prefix ${OUTPUT_DIR}/fshape_output \ --output_tar_gz ${OUTPUT_DIR}/fshape_clip_maps.tar.gz \ --mode fSHAPE_protein_footprinting # Example command for SHAPE calculation (comparing treated vs untreated libraries) # This command generates 'shapeclip_maps.tar.gz' (or similar, matching the pattern *shapeclip_maps.tar.gz) python /path/to/custom_shape_analysis_script.py \ --reference ${REF_GENOME} \ --annotation ${TRANSCRIPT_ANNOTATION} \ --control_bam ${UNTREATED_BAM} \ --experimental_bam ${TREATED_BAM} \ --output_prefix ${OUTPUT_DIR}/shape_output \ --output_tar_gz ${OUTPUT_DIR}/shapeclip_maps.tar.gz \ --mode SHAPE_treatment_effect
Raw Source Text
library strategy: footprinting and selective 2â-hydroxyl acylation analyzed by primer extension (fSHAPE) SHAPE libraries as follows: Full code at https://github.com/meracorley/fSHAPE zcat "$file".fastq.gz | umi_tools extract --bc-pattern=NNNNCCCCNNNNN > "$file".umi.fastq cutadapt -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG --minimum-length=23 --overlap=5 -o "$file".trim.fastq "$file".umi.fastq STAR --runThreadN 8 --genomeDir /projects/ps-yeolab4/genomes/GRCh38/star/ --readFilesIn "$file".trim.fastq --outFileNamePrefix "$file" --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --outSAMtype BAM SortedByCoordinate umitools dedup -I "$file"Aligned.sortedByCoord.out.bam --spliced-is-unique -S "$file".dedup.bam Total coverage (bedtools genomecov -split -strand + -dz; bedtools genomecov -split -strand - -dz) and 5' end coverage (bedtools genomecov -5 -strand + -dz; bedtools genomecov -5 -strand - -dz) summed at each position in the transcriptome with de-duplicated reads. For each human transcript with read coverage, (5'end coverage / total read coverage) calculated at each position and normalized between -protein and +protein SHAPE libraries to calculate "fSHAPE reactivity" at each nucleotide in *maps.tar.gz files. -protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vitro icSHAPE reactivity" at each nucleotide in *vitro_icshape.tar.gz files. +protein SHAPE libraries normalized against untreated SHAPE libraries to produce "in vivo icSHAPE reactivity" at each nucleotide in *vivo_icshape.tar.gz files. SHAPE-eCLIP and fSHAPE-eCLIP libraries as follows: Full code at https://github.com/meracorley/f-SHAPE-eCLIP eCLIP pipeline (reads mapped with bowtie and STAR, de-duplicated, peaks called for regions with read pileup in IP samples relative to Input) Total coverage and mutation counts summed at each position in the transcriptome with de-duplicated reads. For each human transcript with read coverage, (mutation coverage / total read coverage) calculated at each position and normalized between -protein and +protein libraries, or treated and untreated libraries, to calculate "fSHAPE" or "SHAPE" reactivity at each nucleotide in fshape_clip_maps.tar.gz and *shapeclip_maps.tar.gz files. Genome_build: Hg38 for icSHAPE and fSHAPE, Hg19 for SHAPE-eCLIP and fSHAPE-eCLIP Supplementary_files_format_and_content: Tab-delimited .map files for multiple human transcripts: Nucleotide number in transcript, Average SHAPE or fSHAPE reactivity, Standard deviation, Base. Tab-delimited .rx files for multiple human transcripts: SHAPE or fSHAPE reactivity rep1, SHAPE or fSHAPE reactivity rep2.