GSE118347 Processing Pipeline
Publication
Gain-of-function cardiomyopathic mutations in RBM20 rewire splicing regulation and re-distribute ribonucleoprotein granules within processing bodies.Nature communications (2021) — PMID 34732726
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: PAR-CLIP
PAR-CLIP vv1.0$ Bash example
# Install PARalyzer (if not already installed) # Download from SourceForge: https://sourceforge.net/projects/paralyzer/ # wget https://sourceforge.net/projects/paralyzer/files/PARalyzer_v1.0.tar.gz # tar -xzf PARalyzer_v1.0.tar.gz # cd PARalyzer_v1.0 # Ensure 'perl' is installed and PARalyzer.pl is executable and in your PATH or specified by full path. # Placeholder for reference genome FASTA file (e.g., human hg38) GENOME_FASTA="/path/to/reference/genome/hg38.fa" # Placeholder for input aligned reads in BED format. # This file typically results from aligning PAR-CLIP reads (e.g., with STAR) and converting the BAM to BED. # Example command to generate this (assuming STAR alignment and samtools/bedtools): # STAR --runThreadN 8 --genomeDir /path/to/genome/index/hg38 --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \ # --outFileNamePrefix sample_parclip. --outSAMtype BAM SortedByCoordinate # samtools view -b -F 0x4 sample_parclip.Aligned.sortedByCoord.out.bam | bedtools bamtobed -i stdin > aligned_reads.bed INPUT_BED_FILE="aligned_reads.bed" # Output directory for PARalyzer results OUTPUT_DIR="paralyzer_output" mkdir -p ${OUTPUT_DIR} # Create a minimal PARalyzer configuration file. # This file defines parameters for the analysis, including genome path, input files, and thresholds. # Real-world usage may require more detailed and specific configuration based on experimental design. CONFIG_FILE="paralyzer_config.txt" echo "[General]" > ${CONFIG_FILE} echo "genome_fasta = ${GENOME_FASTA}" >> ${CONFIG_FILE} echo "output_dir = ${OUTPUT_DIR}" >> ${CONFIG_FILE} echo "[Sample1]" >> ${CONFIG_FILE} echo "bed_file = ${INPUT_BED_FILE}" >> ${CONFIG_FILE} echo "name = sample_parclip" >> ${CONFIG_FILE} echo "min_read_length = 15" >> ${CONFIG_FILE} # Minimum length of reads to consider echo "min_cluster_size = 5" >> ${CONFIG_FILE} # Minimum number of reads to form a cluster (peak) echo "min_t_to_c_ratio = 0.1" >> ${CONFIG_FILE} # Minimum ratio of T-to-C mutations within a cluster echo "min_t_to_c_count = 2" >> ${CONFIG_FILE} # Minimum number of T-to-C mutations within a cluster # Execute PARalyzer using the generated configuration file. # The '-o' flag specifies the prefix for output files. perl PARalyzer.pl -c ${CONFIG_FILE} -o ${OUTPUT_DIR}/sample_parclip_results -
2
Sequencing data was converted using the bcl2fastq tool from the Illumina CASAVA package, followed by demultiplexing and 3â adapter trimming using flexbar and Cutadapt .
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt # Define input and output files (placeholders) INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="output.fastq.gz" # Define the 3' adapter sequence (common Illumina adapter) ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Execute cutadapt for 3' adapter trimming cutadapt -a "${ADAPTER_SEQUENCE}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length 18 \ "${INPUT_FASTQ}" -
3
To remove PCR duplicates, reads with identical sequence were collapsed, then we used FASTX Trimmer tool to remove random nucleotides from both ends of collapsed reads.
$ Bash example
# Install FASTX-Toolkit # conda install -c bioconda fastx_toolkit # Assuming input is a FASTQ file, convert to FASTA for fastx_collapser fastq_to_fasta -i input.fastq -o input.fasta # Collapse reads with identical sequences to remove PCR duplicates fastx_collapser -i input.fasta -o collapsed.fasta # Remove random nucleotides from both ends of the collapsed reads # Example parameters based on eCLIP workflows: trim 5 bases from the 5' end and 5 bases from the 3' end fastx_trimmer -f 5 -l 5 -i collapsed.fasta -o trimmed.fasta
-
4
Reads shorter than 14 nucleotides were discarded.
fastp (Inferred with models/gemini-2.5-flash) v0.23.4 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install fastp (e.g., via conda) # conda install -c bioconda fastp # Example for single-end reads: # Replace input.fastq.gz and output.fastq.gz with your actual file names fastp \ --in1 input.fastq.gz \ --out1 output.fastq.gz \ --length_required 14 \ --json fastp.json \ --html fastp.html # Example for paired-end reads (uncomment and adjust as needed): # fastp \ # --in1 input_R1.fastq.gz \ # --in2 input_R2.fastq.gz \ # --out1 output_R1.fastq.gz \ # --out2 output_R2.fastq.gz \ # --length_required 14 \ # --json fastp.json \ # --html fastp.html
-
5
BWA-based PARA-Suite aligner was used to align reads to GRCh38 genome and to a custom database of exon-exon junctions generated from Ensembl 89 transcriptome annotation.
BWA v0.7.17$ Bash example
# Install BWA (e.g., using conda) # conda install -c bioconda bwa # Define paths for reference genomes and input reads GRCH38_GENOME="/path/to/GRCh38/GRCh38.fa" # Source: GRCh38 genome JUNCTION_DB="/path/to/junctions/ensembl_89_junctions.fa" # Source: Custom database from Ensembl 89 transcriptome annotation COMBINED_REFERENCE="combined_grch38_junctions.fa" INPUT_READS="reads.fastq.gz" # Placeholder for input FASTQ reads OUTPUT_SAM="aligned_reads.sam" # Concatenate GRCh38 and exon-exon junction sequences into a single reference # This approach allows BWA to align reads against both simultaneously. cat "${GRCH38_GENOME}" "${JUNCTION_DB}" > "${COMBINED_REFERENCE}" # Index the combined reference genome for BWA bwa index "${COMBINED_REFERENCE}" # Align reads using BWA MEM to the combined reference # -t: Number of threads to use (adjust as needed for your system) # -M: Mark shorter split hits as secondary (recommended for Picard compatibility) bwa mem -t 8 "${COMBINED_REFERENCE}" "${INPUT_READS}" > "${OUTPUT_SAM}" -
6
We handled multi-mapping reads by keeping only those that mapped once with at least one T-C mismatch: mapping positions indicating T-C transitions were chosen as the correct ones.
Custom script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# Define input and output file paths INPUT_BAM="aligned_reads.bam" # Input BAM file, potentially containing multi-mapping reads OUTPUT_BAM="filtered_unique_tc_reads.bam" # Output BAM file with filtered reads REFERENCE_FASTA="/path/to/your/reference_genome.fa" # Path to the reference genome FASTA file (e.g., GRCh38.p14.genome.fa) # This step requires a custom script to implement the described logic: # 1. Read the input BAM file. # 2. For each read, identify if it is uniquely mapped (e.g., by checking the 'NH' tag for number of hits, or a high MAPQ score). # 3. For uniquely mapped reads, parse the alignment (CIGAR string and MD tag) to identify mismatches. # 4. Compare the read sequence at mismatch positions with the reference genome sequence # to determine if the mismatch represents a T-C transition (reference T, read C). # 5. Keep only those uniquely mapped reads that contain at least one T-C mismatch. # Example conceptual command for a Python script that would perform this filtering. # This script would need to be developed and would typically use libraries like pysam and pyfaidx. python filter_multi_mapping_by_tc_mismatch.py \ --input_bam "$INPUT_BAM" \ --output_bam "$OUTPUT_BAM" \ --reference_fasta "$REFERENCE_FASTA" \ --min_mapq 30 # Example parameter for mapping quality threshold for unique mapping -
7
Alignment files from each pair of replicates were combined and T-C transition were called using Bmix tool.
Bmix vNot specified, used within Yeo lab eCLIP pipeline (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install Miniconda or Anaconda if not already installed # conda create -n bmix_env python=3.8 # conda activate bmix_env # Install bmix Python package # pip install bmix # Clone the Yeo lab eCLIP pipeline to obtain the bmix_call_transitions.py script # git clone https://github.com/yeolab/eclip.git # Download reference genome (e.g., hg38) and chromosome sizes # mkdir -p reference_data # wget -O reference_data/hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip reference_data/hg38.fa.gz # wget -O reference_data/hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # Define input and reference files INPUT_BAM="combined_replicates.bam" # Placeholder for the combined alignment file GENOME_FASTA="reference_data/hg38.fa" # Placeholder for the reference genome FASTA CHROM_SIZES="reference_data/hg38.chrom.sizes" # Placeholder for the chromosome sizes file OUTPUT_PREFIX="tc_transitions" # Run the bmix_call_transitions.py script from the eCLIP pipeline to call T-C transitions # Adjust the path to bmix_call_transitions.py if 'eclip' directory is not in the current working directory python eclip/scripts/bmix_call_transitions.py \ --bam "${INPUT_BAM}" \ --genome "${GENOME_FASTA}" \ --output "${OUTPUT_PREFIX}" \ --chrom_sizes "${CHROM_SIZES}" \ --min_read_count 5 \ --min_base_qual 20 \ --min_map_qual 20 \ --min_prob 0.8 \ --strand_specific -
8
Only transitions supported by at least two reads from both replicates were kept.
$ Bash example
# This is a conceptual implementation based on the description. # The actual input file format and column indices would depend on the upstream steps. # Assuming an input file 'transitions_with_counts.tsv' where each row represents a transition # and includes read counts for at least two replicates. For example: # transition_id\treplicate1_reads\treplicate2_reads\t... # chr1:100:A>G\t5\t3\t... # chr2:200:C>T\t1\t2\t... # chr3:300:G>A\t3\t1\t... # chr4:400:T>C\t2\t2\t... INPUT_FILE="transitions_with_counts.tsv" OUTPUT_FILE="filtered_transitions.tsv" MIN_READS_PER_REPLICATE=2 # Using awk to filter based on the specified criteria. # This command assumes that the read counts for replicate 1 are in the 2nd column ($2) # and for replicate 2 are in the 3rd column ($3). Adjust column indices as needed # based on the actual input file format. awk -v min_reads="$MIN_READS_PER_REPLICATE" 'NR==1 || ($2 >= min_reads && $3 >= min_reads)' "$INPUT_FILE" > "$OUTPUT_FILE"
-
9
Annotation of T-C transitions with respect to their position within Ensembl 89 gene bodies was performed using BEDTools intersect tool.
$ Bash example
# Install bedtools (if not already installed) # conda install -c bioconda bedtools # Placeholder for T-C transitions file (e.g., a BED file of variant positions or specific sites) # This file should contain the genomic coordinates of the T-C transitions. TC_TRANSITIONS_BED="tc_transitions.bed" # Placeholder for Ensembl 89 gene bodies file. # This file can be generated by downloading the Ensembl 89 GTF for the relevant species # (e.g., Homo_sapiens.GRCh38.89.gtf.gz from ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/) # and then extracting gene features and converting them to BED format. # Example conversion (simplified): # wget ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz # gunzip Homo_sapiens.GRCh38.89.gtf.gz # grep -P "\tgene\t" Homo_sapiens.GRCh38.89.gtf | awk 'BEGIN{OFS="\t"} {print $1, $4-1, $5, $10, $6, $7}' | sed 's/;//g' | sed 's/"//g' > ensembl_89_gene_bodies.bed ENSEMBL_89_GENE_BODIES_BED="ensembl_89_gene_bodies.bed" # Output file for annotated T-C transitions OUTPUT_ANNOTATED_BED="tc_transitions_annotated_ensembl89.bed" # Perform annotation using bedtools intersect. # -a: Input file A (T-C transitions) # -b: Input file B (Ensembl 89 gene bodies) # -wao: Write the original entry in A, the original entry in B, and the number of overlapping bases. # If there is no overlap, a "." is written for the B entry and 0 for the overlap count. # This allows for annotating each T-C transition with gene body information if it overlaps. bedtools intersect -a "${TC_TRANSITIONS_BED}" -b "${ENSEMBL_89_GENE_BODIES_BED}" -wao > "${OUTPUT_ANNOTATED_BED}" -
10
Each column reports the fraction of T-C transitions in each gene body region that was calculated by dividing the number of T converted to C by the total number of T in that region.
Custom script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# Install necessary tools (example) # conda install -c bioconda samtools bedtools python pysam pyfaidx pandas # Define input files and parameters # Placeholder for reference genome (e.g., GRCh38) # Download from NCBI or UCSC if not available locally # wget -O GRCh38.p13.genome.fa.gz "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GRCh38.p13_GCA_000001405.28/GCF_000001405.28_GRCh38.p13_genomic.fna.gz" # gunzip GRCh38.p13.genome.fa.gz # samtools faidx GRCh38.p13.genome.fa REFERENCE_FASTA="GRCh38.p13.genome.fa" # Placeholder for gene body regions (e.g., from GENCODE or RefSeq) # This would typically be a BED file containing chromosome, start, end, and gene name. # Example: A BED file of protein-coding gene bodies. GENE_BODY_BED="gencode.v38.gene_bodies.bed" # Example, needs to be generated or downloaded # Input BAM file (aligned reads) INPUT_BAM="aligned_reads.bam" # Output file OUTPUT_TSV="gene_body_t_to_c_fractions.tsv" # --- Conceptual Python script for T->C transition calculation --- # This script iterates through gene body regions, # fetches reference sequence, identifies 'T' bases, # and then uses pysam to query aligned reads at those positions # to count 'C' bases and total coverage. # The full script is not provided here, but its execution is simulated. # Create a dummy Python script for demonstration purposes # In a real scenario, this script would be pre-written and robust. cat << 'EOF' > calculate_t_to_c_transitions.py import pysam import pyfaidx import pandas as pd import argparse def calculate_t_to_c_fractions(bam_file, ref_file, bed_file, output_file): samfile = pysam.AlignmentFile(bam_file, 'rb') genome = pyfaidx.Fasta(ref_file) results = [] with open(bed_file, 'r') as f: for line in f: parts = line.strip().split('\t') if len(parts) < 4: # Basic check for valid BED format continue chrom, start_str, end_str, name = parts[0], parts[1], parts[2], parts[3] try: start, end = int(start_str), int(end_str) except ValueError: continue # Skip malformed lines total_ref_T_in_region = 0 total_T_to_C_reads_in_region = 0 total_T_covered_reads_in_region = 0 # Iterate through each position in the gene body region for pos in range(start, end): try: ref_base = genome[chrom][pos].seq.upper() except KeyError: # Chromosome not found in reference continue if ref_base == 'T': total_ref_T_in_region += 1 # Fetch reads covering this specific 'T' position # min_base_quality=0 to include all bases regardless of quality # ignore_overlaps=False to count all reads covering the position for pileupcolumn in samfile.pileup(chrom, pos, pos + 1, stepper='all', min_base_quality=0, ignore_overlaps=False): if pileupcolumn.pos == pos: # Ensure we are at the correct position for pileupread in pileupcolumn.pileups: if not pileupread.is_del and not pileupread.is_refskip and pileupread.query_position is not None: read_base = pileupread.alignment.query_sequence[pileupread.query_position].upper() total_T_covered_reads_in_region += 1 if read_base == 'C': total_T_to_C_reads_in_region += 1 fraction = 0.0 if total_T_covered_reads_in_region > 0: fraction = total_T_to_C_reads_in_region / total_T_covered_reads_in_region results.append({'gene_region': name, 'fraction_T_to_C': fraction}) df = pd.DataFrame(results) df.to_csv(output_file, sep='\t', index=False) samfile.close() genome.close() if __name__ == '__main__': parser = argparse.ArgumentParser(description='Calculate T->C transition fractions in gene body regions.') parser.add_argument('--bam', required=True, help='Path to the input BAM file.') parser.add_argument('--reference', required=True, help='Path to the reference FASTA file.') parser.add_argument('--regions', required=True, help='Path to the gene body regions BED file.') parser.add_argument('--output', required=True, help='Path to the output TSV file.') args = parser.parse_args() calculate_t_to_c_fractions(args.bam, args.reference, args.regions, args.output) EOF # Execute the Python script python calculate_t_to_c_transitions.py \ --bam "$INPUT_BAM" \ --reference "$REFERENCE_FASTA" \ --regions "$GENE_BODY_BED" \ --output "$OUTPUT_TSV"
Tools Used
Raw Source Text
Library strategy: PAR-CLIP Sequencing data was converted using the bcl2fastq tool from the Illumina CASAVA package, followed by demultiplexing and 3â adapter trimming using flexbar and Cutadapt . To remove PCR duplicates, reads with identical sequence were collapsed, then we used FASTX Trimmer tool to remove random nucleotides from both ends of collapsed reads. Reads shorter than 14 nucleotides were discarded. BWA-based PARA-Suite aligner was used to align reads to GRCh38 genome and to a custom database of exon-exon junctions generated from Ensembl 89 transcriptome annotation. We handled multi-mapping reads by keeping only those that mapped once with at least one T-C mismatch: mapping positions indicating T-C transitions were chosen as the correct ones. Alignment files from each pair of replicates were combined and T-C transition were called using Bmix tool. Only transitions supported by at least two reads from both replicates were kept. Annotation of T-C transitions with respect to their position within Ensembl 89 gene bodies was performed using BEDTools intersect tool. Each column reports the fraction of T-C transitions in each gene body region that was calculated by dividing the number of T converted to C by the total number of T in that region. Genome_build: hg19 Supplementary_files_format_and_content: tab separated values