GSE118347 Processing Pipeline

OTHER code_examples 10 steps

Publication

Gain-of-function cardiomyopathic mutations in RBM20 rewire splicing regulation and re-distribute ribonucleoprotein granules within processing bodies.

Nature communications (2021) — PMID 34732726

Dataset

GSE118347

Mutant FUS and ELAVL4 (HuD) aberrant crosstalk in Amyotrophic Lateral Sclerosis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: PAR-CLIP

    $ Bash example
    # Install PARalyzer (if not already installed)
    # Download from SourceForge: https://sourceforge.net/projects/paralyzer/
    # wget https://sourceforge.net/projects/paralyzer/files/PARalyzer_v1.0.tar.gz
    # tar -xzf PARalyzer_v1.0.tar.gz
    # cd PARalyzer_v1.0
    # Ensure 'perl' is installed and PARalyzer.pl is executable and in your PATH or specified by full path.
    
    # Placeholder for reference genome FASTA file (e.g., human hg38)
    GENOME_FASTA="/path/to/reference/genome/hg38.fa"
    
    # Placeholder for input aligned reads in BED format.
    # This file typically results from aligning PAR-CLIP reads (e.g., with STAR) and converting the BAM to BED.
    # Example command to generate this (assuming STAR alignment and samtools/bedtools):
    # STAR --runThreadN 8 --genomeDir /path/to/genome/index/hg38 --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
    #      --outFileNamePrefix sample_parclip. --outSAMtype BAM SortedByCoordinate
    # samtools view -b -F 0x4 sample_parclip.Aligned.sortedByCoord.out.bam | bedtools bamtobed -i stdin > aligned_reads.bed
    INPUT_BED_FILE="aligned_reads.bed"
    
    # Output directory for PARalyzer results
    OUTPUT_DIR="paralyzer_output"
    mkdir -p ${OUTPUT_DIR}
    
    # Create a minimal PARalyzer configuration file.
    # This file defines parameters for the analysis, including genome path, input files, and thresholds.
    # Real-world usage may require more detailed and specific configuration based on experimental design.
    CONFIG_FILE="paralyzer_config.txt"
    
    echo "[General]" > ${CONFIG_FILE}
    echo "genome_fasta = ${GENOME_FASTA}" >> ${CONFIG_FILE}
    echo "output_dir = ${OUTPUT_DIR}" >> ${CONFIG_FILE}
    echo "[Sample1]" >> ${CONFIG_FILE}
    echo "bed_file = ${INPUT_BED_FILE}" >> ${CONFIG_FILE}
    echo "name = sample_parclip" >> ${CONFIG_FILE}
    echo "min_read_length = 15" >> ${CONFIG_FILE} # Minimum length of reads to consider
    echo "min_cluster_size = 5" >> ${CONFIG_FILE} # Minimum number of reads to form a cluster (peak)
    echo "min_t_to_c_ratio = 0.1" >> ${CONFIG_FILE} # Minimum ratio of T-to-C mutations within a cluster
    echo "min_t_to_c_count = 2" >> ${CONFIG_FILE} # Minimum number of T-to-C mutations within a cluster
    
    # Execute PARalyzer using the generated configuration file.
    # The '-o' flag specifies the prefix for output files.
    perl PARalyzer.pl -c ${CONFIG_FILE} -o ${OUTPUT_DIR}/sample_parclip_results
  2. 2

    Sequencing data was converted using the bcl2fastq tool from the Illumina CASAVA package, followed by demultiplexing and 3’ adapter trimming using flexbar and Cutadapt .

    cutadapt vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt
    
    # Define input and output files (placeholders)
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="output.fastq.gz"
    
    # Define the 3' adapter sequence (common Illumina adapter)
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Execute cutadapt for 3' adapter trimming
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -o "${OUTPUT_FASTQ}" \
             --minimum-length 18 \
             "${INPUT_FASTQ}"
  3. 3

    To remove PCR duplicates, reads with identical sequence were collapsed, then we used FASTX Trimmer tool to remove random nucleotides from both ends of collapsed reads.

    fastx_collapser (Inferred with models/gemini-2.5-flash), fastx_trimmer vFASTX-Toolkit 0.0.14 GitHub
    $ Bash example
    # Install FASTX-Toolkit
    # conda install -c bioconda fastx_toolkit
    
    # Assuming input is a FASTQ file, convert to FASTA for fastx_collapser
    fastq_to_fasta -i input.fastq -o input.fasta
    
    # Collapse reads with identical sequences to remove PCR duplicates
    fastx_collapser -i input.fasta -o collapsed.fasta
    
    # Remove random nucleotides from both ends of the collapsed reads
    # Example parameters based on eCLIP workflows: trim 5 bases from the 5' end and 5 bases from the 3' end
    fastx_trimmer -f 5 -l 5 -i collapsed.fasta -o trimmed.fasta
  4. 4

    Reads shorter than 14 nucleotides were discarded.

    fastp (Inferred with models/gemini-2.5-flash) v0.23.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install fastp (e.g., via conda)
    # conda install -c bioconda fastp
    
    # Example for single-end reads:
    # Replace input.fastq.gz and output.fastq.gz with your actual file names
    fastp \
      --in1 input.fastq.gz \
      --out1 output.fastq.gz \
      --length_required 14 \
      --json fastp.json \
      --html fastp.html
    
    # Example for paired-end reads (uncomment and adjust as needed):
    # fastp \
    #   --in1 input_R1.fastq.gz \
    #   --in2 input_R2.fastq.gz \
    #   --out1 output_R1.fastq.gz \
    #   --out2 output_R2.fastq.gz \
    #   --length_required 14 \
    #   --json fastp.json \
    #   --html fastp.html
  5. 5

    BWA-based PARA-Suite aligner was used to align reads to GRCh38 genome and to a custom database of exon-exon junctions generated from Ensembl 89 transcriptome annotation.

    BWA v0.7.17
    $ Bash example
    # Install BWA (e.g., using conda)
    # conda install -c bioconda bwa
    
    # Define paths for reference genomes and input reads
    GRCH38_GENOME="/path/to/GRCh38/GRCh38.fa" # Source: GRCh38 genome
    JUNCTION_DB="/path/to/junctions/ensembl_89_junctions.fa" # Source: Custom database from Ensembl 89 transcriptome annotation
    COMBINED_REFERENCE="combined_grch38_junctions.fa"
    INPUT_READS="reads.fastq.gz" # Placeholder for input FASTQ reads
    OUTPUT_SAM="aligned_reads.sam"
    
    # Concatenate GRCh38 and exon-exon junction sequences into a single reference
    # This approach allows BWA to align reads against both simultaneously.
    cat "${GRCH38_GENOME}" "${JUNCTION_DB}" > "${COMBINED_REFERENCE}"
    
    # Index the combined reference genome for BWA
    bwa index "${COMBINED_REFERENCE}"
    
    # Align reads using BWA MEM to the combined reference
    # -t: Number of threads to use (adjust as needed for your system)
    # -M: Mark shorter split hits as secondary (recommended for Picard compatibility)
    bwa mem -t 8 "${COMBINED_REFERENCE}" "${INPUT_READS}" > "${OUTPUT_SAM}"
    
  6. 6

    We handled multi-mapping reads by keeping only those that mapped once with at least one T-C mismatch: mapping positions indicating T-C transitions were chosen as the correct ones.

    Custom script (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
        # Define input and output file paths
        INPUT_BAM="aligned_reads.bam" # Input BAM file, potentially containing multi-mapping reads
        OUTPUT_BAM="filtered_unique_tc_reads.bam" # Output BAM file with filtered reads
        REFERENCE_FASTA="/path/to/your/reference_genome.fa" # Path to the reference genome FASTA file (e.g., GRCh38.p14.genome.fa)
    
        # This step requires a custom script to implement the described logic:
        # 1. Read the input BAM file.
        # 2. For each read, identify if it is uniquely mapped (e.g., by checking the 'NH' tag for number of hits, or a high MAPQ score).
        # 3. For uniquely mapped reads, parse the alignment (CIGAR string and MD tag) to identify mismatches.
        # 4. Compare the read sequence at mismatch positions with the reference genome sequence
        #    to determine if the mismatch represents a T-C transition (reference T, read C).
        # 5. Keep only those uniquely mapped reads that contain at least one T-C mismatch.
    
        # Example conceptual command for a Python script that would perform this filtering.
        # This script would need to be developed and would typically use libraries like pysam and pyfaidx.
        python filter_multi_mapping_by_tc_mismatch.py \
          --input_bam "$INPUT_BAM" \
          --output_bam "$OUTPUT_BAM" \
          --reference_fasta "$REFERENCE_FASTA" \
          --min_mapq 30 # Example parameter for mapping quality threshold for unique mapping
    
  7. 7

    Alignment files from each pair of replicates were combined and T-C transition were called using Bmix tool.

    Bmix vNot specified, used within Yeo lab eCLIP pipeline (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Miniconda or Anaconda if not already installed
    # conda create -n bmix_env python=3.8
    # conda activate bmix_env
    
    # Install bmix Python package
    # pip install bmix
    
    # Clone the Yeo lab eCLIP pipeline to obtain the bmix_call_transitions.py script
    # git clone https://github.com/yeolab/eclip.git
    
    # Download reference genome (e.g., hg38) and chromosome sizes
    # mkdir -p reference_data
    # wget -O reference_data/hg38.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # gunzip reference_data/hg38.fa.gz
    # wget -O reference_data/hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    
    # Define input and reference files
    INPUT_BAM="combined_replicates.bam" # Placeholder for the combined alignment file
    GENOME_FASTA="reference_data/hg38.fa" # Placeholder for the reference genome FASTA
    CHROM_SIZES="reference_data/hg38.chrom.sizes" # Placeholder for the chromosome sizes file
    OUTPUT_PREFIX="tc_transitions"
    
    # Run the bmix_call_transitions.py script from the eCLIP pipeline to call T-C transitions
    # Adjust the path to bmix_call_transitions.py if 'eclip' directory is not in the current working directory
    python eclip/scripts/bmix_call_transitions.py \
        --bam "${INPUT_BAM}" \
        --genome "${GENOME_FASTA}" \
        --output "${OUTPUT_PREFIX}" \
        --chrom_sizes "${CHROM_SIZES}" \
        --min_read_count 5 \
        --min_base_qual 20 \
        --min_map_qual 20 \
        --min_prob 0.8 \
        --strand_specific
    
  8. 8

    Only transitions supported by at least two reads from both replicates were kept.

    Custom Script (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # This is a conceptual implementation based on the description.
    # The actual input file format and column indices would depend on the upstream steps.
    
    # Assuming an input file 'transitions_with_counts.tsv' where each row represents a transition
    # and includes read counts for at least two replicates. For example:
    # transition_id\treplicate1_reads\treplicate2_reads\t...
    # chr1:100:A>G\t5\t3\t...
    # chr2:200:C>T\t1\t2\t...
    # chr3:300:G>A\t3\t1\t...
    # chr4:400:T>C\t2\t2\t...
    
    INPUT_FILE="transitions_with_counts.tsv"
    OUTPUT_FILE="filtered_transitions.tsv"
    MIN_READS_PER_REPLICATE=2
    
    # Using awk to filter based on the specified criteria.
    # This command assumes that the read counts for replicate 1 are in the 2nd column ($2)
    # and for replicate 2 are in the 3rd column ($3). Adjust column indices as needed
    # based on the actual input file format.
    awk -v min_reads="$MIN_READS_PER_REPLICATE" 'NR==1 || ($2 >= min_reads && $3 >= min_reads)' "$INPUT_FILE" > "$OUTPUT_FILE"
  9. 9

    Annotation of T-C transitions with respect to their position within Ensembl 89 gene bodies was performed using BEDTools intersect tool.

    bedtools v2.29.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install bedtools (if not already installed)
    # conda install -c bioconda bedtools
    
    # Placeholder for T-C transitions file (e.g., a BED file of variant positions or specific sites)
    # This file should contain the genomic coordinates of the T-C transitions.
    TC_TRANSITIONS_BED="tc_transitions.bed"
    
    # Placeholder for Ensembl 89 gene bodies file.
    # This file can be generated by downloading the Ensembl 89 GTF for the relevant species
    # (e.g., Homo_sapiens.GRCh38.89.gtf.gz from ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/)
    # and then extracting gene features and converting them to BED format.
    # Example conversion (simplified):
    # wget ftp://ftp.ensembl.org/pub/release-89/gtf/homo_sapiens/Homo_sapiens.GRCh38.89.gtf.gz
    # gunzip Homo_sapiens.GRCh38.89.gtf.gz
    # grep -P "\tgene\t" Homo_sapiens.GRCh38.89.gtf | awk 'BEGIN{OFS="\t"} {print $1, $4-1, $5, $10, $6, $7}' | sed 's/;//g' | sed 's/"//g' > ensembl_89_gene_bodies.bed
    ENSEMBL_89_GENE_BODIES_BED="ensembl_89_gene_bodies.bed"
    
    # Output file for annotated T-C transitions
    OUTPUT_ANNOTATED_BED="tc_transitions_annotated_ensembl89.bed"
    
    # Perform annotation using bedtools intersect.
    # -a: Input file A (T-C transitions)
    # -b: Input file B (Ensembl 89 gene bodies)
    # -wao: Write the original entry in A, the original entry in B, and the number of overlapping bases.
    #       If there is no overlap, a "." is written for the B entry and 0 for the overlap count.
    #       This allows for annotating each T-C transition with gene body information if it overlaps.
    bedtools intersect -a "${TC_TRANSITIONS_BED}" -b "${ENSEMBL_89_GENE_BODIES_BED}" -wao > "${OUTPUT_ANNOTATED_BED}"
  10. 10

    Each column reports the fraction of T-C transitions in each gene body region that was calculated by dividing the number of T converted to C by the total number of T in that region.

    Custom script (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # Install necessary tools (example)
    # conda install -c bioconda samtools bedtools python pysam pyfaidx pandas
    
    # Define input files and parameters
    # Placeholder for reference genome (e.g., GRCh38)
    # Download from NCBI or UCSC if not available locally
    # wget -O GRCh38.p13.genome.fa.gz "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GRCh38.p13_GCA_000001405.28/GCF_000001405.28_GRCh38.p13_genomic.fna.gz"
    # gunzip GRCh38.p13.genome.fa.gz
    # samtools faidx GRCh38.p13.genome.fa
    REFERENCE_FASTA="GRCh38.p13.genome.fa"
    
    # Placeholder for gene body regions (e.g., from GENCODE or RefSeq)
    # This would typically be a BED file containing chromosome, start, end, and gene name.
    # Example: A BED file of protein-coding gene bodies.
    GENE_BODY_BED="gencode.v38.gene_bodies.bed" # Example, needs to be generated or downloaded
    
    # Input BAM file (aligned reads)
    INPUT_BAM="aligned_reads.bam"
    
    # Output file
    OUTPUT_TSV="gene_body_t_to_c_fractions.tsv"
    
    # --- Conceptual Python script for T->C transition calculation ---
    # This script iterates through gene body regions,
    # fetches reference sequence, identifies 'T' bases,
    # and then uses pysam to query aligned reads at those positions
    # to count 'C' bases and total coverage.
    # The full script is not provided here, but its execution is simulated.
    
    # Create a dummy Python script for demonstration purposes
    # In a real scenario, this script would be pre-written and robust.
    cat << 'EOF' > calculate_t_to_c_transitions.py
    import pysam
    import pyfaidx
    import pandas as pd
    import argparse
    
    def calculate_t_to_c_fractions(bam_file, ref_file, bed_file, output_file):
        samfile = pysam.AlignmentFile(bam_file, 'rb')
        genome = pyfaidx.Fasta(ref_file)
    
        results = []
    
        with open(bed_file, 'r') as f:
            for line in f:
                parts = line.strip().split('\t')
                if len(parts) < 4: # Basic check for valid BED format
                    continue
                chrom, start_str, end_str, name = parts[0], parts[1], parts[2], parts[3]
                try:
                    start, end = int(start_str), int(end_str)
                except ValueError:
                    continue # Skip malformed lines
    
                total_ref_T_in_region = 0
                total_T_to_C_reads_in_region = 0
                total_T_covered_reads_in_region = 0
    
                # Iterate through each position in the gene body region
                for pos in range(start, end):
                    try:
                        ref_base = genome[chrom][pos].seq.upper()
                    except KeyError: # Chromosome not found in reference
                        continue
    
                    if ref_base == 'T':
                        total_ref_T_in_region += 1
                        # Fetch reads covering this specific 'T' position
                        # min_base_quality=0 to include all bases regardless of quality
                        # ignore_overlaps=False to count all reads covering the position
                        for pileupcolumn in samfile.pileup(chrom, pos, pos + 1, stepper='all', min_base_quality=0, ignore_overlaps=False):
                            if pileupcolumn.pos == pos: # Ensure we are at the correct position
                                for pileupread in pileupcolumn.pileups:
                                    if not pileupread.is_del and not pileupread.is_refskip and pileupread.query_position is not None:
                                        read_base = pileupread.alignment.query_sequence[pileupread.query_position].upper()
                                        total_T_covered_reads_in_region += 1
                                        if read_base == 'C':
                                            total_T_to_C_reads_in_region += 1
                
                fraction = 0.0
                if total_T_covered_reads_in_region > 0:
                    fraction = total_T_to_C_reads_in_region / total_T_covered_reads_in_region
                
                results.append({'gene_region': name, 'fraction_T_to_C': fraction})
    
        df = pd.DataFrame(results)
        df.to_csv(output_file, sep='\t', index=False)
    
        samfile.close()
        genome.close()
    
    if __name__ == '__main__':
        parser = argparse.ArgumentParser(description='Calculate T->C transition fractions in gene body regions.')
        parser.add_argument('--bam', required=True, help='Path to the input BAM file.')
        parser.add_argument('--reference', required=True, help='Path to the reference FASTA file.')
        parser.add_argument('--regions', required=True, help='Path to the gene body regions BED file.')
        parser.add_argument('--output', required=True, help='Path to the output TSV file.')
        args = parser.parse_args()
    
        calculate_t_to_c_fractions(args.bam, args.reference, args.regions, args.output)
    EOF
    
    # Execute the Python script
    python calculate_t_to_c_transitions.py \
      --bam "$INPUT_BAM" \
      --reference "$REFERENCE_FASTA" \
      --regions "$GENE_BODY_BED" \
      --output "$OUTPUT_TSV"

Tools Used

Raw Source Text
Library strategy: PAR-CLIP
Sequencing data was converted using the bcl2fastq tool from the Illumina CASAVA package, followed by demultiplexing and 3’ adapter trimming using flexbar and Cutadapt . To remove PCR duplicates, reads with identical sequence were collapsed, then we used FASTX Trimmer tool to remove random nucleotides from both ends of collapsed reads. Reads shorter than 14 nucleotides were discarded.
BWA-based PARA-Suite aligner was used to align reads to GRCh38 genome and to a custom database of exon-exon junctions generated from Ensembl 89 transcriptome annotation. We handled multi-mapping reads by keeping only those that mapped once with at least one T-C mismatch: mapping positions indicating T-C transitions were chosen as the correct ones.
Alignment files from each pair of replicates were combined and  T-C transition were called using Bmix tool. Only transitions supported by at least two reads from both replicates were kept.
Annotation of T-C transitions with respect to their position within Ensembl 89 gene bodies was performed using BEDTools intersect tool.
Each column reports the fraction of T-C transitions in each gene body region that was calculated by dividing the number of T converted to C by the total number of T in that region.
Genome_build: hg19
Supplementary_files_format_and_content: tab separated values
← Back to Analysis