GSE249246 Processing Pipeline

RIP-Seq code_examples 3 steps

Publication

Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.

Molecular cell (2024) — PMID 39303722

Dataset

GSE249246

Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [eCLIP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    *library strategy: eCLIP-seq

    $ Bash example
    # This script demonstrates how to run the eCLIP CWL workflow.
    # It assumes 'cwltool' is installed and the 'yeolab/eclip' repository is cloned.
    
    # Install cwltool if not already installed
    # pip install cwltool
    
    # Clone the eCLIP CWL workflow repository if not already done
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Define placeholder paths for input data and reference files.
    # Replace these with your actual data and reference genome paths.
    # For eCLIP, a comprehensive reference genome (e.g., hg38) typically requires:
    # - Genome FASTA file (e.g., hg38.fa)
    # - STAR index built from the genome FASTA
    # - GTF annotation file (e.g., gencode.v38.annotation.gtf)
    # - Blacklist regions BED file
    # - RepeatMasker BED file
    # - Chromosome sizes file
    # - Adapter sequences FASTA file
    
    OUTPUT_DIR="eclip_results"
    mkdir -p "${OUTPUT_DIR}"
    
    # Create a placeholder inputs.yaml file for the CWL workflow.
    # This file would contain paths to all input FASTQ files (replicates, controls),
    # reference genome components, and pipeline parameters (e.g., threads, memory).
    # A full inputs.yaml would be much more extensive, reflecting all parameters
    # defined in the 'yeolab/eclip' workflow's 'workflow.cwl' file.
    cat << EOF > eclip_inputs.yaml
    # Minimal example of an inputs.yaml for the eCLIP CWL workflow
    # Replace with actual file paths and parameters for your experiment.
    # Example for a single replicate and control, paired-end reads:
    fastq_r1_rep1: {class: File, path: "path/to/replicate1_R1.fastq.gz"}
    fastq_r2_rep1: {class: File, path: "path/to/replicate1_R2.fastq.gz"}
    fastq_r1_control: {class: File, path: "path/to/control_R1.fastq.gz"}
    fastq_r2_control: {class: File, path: "path/to/control_R2.fastq.gz"}
    
    # Reference genome components (using hg38 as a placeholder)
    genome_fasta: {class: File, path: "path/to/hg38.fa"}
    genome_star_index_dir: {class: Directory, path: "path/to/hg38_star_index"}
    genome_gtf: {class: File, path: "path/to/gencode.v38.annotation.gtf"}
    blacklist_bed: {class: File, path: "path/to/hg38_blacklist.bed"}
    repeatmasker_bed: {class: File, path: "path/to/hg38_repeatmasker.bed"}
    chrom_sizes: {class: File, path: "path/to/hg38.chrom.sizes"}
    adapter_fasta: {class: File, path: "path/to/adapters.fa"}
    
    # Pipeline parameters
    output_prefix: "my_eclip_experiment"
    threads: 8
    memory: 32G
    # ... other parameters as required by the workflow
    EOF
    
    # Execute the eCLIP CWL workflow using cwltool.
    # The 'workflow.cwl' file is typically found in the root of the cloned 'eclip' repository.
    cwltool --outdir "${OUTPUT_DIR}" eclip/workflow.cwl eclip_inputs.yaml
  2. 2

    eCLIP data was analyzed using Skipper with default settings.

    $ Bash example
    # Install Skipper (e.g., using pip or conda)
    # pip install skipper
    # conda install -c bioconda skipper
    
    # Create a configuration file (config.yaml) specifying samples, genome, and annotation.
    # Skipper will use its default settings for other parameters as per the description.
    # Example config.yaml content:
    # samples:
    #   sample1_rep1:
    #     R1: "path/to/sample1_rep1_R1.fastq.gz"
    #     R2: "path/to/sample1_rep1_R2.fastq.gz" # Optional, if paired-end
    #   sample1_rep2:
    #     R1: "path/to/sample1_rep2_R1.fastq.gz"
    #     R2: "path/to/sample1_rep2_R2.fastq.gz" # Optional, if paired-end
    #   input_rep1:
    #     R1: "path/to/input_rep1_R1.fastq.gz"
    #     R2: "path/to/input_rep1_R2.fastq.gz" # Optional, if paired-end
    #   input_rep2:
    #     R1: "path/to/input_rep2_R1.fastq.gz"
    #     R2: "path/to/input_rep2_R2.fastq.gz" # Optional, if paired-end
    #
    # genome: "hg38" # Placeholder: Human genome assembly (e.g., hg38, mm10)
    # annotation: "gencode_v38" # Placeholder: GENCODE annotation version (e.g., gencode_v38, gencode_m25)
    
    # Execute Skipper with the configuration file. Adjust --cores as needed.
    skipper run --config config.yaml --cores 8
  3. 3

    The kDa of each ORF was matched to the closest input sample beginning at that range; for example, a 78 kDa ORF would be paired with the input sample covering the range of 75 to 150 kDa.

    RangeMatcher (Inferred with models/gemini-2.5-flash) v1.0 GitHub
    $ Bash example
    # Assume input files:
    # orf_data.tsv: ORF_ID<tab>kDa
    # sample_ranges.tsv: Sample_ID<tab>Min_kDa<tab>Max_kDa
    
    # Example content for orf_data.tsv:
    # ORF1    78
    # ORF2    120
    # ORF3    50
    
    # Example content for sample_ranges.tsv:
    # SampleA 75      150
    # SampleB 50      100
    # SampleC 10      70
    # SampleD 100     200
    
    # Create dummy input files for demonstration
    echo -e "ORF1\t78\nORF2\t120\nORF3\t50" > orf_data.tsv
    echo -e "SampleA\t75\t150\nSampleB\t50\t100\nSampleC\t10\t70\nSampleD\t100\t200" > sample_ranges.tsv
    
    # Python script to perform the matching logic described:
    # For each ORF, find the sample range that contains its kDa value.
    # If multiple ranges contain it, select the one whose 'Min_kDa' is closest to (but not greater than) the ORF's kDa.
    # If there's a tie in closeness, the one with the smallest 'Min_kDa' among the tied ones is chosen as a tie-breaker.
    python3 -c "
    import sys
    
    def parse_data(filename, delimiter='\t'):
        data = []
        with open(filename, 'r') as f:
            for line in f:
                parts = line.strip().split(delimiter)
                data.append(parts)
        return data
    
    def main():
        orf_data_file = 'orf_data.tsv'
        sample_ranges_file = 'sample_ranges.tsv'
        output_file = 'orf_sample_matches.tsv'
    
        orfs = parse_data(orf_data_file)
        sample_ranges = parse_data(sample_ranges_file)
    
        with open(output_file, 'w') as outfile:
            outfile.write('ORF_ID\tORF_kDa\tMatched_Sample_ID\tMatched_Min_kDa\tMatched_Max_kDa\n')
            for orf_parts in orfs:
                orf_id = orf_parts[0]
                orf_kDa = float(orf_parts[1])
                
                best_match = None
                min_diff = float('inf') # Stores abs(orf_kDa - min_kDa) for the best match
                
                for sample_parts in sample_ranges:
                    sample_id = sample_parts[0]
                    min_kDa = float(sample_parts[1])
                    max_kDa = float(sample_parts[2])
    
                    if min_kDa <= orf_kDa <= max_kDa:
                        # This sample range contains the ORF's kDa
                        current_diff = abs(orf_kDa - min_kDa)
                        
                        if best_match is None or current_diff < min_diff:
                            best_match = (sample_id, min_kDa, max_kDa)
                            min_diff = current_diff
                        elif current_diff == min_diff:
                            # If difference is the same, prefer the one with smaller min_kDa as a tie-breaker
                            if min_kDa < best_match[1]:
                                best_match = (sample_id, min_kDa, max_kDa)
    
                if best_match:
                    outfile.write(f'{orf_id}\t{orf_kDa}\t{best_match[0]}\t{best_match[1]}\t{best_match[2]}\n')
                else:
                    outfile.write(f'{orf_id}\t{orf_kDa}\tNo_Match\tN/A\tN/A\n')
    
    if __name__ == '__main__':
        main()
    "
    # Clean up dummy files (uncomment to enable cleanup after execution)
    # rm orf_data.tsv sample_ranges.tsv
    

Tools Used

Raw Source Text
*library strategy: eCLIP-seq
eCLIP data was analyzed using Skipper with default settings. The kDa of each ORF was matched to the closest input sample beginning at that range; for example, a 78 kDa ORF would be paired with the input sample covering the range of 75 to 150 kDa.
Assembly: hg38
Supplementary files format and content: Reproducible enriched peaks output by Skipper
← Back to Analysis