GSE249246 Processing Pipeline

RIP-Seq code_examples 3 steps

Publication

Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.

Molecular cell (2024) — PMID 39303722

Dataset

Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [eCLIP]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

*library strategy: eCLIP-seq

eCLIP vv1.0.0 GitHub

$ Bash example

# This script demonstrates how to run the eCLIP CWL workflow.
# It assumes 'cwltool' is installed and the 'yeolab/eclip' repository is cloned.

# Install cwltool if not already installed
# pip install cwltool

# Clone the eCLIP CWL workflow repository if not already done
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Define placeholder paths for input data and reference files.
# Replace these with your actual data and reference genome paths.
# For eCLIP, a comprehensive reference genome (e.g., hg38) typically requires:
# - Genome FASTA file (e.g., hg38.fa)
# - STAR index built from the genome FASTA
# - GTF annotation file (e.g., gencode.v38.annotation.gtf)
# - Blacklist regions BED file
# - RepeatMasker BED file
# - Chromosome sizes file
# - Adapter sequences FASTA file

OUTPUT_DIR="eclip_results"
mkdir -p "${OUTPUT_DIR}"

# Create a placeholder inputs.yaml file for the CWL workflow.
# This file would contain paths to all input FASTQ files (replicates, controls),
# reference genome components, and pipeline parameters (e.g., threads, memory).
# A full inputs.yaml would be much more extensive, reflecting all parameters
# defined in the 'yeolab/eclip' workflow's 'workflow.cwl' file.
cat << EOF > eclip_inputs.yaml
# Minimal example of an inputs.yaml for the eCLIP CWL workflow
# Replace with actual file paths and parameters for your experiment.
# Example for a single replicate and control, paired-end reads:
fastq_r1_rep1: {class: File, path: "path/to/replicate1_R1.fastq.gz"}
fastq_r2_rep1: {class: File, path: "path/to/replicate1_R2.fastq.gz"}
fastq_r1_control: {class: File, path: "path/to/control_R1.fastq.gz"}
fastq_r2_control: {class: File, path: "path/to/control_R2.fastq.gz"}

# Reference genome components (using hg38 as a placeholder)
genome_fasta: {class: File, path: "path/to/hg38.fa"}
genome_star_index_dir: {class: Directory, path: "path/to/hg38_star_index"}
genome_gtf: {class: File, path: "path/to/gencode.v38.annotation.gtf"}
blacklist_bed: {class: File, path: "path/to/hg38_blacklist.bed"}
repeatmasker_bed: {class: File, path: "path/to/hg38_repeatmasker.bed"}
chrom_sizes: {class: File, path: "path/to/hg38.chrom.sizes"}
adapter_fasta: {class: File, path: "path/to/adapters.fa"}

# Pipeline parameters
output_prefix: "my_eclip_experiment"
threads: 8
memory: 32G
# ... other parameters as required by the workflow
EOF

# Execute the eCLIP CWL workflow using cwltool.
# The 'workflow.cwl' file is typically found in the root of the cloned 'eclip' repository.
cwltool --outdir "${OUTPUT_DIR}" eclip/workflow.cwl eclip_inputs.yaml

View on GitHub

eCLIP data was analyzed using Skipper with default settings.

Skipper v0.1.0 GitHub

$ Bash example

# Install Skipper (e.g., using pip or conda)
# pip install skipper
# conda install -c bioconda skipper

# Create a configuration file (config.yaml) specifying samples, genome, and annotation.
# Skipper will use its default settings for other parameters as per the description.
# Example config.yaml content:
# samples:
#   sample1_rep1:
#     R1: "path/to/sample1_rep1_R1.fastq.gz"
#     R2: "path/to/sample1_rep1_R2.fastq.gz" # Optional, if paired-end
#   sample1_rep2:
#     R1: "path/to/sample1_rep2_R1.fastq.gz"
#     R2: "path/to/sample1_rep2_R2.fastq.gz" # Optional, if paired-end
#   input_rep1:
#     R1: "path/to/input_rep1_R1.fastq.gz"
#     R2: "path/to/input_rep1_R2.fastq.gz" # Optional, if paired-end
#   input_rep2:
#     R1: "path/to/input_rep2_R1.fastq.gz"
#     R2: "path/to/input_rep2_R2.fastq.gz" # Optional, if paired-end
#
# genome: "hg38" # Placeholder: Human genome assembly (e.g., hg38, mm10)
# annotation: "gencode_v38" # Placeholder: GENCODE annotation version (e.g., gencode_v38, gencode_m25)

# Execute Skipper with the configuration file. Adjust --cores as needed.
skipper run --config config.yaml --cores 8

View on GitHub

The kDa of each ORF was matched to the closest input sample beginning at that range; for example, a 78 kDa ORF would be paired with the input sample covering the range of 75 to 150 kDa.

RangeMatcher (Inferred with models/gemini-2.5-flash) v1.0 GitHub

$ Bash example

# Assume input files:
# orf_data.tsv: ORF_ID<tab>kDa
# sample_ranges.tsv: Sample_ID<tab>Min_kDa<tab>Max_kDa

# Example content for orf_data.tsv:
# ORF1    78
# ORF2    120
# ORF3    50

# Example content for sample_ranges.tsv:
# SampleA 75      150
# SampleB 50      100
# SampleC 10      70
# SampleD 100     200

# Create dummy input files for demonstration
echo -e "ORF1\t78\nORF2\t120\nORF3\t50" > orf_data.tsv
echo -e "SampleA\t75\t150\nSampleB\t50\t100\nSampleC\t10\t70\nSampleD\t100\t200" > sample_ranges.tsv

# Python script to perform the matching logic described:
# For each ORF, find the sample range that contains its kDa value.
# If multiple ranges contain it, select the one whose 'Min_kDa' is closest to (but not greater than) the ORF's kDa.
# If there's a tie in closeness, the one with the smallest 'Min_kDa' among the tied ones is chosen as a tie-breaker.
python3 -c "
import sys

def parse_data(filename, delimiter='\t'):
    data = []
    with open(filename, 'r') as f:
        for line in f:
            parts = line.strip().split(delimiter)
            data.append(parts)
    return data

def main():
    orf_data_file = 'orf_data.tsv'
    sample_ranges_file = 'sample_ranges.tsv'
    output_file = 'orf_sample_matches.tsv'

    orfs = parse_data(orf_data_file)
    sample_ranges = parse_data(sample_ranges_file)

    with open(output_file, 'w') as outfile:
        outfile.write('ORF_ID\tORF_kDa\tMatched_Sample_ID\tMatched_Min_kDa\tMatched_Max_kDa\n')
        for orf_parts in orfs:
            orf_id = orf_parts[0]
            orf_kDa = float(orf_parts[1])
            
            best_match = None
            min_diff = float('inf') # Stores abs(orf_kDa - min_kDa) for the best match
            
            for sample_parts in sample_ranges:
                sample_id = sample_parts[0]
                min_kDa = float(sample_parts[1])
                max_kDa = float(sample_parts[2])

                if min_kDa <= orf_kDa <= max_kDa:
                    # This sample range contains the ORF's kDa
                    current_diff = abs(orf_kDa - min_kDa)
                    
                    if best_match is None or current_diff < min_diff:
                        best_match = (sample_id, min_kDa, max_kDa)
                        min_diff = current_diff
                    elif current_diff == min_diff:
                        # If difference is the same, prefer the one with smaller min_kDa as a tie-breaker
                        if min_kDa < best_match[1]:
                            best_match = (sample_id, min_kDa, max_kDa)

            if best_match:
                outfile.write(f'{orf_id}\t{orf_kDa}\t{best_match[0]}\t{best_match[1]}\t{best_match[2]}\n')
            else:
                outfile.write(f'{orf_id}\t{orf_kDa}\tNo_Match\tN/A\tN/A\n')

if __name__ == '__main__':
    main()
"
# Clean up dummy files (uncomment to enable cleanup after execution)
# rm orf_data.tsv sample_ranges.tsv

View on GitHub

Tools Used

eCLIP Skipper

Raw Source Text

*library strategy: eCLIP-seq
eCLIP data was analyzed using Skipper with default settings. The kDa of each ORF was matched to the closest input sample beginning at that range; for example, a 78 kDa ORF would be paired with the input sample covering the range of 75 to 150 kDa.
Assembly: hg38
Supplementary files format and content: Reproducible enriched peaks output by Skipper

← Back to Analysis