GSE249246 Processing Pipeline
RIP-Seq
code_examples
3 steps
Publication
Integrated multi-omics analysis of zinc-finger proteins uncovers roles in RNA regulation.Molecular cell (2024) — PMID 39303722
Dataset
GSE249246Integrated multi-omics analysis of zinc finger proteins uncovers roles in RNA regulation [eCLIP]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
*library strategy: eCLIP-seq
$ Bash example
# This script demonstrates how to run the eCLIP CWL workflow. # It assumes 'cwltool' is installed and the 'yeolab/eclip' repository is cloned. # Install cwltool if not already installed # pip install cwltool # Clone the eCLIP CWL workflow repository if not already done # git clone https://github.com/yeolab/eclip.git # cd eclip # Define placeholder paths for input data and reference files. # Replace these with your actual data and reference genome paths. # For eCLIP, a comprehensive reference genome (e.g., hg38) typically requires: # - Genome FASTA file (e.g., hg38.fa) # - STAR index built from the genome FASTA # - GTF annotation file (e.g., gencode.v38.annotation.gtf) # - Blacklist regions BED file # - RepeatMasker BED file # - Chromosome sizes file # - Adapter sequences FASTA file OUTPUT_DIR="eclip_results" mkdir -p "${OUTPUT_DIR}" # Create a placeholder inputs.yaml file for the CWL workflow. # This file would contain paths to all input FASTQ files (replicates, controls), # reference genome components, and pipeline parameters (e.g., threads, memory). # A full inputs.yaml would be much more extensive, reflecting all parameters # defined in the 'yeolab/eclip' workflow's 'workflow.cwl' file. cat << EOF > eclip_inputs.yaml # Minimal example of an inputs.yaml for the eCLIP CWL workflow # Replace with actual file paths and parameters for your experiment. # Example for a single replicate and control, paired-end reads: fastq_r1_rep1: {class: File, path: "path/to/replicate1_R1.fastq.gz"} fastq_r2_rep1: {class: File, path: "path/to/replicate1_R2.fastq.gz"} fastq_r1_control: {class: File, path: "path/to/control_R1.fastq.gz"} fastq_r2_control: {class: File, path: "path/to/control_R2.fastq.gz"} # Reference genome components (using hg38 as a placeholder) genome_fasta: {class: File, path: "path/to/hg38.fa"} genome_star_index_dir: {class: Directory, path: "path/to/hg38_star_index"} genome_gtf: {class: File, path: "path/to/gencode.v38.annotation.gtf"} blacklist_bed: {class: File, path: "path/to/hg38_blacklist.bed"} repeatmasker_bed: {class: File, path: "path/to/hg38_repeatmasker.bed"} chrom_sizes: {class: File, path: "path/to/hg38.chrom.sizes"} adapter_fasta: {class: File, path: "path/to/adapters.fa"} # Pipeline parameters output_prefix: "my_eclip_experiment" threads: 8 memory: 32G # ... other parameters as required by the workflow EOF # Execute the eCLIP CWL workflow using cwltool. # The 'workflow.cwl' file is typically found in the root of the cloned 'eclip' repository. cwltool --outdir "${OUTPUT_DIR}" eclip/workflow.cwl eclip_inputs.yaml -
2
eCLIP data was analyzed using Skipper with default settings.
$ Bash example
# Install Skipper (e.g., using pip or conda) # pip install skipper # conda install -c bioconda skipper # Create a configuration file (config.yaml) specifying samples, genome, and annotation. # Skipper will use its default settings for other parameters as per the description. # Example config.yaml content: # samples: # sample1_rep1: # R1: "path/to/sample1_rep1_R1.fastq.gz" # R2: "path/to/sample1_rep1_R2.fastq.gz" # Optional, if paired-end # sample1_rep2: # R1: "path/to/sample1_rep2_R1.fastq.gz" # R2: "path/to/sample1_rep2_R2.fastq.gz" # Optional, if paired-end # input_rep1: # R1: "path/to/input_rep1_R1.fastq.gz" # R2: "path/to/input_rep1_R2.fastq.gz" # Optional, if paired-end # input_rep2: # R1: "path/to/input_rep2_R1.fastq.gz" # R2: "path/to/input_rep2_R2.fastq.gz" # Optional, if paired-end # # genome: "hg38" # Placeholder: Human genome assembly (e.g., hg38, mm10) # annotation: "gencode_v38" # Placeholder: GENCODE annotation version (e.g., gencode_v38, gencode_m25) # Execute Skipper with the configuration file. Adjust --cores as needed. skipper run --config config.yaml --cores 8
-
3
The kDa of each ORF was matched to the closest input sample beginning at that range; for example, a 78 kDa ORF would be paired with the input sample covering the range of 75 to 150 kDa.
$ Bash example
# Assume input files: # orf_data.tsv: ORF_ID<tab>kDa # sample_ranges.tsv: Sample_ID<tab>Min_kDa<tab>Max_kDa # Example content for orf_data.tsv: # ORF1 78 # ORF2 120 # ORF3 50 # Example content for sample_ranges.tsv: # SampleA 75 150 # SampleB 50 100 # SampleC 10 70 # SampleD 100 200 # Create dummy input files for demonstration echo -e "ORF1\t78\nORF2\t120\nORF3\t50" > orf_data.tsv echo -e "SampleA\t75\t150\nSampleB\t50\t100\nSampleC\t10\t70\nSampleD\t100\t200" > sample_ranges.tsv # Python script to perform the matching logic described: # For each ORF, find the sample range that contains its kDa value. # If multiple ranges contain it, select the one whose 'Min_kDa' is closest to (but not greater than) the ORF's kDa. # If there's a tie in closeness, the one with the smallest 'Min_kDa' among the tied ones is chosen as a tie-breaker. python3 -c " import sys def parse_data(filename, delimiter='\t'): data = [] with open(filename, 'r') as f: for line in f: parts = line.strip().split(delimiter) data.append(parts) return data def main(): orf_data_file = 'orf_data.tsv' sample_ranges_file = 'sample_ranges.tsv' output_file = 'orf_sample_matches.tsv' orfs = parse_data(orf_data_file) sample_ranges = parse_data(sample_ranges_file) with open(output_file, 'w') as outfile: outfile.write('ORF_ID\tORF_kDa\tMatched_Sample_ID\tMatched_Min_kDa\tMatched_Max_kDa\n') for orf_parts in orfs: orf_id = orf_parts[0] orf_kDa = float(orf_parts[1]) best_match = None min_diff = float('inf') # Stores abs(orf_kDa - min_kDa) for the best match for sample_parts in sample_ranges: sample_id = sample_parts[0] min_kDa = float(sample_parts[1]) max_kDa = float(sample_parts[2]) if min_kDa <= orf_kDa <= max_kDa: # This sample range contains the ORF's kDa current_diff = abs(orf_kDa - min_kDa) if best_match is None or current_diff < min_diff: best_match = (sample_id, min_kDa, max_kDa) min_diff = current_diff elif current_diff == min_diff: # If difference is the same, prefer the one with smaller min_kDa as a tie-breaker if min_kDa < best_match[1]: best_match = (sample_id, min_kDa, max_kDa) if best_match: outfile.write(f'{orf_id}\t{orf_kDa}\t{best_match[0]}\t{best_match[1]}\t{best_match[2]}\n') else: outfile.write(f'{orf_id}\t{orf_kDa}\tNo_Match\tN/A\tN/A\n') if __name__ == '__main__': main() " # Clean up dummy files (uncomment to enable cleanup after execution) # rm orf_data.tsv sample_ranges.tsv
Raw Source Text
*library strategy: eCLIP-seq eCLIP data was analyzed using Skipper with default settings. The kDa of each ORF was matched to the closest input sample beginning at that range; for example, a 78 kDa ORF would be paired with the input sample covering the range of 75 to 150 kDa. Assembly: hg38 Supplementary files format and content: Reproducible enriched peaks output by Skipper