GSE266924 Processing Pipeline

OTHER code_examples 5 steps

Publication

ePRINT: exonuclease assisted mapping of protein-RNA interactions.

Genome biology (2024) — PMID 38807229

Dataset

Exonuclease assisted mapping of protein-RNA interactions (ePRINT)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Read1 data was processed using Skipper, available at: https://github.com/yeolab/skipper

Skipper vlatest (based on Snakemake workflow) GitHub

$ Bash example

# Install Snakemake (if not already installed)
# conda install -c bioconda -c conda-forge snakemake

# Clone the Skipper workflow repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create a Conda environment from the provided environment.yaml (optional, but recommended for reproducibility)
# conda env create -f environment.yaml
# conda activate skipper_env

# Assuming 'Read1.fastq.gz' is the input file for Read1 data
# The Skipper workflow is designed to be run with a configuration file (config.yaml) and a samplesheet.tsv
# For a single Read1 file, you would typically define it in a samplesheet and configure the workflow.
# Example of a minimal samplesheet.tsv (assuming single-end reads):
# sample_id	fastq_r1
# my_sample	Read1.fastq.gz

# Example of a minimal config.yaml (adjust parameters as needed):
# outdir: results
# genome: hg38 # Placeholder, replace with actual genome if known
# annotation: gencode.v38 # Placeholder, replace with actual annotation if known
# adapters:
#   - AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
#   - AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

# To run the Skipper workflow for processing Read1 data:
# snakemake --use-conda --cores 8 --configfile config.yaml --samplesheet samplesheet.tsv all

View on GitHub

Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.

eCLIP v0.2.2 GitHub

$ Bash example

# Install skewer (if not already installed)
# conda install -c bioconda skewer

# Define input and output files
INPUT_FASTQ="sample.fastq.gz"
OUTPUT_PREFIX="sample_trimmed"
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Illumina TruSeq 3' adapter, commonly used in eCLIP
MIN_READ_LENGTH=18 # Minimum read length after trimming, common for eCLIP
MIN_QUALITY=20     # Minimum average quality score
THREADS=8          # Number of threads to use

# Trim adapter and barcode sequences using skewer
# -x: adapter sequence to trim
# -l: minimum read length after trimming
# -q: minimum average quality score
# -t: number of threads
# -o: output file prefix
skewer -x "${ADAPTER_SEQUENCE}" -l "${MIN_READ_LENGTH}" -q "${MIN_QUALITY}" -t "${THREADS}" -o "${OUTPUT_PREFIX}" "${INPUT_FASTQ}"

# The output file for single-end reads will be named like sample_trimmed-trimmed.fastq.gz

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp

fastp v0.23.2 (Inferred with models/gemini-2.5-flash)

$ Bash example

# fastp installation (example using conda)
# conda install -c bioconda fastp

# Define input and output file names
# Replace with actual raw sequencing read files (e.g., from a sequencing run)
READ1_IN="raw_read1.fastq.gz"
READ2_IN="raw_read2.fastq.gz" # Remove this line and corresponding -I/-O if single-end reads

# Define output file names for UMI-extracted and quality-controlled reads
READ1_OUT="umi_extracted_read1.fastq.gz"
READ2_OUT="umi_extracted_read2.fastq.gz" # Remove this line and corresponding -I/-O if single-end reads

# Define UMI parameters
# UMI_LOCATION: Specifies where the UMI is located. Common options include:
#   'per_read': UMI is at the beginning of each read (e.g., Read 1).
#   'read1': UMI is at the beginning of Read 1.
#   'read2': UMI is at the beginning of Read 2.
#   'index1': UMI is in the first index read.
#   'index2': UMI is in the second index read.
# UMI_LENGTH: The length of the UMI in base pairs. This value is critical and depends on the library preparation protocol.
# The following values are placeholders and MUST be replaced with the actual experimental parameters.
UMI_LOCATION="per_read" # Example: UMI is at the beginning of Read 1
UMI_LENGTH=10           # Example: UMI is 10 bp long

# Run fastp for UMI extraction, quality control, and adapter trimming
fastp \
  -i "${READ1_IN}" \
  -o "${READ1_OUT}" \
  -I "${READ2_IN}" \
  -O "${READ2_OUT}" \
  --umi_loc "${UMI_LOCATION}" \
  --umi_len "${UMI_LENGTH}" \
  --json "fastp_report.json" \
  --html "fastp_report.html"

Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Example: Create a dummy STAR genome directory (replace with your actual path)
# mkdir -p /path/to/STAR_genome_dir
# echo "This is a placeholder for STAR genome files." > /path/to/STAR_genome_dir/genome.fa

# Example: Create a dummy input FASTQ file (assuming gzipped based on 'zcat')
# echo "@read1\nAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT\n+\n################################" | gzip > input.fq.gz

STAR \
  --alignEndsType EndToEnd \
  --genomeDir /path/to/STAR_genome_dir \
  --genomeLoad NoSharedMemory \
  --outBAMcompression 10 \
  --outFileNamePrefix aligned_reads \
  --winAnchorMultimapNmax 100 \
  --outFilterMultimapNmax 100 \
  --outFilterMultimapScoreRange 1 \
  --outSAMmultNmax 1 \
  --outMultimapperOrder Random \
  --outFilterScoreMin 10 \
  --outFilterType BySJout \
  --limitOutSJcollapsed 5000000 \
  --outReadsUnmapped None \
  --outSAMattrRGline ID:sample1_rep1 \
  --outSAMattributes All \
  --outSAMmode Full \
  --outSAMtype BAM Unsorted \
  --outSAMunmapped Within \
  --readFilesCommand zcat \
  --outStd Log \
  --readFilesIn input.fq.gz \
  --runMode alignReads \
  --runThreadN 8

View on GitHub

Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper

Skipper vNot specified GitHub

$ Bash example

# This code block assumes you have cloned the Skipper repository and are in its root directory.
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# It also assumes you have Snakemake installed and configured, or that you will use
# Snakemake's --use-conda feature to manage environments for the workflow's tools.
# Example Snakemake installation:
# conda create -n snakemake_env snakemake -y
# conda activate snakemake_env

# Before running, you would typically configure the workflow by editing
# config/config.yaml and config/samples.tsv to specify your input files, genome assembly
# (e.g., hg38), and other parameters relevant to reproducible enriched windows
# (IDR) and repetitive element analysis.
# cp config/config.yaml.example config/config.yaml
# cp config/samples.tsv.example config/samples.tsv
# (Edit config/config.yaml and config/samples.tsv with your data and settings)

# Execute the Skipper Snakemake workflow.
# This command will trigger the analysis steps, including those for
# identifying reproducible enriched windows (IDR) and analyzing repetitive elements,
# based on your configuration.
snakemake --cores 8 --use-conda --snakefile workflow/Snakefile

View on GitHub

Tools Used

Skipper eCLIP STAR

Raw Source Text

Read1 data was processed using Skipper, available at: https://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched re files contain p values, q values, enrichment, and annotations for significantly bound repetitive elements

← Back to Analysis