GSE213867 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Skipper analysis of eCLIP datasets enables sensitive detection of constrained translation factor binding sites.

Cell genomics (2023) — PMID 37388912

Dataset

Skipper analysis of RNA-protein interactions highlights depletion of genetic variation in translation factor binding sites

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data was processed using Skipper, available at: http://github.com/yeolab/skipper

Skipper vNot specified GitHub

$ Bash example

# Clone the Skipper repository (if not already cloned)
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Install Snakemake and dependencies (if not already installed)
# conda create -n skipper_env snakemake mamba -c conda-forge -c bioconda
# conda activate skipper_env

# Placeholder for configuration:
# A 'config.yaml' file is typically required to specify input files,
# reference genomes, and other parameters for the Skipper workflow.
# Example content for config.yaml (refer to Skipper's documentation for details):
# samples:
#   sample1:
#     R1: "path/to/sample1_R1.fastq.gz"
#     R2: "path/to/sample1_R2.fastq.gz" # Optional for single-end
# genome: "path/to/reference_genome.fa" # Placeholder for reference genome
# annotation: "path/to/genome_annotation.gtf" # Placeholder for genome annotation
# output_dir: "results"

# Run the Skipper Snakemake workflow
# Adjust --cores based on available resources.
# Ensure 'config.yaml' is properly configured for your data and references.
snakemake --use-conda --cores 8 --configfile config.yaml

View on GitHub

Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.

eCLIP v0.2.2 (Inferred from yeolab/skipper eCLIP workflow) GitHub

$ Bash example

# Install skewer (e.g., via conda)
# conda create -n skewer_env skewer=0.2.2 -c bioconda -c conda-forge
# conda activate skewer_env

# Placeholder for adapter and barcode sequences. These would typically be provided in a FASTA file.
# Example content for adapters.fa (replace with actual sequences):
# >adapter1
# AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
# >barcode_sequence_example
# GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

# Define input and output file names
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_PREFIX="trimmed_reads"
ADAPTER_FILE="adapters.fa"
THREADS=8
MIN_LENGTH=18
MIN_QUALITY=20

# Trim reads for adapter and barcode sequences using skewer
skewer -x "${ADAPTER_FILE}" \
       -m any \
       -l "${MIN_LENGTH}" \
       -q "${MIN_QUALITY}" \
       -t "${THREADS}" \
       -o "${OUTPUT_PREFIX}" \
       "${INPUT_R1}" \
       "${INPUT_R2}"

# Expected output files:
# trimmed_reads-trimmed-pair1.fastq.gz
# trimmed_reads-trimmed-pair2.fastq.gz
# trimmed_reads-trimmed.log

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp

fastp v0.23.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install fastp (if not already installed)
# conda install -c bioconda fastp

# Define input and output file names (placeholders)
# Replace with your actual input and desired output file names
INPUT_R1="raw_reads_R1.fastq.gz"
INPUT_R2="raw_reads_R2.fastq.gz"
OUTPUT_R1="umi_extracted_R1.fastq.gz"
OUTPUT_R2="umi_extracted_R2.fastq.gz"
REPORT_JSON="fastp_report.json"
REPORT_HTML="fastp_report.html"

# Extract UMIs from raw sequencing reads using fastp
# This command assumes UMIs are at the beginning of Read 1 and are 10 bp long.
# The UMI sequence will be moved from the read sequence to the read ID, prefixed with 'UMI:'.
# Basic quality control (trimming low-quality bases, adapter trimming, filtering short reads) is also included.
fastp \
  --in1 "${INPUT_R1}" \
  --out1 "${OUTPUT_R1}" \
  --in2 "${INPUT_R2}" \
  --out2 "${OUTPUT_R2}" \
  --umi_loc read1 \
  --umi_len 10 \
  --umi_prefix "UMI:" \
  --qualified_quality_phred 20 \
  --unqualified_percent_limit 50 \
  --length_required 50 \
  --detect_adapter_for_pe \
  --json "${REPORT_JSON}" \
  --html "${REPORT_HTML}"

View on GitHub

Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}

STAR v2.7.x GitHub

$ Bash example

# Install STAR using conda
# conda install -c bioconda star

# Define variables (replace with actual paths and values)
genome_dir="/path/to/STAR_genome_index/GRCh38"
output_prefix="sample_aligned"
input_fastq="sample.fastq.gz"
num_threads=8
replicate_id="sample1_rep1"

# Run STAR alignment
STAR \
  --alignEndsType EndToEnd \
  --genomeDir "${genome_dir}" \
  --genomeLoad NoSharedMemory \
  --outBAMcompression 10 \
  --outFileNamePrefix "${output_prefix}" \
  --winAnchorMultimapNmax 100 \
  --outFilterMultimapNmax 100 \
  --outFilterMultimapScoreRange 1 \
  --outSAMmultNmax 1 \
  --outMultimapperOrder Random \
  --outFilterScoreMin 10 \
  --outFilterType BySJout \
  --limitOutSJcollapsed 5000000 \
  --outReadsUnmapped None \
  --outSAMattrRGline ID:"${replicate_id}" \
  --outSAMattributes All \
  --outSAMmode Full \
  --outSAMtype BAM Unsorted \
  --outSAMunmapped Within \
  --readFilesCommand zcat \
  --outStd Log \
  --readFilesIn "${input_fastq}" \
  --runMode alignReads \
  --runThreadN "${num_threads}"

View on GitHub

Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper

Skipper vSnakemake workflow (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the Skipper repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create a Conda environment for Snakemake
# conda create -n skipper_env snakemake -y
# conda activate skipper_env

# --- Configuration for Skipper workflow ---
# Create a config.yaml file with your sample and genome information.
# Example config.yaml:
# samples:
#   sample_rep1:
#     r1: "/path/to/sample_rep1_R1.fastq.gz"
#     r2: "/path/to/sample_rep1_R2.fastq.gz" # Optional, if paired-end
#   sample_rep2:
#     r1: "/path/to/sample_rep2_R1.fastq.gz"
#     r2: "/path/to/sample_rep2_R2.fastq.gz" # Optional, if paired-end
#   input_rep1: # Control/Input sample
#     r1: "/path/to/input_rep1_R1.fastq.gz"
#     r2: "/path/to/input_rep1_R2.fastq.gz" # Optional, if paired-end
#   input_rep2: # Control/Input sample
#     r1: "/path/to/input_rep2_R1.fastq.gz"
#     r2: "/path/to/input_rep2_R2.fastq.gz" # Optional, if paired-end
#
# genome_assembly: "hg38" # e.g., hg38, mm10
# genome_dir: "/path/to/STAR/index/for/hg38" # Path to STAR genome index
# gtf_file: "/path/to/gencode.vXX.annotation.gtf" # Path to GTF annotation file
# repeat_masker_file: "/path/to/hg38.fa.out.bed" # Path to RepeatMasker BED file
#
# # Ensure all necessary reference files (STAR index, GTF, RepeatMasker BED) are prepared.
# # For hg38, these can be downloaded from ENCODE or UCSC.
# # Example for hg38:
# # STAR index: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ (e.g., star_2.7.10a_idx.tar)
# # GTF: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
# # RepeatMasker BED: Can be generated from UCSC Table Browser (track: RepeatMasker, output format: BED) or downloaded from specific resources.

# Execute the Skipper Snakemake workflow to perform peak calling, IDR, and annotation.
# This command will run the 'all' rule, which includes reproducible enriched window (IDR)
# and repetitive element annotation steps.
snakemake --snakefile Snakefile --configfile config.yaml --cores 8 all

View on GitHub

Tools Used

Skipper eCLIP STAR

Raw Source Text

Data was processed using Skipper, available at: http://github.com/yeolab/skipper
Reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using skewer.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with fastp
Extracted reads were aligned using STAR: --alignEndsType EndToEnd --genomeDir {params.star_sjdb} --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix {params.outprefix} --winAnchorMultimapNmax 100 --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outSAMmultNmax 1 --outMultimapperOrder Random --outFilterScoreMin 10 --outFilterType BySJout --limitOutSJcollapsed 5000000 --outReadsUnmapped None --outSAMattrRGline ID:{wildcards.replicate_label} --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped Within --readFilesCommand zcat --outStd Log --readFilesIn {input.fq} --runMode alignReads --runThreadN {threads}
Custom scripts called reproducible enriched windows and repetitive elements as part of Skipper
Assembly: GRCh38
Supplementary files format and content: reproducible enriched window files contain p values, q values, enrichment, and annotations for significantly bound transcriptome tiled windows
Supplementary files format and content: reproducible enriched re files contain p values, q values, enrichment, and annotations for significantly bound repetitive elements

← Back to Analysis