GSE78507 Processing Pipeline

OTHER code_examples 33 steps

Publication

Enhanced CLIP Uncovers IMP Protein-RNA Targets in Human Pluripotent Stem Cells Important for Cell Adhesion and Survival.

Cell reports (2016) — PMID 27068461

Dataset

GSE78507

Enhanced CLIP uncovers IMP protein-RNA targets in human pluripotent stem cells important for cell adhesion and survival [eCLIP-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: eCLIP-Seq

    $ Bash example
    # Install necessary tools (example using conda)
    # conda create -n eclip_env star clipper python=3.8 -y
    # conda activate eclip_env
    # pip install git+https://github.com/yeolab/merge_peaks.git
    
    # --- Placeholder for reference genome and annotation ---
    # Replace with actual paths to your reference files (e.g., hg38)
    GENOME_FASTA="path/to/hg38.fa" # Inferred: Latest human assembly (hg38)
    GENOME_GTF="path/to/gencode.v38.annotation.gtf" # Inferred: Corresponding GTF for hg38
    STAR_INDEX_DIR="path/to/STAR_genome_index_hg38"
    
    # Build STAR genome index (if not already built)
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${STAR_INDEX_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GENOME_GTF}" \
    #      --runThreadN 8
    
    # --- Alignment with STAR (splice-aware aligner) ---
    # Assuming paired-end reads for eCLIP. Replace with actual input FASTQ files and output directory.
    INPUT_FASTQ_R1="sample_eCLIP_R1.fastq.gz"
    INPUT_FASTQ_R2="sample_eCLIP_R2.fastq.gz"
    CONTROL_FASTQ_R1="sample_control_R1.fastq.gz"
    CONTROL_FASTQ_R2="sample_control_R2.fastq.gz"
    OUTPUT_BASE_DIR="eclip_analysis"
    
    mkdir -p "${OUTPUT_BASE_DIR}/alignment"
    
    echo "Running STAR alignment for eCLIP sample..."
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${INPUT_FASTQ_R1}" "${INPUT_FASTQ_R2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_BASE_DIR}/alignment/eCLIP_sample_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --runThreadN 8
    
    eCLIP_BAM="${OUTPUT_BASE_DIR}/alignment/eCLIP_sample_Aligned.sortedByCoord.out.bam"
    
    echo "Running STAR alignment for control sample..."
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${CONTROL_FASTQ_R1}" "${CONTROL_FASTQ_R2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_BASE_DIR}/alignment/control_sample_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --runThreadN 8
    
    CONTROL_BAM="${OUTPUT_BASE_DIR}/alignment/control_sample_Aligned.sortedByCoord.out.bam"
    
    # --- Peak Calling with CLIPper ---
    mkdir -p "${OUTPUT_BASE_DIR}/peaks"
    OUTPUT_PEAKS_BED="${OUTPUT_BASE_DIR}/peaks/eCLIP_sample_peaks.bed"
    
    echo "Running CLIPper peak calling..."
    clipper -b "${eCLIP_BAM}" \
            -s "${CONTROL_BAM}" \
            -g "${GENOME_FASTA}" \
            -o "${OUTPUT_PEAKS_BED}" \
            -t 8 # Number of threads
    
    # --- IDR (Identifying Reproducible Peaks) with merge_peaks ---
    # Assuming two replicates for IDR. Replace with actual peak files from replicates.
    REPLICATE1_PEAKS="path/to/replicate1_peaks.bed"
    REPLICATE2_PEAKS="path/to/replicate2_peaks.bed"
    OUTPUT_IDR_PEAKS="${OUTPUT_BASE_DIR}/idr/reproducible_peaks.bed"
    mkdir -p "${OUTPUT_BASE_DIR}/idr"
    
    echo "Running merge_peaks for IDR..."
    merge_peaks --replicate1 "${REPLICATE1_PEAKS}" \
                --replicate2 "${REPLICATE2_PEAKS}" \
                --output "${OUTPUT_IDR_PEAKS}" \
                --idr_threshold 0.1 # Example IDR threshold
    
  2. 2

    Takes output from raw files.

    Unknown (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # The description "Takes output from raw files." is too generic to infer a specific tool or command.
    # This step likely involves initial processing, quality control, or format conversion depending on the specific assay and raw file type.
    # Please provide more context (e.g., assay type, file format, desired output) for a more specific inference.
    # Example placeholder for a common initial step like quality control for sequencing data:
    # fastqc input_raw_file.fastq -o qc_output_directory
    # Or for file format conversion:
    # samtools view -bS input.sam > output.bam
  3. 3

    Run to trim off both 5’ and 3’ adapters on both reads.

    cutadapt (Inferred with models/gemini-2.5-flash) v2.10 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=2.10
    
    # Define input and output file paths
    READ1_INPUT="input_read1.fastq.gz"
    READ2_INPUT="input_read2.fastq.gz"
    READ1_OUTPUT="trimmed_read1.fastq.gz"
    READ2_OUTPUT="trimmed_read2.fastq.gz"
    
    # Define adapter sequences (common Illumina adapters for eCLIP, 3' for Read 1 and Read 2)
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Number of CPU cores to use (adjust as needed)
    NUM_CORES=$(nproc)
    
    # Run cutadapt to trim 3' adapters, perform quality trimming, and filter short reads
    # -j: Number of CPU cores
    # -a: 3' adapter for Read 1
    # -A: 3' adapter for Read 2
    # -q: Quality trim from 5' and 3' ends, minimum quality 20
    # --minimum-length: Discard reads shorter than 18 bp after trimming
    # --nextseq-trim: Quality trim from the 3' end of NextSeq reads (q20)
    # --pair-filter=any: Discard read pairs if either read becomes too short
    # -o: Output file for Read 1
    # -p: Output file for Read 2
    cutadapt \
        -j "${NUM_CORES}" \
        -a "${ADAPTER_R1}" \
        -A "${ADAPTER_R2}" \
        -q 20 \
        --minimum-length 18 \
        --nextseq-trim 20 \
        --pair-filter=any \
        -o "${READ1_OUTPUT}" \
        -p "${READ2_OUTPUT}" \
        "${READ1_INPUT}" "${READ2_INPUT}"
  4. 4

    Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

    quality-cutoff vNot specified, part of Yeo Lab eCLIP pipeline GitHub
    $ Bash example
    # Clone the eCLIP pipeline repository from Yeo Lab:
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # # Set up the environment (e.g., using conda) and ensure 'quality-cutoff' script is executable and in your PATH.
    # # For example, if quality_cutoff.py is in eclip/src, you might need to run it as 'python eclip/src/quality_cutoff.py' or symlink it.
    
    # Define input and output paths
    R1_INPUT="/full/path/to/files/file_R1.C01.fastq.gz"
    R2_INPUT="/full/path/to/files/file_R2.C01.fastq.gz"
    R1_OUTPUT="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
    R2_OUTPUT="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
    METRICS_OUTPUT="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics"
    
    # Main 3' adapter for R1
    ADAPTER_R1="NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    # 5' adapters
    ADAPTER_5P_1="CTTCCGATCTACAAGTT"
    ADAPTER_5P_2="CTTCCGATCTTGGTCCT"
    # Additional 3' adapters (often for R2 or truncated versions)
    # Note: The adapter "CTTGT AGATCGGAAG" contains a space, which is unusual but kept as per the original command.
    ADAPTER_R2_LIST=(
      "AACTTGTAGATCGGA"
      "AGGACCAAGATCGGA"
      "ACTTGTAGATCGGAA"
      "GGACCAAGATCGGAA"
      "CTTGT AGATCGGAAG"
      "GACCAAGATCGGAAG"
      "TTGTAGATCGGAAGA"
      "ACCAAGATCGGAAGA"
      "TGTAGATCGGAAGAG"
      "CCAAGATCGGAAGAG"
      "GTAGATCGGAAGAGC"
      "CAAGATCGGAAGAGC"
      "TAGATCGGAAGAGCG"
      "AAGATCGGAAGAGCG"
      "AGATCGGAAGAGCGT"
      "GATCGGAAGAGCGTC"
      "ATCGGAAGAGCGTCG"
      "TCGGAAGAGCGTCGT"
      "CGGAAGAGCGTCGTG"
      "GGAAGAGCGTCGTGT"
    )
    
    # Construct the -A arguments dynamically
    ADAPTER_A_ARGS=""
    for adapter in "${ADAPTER_R2_LIST[@]}"; do
      ADAPTER_A_ARGS+=" -A \"${adapter}\""
    done
    
    quality-cutoff 6 -m 18 \
      -a "${ADAPTER_R1}" \
      -g "${ADAPTER_5P_1}" \
      -g "${ADAPTER_5P_2}" \
      ${ADAPTER_A_ARGS} \
      -o "${R1_OUTPUT}" \
      -p "${R2_OUTPUT}" \
      "${R1_INPUT}" \
      "${R2_INPUT}" \
      > "${METRICS_OUTPUT}"
  5. 5

    Takes output from cutadapt round 1.

    cutadapt v3.4 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=3.4
    
    # Define input and output files
    # INPUT_FASTQ: Output from cutadapt round 1, typically a FASTQ file.
    INPUT_FASTQ="input_round1.fastq.gz"
    OUTPUT_FASTQ="trimmed_round2.fastq.gz"
    
    # ADAPTER: Placeholder for the 3' adapter sequence. 
    # For eCLIP, this is often the Illumina universal adapter or a specific small RNA adapter.
    # Replace with the actual adapter sequence used in your experiment.
    ADAPTER="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" 
    
    # Execute cutadapt for 3' adapter trimming
    # -a: 3' adapter sequence
    # -o: Output file for trimmed reads
    # -m: Minimum length of reads to keep after trimming (e.g., 18 for eCLIP)
    # -q: Quality cutoff (e.g., 20 for Phred score 20)
    # -j: Number of CPU cores to use
    cutadapt \
      -a "${ADAPTER}" \
      -o "${OUTPUT_FASTQ}" \
      -m 18 \
      -q 20 \
      -j 8 \
      "${INPUT_FASTQ}"
  6. 6

    Run to trim off the 3’ adapters on read 2, to control for double ligation events.

    cutadapt (Inferred with models/gemini-2.5-flash) v1.18 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=1.18
    
    # Define input and output files (placeholders)
    INPUT_R2="read2.fastq.gz"
    OUTPUT_TRIMMED_R2="trimmed_read2.fastq.gz"
    
    # Define the 3' adapter sequence for read 2 (eCLIP double ligation adapter).
    # This specific adapter sequence is used in the yeolab/eclip CWL workflow
    # for trimming R2 double ligation events.
    ADAPTER_R2="AAAAAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT"
    
    # Run cutadapt to trim the 3' adapter from read 2
    # -a: Specifies a 3' adapter sequence to be removed from the 3' end of the reads.
    # -o: Specifies the output file for the trimmed reads.
    cutadapt -a "${ADAPTER_R2}" -o "${OUTPUT_TRIMMED_R2}" "${INPUT_R2}"
  7. 7

    Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

    cutadapt v3.4 GitHub
    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda install -c bioconda cutadapt
    
    cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
  8. 8

    Takes output from cutadapt round 2.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    # Input files are inferred from "Takes output from cutadapt round 2"
    INPUT_R1="round2_trimmed_R1.fastq.gz"
    INPUT_R2="round2_trimmed_R2.fastq.gz"
    OUTPUT_R1="round3_trimmed_R1.fastq.gz"
    OUTPUT_R2="round3_trimmed_R2.fastq.gz"
    
    # Define adapters (example Illumina adapters, adjust as needed for specific library prep)
    # For eCLIP, these might be specific to the library preparation protocol.
    # These are common Illumina adapters, often used in eCLIP pipelines like Skipper.
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example Illumina universal adapter
    ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example Illumina small RNA 5' adapter
    
    # Define minimum read length (common value for eCLIP)
    MIN_LENGTH=18
    
    # Number of cores to use
    NUM_CORES=8
    
    # Execute cutadapt for adapter trimming and quality filtering
    cutadapt \
        -a "${ADAPTER_3PRIME}" \
        -A "${ADAPTER_5PRIME}" \
        -o "${OUTPUT_R1}" \
        -p "${OUTPUT_R2}" \
        --minimum-length "${MIN_LENGTH}" \
        --cores "${NUM_CORES}" \
        --max-n 0.1 \
        --trim-n \
        "${INPUT_R1}" "${INPUT_R2}"
  9. 9

    Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.0f GitHub
    $ Bash example
    # Reference data: A FASTA file containing human rRNA, tRNA, snRNA, snoRNA, and other common repetitive elements (e.g., from RepBase).
    # For eCLIP, this is often a custom "decoy" genome, which can be generated by concatenating relevant repeat sequences.
    # Placeholder for reference FASTA: human_rRNA_tRNA_snRNA_snoRNA_repeats.fasta
    # Placeholder for STAR index directory: /path/to/star_repeat_index
    # Placeholder for input reads: reads.fastq.gz
    # Placeholder for output filtered reads: reads_filtered_repeats.fastq.gz
    
    # Step 1: Build STAR index for repetitive elements (if not already built)
    # This step is typically performed once for a given set of repeat sequences.
    # mkdir -p /path/to/star_repeat_index
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/star_repeat_index \
    #      --genomeFastaFiles human_rRNA_tRNA_snRNA_snoRNA_repeats.fasta \
    #      --runThreadN 8 # Adjust thread count as needed
    
    # Step 2: Align reads to the repetitive element index and extract unmapped reads
    # This effectively filters out reads that map to repetitive elements, leaving only non-repetitive reads.
    STAR --genomeDir /path/to/star_repeat_index \
         --readFilesIn reads.fastq.gz \
         --readFilesCommand zcat \
         --outFileNamePrefix repeat_filtered_ \
         --outSAMtype None \
         --outReadsUnmapped Fastx \
         --outStd Log \
         --runThreadN 8 # Adjust thread count as needed
    
    # The unmapped reads are written to repeat_filtered_Unmapped.out.mate1 (for single-end reads)
    # or repeat_filtered_Unmapped.out.mate1 and repeat_filtered_Unmapped.out.mate2 (for paired-end reads)
    # Rename the output file for clarity
    mv repeat_filtered_Unmapped.out.mate1 reads_filtered_repeats.fastq.gz
  10. 10

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Define variables for input/output files and reference genome directory
    # Replace with actual paths relevant to your environment.
    # The REF_GENOME_DIR should point to a STAR-indexed genome directory for RepBase human sequences.
    REF_GENOME_DIR="/path/to/RepBase_human_database_file"
    READ1_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    READ2_FILE="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    OUTPUT_BAM_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    # Note: The --outFileNamePrefix is set to the same path as the redirected BAM output.
    # This means auxiliary files (Log.out, SJ.out.tab, Unmapped.out.mate1, etc.) will be named with this full path as a prefix.
    OUTPUT_PREFIX_FOR_AUX_FILES="${OUTPUT_BAM_FILE}"
    
    # Execute STAR alignment
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${REF_GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${READ1_FILE}" "${READ2_FILE}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 30 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_PREFIX_FOR_AUX_FILES}" \
      --outSAMattributes All \
      --readFilesCommand zcat \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd \
      > "${OUTPUT_BAM_FILE}"
  11. 11

    Takes output from STAR rmRep.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    GENOME_DIR="/path/to/genome/index/hg38" # Placeholder for human genome hg38
    READ1="input_R1.fastq.gz"
    READ2="input_R2.fastq.gz" # Optional, if paired-end
    OUTPUT_PREFIX="aligned_rmRep_"
    THREADS=8 # Adjust as needed
    
    # Run STAR alignment with multi-mapping filter.
    # The description "rmRep" is interpreted as removing reads that map to multiple locations
    # or repetitive regions, which STAR can control via --outFilterMultimapNmax.
    # Setting --outFilterMultimapNmax 1 reports only uniquely mapping reads.
    STAR --runThreadN ${THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READ1} ${READ2} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1
  12. 12

    Maps unique reads to the human genome.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Replace with actual paths and filenames
    GENOME_DIR="/path/to/STAR_human_GRCh38_index" # Pre-built STAR index for human GRCh38
    INPUT_FASTQ="input_reads.fastq.gz" # Input FASTQ file
    OUTPUT_PREFIX="aligned_unique_reads" # Prefix for output files
    THREADS=8 # Number of threads to use, adjust as needed
    
    # Note: The STAR genome index must be pre-built using the 'STAR --runMode genomeGenerate' command.
    # Example for generating index (run once):
    # STAR --runThreadN ${THREADS} --runMode genomeGenerate --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/human_GRCh338.fa \
    #      --sjdbGTFfile /path/to/human_GRCh38.gtf \
    #      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length - 1
    
    # Map unique reads to the human genome using STAR
    # --outFilterMultimapNmax 1 ensures only uniquely mapping reads are reported.
    STAR --runThreadN ${THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outReadsUnmapped Fastx \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 10 \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66
  13. 13

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables for clarity
    # Placeholder for human hg38 genome directory. Replace with your actual STAR index path.
    GENOME_DIR="/path/to/STAR_indices/GRCh38_GENCODE_v38"
    INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
    INPUT_R2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    # This prefix is used for auxiliary STAR output files (e.g., Log.out, SJ.out.tab).
    # The primary BAM output is redirected to OUTPUT_BAM.
    OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${INPUT_R1}" "${INPUT_R2}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outSAMattributes All \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd > "${OUTPUT_BAM}"
  14. 14

    takes output from STAR genome mapping.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for human hg38 genome index
    READ1_FASTQ="input_R1.fastq.gz" # Placeholder for input R1 FASTQ file
    READ2_FASTQ="input_R2.fastq.gz" # Placeholder for input R2 FASTQ file (remove if single-end)
    OUTPUT_PREFIX="aligned_sample" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Run STAR genome mapping
    STAR \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
      --runThreadN "${THREADS}" \
      --outFileNamePrefix "${OUTPUT_PREFIX}." \
      --outSAMtype BAM SortedByCoordinate \
      --outFilterMultimapNmax 1 \
      --outFilterMismatchNmax 3 \
      --outFilterScoreMinOverLread 0.66 \
      --outFilterMatchNminOverLread 0.66
  15. 15

    Custom random-mer-aware script for PCR duplicate removal.

    dedup_umi.py (Inferred with models/gemini-2.5-flash) vumi_tools 1.0.0 (as used in yeolab/eclip workflow) GitHub
    $ Bash example
    # Clone the eclip repository to get dedup_umi.py
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/tools
    
    # Install umi_tools if not available
    # conda install -c bioconda umi_tools=1.0.0
    
    # Example: PCR duplicate removal using the custom random-mer-aware script
    # This script (dedup_umi.py) wraps umi_tools dedup, handling UMI extraction from read IDs.
    # Input: aligned_reads.bam (BAM file with UMIs in read IDs, e.g., "readname_UMI")
    # Output: deduplicated_reads.bam, deduplication_stats.tsv, deduplication.log
    
    python eclip/tools/dedup_umi.py \
      -i aligned_reads.bam \
      -o deduplicated_reads.bam \
      -s deduplication_stats.tsv \
      -l deduplication.log \
      --umi-separator "_" \
      --extract-method "read_id" \
      --paired
  16. 16

    Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

    barcode_collapse_pe.py vN/A (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Clone the eCLIP pipeline repository
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Create and activate the conda environment (assuming environment.yml is present in the cloned repo)
    # conda env create -f environment.yml
    # conda activate eclip
    
    # Execute the barcode_collapse_pe.py script
    # Assuming the 'eclip' repository was cloned into the current directory.
    python eclip/scripts/barcode_collapse_pe.py \
        --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \
        --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \
        --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
  17. 17

    Takes output from barcode collapse PE.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.7.10a
    
    # Define variables
    # STAR genome index for hg38 (GRCh38) should be pre-built using STAR's genomeGenerate command.
    # Example: STAR --runMode genomeGenerate --genomeDir /path/to/star_index/hg38 --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf --runThreadN <num_threads>
    STAR_INDEX="/path/to/star_index/hg38" # Placeholder for STAR genome index
    READ1="deduplicated_reads_R1.fastq.gz" # Input FASTQ R1 from barcode collapse PE
    READ2="deduplicated_reads_R2.fastq.gz" # Input FASTQ R2 from barcode collapse PE
    OUTPUT_PREFIX="aligned_deduplicated" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Run STAR alignment for paired-end reads, typical for eCLIP pipelines
    STAR --genomeDir "${STAR_INDEX}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --readFilesCommand zcat \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.05 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --alignSJoverhangMin 8 \
         --alignSJDBoverhangMin 1 \
         --sjdbScore 1 \
         --outSAMattributes NH HI AS NM MD \
         --limitBAMsortRAM 60000000000 # Adjust based on available RAM (e.g., 60GB for human genome)
    
  18. 18

    Sorts resulting bam file for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.10.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.10.2
    
    # Sort the BAM file
    # Replace 'input.bam' with your unsorted BAM file
    # Replace 'output.sorted.bam' with the desired name for the sorted BAM file
    # Adjust '-@' for the number of threads/CPUs to use
    samtools sort -@ 8 -o output.sorted.bam input.bam
  19. 19

    Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

    Picard v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Picard (e.g., via conda)
    # conda install -c bioconda picard
    
    # Define variables for paths and files
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    TMP_DIR="/full/path/to/files/.queue/tmp"
    GATK_QUEUE_JAR="/path/to/gatk/dist/Queue.jar" # Path to GATK Queue.jar, which might contain bundled Picard
    
    # Create temporary directory if it doesn't exist
    mkdir -p "${TMP_DIR}"
    
    # Execute Picard SortSam via Java
    java -Xmx2048m \
         -XX:+UseParallelOldGC \
         -XX:ParallelGCThreads=4 \
         -XX:GCTimeLimit=50 \
         -XX:GCHeapFreeLimit=10 \
         -Djava.io.tmpdir="${TMP_DIR}" \
         -cp "${GATK_QUEUE_JAR}" \
         net.sf.picard.sam.SortSam \
         INPUT="${INPUT_BAM}" \
         TMP_DIR="${TMP_DIR}" \
         OUTPUT="${OUTPUT_BAM}" \
         VALIDATION_STRINGENCY=SILENT \
         SO=coordinate \
         CREATE_INDEX=true
  20. 20

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assuming 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  21. 21

    Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

    samtools v1.10 GitHub
    $ Bash example
    # Install samtools (e.g., using conda)
    # conda install -c bioconda samtools=1.10
    
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"
    
    samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"
  22. 22

    Takes inputs from multiple final bam files.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Example: Merge multiple BAM files into a single BAM file.
    # This command takes multiple sorted BAM files as input and merges them.
    # The output BAM file will be sorted by coordinate.
    # Replace rep1.bam, rep2.bam, rep3.bam with your actual input BAM files.
    # Replace merged_output.bam with your desired output file name.
    samtools merge merged_output.bam rep1.bam rep2.bam rep3.bam
  23. 23

    Merges the two technical replicates for further downstream analysis.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.10
    
    # Merge technical replicates (e.g., BAM files) into a single file
    # Replace 'replicate1.bam' and 'replicate2.bam' with actual input file names
    # Replace 'merged_replicates.bam' with the desired output file name
    samtools merge merged_replicates.bam replicate1.bam replicate2.bam
    
    # Index the merged BAM file for efficient access by downstream tools
    samtools index merged_replicates.bam
  24. 24

    Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

    samtools v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools (e.g., using conda)
    # conda install -c bioconda samtools=1.19
    
    # Define input and output paths
    OUTPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
    INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    
    # Merge sorted BAM files
    samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}"
  25. 25

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assuming 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  26. 26

    Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
  27. 27

    Takes output from sortSam.

    samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Example: Index a sorted BAM file
    # Input: sorted.bam (output from sortSam)
    # Output: sorted.bam.bai (BAM index file)
    samtools index sorted.bam
  28. 28

    Only outputs the second read in each pair for use with single stranded peak caller.

    cp (Inferred with models/gemini-2.5-flash) vSystem default GitHub
    $ Bash example
    cp input_R2.fastq.gz output_single_end.fastq.gz
  29. 29

    This is the final bam file to perform analysis on.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # This step assumes an input BAM file (e.g., aligned_reads.bam) has been generated
    # and is being prepared as the 'final.bam' for downstream analysis.
    # The BAM file is typically aligned to a reference genome (e.g., hg38, mm10).
    
    # Sort the BAM file by coordinate, which is often a prerequisite for many downstream analyses.
    samtools sort -o final.bam aligned_reads.bam
    
    # Index the sorted BAM file, which is necessary for quick random access to reads (e.g., by region).
    samtools index final.bam
  30. 30

    Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

    samtools v1.10 GitHub
    $ Bash example
    # Install samtools (e.g., using conda)
    # conda install -c bioconda samtools=1.10
    
    # Filter BAM file to keep only second-in-pair reads (-f 128) and output in BAM format (-b) with header (-h)
    samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
  31. 31

    Takes results from samtools view.

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (example using conda)
    # conda install -c bioconda samtools=1.19
    
    # This command sorts a BAM file. The description "Takes results from samtools view"
    # implies that the input BAM file is the output from a previous 'samtools view' command
    # (e.g., a filtered or subsetted BAM file).
    # Sorting is a common and often necessary next step in bioinformatics pipelines,
    # typically performed before indexing or further downstream analyses.
    # Input: A BAM file (e.g., 'filtered_reads.bam' which is the result of 'samtools view').
    # Output: A coordinate-sorted BAM file ('sorted_reads.bam').
    
    samtools sort -o sorted_reads.bam filtered_reads.bam
  32. 32

    Calls peaks on those files.

    clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install clipper (if not already installed)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # pip install .
    # cd ..
    
    # Placeholder for input and control BAM files
    # Replace with actual paths to your aligned IP and control BAM files.
    INPUT_BAM="path/to/your/ip_sample.bam"
    CONTROL_BAM="path/to/your/control_sample.bam"
    
    # Placeholder for genome size file (e.g., hg38.chrom.sizes)
    # Download from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    GENOME_SIZE_FILE="path/to/your/hg38.chrom.sizes"
    
    # Output prefix for the peak file (clipper will append .bed)
    OUTPUT_PREFIX="eclip_peaks"
    
    # Execute clipper to call peaks
    # -b: IP sample BAM file
    # -c: Control sample BAM file
    # -s: Genome size file or integer (e.g., 2.9e9 for human hg38)
    # -o: Output prefix for peak files
    # -p: P-value threshold for peak calling
    # -f: FDR threshold for peak calling
    # --min-peak-width: Minimum width of a peak
    # --max-peak-width: Maximum width of a peak
    clipper -b "${INPUT_BAM}" \
            -c "${CONTROL_BAM}" \
            -s "${GENOME_SIZE_FILE}" \
            -o "${OUTPUT_PREFIX}" \
            -p 0.01 \
            -f 0.05 \
            --min-peak-width 10 \
            --max-peak-width 500
  33. 33

    Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

    CLIPper vNot specified GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # It's a Python package, often installed via pip or conda.
    # pip install clipper
    # or
    # conda install -c bioconda clipper
    
    # Define input and output paths (adjust as needed)
    INPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"
    OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed"
    GENOME_ASSEMBLY="hg19" # Reference genome assembly
    
    # Run CLIPper peak calling
    clipper -b "${INPUT_BAM}" \
            -s "${GENOME_ASSEMBLY}" \
            -o "${OUTPUT_BED}" \
            --bonferroni \
            --superlocal \
            --threshold-method binomial \
            --save-pickle

Tools Used

Raw Source Text
Library strategy: eCLIP-Seq
Takes output from raw files.  Run to trim off both 5’ and 3’ adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3’ adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding, txt format contains counts of reads for both IP and Input for each gene in subtranscriptomic region, bigWigs are read densities for positive and negative strand genome wide
← Back to Analysis