GSE80039 Processing Pipeline

OTHER code_examples 33 steps

Publication

Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).

Nature methods (2016) — PMID 27018577

Dataset

GSE80039

Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [eCLIP - Hep…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: eCLIP-seq

    eCLIP vCWL workflow (yeolab/eclip) GitHub
    $ Bash example
    # This command is a placeholder for running the eCLIP CWL workflow.
    # It assumes 'eclip.cwl' is the main workflow definition file
    # and 'eclip_inputs.yaml' contains paths to input FASTQ files,
    # genome reference (e.g., hg38), and other necessary parameters.
    #
    # Example 'eclip_inputs.yaml' content for a human (hg38) sample:
    # fastq_r1: { class: File, path: "sample_R1.fastq.gz" }
    # fastq_r2: { class: File, path: "sample_R2.fastq.gz" }
    # genome_fasta: { class: File, path: "/path/to/hg38.fa" }
    # genome_star_index: { class: Directory, path: "/path/to/hg38_star_index" }
    # adapter_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example adapter
    # output_dir: "eclip_results"
    #
    # For detailed setup and execution, refer to the yeolab/eclip GitHub repository:
    # https://github.com/yeolab/eclip/
    #
    # Installation of cwltool (if not already installed):
    # conda install -c conda-forge cwltool
    # or
    # pip install cwltool
    #
    # Clone the eCLIP CWL workflow repository:
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    #
    # Execute the eCLIP CWL workflow:
    # Replace 'eclip.cwl' and 'eclip_inputs.yaml' with actual paths.
    cwltool eclip.cwl eclip_inputs.yaml
  2. 2

    Takes output from raw files.

    Trim Galore! (Inferred with models/gemini-2.5-flash) v0.6.7 GitHub
    $ Bash example
    # Install Trim Galore! (if not already installed)
    # conda install -c bioconda trim-galore
    
    # Define input raw FASTQ files (replace with actual file paths)
    # Assuming paired-end raw FASTQ files as common input for many pipelines
    INPUT_FASTQ_R1="sample_R1.fastq.gz"
    INPUT_FASTQ_R2="sample_R2.fastq.gz"
    
    # Define output directory for trimmed FASTQ files
    OUTPUT_DIR="./trimmed_fastq"
    mkdir -p "${OUTPUT_DIR}"
    
    # Run Trim Galore! for adapter trimming and quality filtering
    # This command processes paired-end reads, automatically detects adapters,
    # and places the trimmed files in the specified output directory.
    # Trim Galore! internally uses Cutadapt for trimming.
    trim_galore --paired \
                --output_dir "${OUTPUT_DIR}" \
                "${INPUT_FASTQ_R1}" \
                "${INPUT_FASTQ_R2}"
  3. 3

    Run to trim off both 5’ and 3’ adapters on both reads.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output file paths
    READ1_IN="read1.fastq.gz"
    READ2_IN="read2.fastq.gz"
    READ1_OUT="trimmed_read1.fastq.gz"
    READ2_OUT="trimmed_read2.fastq.gz"
    REPORT_FILE="cutadapt_report.txt"
    
    # Define adapter sequences (example Illumina TruSeq adapters from Yeo lab's skipper workflow)
    # IMPORTANT: Replace these with the actual adapter sequences used in your library preparation.
    # If distinct 5' adapters are used, replace ADAPTER_FWD_5PRIME and ADAPTER_REV_5PRIME accordingly.
    ADAPTER_FWD_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    ADAPTER_REV_3PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    # For 5' adapter trimming, often the same sequence or a specific 5' adapter is used.
    # Using the same sequence as a placeholder if no distinct 5' adapter is specified.
    ADAPTER_FWD_5PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder, replace if distinct 5' adapter exists
    ADAPTER_REV_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Placeholder, replace if distinct 5' adapter exists
    
    # Run cutadapt to trim both 5' and 3' adapters from both reads
    # -a ADAPTER_FWD_3PRIME: 3' adapter for R1
    # -A ADAPTER_REV_3PRIME: 3' adapter for R2
    # -g ADAPTER_FWD_5PRIME: 5' adapter for R1
    # -G ADAPTER_REV_5PRIME: 5' adapter for R2
    # -q 20: Trim low-quality bases from the 3' end (Phred score < 20)
    # --minimum-length 15: Discard reads shorter than 15 bp after trimming
    # -e 0.1: Maximum error rate for adapter matching
    # -o: Output file for R1
    # -p: Output file for R2
    cutadapt \
      -a "${ADAPTER_FWD_3PRIME}" \
      -A "${ADAPTER_REV_3PRIME}" \
      -g "${ADAPTER_FWD_5PRIME}" \
      -G "${ADAPTER_REV_5PRIME}" \
      -q 20 \
      --minimum-length 15 \
      -e 0.1 \
      -o "${READ1_OUT}" \
      -p "${READ2_OUT}" \
      "${READ1_IN}" "${READ2_IN}" > "${REPORT_FILE}" 2>&1
  4. 4

    Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

    quality-cutoff vAssociated with yeolab/eclip workflow (circa 2017-2020) GitHub
    $ Bash example
    # Install dependencies: Cutadapt
    # conda install -c bioconda cutadapt
    
    # Install quality-cutoff script:
    # This script (quality-cutoff.py) is part of the yeolab/eclip workflow.
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/scripts
    # chmod +x quality-cutoff.py
    # # Ensure 'quality-cutoff' is in your PATH, e.g., by creating a symlink or adding the directory to PATH:
    # # sudo ln -s $(pwd)/quality-cutoff.py /usr/local/bin/quality-cutoff
    # # Alternatively, invoke directly using python: python /path/to/eclip/scripts/quality-cutoff.py ...
    
    # Execute the quality-cutoff command
    quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
  5. 5

    Takes output from cutadapt round 1.

    cutadapt v1.18 GitHub
    $ Bash example
    # Install cutadapt if not already available
    # conda install -c bioconda cutadapt=1.18
    
    # Execute cutadapt for a second round of trimming.
    # This command typically focuses on quality trimming, length filtering,
    # removing reads with Ns, and potentially trimming poly-A tails,
    # assuming primary adapter trimming was handled in the first round.
    cutadapt \
      -q 20,20 \
      -m 18 \
      --max-n 0 \
      -a "A{10}" \
      -o output_cutadapt_round2.fastq.gz \
      input_from_cutadapt_round1.fastq.gz
  6. 6

    Run to trim off the 3’ adapters on read 2, to control for double ligation events.

    cutadapt (Inferred with models/gemini-2.5-flash) v2.10 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    READ1_INPUT="sample_R1.fastq.gz"
    READ2_INPUT="sample_R2.fastq.gz"
    READ1_TRIMMED="sample_R1_trimmed.fastq.gz"
    READ2_TRIMMED="sample_R2_trimmed.fastq.gz"
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Standard Illumina TruSeq adapter for eCLIP
    
    # Run cutadapt to trim 3' adapters from Read 2, outputting both R1 and R2
    # -a ADAPTER: Specifies the 3' adapter to trim from the forward read (R2 in this case)
    # -o: Output file for the forward read (R2)
    # -p: Output file for the reverse read (R1), which is paired with the forward read
    # -m 18: Discard reads shorter than 18 bp after trimming, as used in the eCLIP pipeline
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -o "${READ2_TRIMMED}" \
             -p "${READ1_TRIMMED}" \
             -m 18 \
             "${READ2_INPUT}" \
             "${READ1_INPUT}"
  7. 7

    Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

    cutadapt v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
    INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
    OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics"
    
    # Define adapters
    ADAPTERS=(
        "-A AACTTGTAGATCGGA"
        "-A AGGACCAAGATCGGA"
        "-A ACTTGTAGATCGGAA"
        "-A GGACCAAGATCGGAA"
        "-A CTTGTAGATCGGAAG"
        "-A GACCAAGATCGGAAG"
        "-A TTGTAGATCGGAAGA"
        "-A ACCAAGATCGGAAGA"
        "-A TGTAGATCGGAAGAG"
        "-A CCAAGATCGGAAGAG"
        "-A GTAGATCGGAAGAGC"
        "-A CAAGATCGGAAGAGC"
        "-A TAGATCGGAAGAGCG"
        "-A AAGATCGGAAGAGCG"
        "-A AGATCGGAAGAGCGT"
        "-A GATCGGAAGAGCGTC"
        "-A ATCGGAAGAGCGTCG"
        "-A TCGGAAGAGCGTCGT"
        "-A CGGAAGAGCGTCGTG"
        "-A GGAAGAGCGTCGTGT"
    )
    
    # Execute cutadapt command
    cutadapt -f fastq \
        --match-read-wildcards \
        --times 1 \
        -e 0.1 \
        -O 5 \
        --quality-cutoff 6 \
        -m 18 \
        "${ADAPTERS[@]}" \
        -o "${OUTPUT_R1}" \
        -p "${OUTPUT_R2}" \
        "${INPUT_R1}" \
        "${INPUT_R2}" \
        > "${METRICS_FILE}"
  8. 8

    Takes output from cutadapt round 2.

    cutadapt v1.18 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=1.18
    
    # Define input and output files
    # INPUT_FASTQ is the output from cutadapt round 1 (adapter trimming)
    INPUT_FASTQ="sample_R1_trimmed_adapter.fastq.gz"
    OUTPUT_FASTQ="sample_R1_trimmed_polyA.fastq.gz"
    
    # Run cutadapt for poly-A trimming (round 2 in eCLIP pipeline)
    # -a A{100}: Trims a poly-A tail of up to 100 A's
    # -q 10: Trims low-quality bases from the 3' end with a quality cutoff of 10
    # --minimum-length 18: Discards reads shorter than 18 bp after trimming
    # -e 0.1: Maximum error rate of 10% for adapter matching
    # --overlap 3: Minimum overlap of 3 bases for adapter matching
    # -j 8: Use 8 CPU cores for parallel processing
    cutadapt -a A{100} \
             -q 10 \
             --minimum-length 18 \
             -e 0.1 \
             --overlap 3 \
             -j 8 \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  9. 9

    Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

    BBDuk (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install BBTools suite (which includes BBDuk)
    # conda install -c bioconda bbmap
    
    # Define variables
    # Replace with your actual input FASTQ file(s)
    INPUT_READS="input_reads.fastq.gz" 
    OUTPUT_FILTERED_READS="filtered_reads.fastq.gz"
    
    # This FASTA file contains sequences of human-specific repetitive elements from RepBase.
    # It needs to be prepared beforehand, e.g., by extracting sequences from the RepBase database
    # (Genetic Information Research Institute - GIRI) or by extracting repeat sequences
    # identified by RepeatMasker on the human reference genome (e.g., GRCh38).
    # For example, a combined FASTA of human rRNA, tRNA, and other RepBase elements.
    HUMAN_REPBASE_FASTA="path/to/human_repbase_elements.fa"
    
    # Run BBDuk to remove reads that map to human-specific repetitive elements.
    # BBDuk maps reads against the provided reference FASTA and filters out matches.
    # in: Input FASTQ file(s). Can be comma-separated for multiple files or wildcards.
    # ref: Reference FASTA file containing repetitive element sequences.
    # out: Output FASTQ file(s) with repetitive reads removed.
    # k: Kmer length for matching (default 31, can be adjusted for sensitivity).
    # hdist: Hamming distance for kmer matching (default 1, allows for 1 mismatch).
    # stats: Output statistics about filtered reads to a specified file.
    # overwrite: Allow overwriting output files if they exist.
    # The description mentions "rRNA (& other) repetitive reads". If the HUMAN_REPBASE_FASTA
    # includes rRNA sequences, then this single step can handle both rRNA and other RepBase elements.
    
    bbduk.sh in="${INPUT_READS}" \
               ref="${HUMAN_REPBASE_FASTA}" \
               out="${OUTPUT_FILTERED_READS}" \
               k=31 \
               hdist=1 \
               stats="${OUTPUT_FILTERED_READS}.stats" \
               overwrite=true
  10. 10

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    bash
    # Reference genome directory for RepBase human repeats.
    # This is a placeholder. Replace with the actual path to your STAR-indexed RepBase human genome directory.
    GENOME_DIR="/path/to/RepBase_human_database_file"
    
    # Input FASTQ files
    READ1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    READ2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    
    # Output BAM file prefix (for auxiliary files like Log.out, SJ.out.tab) and the final redirected BAM file.
    # Note: The main alignment output is sent to stdout (--outStd BAM_Unsorted) and then redirected to FINAL_BAM_OUTPUT.
    OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    FINAL_BAM_OUTPUT="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    
    STAR --runMode alignReads \
         --runThreadN 16 \
         --genomeDir "${GENOME_DIR}" \
         --genomeLoad LoadAndRemove \
         --readFilesIn "${READ1}" "${READ2}" \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 30 \
         --outFilterMultimapScoreRange 1 \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMattributes All \
         --readFilesCommand zcat \
         --outStd BAM_Unsorted \
         --outSAMtype BAM Unsorted \
         --outFilterType BySJout \
         --outReadsUnmapped Fastx \
         --outFilterScoreMin 10 \
         --outSAMattrRGline ID:foo \
         --alignEndsType EndToEnd \
         > "${FINAL_BAM_OUTPUT}"
    
  11. 11

    Takes output from STAR rmRep.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define variables (replace with actual paths and filenames)
    # GENOME_DIR: Path to the STAR genome index (e.g., for hg38).
    # READ1: Path to the input FASTQ file for read 1.
    # READ2: Path to the input FASTQ file for read 2 (optional, remove if single-end).
    # OUTPUT_PREFIX: Prefix for output files.
    # THREADS: Number of threads to use for STAR alignment.
    GENOME_DIR="/path/to/STAR_genome_index/hg38"
    READ1="input_read1.fastq.gz"
    READ2="input_read2.fastq.gz" # Remove this line if single-end reads
    OUTPUT_PREFIX="aligned_reads"
    THREADS=8
    
    # 1. Align reads with STAR
    # This command aligns RNA-based assay reads (like eCLIP) to a reference genome.
    # Parameters are adapted from the Yeo lab eCLIP CWL workflow (https://github.com/yeolab/eclip).
    # --runThreadN: Number of threads.
    # --genomeDir: Path to the STAR genome index.
    # --readFilesIn: Input FASTQ files. Use only READ1 if single-end.
    # --outFileNamePrefix: Prefix for output files.
    # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file.
    # --outFilterMultimapNmax 1: Consider only uniquely mapping reads (common for eCLIP).
    # --outFilterMismatchNmax 3: Max number of mismatches per read.
    # --alignIntronMax 1: For eCLIP, introns are not expected, so set to 1 to disable splicing.
    # --alignEndsType Local: Local alignment for eCLIP.
    # --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66: Filtering parameters.
    # --outFilterMatchNmin 20: Minimum number of matched bases.
    # --limitBAMsortRAM 30000000000: Limit RAM for BAM sorting (30GB).
    
    STAR \
      --runThreadN ${THREADS} \
      --genomeDir ${GENOME_DIR} \
      --readFilesIn ${READ1} ${READ2} \
      --outFileNamePrefix ${OUTPUT_PREFIX}_ \
      --outSAMtype BAM SortedByCoordinate \
      --outFilterMultimapNmax 1 \
      --outFilterMismatchNmax 3 \
      --alignIntronMax 1 \
      --alignEndsType Local \
      --outFilterScoreMinOverLread 0.66 \
      --outFilterMatchNminOverLread 0.66 \
      --outFilterMatchNmin 20 \
      --limitBAMsortRAM 30000000000
    
    # The above command produces a sorted BAM file: ${OUTPUT_PREFIX}_Aligned.sortedByCoordinate.out.bam
    
    # 2. Deduplicate reads using samtools markdup (implied by "rmRep" - remove replicates)
    # This step removes PCR duplicates from the aligned BAM file, which is crucial for eCLIP.
    # -r: Remove duplicate reads (rather than just marking them).
    # -S: Treat all reads as single-end (used in eCLIP pipelines even for paired-end input if pairing is not strictly maintained).
    
    samtools markdup -r -S \
      ${OUTPUT_PREFIX}_Aligned.sortedByCoordinate.out.bam \
      ${OUTPUT_PREFIX}_deduplicated.bam
    
    # Index the deduplicated BAM file for downstream processing
    samtools index ${OUTPUT_PREFIX}_deduplicated.bam
  12. 12

    Maps unique reads to the human genome.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.7.10a
    
    # --- Reference Data Setup (Example using GRCh38 and GENCODE v38) ---
    # Download human genome primary assembly FASTA (e.g., from UCSC or NCBI)
    # wget -P /path/to/references/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
    # gunzip /path/to/references/hg38.fa.gz
    
    # Download GENCODE v38 GTF annotation (e.g., from GENCODE)
    # wget -P /path/to/references/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
    # gunzip /path/to/references/gencode.v38.annotation.gtf.gz
    
    # Create STAR genome index (run once per reference genome)
    # mkdir -p /path/to/STAR_index/GRCh38_gencode_v38
    # STAR --runThreadN 8 \
    # --runMode genomeGenerate \
    # --genomeDir /path/to/STAR_index/GRCh38_gencode_v38 \
    # --genomeFastaFiles /path/to/references/hg38.fa \
    # --sjdbGTFfile /path/to/references/gencode.v38.annotation.gtf \
    # --sjdbOverhang 100 # Recommended for RNA-seq, typically read length - 1
    
    # --- Alignment Command ---
    # Maps unique reads to the human genome (GRCh38) using STAR
    # Input: input_reads.fastq.gz (replace with your actual FASTQ file)
    # Output: output_prefix_Aligned.sortedByCoord.out.bam (BAM file sorted by coordinate)
    #         output_prefix_ReadsPerGene.out.tab (Gene counts, if --quantMode GeneCounts is used)
    STAR --runThreadN 8 \
    --genomeDir /path/to/STAR_index/GRCh38_gencode_v38 \
    --readFilesIn input_reads.fastq.gz \
    --outFileNamePrefix output_prefix_ \
    --outSAMtype BAM SortedByCoordinate \
    --outFilterMultimapNmax 1 \
    --outFilterMismatchNmax 10 \
    --outFilterScoreMinOverLread 0.66 \
    --outFilterMatchNminOverLread 0.66 \
    --quantMode GeneCounts
  13. 13

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

    $ Bash example
    bash
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Replace with your actual STAR genome directory (e.g., for hg38 or mm10).
    # For eCLIP/RNA-based assays, hg38 is a common reference.
    GENOME_DIR="/path/to/your/STAR_index/hg38" 
    
    # Input FASTQ files (these appear to be unmapped mates extracted from a BAM file)
    INPUT_READS_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
    INPUT_READS_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
    
    # Output BAM file (the alignment output is redirected to this file from stdout)
    OUTPUT_BAM_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    
    # Prefix for other STAR output files (e.g., Log.out, SJ.out.tab, etc.)
    # Note: The original description uses the .bam file name as a prefix, which will result in files like "your.bamLog.out".
    # If you prefer a cleaner prefix, you might change this to something like "/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep."
    OUTPUT_FILE_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    
    # Run STAR alignment
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${INPUT_READS_MATE1}" "${INPUT_READS_MATE2}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_FILE_PREFIX}" \
      --outSAMattributes All \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd \
      > "${OUTPUT_BAM_FILE}"
    
  14. 14

    takes output from STAR genome mapping.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Replace with actual paths and filenames
    GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for GRCh38 genome index
    READ1_FASTQ="input_R1.fastq.gz"
    READ2_FASTQ="input_R2.fastq.gz" # Omit if single-end
    OUTPUT_PREFIX="mapped_reads"
    THREADS=8 # Number of threads to use
    
    # Run STAR mapping
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # ~30GB RAM for sorting, adjust as needed based on available memory
  15. 15

    Custom random-mer-aware script for PCR duplicate removal.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools=1.1.2
    
    # Example: Deduplicate a BAM file using UMIs embedded in read IDs.
    # This command assumes UMIs have been extracted and appended to read IDs
    # in a previous step (e.g., using 'umi_tools extract') and are separated by an underscore '_'.
    # The 'directional' method is commonly used for eCLIP data to handle PCR duplicates.
    
    umi_tools dedup \
        --input input.sorted.bam \
        --output output.dedup.bam \
        --extract-umi-method=read_id \
        --umi-separator='_' \
        --method=directional \
        --log dedup.log
  16. 16

    Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

    eCLIP pipeline (Yeo Lab) (Inferred with models/gemini-2.5-flash) vv1.0.1 GitHub
    $ Bash example
    bash
    # Clone the eCLIP pipeline repository
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Create and activate the conda environment (if using conda) from the provided environment.yml
    # conda env create -f environment.yml
    # conda activate eclip
    
    # Set the path to the eCLIP scripts directory
    # Adjust this path to where you cloned the 'eclip' repository
    ECLIP_SCRIPTS_DIR="/path/to/cloned/eclip/scripts"
    
    # Define input and output file paths
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"
    
    # Execute the barcode_collapse_pe.py script
    python "${ECLIP_SCRIPTS_DIR}/barcode_collapse_pe.py" \
        --bam "${INPUT_BAM}" \
        --out_file "${OUTPUT_BAM}" \
        --metrics_file "${METRICS_FILE}"
    
  17. 17

    Takes output from barcode collapse PE.

    cutadapt (Inferred with models/gemini-2.5-flash) v1.18 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # Define input files (expected output from a barcode collapse PE step)
    # These are placeholder names; adjust to actual file names from the previous step.
    INPUT_R1="collapsed_reads_r1.fastq.gz"
    INPUT_R2="collapsed_reads_r2.fastq.gz"
    
    # Define output files for trimmed reads
    OUTPUT_R1="trimmed_R1.fastq.gz"
    OUTPUT_R2="trimmed_R2.fastq.gz"
    
    # Define adapter sequences commonly used in eCLIP (Illumina TruSeq adapters)
    # These specific adapters are used in the Yeo lab eCLIP pipeline.
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Execute cutadapt for paired-end adapter trimming
    # -a: 3' adapter for R1
    # -A: 3' adapter for R2
    # -o: Output file for R1
    # -p: Output file for R2
    # -m 18: Discard reads shorter than 18 bp after trimming (as used in eCLIP pipeline)
    cutadapt -a "${ADAPTER_R1}" -A "${ADAPTER_R2}" \
             -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \
             -m 18 \
             "${INPUT_R1}" "${INPUT_R2}"
  18. 18

    Sorts resulting bam file for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Define input and output file names
    INPUT_BAM="input.bam"
    OUTPUT_SORTED_BAM="${INPUT_BAM%.bam}.sorted.bam"
    
    # Sort the BAM file by coordinate
    # -o: output file
    # -@: number of threads (adjust as needed)
    # -m: memory per thread (adjust as needed, e.g., 2G for 2GB)
    samtools sort -o "${OUTPUT_SORTED_BAM}" -@ 8 -m 2G "${INPUT_BAM}"
  19. 19

    Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

    Picard vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install Picard (example using conda)
    # conda install -c bioconda picard
    
    # Define variables for paths and files
    GATK_QUEUE_JAR="/path/to/gatk/dist/Queue.jar" # Adjust this path to your GATK Queue.jar
    DATA_DIR="/full/path/to/files" # Base directory for input/output files
    INPUT_BAM="$DATA_DIR/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    OUTPUT_BAM="$DATA_DIR/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    TMP_DIR="$DATA_DIR/.queue/tmp"
    
    # Create temporary directory if it doesn't exist
    mkdir -p "$TMP_DIR"
    
    # Execute Picard SortSam via GATK Queue.jar
    java -Xmx2048m \
         -XX:+UseParallelOldGC \
         -XX:ParallelGCThreads=4 \
         -XX:GCTimeLimit=50 \
         -XX:GCHeapFreeLimit=10 \
         -Djava.io.tmpdir="$TMP_DIR" \
         -cp "$GATK_QUEUE_JAR" \
         net.sf.picard.sam.SortSam \
         INPUT="$INPUT_BAM" \
         TMP_DIR="$TMP_DIR" \
         OUTPUT="$OUTPUT_BAM" \
         VALIDATION_STRINGENCY=SILENT \
         SO=coordinate \
         CREATE_INDEX=true
  20. 20

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assuming 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  21. 21

    Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

    samtools v1.15.1 (Inferred from Skipper workflow) GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define input and output paths
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"
    
    # Execute samtools index command
    samtools index "$INPUT_BAM" "$OUTPUT_BAI"
  22. 22

    Takes inputs from multiple final bam files.

    samtools merge (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Example: Merging multiple final BAM files into a single BAM file.
    # This is a common step for combining technical or biological replicates.
    # Replace 'replicate1.bam', 'replicate2.bam', etc., with your actual input BAM file names.
    # Replace 'merged_replicates.bam' with your desired output BAM file name.
    samtools merge -o merged_replicates.bam replicate1.bam replicate2.bam replicate3.bam
  23. 23

    Merges the two technical replicates for further downstream analysis.

    samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Define input and output file names (replace with actual file paths)
    INPUT_REPLICATE_1="replicate1.bam"
    INPUT_REPLICATE_2="replicate2.bam"
    OUTPUT_MERGED_BAM="merged_replicates.bam"
    
    # Merge the two technical replicates into a single BAM file
    samtools merge -o "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE_1}" "${INPUT_REPLICATE_2}"
  24. 24

    Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

    samtools v1.10 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Define input and output files
    OUTPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
    INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    
    # Merge sorted BAM files
    samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}"
  25. 25

    Takes output from sortSam, makes bam index for use downstream.

    samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.19
    
    # Assuming 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  26. 26

    Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

    samtools v1.x GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Execute samtools index command
    samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
  27. 27

    Takes output from sortSam.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Example: Index a sorted BAM file
    # This step is commonly performed after sorting a BAM file to enable fast random access to alignments.
    # Input: sorted.bam (output from sortSam)
    # Output: sorted.bam.bai (BAM index file)
    samtools index sorted.bam
  28. 28

    Only outputs the second read in each pair for use with single stranded peak caller.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub
    $ Bash example
    # Install samtools (example using conda)
    # conda install -c bioconda samtools=1.10
    
    # Example: Extract only the second read from each pair in an aligned BAM file
    # This command filters for reads where the 'second in pair' flag (0x80) is set
    # and converts them to FASTQ format.
    # Input: aligned.bam (BAM file containing paired-end reads)
    # Output: second_reads.fastq (FASTQ file containing only the second reads in each pair)
    samtools fastq -f 0x80 aligned.bam > second_reads.fastq
  29. 29

    This is the final bam file to perform analysis on.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # The description "final bam file to perform analysis on" implies the BAM file is sorted and indexed.
    # This code block demonstrates how to sort and index a BAM file using samtools.
    # Replace 'input.bam' with your actual unsorted BAM file and 'final.bam' with your desired output name.
    samtools sort -o final.bam input.bam
    samtools index final.bam
  30. 30

    Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Define input and output file paths
    INPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
    OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"
    
    # Extract reads that are the second in a pair (flag 128) and output as BAM
    samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}"
  31. 31

    Takes results from samtools view.

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (example using conda)
    # conda install -c bioconda samtools=1.19
    
    # This step takes a BAM file (e.g., 'input.bam') that was previously generated
    # by a 'samtools view' command (e.g., for format conversion or initial filtering).
    # It then sorts the BAM file by coordinate, which is a common next step in bioinformatics pipelines.
    samtools sort -o output.sorted.bam input.bam
  32. 32

    Calls peaks on those files.

    clipper (Inferred with models/gemini-2.5-flash) vfrom source (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install clipper (if not already available)
    # clipper is a Python script, often run directly or installed via pip.
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # pip install .
    
    # Define input files and parameters (placeholders)
    # Replace with actual paths to your IP and control BAM files
    IP_BAM="path/to/your/ip_replicate1.bam"
    CONTROL_BAM="path/to/your/control_replicate1.bam" # Optional, but highly recommended for eCLIP
    OUTPUT_PREFIX="eclip_peaks"
    SPECIES="hg38" # Placeholder: Use 'hg38' for human, 'mm10' for mouse, etc.
    FDR_THRESHOLD="0.01" # False Discovery Rate threshold for peak calling
    WINDOW_SIZE="20" # Window size for peak detection, common for eCLIP
    
    # Execute clipper to call peaks
    # Ensure 'clipper.py' is in your PATH or provide the full path to the script
    clipper.py --species "${SPECIES}" \
               --bam "${IP_BAM}" \
               --control-bam "${CONTROL_BAM}" \
               --output-prefix "${OUTPUT_PREFIX}" \
               --fdr "${FDR_THRESHOLD}" \
               --window-size "${WINDOW_SIZE}"
  33. 33

    Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

    CLIPper vUnknown GitHub
    $ Bash example
    # Installation instructions for CLIPper.
    # It is recommended to use a virtual environment (e.g., conda or venv).
    #
    # Example using conda:
    # conda create -n clipper_env python=3.7
    # conda activate clipper_env
    # pip install clipper
    #
    # Alternatively, if installing from source:
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # pip install .
    #
    # Reference genome: hg19. Ensure the necessary genome files (e.g., FASTA, gene annotations)
    # for hg19 are configured or available in the environment where CLIPper is run.
    
    clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

Tools Used

Raw Source Text
Library strategy: eCLIP-seq
Takes output from raw files.  Run to trim off both 5’ and 3’ adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3’ adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bigWig, bigBed, bed (col1: chrom, col2: chromStart, col3: chromEnd, col4: -log10 pvalue, col5: log2 fold enrichment above input, col6: strand) format, contains clusters of predicted RBP binding
← Back to Analysis