GSE107768 Processing Pipeline

GSE code_examples 33 steps

Publication

A protein-RNA interaction atlas of the ribosome biogenesis factor AATF.

Scientific reports (2019) — PMID 31363146

Dataset

GSE107768

Best practices for eCLIP experiments and analysis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: eCLIP-seq

    eCLIP vlatest (as of 2020) GitHub
    $ Bash example
    # Install cwltool and other dependencies
    # pip install cwltool
    # conda install -c bioconda star clipper # For underlying tools if running manually
    # pip install git+https://github.com/yeolab/merge_peaks.git # For underlying tools if running manually
    
    # Define input files and parameters for the eCLIP CWL workflow
    # Replace with actual paths and values
    ECLIP_WORKFLOW_DIR="path/to/yeolab/eclip/workflow" # Clone https://github.com/yeolab/eclip.git
    ECLIP_CWL="${ECLIP_WORKFLOW_DIR}/eclip.cwl"
    INPUT_FASTQ_R1="path/to/sample_R1.fastq.gz"
    INPUT_FASTQ_R2="path/to/sample_R2.fastq.gz" # Optional, if paired-end
    CONTROL_FASTQ_R1="path/to/control_R1.fastq.gz"
    CONTROL_FASTQ_R2="path/to/control_R2.fastq.gz" # Optional, if paired-end
    GENOME_FASTA="path/to/genome.fa" # e.g., hg38
    GENOME_GTF="path/to/annotations.gtf" # e.g., GENCODE v38
    STAR_INDEX="path/to/star_index" # Pre-built STAR index for the genome
    OUTPUT_DIR="eclip_analysis_output"
    SAMPLE_ID="my_eclip_sample"
    
    # Create a CWL input YAML file
    # This is a simplified example; the actual workflow might require more inputs.
    # Refer to the eclip.cwl and eclip-inputs.yaml in the yeolab/eclip repository.
    cat << EOF > eclip_inputs.yaml
    fastq_r1:
      class: File
      path: ${INPUT_FASTQ_R1}
    # fastq_r2: # Uncomment if paired-end
    #   class: File
    #   path: ${INPUT_FASTQ_R2}
    control_fastq_r1:
      class: File
      path: ${CONTROL_FASTQ_R1}
    # control_fastq_r2: # Uncomment if paired-end
    #   class: File
    #   path: ${CONTROL_FASTQ_R2}
    genome_fasta:
      class: File
      path: ${GENOME_FASTA}
    genome_gtf:
      class: File
      path: ${GENOME_GTF}
    star_index_dir:
      class: Directory
      path: ${STAR_INDEX}
    output_directory: ${OUTPUT_DIR}
    sample_id: ${SAMPLE_ID}
    # Add other parameters as required by the eclip.cwl workflow,
    # such as adapter sequences, minimum read length, etc.
    EOF
    
    # Execute the eCLIP CWL workflow
    # This workflow internally uses tools like STAR for alignment, CLIPper for peak calling,
    # and merge_peaks for IDR.
    cwltool --outdir "${OUTPUT_DIR}" "${ECLIP_CWL}" eclip_inputs.yaml
  2. 2

    Takes output from raw files.

    Not specified (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # This step describes the input to the pipeline.
    # No specific tool or command can be inferred from "Takes output from raw files."
    # Assuming raw sequencing data in FASTQ format as input for subsequent steps.
    
    # Define variables for raw input files (example placeholders)
    RAW_FASTQ_R1="sample_R1.fastq.gz"
    RAW_FASTQ_R2="sample_R2.fastq.gz" # For paired-end data
    # Or for single-end data:
    # RAW_FASTQ="sample.fastq.gz"
    
    # Further pipeline steps would then process these files.
  3. 3

    Run to trim off both 5’ and 3’ adapters on both reads.

    fastp (Inferred with models/gemini-2.5-flash) v0.23.2 GitHub
    $ Bash example
    # Install fastp if not already installed
    # conda install -c bioconda fastp=0.23.2
    
    # Define input and output file names
    READ1_IN="read1.fastq.gz"
    READ2_IN="read2.fastq.gz"
    READ1_OUT="trimmed_read1.fastq.gz"
    READ2_OUT="trimmed_read2.fastq.gz"
    JSON_REPORT="fastp_report.json"
    HTML_REPORT="fastp_report.html"
    THREADS=8
    QUAL_THRESHOLD=15 # Example quality threshold for filtering low quality bases
    MIN_LENGTH=20     # Example minimum read length after trimming
    
    # Run fastp to trim adapters and perform quality filtering
    # --detect_adapter_for_pe: Automatically detect adapters for paired-end reads
    # --trim_poly_g: Trim polyG tails (common in Illumina NextSeq/NovaSeq)
    # --trim_poly_x: Trim polyX tails (any base)
    # --correction: Enable base correction for overlapping reads
    # --cut_by_quality5/3: Cut reads by quality from 5' and 3' ends
    # --cut_window_size: Window size for quality cutting
    # --cut_mean_quality: Mean quality requirement in the window
    # --low_complexity_filter: Filter reads with low complexity
    # --complexity_threshold: Threshold for low complexity filtering
    fastp \
        --in1 "${READ1_IN}" \
        --in2 "${READ2_IN}" \
        --out1 "${READ1_OUT}" \
        --out2 "${READ2_OUT}" \
        --json "${JSON_REPORT}" \
        --html "${HTML_REPORT}" \
        --detect_adapter_for_pe \
        --thread "${THREADS}" \
        --qualified_quality_phred "${QUAL_THRESHOLD}" \
        --length_required "${MIN_LENGTH}" \
        --trim_poly_g \
        --trim_poly_x \
        --correction \
        --cut_by_quality5 \
        --cut_by_quality3 \
        --cut_window_size 4 \
        --cut_mean_quality 20 \
        --low_complexity_filter \
        --complexity_threshold 30
  4. 4

    Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

    quality-cutoff.py (Inferred with models/gemini-2.5-flash) vNot explicitly stated (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Python (if not already installed)
    # conda install python
    
    # Clone the eCLIP pipeline repository to get the quality-cutoff.py script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/src
    
    # Ensure quality-cutoff.py is executable or run with python
    # chmod +x quality-cutoff.py
    
    # Execute the quality-cutoff.py script for adapter trimming and quality filtering
    # Note: The original command 'quality-cutoff 6' is interpreted as 'python quality-cutoff.py -q 6'
    # based on the usage of the quality-cutoff.py script from the yeolab/eclip repository.
    # Also, '-A CTTGT AGATCGGAAG' is split into two separate -A arguments for proper parsing.
    python quality-cutoff.py -q 6 \
        -m 18 \
        -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
        -g CTTCCGATCTACAAGTT \
        -g CTTCCGATCTTGGTCCT \
        -A AACTTGTAGATCGGA \
        -A AGGACCAAGATCGGA \
        -A ACTTGTAGATCGGAA \
        -A GGACCAAGATCGGAA \
        -A CTTGT \
        -A AGATCGGAAG \
        -A GACCAAGATCGGAAG \
        -A TTGTAGATCGGAAGA \
        -A ACCAAGATCGGAAGA \
        -A TGTAGATCGGAAGAG \
        -A CCAAGATCGGAAGAG \
        -A GTAGATCGGAAGAGC \
        -A CAAGATCGGAAGAGC \
        -A TAGATCGGAAGAGCG \
        -A AAGATCGGAAGAGCG \
        -A AGATCGGAAGAGCGT \
        -A GATCGGAAGAGCGTC \
        -A ATCGGAAGAGCGTCG \
        -A TCGGAAGAGCGTCGT \
        -A CGGAAGAGCGTCGTG \
        -A GGAAGAGCGTCGTGT \
        -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \
        -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \
        /full/path/to/files/file_R1.C01.fastq.gz \
        /full/path/to/files/file_R2.C01.fastq.gz \
        > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
  5. 5

    Takes output from cutadapt round 1.

    cutadapt v1.18 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda create -n cutadapt_env cutadapt=1.18
    # conda activate cutadapt_env
    
    # Define input and output files
    INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
    OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz"
    REPORT_JSON="cutadapt_round2_report.json"
    
    # Define adapter sequence (example: Illumina universal adapter)
    # For eCLIP, this is typically a specific 3' adapter or a library-specific adapter.
    # This example uses a common Illumina universal adapter.
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" 
    
    # Execute cutadapt for adapter and quality trimming
    # -a: 3' adapter sequence to remove
    # -q: Trim low-quality ends (e.g., 20 for phred score 20)
    # --minimum-length: Discard reads shorter than this length after trimming (e.g., 18 for eCLIP reads)
    # -o: Output file for trimmed reads
    # --json: Write a JSON report with trimming statistics
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -q 20 \
             --minimum-length 18 \
             -o "${OUTPUT_FASTQ}" \
             --json "${REPORT_JSON}" \
             "${INPUT_FASTQ}"
  6. 6

    Run to trim off the 3’ adapters on read 2, to control for double ligation events.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # Install cutadapt (e.g., via conda)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output file paths
    INPUT_R1="read1.fastq.gz"
    INPUT_R2="read2.fastq.gz"
    OUTPUT_R1="trimmed_read1.fastq.gz"
    OUTPUT_R2="trimmed_read2.fastq.gz"
    
    # The 3' adapter sequence for Read 2, common in eCLIP and Illumina sequencing
    # This adapter is used to control for double ligation events.
    ADAPTER_R2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Run cutadapt to trim 3' adapters from Read 2
    # -A: Specifies a 3' adapter for the second read in a paired-end library.
    # -o: Output file for the first read.
    # -p: Output file for the second read.
    # --minimum-length 18: Discard reads shorter than 18 bp after trimming. This is a common setting for eCLIP to remove very short fragments.
    # --cores 4: Use 4 CPU cores for faster processing (adjust based on available resources).
    cutadapt -A "${ADAPTER_R2}" \
             -o "${OUTPUT_R1}" \
             -p "${OUTPUT_R2}" \
             "${INPUT_R1}" \
             "${INPUT_R2}" \
             --minimum-length 18 \
             --cores 4
  7. 7

    Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

    cutadapt vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install cutadapt (e.g., via conda)
    # conda install -c bioconda cutadapt
    
    cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
  8. 8

    Takes output from cutadapt round 2.

    cutadapt v3.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    # INPUT_FASTQ should be the output file from the previous cutadapt round (round 1).
    INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
    OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz"
    LOG_FILE="cutadapt_round2.log"
    
    # Define common eCLIP trimming parameters (example values from Yeo lab pipelines, adjust as needed)
    # For cutadapt version 3.4, --minimum-length, --quality-cutoff, --nextseq-trim are standard.
    MIN_READ_LENGTH=18 # Minimum read length after trimming
    QUALITY_CUTOFF=20  # Quality cutoff for 3' end trimming (e.g., Phred score 20)
    NEXTSEQ_TRIM=20    # Quality cutoff for NextSeq-specific 3' end trimming (e.g., Phred score 20)
    
    # If a 5' adapter needs to be trimmed in round 2, define it here.
    # This is often for random Ns (e.g., UMI/barcode) or a specific linker.
    # Example: ADAPTER_5_PRIME="NNNNNNNNNN" # For 10 random Ns at the 5' end
    # Example: ADAPTER_5_PRIME="AAGCAGTGGTATCAACGCAGAGTAC" # A common 5' adapter sequence
    # If no 5' adapter trimming is required in this round, set ADAPTER_5_PRIME="" or omit the -g parameter.
    ADAPTER_5_PRIME="" # Placeholder: Set to your specific 5' adapter sequence if needed
    
    # Execute cutadapt for round 2: quality trimming, minimum length filtering, and optional 5' adapter trimming
    # This command assumes single-end reads.
    cutadapt \
        --minimum-length ${MIN_READ_LENGTH} \
        --quality-cutoff ${QUALITY_CUTOFF} \
        --nextseq-trim=${NEXTSEQ_TRIM} \
        ${ADAPTER_5_PRIME:+-g "${ADAPTER_5_PRIME}"} \
        -o ${OUTPUT_FASTQ} \
        ${INPUT_FASTQ} \
        > ${LOG_FILE} 2>&1
  9. 9

    Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

    bbduk.sh (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install BBMap suite (contains bbduk.sh)
    # conda install -c bioconda bbmap
    
    # Placeholder for human RepBase and rRNA sequences. These files need to be prepared.
    # A common approach is to combine known human repetitive elements (e.g., from RepBase) and rRNA sequences.
    # Example (hypothetical links, actual RepBase access may require license):
    # wget -O human_repeats.fasta "https://www.girinst.org/repbase/update/human_repeats.fasta"
    # wget -O hg38_rRNA.fasta "https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.rRNA.fa.gz"
    # cat human_repeats.fasta hg38_rRNA.fasta > human_repbase_rRNA.fasta
    
    # Run bbduk.sh to remove reads matching human repetitive elements and rRNA
    bbduk.sh \
      in=input_reads.fastq.gz \
      out=filtered_reads.fastq.gz \
      ref=human_repbase_rRNA.fasta \
      k=31 \
      hdist=1 \
      minidentity=90 \
      stats=bbduk_repeat_rRNA_stats.txt \
      tpe \
      tbo \
      minlen=15
  10. 10

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Define input and output paths
    READ1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    READ2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    
    # Define the genome directory for RepBase
    # This directory should contain a STAR index built from a RepBase FASTA file.
    # Example for building a RepBase index (adjust paths and RepBase FASTA as needed):
    # wget -O RepBase_human.fasta.gz "https://www.girinst.org/server/RepBase/protected/RepBase20.09.fasta.gz" # (Requires login)
    # gunzip RepBase_human.fasta.gz
    # STAR --runMode genomeGenerate --genomeDir /path/to/RepBase_human_database_file --genomeFastaFiles RepBase_human.fasta --genomeSAindexNbases 10 # Adjust index size if RepBase is small
    GENOME_DIR="/path/to/RepBase_human_database_file"
    
    # Execute STAR alignment command
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${READ1}" "${READ2}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 30 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outSAMattributes All \
      --readFilesCommand zcat \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd > "${OUTPUT_BAM}"
    
  11. 11

    Takes output from STAR rmRep.

    $ Bash example
    # Install STAR and Samtools
    # conda install -c bioconda star samtools
    
    # Define reference paths (using hg38 as a placeholder)
    GENOME_FASTA="/path/to/references/GRCh38/GRCh38.primary_assembly.genome.fa"
    GTF_FILE="/path/to/references/GRCh38/gencode.v44.annotation.gtf"
    STAR_INDEX_DIR="/path/to/references/GRCh38/STAR_index"
    
    # Create STAR genome index if it doesn't exist (optional, usually done once)
    # STAR --runMode genomeGenerate \
    #      --genomeDir "${STAR_INDEX_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GTF_FILE}" \
    #      --sjdbOverhang 100 \
    #      --runThreadN 8
    
    # Define input and output files
    INPUT_R1="sample_R1.fastq.gz"
    INPUT_R2="sample_R2.fastq.gz"
    OUTPUT_PREFIX="sample_aligned"
    
    # 1. Align reads with STAR
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${INPUT_R1}" "${INPUT_R2}" \
         --runThreadN 8 \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped None \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)
    
    # Output from STAR is ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam
    
    # 2. Remove PCR duplicates using samtools markdup (interpreting "rmRep" as remove duplicates)
    # The -r option removes duplicates, -s outputs statistics
    samtools markdup -r -s "${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam" "${OUTPUT_PREFIX}_Aligned.dedup.bam"
    
    # Index the deduplicated BAM file
    samtools index "${OUTPUT_PREFIX}_Aligned.dedup.bam"
  12. 12

    Maps unique reads to the human genome.

    bwa (Inferred with models/gemini-2.5-flash) v0.7.17 GitHub
    $ Bash example
    # Install BWA and Samtools if not already present
    # conda install -c bioconda bwa samtools
    
    # Define variables
    REFERENCE_GENOME_PREFIX="human_genome_hg38" # Placeholder for indexed human genome reference (e.g., hg38)
    READ1="input_read1.fastq.gz"
    READ2="input_read2.fastq.gz"
    OUTPUT_BAM="mapped_reads.bam"
    NUM_THREADS=8
    READ_GROUP="@RG\tID:sample_id\tSM:sample_name\tPL:ILLUMINA\tLB:library_name"
    
    # Index the reference genome (if not already indexed). This creates .sa, .pac, .bwt, .ann, .amb files.
    # bwa index ${REFERENCE_GENOME_PREFIX}.fa
    
    # Map reads to the human genome using BWA-MEM
    # -t: Number of threads
    # -M: Mark shorter split hits as secondary (recommended for Picard compatibility)
    # -R: Read group header (important for downstream processing like GATK)
    bwa mem -t ${NUM_THREADS} -M -R "${READ_GROUP}" ${REFERENCE_GENOME_PREFIX}.fa ${READ1} ${READ2} | \
    # Convert SAM to BAM and sort the BAM file
    samtools view -bS - | \
    samtools sort -o ${OUTPUT_BAM}
    
    # Optional: Index the sorted BAM file
    # samtools index ${OUTPUT_BAM}
    
    # Optional: Filter for uniquely mapped reads (e.g., MAPQ >= 20 and primary alignment)
    # samtools view -b -q 20 -F 0x100 ${OUTPUT_BAM} > ${OUTPUT_BAM%.bam}.unique.bam
  13. 13

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    bash
    # Reference genome directory: /path/to/STAR_database_file
    # Input files: /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1, /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2
    # Output file: /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
    
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir /path/to/STAR_database_file \
      --genomeLoad LoadAndRemove \
      --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \
      --outSAMattributes All \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd \
      > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
    
  14. 14

    takes output from STAR genome mapping.

    $ Bash example
    # Installation (commented out)
    # conda install -c bioconda star=2.5.2b
    
    # Placeholder for genome index directory
    GENOME_DIR="/path/to/STAR_index/GRCh38"
    # Placeholder for input FASTQ file (single-end as commonly used in eCLIP)
    READ_FILE="sample.fastq.gz"
    # Placeholder for output prefix
    OUTPUT_PREFIX="sample_aligned_"
    # Number of threads
    THREADS=8
    
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ_FILE}" \
         --readFilesCommand zcat \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM 30000000000 # 30GB
  15. 15

    Custom random-mer-aware script for PCR duplicate removal.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools
    # conda install -c bioconda umi-tools=1.1.2
    
    # Example: Deduplicate a BAM file using umi_tools, assuming UMIs are in read names
    # This command is suitable for eCLIP data where UMIs are often extracted
    # into the read headers in a preceding step.
    # The --spliced-reads flag is important for RNA-seq based assays like eCLIP.
    
    umi_tools dedup \
        --input input.bam \
        --output deduplicated.bam \
        --method directional \
        --paired \
        --spliced-reads \
        --output-stats deduplication_stats.tsv \
        --log deduplication.log
  16. 16

    Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

    barcode_collapse_pe.py vv0.1.0 (part of Skipper pipeline) (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Miniconda or Anaconda if not already installed
    # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    # bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
    # source $HOME/miniconda/bin/activate
    # conda init bash
    
    # Clone the Skipper repository to get the script and environment file
    # git clone https://github.com/yeolab/skipper.git
    # cd skipper
    
    # Create and activate the conda environment using the provided environment.yml
    # conda env create -f environment.yml
    # conda activate skipper_env
    
    # Navigate to the scripts directory (assuming you are in the 'skipper' directory)
    # cd scripts
    
    # Execute the barcode collapse command
    python barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
  17. 17

    Takes output from barcode collapse PE.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output file names (placeholders)
    # INPUT_R1 and INPUT_R2 are the output from the 'barcode collapse PE' step.
    INPUT_R1="barcode_collapsed_reads_R1.fastq.gz"
    INPUT_R2="barcode_collapsed_reads_R2.fastq.gz"
    OUTPUT_R1="trimmed_R1.fastq.gz"
    OUTPUT_R2="trimmed_R2.fastq.gz"
    OUTPUT_LOG="cutadapt.log"
    
    # Define common Illumina adapters for paired-end reads, often used in eCLIP pipelines.
    # These specific adapters are found in the Yeo lab's eCLIP workflow.
    ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina TruSeq Universal Adapter
    ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina TruSeq Adapter, Index 1
    
    # Execute cutadapt for adapter trimming on paired-end reads.
    # -a: 3' adapter for R1
    # -A: 3' adapter for R2
    # -o: output R1 file
    # -p: output R2 file
    # -m 18: Discard reads shorter than 18 bp after trimming.
    # -q 20: Trim low-quality bases from 3' end using a quality cutoff of 20.
    # -e 0.1: Maximum error rate of 10% for adapter matching.
    cutadapt -a "${ADAPTER_R1}" -A "${ADAPTER_R2}" \
             -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \
             -m 18 -q 20 -e 0.1 \
             "${INPUT_R1}" "${INPUT_R2}" > "${OUTPUT_LOG}" 2>&1
  18. 18

    Sorts resulting bam file for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Sort the BAM file by coordinate
    # -o: Output file name
    # -@: Number of threads to use (adjust as needed)
    # -m: Maximum memory per thread (e.g., 2G for 2GB, adjust as needed)
    # Replace 'input.bam' with the path to your unsorted BAM file.
    # Replace 'output.bam' with the desired name for your sorted BAM file.
    samtools sort -o output.bam -@ 8 -m 2G input.bam
  19. 19

    Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

    Picard vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install Picard (often bundled with GATK or available standalone)
    # conda install -c bioconda picard
    # If using GATK, it's included:
    # conda install -c bioconda gatk4
    
    # Define variables for paths
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    TMP_DIR="/full/path/to/files/.queue/tmp"
    GATK_DIST_PATH="/path/to/gatk/dist" # Path where Queue.jar is located
    
    # Create temporary directory if it doesn't exist
    mkdir -p "${TMP_DIR}"
    
    # Execute Picard SortSam
    java -Xmx2048m \
         -XX:+UseParallelOldGC \
         -XX:ParallelGCThreads=4 \
         -XX:GCTimeLimit=50 \
         -XX:GCHeapFreeLimit=10 \
         -Djava.io.tmpdir="${TMP_DIR}" \
         -cp "${GATK_DIST_PATH}/Queue.jar" \
         net.sf.picard.sam.SortSam \
         INPUT="${INPUT_BAM}" \
         TMP_DIR="${TMP_DIR}" \
         OUTPUT="${OUTPUT_BAM}" \
         VALIDATION_STRINGENCY=SILENT \
         SO=coordinate \
         CREATE_INDEX=true
  20. 20

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Example: Assuming 'sorted.bam' is the output from sortSam
    # Replace 'sorted.bam' with the actual path to your sorted BAM file
    samtools index sorted.bam
  21. 21

    Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Create a BAM index file
    samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
  22. 22

    Takes inputs from multiple final bam files.

    samtools (Inferred with models/gemini-2.5-flash) v1.19.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.19.2
    
    # This command merges multiple final BAM files into a single output BAM file.
    # This is a common step when combining technical replicates or data from different lanes of the same sample.
    # Replace 'input1.bam', 'input2.bam', 'input3.bam' with the actual paths to your final BAM files.
    # Replace 'merged_output.bam' with the desired name for the merged file.
    samtools merge merged_output.bam input1.bam input2.bam input3.bam
  23. 23

    Merges the two technical replicates for further downstream analysis.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    bash
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.10
    
    # Merge two technical replicate BAM files
    # Replace replicate1.bam and replicate2.bam with actual input file names
    # Replace merged_replicates.bam with the desired output file name
    samtools merge merged_replicates.bam replicate1.bam replicate2.bam
    
  24. 24

    Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define input and output files
    INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    OUTPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
    
    # Execute samtools merge command
    samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}"
  25. 25

    Takes output from sortSam, makes bam index for use downstream.

    samtools index (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assuming 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  26. 26

    Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

    samtools vNot specified GitHub
    $ Bash example
    # Install samtools (e.g., via conda)
    # conda install -c bioconda samtools
    
    # Create a BAM index file
    samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
  27. 27

    Takes output from sortSam.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Input BAM file (output from sortSam)
    INPUT_BAM="sorted.bam"
    
    # Index the sorted BAM file to allow for fast random access
    samtools index "${INPUT_BAM}"
  28. 28

    Only outputs the second read in each pair for use with single stranded peak caller.

    BBTools (reformat.sh) (Inferred with models/gemini-2.5-flash) v38.90 GitHub
    $ Bash example
    # BBTools installation
    # BBTools can be downloaded from SourceForge or installed via Bioconda.
    # For example, using Bioconda:
    # conda install -c bioconda bbtools
    
    # This command assumes 'input_interleaved.fastq.gz' is a single file containing
    # paired-end reads in an interleaved format (R1, R2, R1, R2, ...).
    # 'reformat.sh' extracts only the second read of each pair and writes it to 'output_R2.fastq.gz'.
    reformat.sh in=input_interleaved.fastq.gz out2=output_R2.fastq.gz
  29. 29

    This is the final bam file to perform analysis on.

    N/A (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    No specific tool or action can be inferred from the description 'This is the final bam file to perform analysis on.' This description refers to a resulting file rather than a processing step.
  30. 30

    Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

    samtools v1.10 GitHub
    $ Bash example
    # Install samtools (e.g., using conda)
    # conda install -c bioconda samtools=1.10
    
    samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
  31. 31

    Takes results from samtools view.

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Example: Convert a SAM file to a BAM file
    # This command takes a SAM file (input.sam) and converts it to a BAM file (output.bam).
    # The description "Takes results from samtools view" implies that this step either performs
    # samtools view or processes its output. Given the tool is samtools, this example shows a common use of samtools view.
    # -b: Output BAM format
    # -S: Input is SAM format (optional for samtools 1.x as it auto-detects, but good for clarity)
    # Replace input.sam and output.bam with your actual file names.
    samtools view -bS input.sam > output.bam
    
    # Another common use case: Filter mapped reads from a BAM file
    # samtools view -F 4 input.bam > mapped_reads.bam
    # -F 4: Exclude reads where the FLAG indicates the read is unmapped (0x4)
  32. 32

    Calls peaks on those files.

    clipper (Inferred with models/gemini-2.5-flash) vv1.0.1 GitHub
    $ Bash example
    # Installation (example for a conda environment)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # conda create -n clipper_env python=3.8 numpy scipy pysam pybedtools matplotlib seaborn pandas statsmodels -y
    # conda activate clipper_env
    
    # Define input and output files (placeholders)
    # Replace 'ip_sample.bam' with your immunoprecipitation (IP) BAM file
    # Replace 'sm_input.bam' with your size-matched input (SMInput) control BAM file
    # Ensure these BAM files are coordinate-sorted and indexed (.bai files exist).
    IP_BAM="ip_sample.bam"
    SM_INPUT_BAM="sm_input.bam"
    OUTPUT_PREFIX="clipper_peaks"
    
    # Reference dataset: Placeholder for human hg38 effective genome size
    # This value represents the mappable portion of the genome. Adjust if using a different organism or assembly.
    GENOME_SIZE="2.7e9" # Approximate effective genome size for human hg38
    
    # Peak calling parameters
    FDR_THRESHOLD="0.05" # False Discovery Rate threshold
    P_VALUE_THRESHOLD="0.01" # P-value threshold (default for clipper)
    
    # Execute clipper for peak calling
    # Assuming clipper.py is in the current directory or in your PATH
    python clipper.py \
        -b "${IP_BAM}" \
        -c "${SM_INPUT_BAM}" \
        -o "${OUTPUT_PREFIX}" \
        -s "${GENOME_SIZE}" \
        -f "${FDR_THRESHOLD}" \
        -p "${P_VALUE_THRESHOLD}"
    
    # Output files will include: clipper_peaks.bed (BED file of called peaks)
  33. 33

    Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

    CLIPper vNot specified GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # It's recommended to install CLIPper in a dedicated conda environment.
    # conda create -n clipper_env python=3.8
    # conda activate clipper_env
    # pip install clipper
    
    # Define input and output paths
    INPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"
    OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed"
    GENOME_ASSEMBLY="hg19" # Reference genome assembly
    
    # Run CLIPper peak calling
    clipper -b "${INPUT_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_BED}" \
            --bonferroni --superlocal --threshold-method binomial --save-pickle

Tools Used

Raw Source Text
Library strategy: eCLIP-seq
Takes output from raw files.  Run to trim off both 5’ and 3’ adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3’ adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding
← Back to Analysis