GSE107766 Processing Pipeline

OTHER code_examples 33 steps

Publication

A protein-RNA interaction atlas of the ribosome biogenesis factor AATF.

Scientific reports (2019) — PMID 31363146

Dataset

GSE107766

Best practices for eCLIP experiments and analysis [uncertain quality]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: eCLIP-seq

    eCLIP vBased on yeolab/eclip workflow (last updated 2020) GitHub
    $ Bash example
    # --- Installation (commented out) ---
    # conda create -n eclip_env python=3.8 r-base=4.0 star cutadapt picard-tools samtools bedtools
    # conda activate eclip_env
    # pip install clipper
    # git clone https://github.com/yeolab/merge_peaks.git
    # export PATH=$PATH:$(pwd)/merge_peaks # Add merge_peaks to PATH
    
    # --- Configuration ---
    GENOME_DIR="/path/to/hg38_star_index" # Placeholder: Pre-built STAR index for hg38
    GENOME_FASTA="/path/to/hg38.fa" # Placeholder: hg38 reference genome FASTA
    GTF_FILE="/path/to/hg38.gtf" # Placeholder: hg38 gene annotation GTF
    GENOME_SIZE_FILE="/path/to/hg38.chrom.sizes" # Placeholder: hg38 chromosome sizes (e.g., from UCSC goldenPath)
    ADAPTERS="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Common Illumina adapters, adjust if known
    OUTPUT_DIR="eclip_output"
    mkdir -p "$OUTPUT_DIR"
    
    # --- Input Files (Placeholders) ---
    R1="sample_R1.fastq.gz"
    R2="sample_R2.fastq.gz"
    CONTROL_R1="input_R1.fastq.gz" # Assuming an input control for peak calling
    CONTROL_R2="input_R2.fastq.gz"
    
    # --- Step 1: Adapter Trimming (using cutadapt) ---
    echo "Step 1: Adapter Trimming with Cutadapt"
    cutadapt -a "$ADAPTERS" -A "$ADAPTERS" \
             -o "$OUTPUT_DIR/trimmed_R1.fastq.gz" \
             -p "$OUTPUT_DIR/trimmed_R2.fastq.gz" \
             "$R1" "$R2" > "$OUTPUT_DIR/cutadapt_report.txt"
    
    cutadapt -a "$ADAPTERS" -A "$ADAPTERS" \
             -o "$OUTPUT_DIR/control_trimmed_R1.fastq.gz" \
             -p "$OUTPUT_DIR/control_trimmed_R2.fastq.gz" \
             "$CONTROL_R1" "$CONTROL_R2" > "$OUTPUT_DIR/control_cutadapt_report.txt"
    
    # --- Step 2: Alignment (using STAR) ---
    echo "Step 2: Alignment with STAR"
    STAR --genomeDir "$GENOME_DIR" \
         --readFilesIn "$OUTPUT_DIR/trimmed_R1.fastq.gz" "$OUTPUT_DIR/trimmed_R2.fastq.gz" \
         --readFilesCommand zcat \
         --outFileNamePrefix "$OUTPUT_DIR/sample_" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.6 \
         --outFilterMatchNminOverLread 0.6 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --runThreadN 8
    
    STAR --genomeDir "$GENOME_DIR" \
         --readFilesIn "$OUTPUT_DIR/control_trimmed_R1.fastq.gz" "$OUTPUT_DIR/control_trimmed_R2.fastq.gz" \
         --readFilesCommand zcat \
         --outFileNamePrefix "$OUTPUT_DIR/control_" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.6 \
         --outFilterMatchNminOverLread 0.6 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --runThreadN 8
    
    # Index BAM files
    samtools index "$OUTPUT_DIR/sample_Aligned.sortedByCoord.out.bam"
    samtools index "$OUTPUT_DIR/control_Aligned.sortedByCoord.out.bam"
    
    # --- Step 3: Deduplication (using Picard MarkDuplicates) ---
    echo "Step 3: Deduplication with Picard MarkDuplicates"
    java -jar /path/to/picard.jar MarkDuplicates \
             I="$OUTPUT_DIR/sample_Aligned.sortedByCoord.out.bam" \
             O="$OUTPUT_DIR/sample_dedup.bam" \
             M="$OUTPUT_DIR/sample_dedup_metrics.txt" \
             REMOVE_DUPLICATES=true
    
    java -jar /path/to/picard.jar MarkDuplicates \
             I="$OUTPUT_DIR/control_Aligned.sortedByCoord.out.bam" \
             O="$OUTPUT_DIR/control_dedup.bam" \
             M="$OUTPUT_DIR/control_dedup_metrics.txt" \
             REMOVE_DUPLICATES=true
    
    samtools index "$OUTPUT_DIR/sample_dedup.bam"
    samtools index "$OUTPUT_DIR/control_dedup.bam"
    
    # --- Step 4: Peak Calling (using CLIPPER) ---
    echo "Step 4: Peak Calling with CLIPPER"
    # CLIPPER requires a BED file for input, so convert BAM to BED
    bedtools bamtobed -i "$OUTPUT_DIR/sample_dedup.bam" > "$OUTPUT_DIR/sample_dedup.bed"
    bedtools bamtobed -i "$OUTPUT_DIR/control_dedup.bam" > "$OUTPUT_DIR/control_dedup.bed"
    
    clipper -s hg38 -o "$OUTPUT_DIR/sample_peaks.bed" \
            -i "$OUTPUT_DIR/control_dedup.bed" \
            "$OUTPUT_DIR/sample_dedup.bed"
    
    # --- Step 5: IDR (using merge_peaks) ---
    echo "Step 5: IDR with merge_peaks"
    # For proper IDR, multiple replicates are typically required.
    # This example simulates with a single peak file for demonstration.
    # In a real scenario, you would run CLIPPER on multiple replicates and then use merge_peaks.
    
    mkdir -p "$OUTPUT_DIR/clipper_peaks"
    cp "$OUTPUT_DIR/sample_peaks.bed" "$OUTPUT_DIR/clipper_peaks/sample_rep1_peaks.bed"
    # If you had a second replicate, you would copy it here:
    # cp "$OUTPUT_DIR/sample_rep2_peaks.bed" "$OUTPUT_DIR/clipper_peaks/sample_rep2_peaks.bed"
    
    python merge_peaks.py -i "$OUTPUT_DIR/clipper_peaks" \
                          -o "$OUTPUT_DIR/idr_output" \
                          -idr 0.05 \
                          -s "$GENOME_SIZE_FILE" \
                          --prefix "eCLIP_IDR"
  2. 2

    Takes output from raw files.

    FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub
    $ Bash example
    # Install FastQC (if not already installed)
    # conda install -c bioconda fastqc
    
    # Example usage: Run FastQC on raw FASTQ files
    # Replace sample_R1.fastq.gz and sample_R2.fastq.gz with your actual raw file names
    # The -o . option specifies the output directory (current directory in this case)
    fastqc -o . sample_R1.fastq.gz sample_R2.fastq.gz
  3. 3

    Run to trim off both 5’ and 3’ adapters on both reads.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files (placeholders)
    READ1_IN="input_R1.fastq.gz"
    READ2_IN="input_R2.fastq.gz"
    READ1_OUT="output_R1.trimmed.fastq.gz"
    READ2_OUT="output_R2.trimmed.fastq.gz"
    LOG_FILE="cutadapt.log"
    
    # Define adapter sequences (example Illumina TruSeq adapters)
    # -a: 3' adapter on the forward read
    # -A: 3' adapter on the reverse read
    ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Define trimming parameters
    MIN_LEN=18       # Minimum length of read to keep
    QUALITY_CUTOFF=20 # Quality cutoff at 3' end
    ERROR_RATE=0.1   # Maximum error rate
    OVERLAP=3        # Minimum overlap between read and adapter
    THREADS=8        # Number of CPU threads
    
    # Run cutadapt to trim adapters from paired-end reads
    cutadapt \
        -a "${ADAPTER_FWD}" \
        -A "${ADAPTER_REV}" \
        -m "${MIN_LEN}" \
        -q "${QUALITY_CUTOFF}" \
        -e "${ERROR_RATE}" \
        --overlap "${OVERLAP}" \
        -j "${THREADS}" \
        -o "${READ1_OUT}" \
        -p "${READ2_OUT}" \
        "${READ1_IN}" "${READ2_IN}" \
        > "${LOG_FILE}" 2>&1
  4. 4

    Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

    cutadapt (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    cutadapt \
      --quality-cutoff=6 \
      --minimum-length=18 \
      -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
      -g CTTCCGATCTACAAGTT \
      -g CTTCCGATCTTGGTCCT \
      -A AACTTGTAGATCGGA \
      -A AGGACCAAGATCGGA \
      -A ACTTGTAGATCGGAA \
      -A GGACCAAGATCGGAA \
      -A CTTGT AGATCGGAAG \
      -A GACCAAGATCGGAAG \
      -A TTGTAGATCGGAAGA \
      -A ACCAAGATCGGAAGA \
      -A TGTAGATCGGAAGAG \
      -A CCAAGATCGGAAGAG \
      -A GTAGATCGGAAGAGC \
      -A CAAGATCGGAAGAGC \
      -A TAGATCGGAAGAGCG \
      -A AAGATCGGAAGAGCG \
      -A AGATCGGAAGAGCGT \
      -A GATCGGAAGAGCGTC \
      -A ATCGGAAGAGCGTCG \
      -A TCGGAAGAGCGTCGT \
      -A CGGAAGAGCGTCGTG \
      -A GGAAGAGCGTCGTGT \
      -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \
      -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \
      /full/path/to/files/file_R1.C01.fastq.gz \
      /full/path/to/files/file_R2.C01.fastq.gz \
      > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
  5. 5

    Takes output from cutadapt round 1.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    # INPUT_FASTQ is the output from a previous cutadapt round (e.g., adapter trimming)
    INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
    OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz"
    
    # Execute cutadapt for poly-A trimming, quality filtering, and minimum length filtering.
    # This is a common second trimming step in eCLIP workflows after initial adapter trimming.
    # -a "A{100}": Trims poly-A tails (up to 100 'A's).
    # -q 20: Trims low-quality bases from the 3' end with a quality cutoff of 20.
    # -m 18: Discards reads shorter than 18 bp after trimming.
    cutadapt \
      -a "A{100}" \
      -q 20 \
      -m 18 \
      -o "${OUTPUT_FASTQ}" \
      "${INPUT_FASTQ}"
    
  6. 6

    Run to trim off the 3’ adapters on read 2, to control for double ligation events.

    cutadapt (Inferred with models/gemini-2.5-flash) v3.4 GitHub
    $ Bash example
    # Install cutadapt via conda
    # conda install -c bioconda cutadapt=3.4
    
    # Define input and output files
    INPUT_R2="read2.fastq.gz"
    OUTPUT_R2="read2_trimmed.fastq.gz"
    
    # Define adapter sequence for 3' end of Read 2.
    # This sequence (Illumina Small RNA 3' Adapter or similar) is commonly found
    # at the 3' end of Read 2 in eCLIP and small RNA-seq libraries, often due to
    # double ligation events or as part of the library preparation.
    ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Define trimming parameters
    THREADS=4 # Number of CPU cores to use for parallel processing
    MIN_READ_LENGTH=18 # Discard reads shorter than this length after trimming
    QUALITY_CUTOFF=20 # Trim low-quality bases from the 3' end using a Phred score cutoff
    
    # Run cutadapt to trim the specified 3' adapter from Read 2
    cutadapt \
      -a "${ADAPTER_SEQUENCE}" \
      -o "${OUTPUT_R2}" \
      --cores "${THREADS}" \
      --minimum-length "${MIN_READ_LENGTH}" \
      --quality-cutoff "${QUALITY_CUTOFF}" \
      "${INPUT_R2}"
  7. 7

    Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

    cutadapt vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda install -c bioconda cutadapt
    
    cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
  8. 8

    Takes output from cutadapt round 2.

    cutadapt v4.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    # INPUT_FASTQ is the output from cutadapt round 1
    INPUT_FASTQ="sample_R1_round1_trimmed.fastq.gz"
    OUTPUT_FASTQ="sample_R1_round2_trimmed.fastq.gz"
    
    # Define trimming parameters for cutadapt round 2
    # These are placeholders. Actual values depend on the specific adapters/barcodes
    # to be removed in this second trimming step for eCLIP.
    # For example, if round 1 removed sequencing adapters, round 2 might remove random barcodes.
    # Replace 'YOUR_3PRIME_ADAPTER_SEQUENCE' and 'YOUR_5PRIME_ADAPTER_SEQUENCE' with actual sequences.
    ADAPTER_3PRIME="YOUR_3PRIME_ADAPTER_SEQUENCE" # Example: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
    ADAPTER_5PRIME="YOUR_5PRIME_ADAPTER_SEQUENCE" # Example: GTTCAGAGTTCTACAGTCCGACGATC
    TRIM_5PRIME_BASES=0 # Number of fixed bases to remove from 5' end
    TRIM_3PRIME_BASES=0 # Number of fixed bases to remove from 3' end (use negative value for cutadapt)
    QUALITY_THRESHOLD=20 # Phred quality score threshold for trimming
    MIN_READ_LENGTH=18 # Minimum read length after trimming
    THREADS=4 # Number of CPU threads to use
    
    # Execute cutadapt round 2
    cutadapt \
        -a "${ADAPTER_3PRIME}" \
        -g "${ADAPTER_5PRIME}" \
        -u "${TRIM_5PRIME_BASES}" \
        -u "-${TRIM_3PRIME_BASES}" \
        -q "${QUALITY_THRESHOLD}" \
        -m "${MIN_READ_LENGTH}" \
        --discard-untrimmed \
        -j "${THREADS}" \
        -o "${OUTPUT_FASTQ}" \
        "${INPUT_FASTQ}"
  9. 9

    Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

    RepeatMasker (Inferred with models/gemini-2.5-flash) v4.1.2-p1 GitHub
    $ Bash example
    # Install RepeatMasker (if not already installed)
    # RepeatMasker requires a repeat library, typically RepBase, which is often installed with it or separately configured. Ensure RepBase is configured for human species.
    # For example, using conda:
    # conda install -c bioconda repeatmasker
    
    # Define input and output files
    # Replace 'human_genome.fasta' with the actual path to your unmasked human reference genome (e.g., GRCh38/hg38).
    GENOME_FASTA="human_genome.fasta"
    OUTPUT_DIR="repeatmasker_output"
    NUM_THREADS=8 # Adjust based on available CPU cores
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run RepeatMasker to identify and mask repetitive elements in the human genome
    # -species human: Uses human-specific repeat libraries from RepBase.
    # -pa ${NUM_THREADS}: Specifies the number of processors to use for parallel execution.
    # -dir ${OUTPUT_DIR}: Sets the directory where all output files will be stored.
    # -xsmall: Masks identified repeats with 'N's (instead of lowercase letters), which is often preferred for downstream analysis.
    # -gff: Outputs the repeat annotations in GFF format, in addition to the standard .out and .tbl files.
    RepeatMasker -species human -pa "${NUM_THREADS}" -dir "${OUTPUT_DIR}" -xsmall -gff "${GENOME_FASTA}"
    
    # Expected output files in ${OUTPUT_DIR}:
    # - human_genome.fasta.masked: The masked genome FASTA file with repeats replaced by 'N's.
    # - human_genome.fasta.out: A detailed report of all identified repeats.
    # - human_genome.fasta.tbl: A summary table of repeat classes and families.
    # - human_genome.fasta.gff: Repeat annotations in GFF format (if -gff was used).
  10. 10

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

    $ Bash example
    # Install STAR using conda
    # conda install -c bioconda star
    
    # Define variables for input, output, and reference files
    # The --genomeDir expects a STAR index built from the RepBase human sequences.
    # Replace with the actual path to your STAR index.
    GENOME_DIR="/path/to/RepBase_human_STAR_index"
    READ1_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    READ2_FILE="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    
    # The --outFileNamePrefix is used for other STAR output files (e.g., Log.out, SJ.out.tab).
    # In this command, it's set to the same path as the output BAM, which means other files
    # will be named like '/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamLog.out'.
    # If you prefer these auxiliary files in a separate directory or with a different prefix,
    # adjust this variable accordingly.
    OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
    
    # Execute the STAR alignment command
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${READ1_FILE}" "${READ2_FILE}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 30 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outSAMattributes All \
      --readFilesCommand zcat \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd > "${OUTPUT_BAM}"
    
  11. 11

    Takes output from STAR rmRep.

    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.10
    
    # Placeholder for input BAM from STAR alignment (e.g., from a previous STAR alignment step)
    INPUT_BAM="star_aligned_reads.bam"
    OUTPUT_DEDUP_BAM="star_aligned_reads.dedup.bam"
    
    # Deduplication steps as typically performed in eCLIP pipelines (e.g., from yeolab/skipper)
    # 1. Sort by read name
    samtools sort -n "${INPUT_BAM}" -o "${INPUT_BAM%.bam}.namesort.bam"
    # 2. Fixmate information
    samtools fixmate -m "${INPUT_BAM%.bam}.namesort.bam" "${INPUT_BAM%.bam}.fixmate.bam"
    # 3. Sort by coordinate
    samtools sort "${INPUT_BAM%.bam}.fixmate.bam" -o "${INPUT_BAM%.bam}.positionsort.bam"
    # 4. Mark and remove PCR duplicates
    samtools markdup -r "${INPUT_BAM%.bam}.positionsort.bam" "${OUTPUT_DEDUP_BAM}"
    # 5. Index the deduplicated BAM file
    samtools index "${OUTPUT_DEDUP_BAM}"
    
    # Optional: Clean up intermediate files
    # rm "${INPUT_BAM%.bam}.namesort.bam" "${INPUT_BAM%.bam}.fixmate.bam" "${INPUT_BAM%.bam}.positionsort.bam"
  12. 12

    Maps unique reads to the human genome.

    BWA (Burrows-Wheeler Aligner) (Inferred with models/gemini-2.5-flash) v0.7.17 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install BWA and Samtools
    # conda install -c bioconda bwa samtools
    
    # Define reference genome and read files
    REFERENCE_GENOME="/path/to/human_genome/GRCh38.fa"
    READS_R1="reads_1.fastq.gz"
    READS_R2="reads_2.fastq.gz"
    OUTPUT_BAM="aligned_reads.sorted.bam"
    
    # Index the reference genome (if not already indexed)
    # This step only needs to be run once per reference genome
    bwa index "${REFERENCE_GENOME}"
    
    # Map reads to the human genome using BWA-MEM
    # -t 8: Use 8 threads for alignment
    # Pipe the SAM output directly to samtools for conversion to BAM, sorting, and indexing
    bwa mem -t 8 "${REFERENCE_GENOME}" "${READS_R1}" "${READS_R2}" | \
      samtools view -bS - | \
      samtools sort -o "${OUTPUT_BAM}" -
    samtools index "${OUTPUT_BAM}"
  13. 13

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    bash
    # Reference genome directory placeholder. Replace with your actual STAR index path.
    # Example: /path/to/your/STAR_index/GRCh38
    GENOME_DIR="/path/to/STAR_database_file"
    
    # Input read files. These appear to be unmapped reads from a previous BAM file.
    # Replace with your actual input file paths.
    READ_FILE_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
    READ_FILE_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
    
    # Output BAM file prefix and final output file name.
    # The command redirects stdout to the final BAM file, so the prefix is mainly for other STAR output files (e.g., Log.out, SJ.out.tab).
    OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep."
    FINAL_OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${READ_FILE_MATE1}" "${READ_FILE_MATE2}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outSAMattributes All \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd \
      > "${FINAL_OUTPUT_BAM}"
    
  14. 14

    takes output from STAR genome mapping.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables (replace with actual paths and names)
    GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder: hg38 genome index
    READ1_FASTQ="input_read1.fastq.gz"
    READ2_FASTQ="input_read2.fastq.gz" # Omit if single-end
    OUTPUT_PREFIX="aligned_reads"
    NUM_THREADS=8 # Adjust as needed
    
    # Run STAR genome mapping for paired-end reads
    STAR \
      --runThreadN ${NUM_THREADS} \
      --genomeDir ${GENOME_DIR} \
      --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \
      --outFileNamePrefix ${OUTPUT_PREFIX} \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMattributes Standard \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 20 \
      --outFilterMismatchNmax 999 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --sjdbScore 1 \
      --readFilesCommand zcat
    
    # The output will be ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam
  15. 15

    Custom random-mer-aware script for PCR duplicate removal.

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools
    # conda install -c bioconda umi_tools
    
    # Example usage for PCR duplicate removal using random-mers (UMIs).
    # This command assumes that UMIs have already been extracted from the reads
    # and appended to the read names (e.g., by a preceding `umi_tools extract` step).
    # The UMI is expected to be separated from the original read name by a colon.
    # The 'directional' method is generally recommended for its robustness.
    # Replace 'input.bam' with your aligned BAM file containing UMIs in read names.
    # Replace 'output.dedup.bam' with your desired output deduplicated BAM file name.
    # Replace 'output.dedup.log' with your desired log file name.
    
    umi_tools dedup \
        --in input.bam \
        --out output.dedup.bam \
        --umi-separator ":" \
        --method directional \
        --paired \
        --log output.dedup.log
  16. 16

    Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

    barcode_collapse_pe.py (Part of Skipper pipeline) (Inferred with models/gemini-2.5-flash) vLatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Miniconda if not already installed
    # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
    # bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
    # export PATH="$HOME/miniconda/bin:$PATH"
    # conda init bash
    # source ~/.bashrc
    
    # Clone the Skipper repository
    # git clone https://github.com/yeolab/skipper.git
    # cd skipper
    
    # Create and activate the conda environment (assuming environment.yml is available in the skipper directory)
    # conda env create -f environment.yml
    # conda activate skipper_env # or the name specified in environment.yml
    
    # Define variables for input and output files
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"
    
    # Execute the barcode_collapse_pe.py script
    # Assuming barcode_collapse_pe.py is in the current PATH or specified with its full path
    # If running from the cloned skipper directory, the path would be ./scripts/barcode_collapse_pe.py
    python barcode_collapse_pe.py \
      --bam "${INPUT_BAM}" \
      --out_file "${OUTPUT_BAM}" \
      --metrics_file "${METRICS_FILE}"
  17. 17

    Takes output from barcode collapse PE.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.9a GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star=2.7.9a
    
    # Define variables
    # Placeholder for STAR genome index, typically built from a reference genome like hg38.
    # Example: /path/to/STAR_index/hg38_gencode_v38
    GENOME_DIR="/path/to/STAR_index/hg38" 
    READ1="sample_R1.collapsed.fastq.gz" # Output from barcode collapse PE
    READ2="sample_R2.collapsed.fastq.gz" # Output from barcode collapse PE
    OUTPUT_PREFIX="sample_aligned" # Prefix for output files
    THREADS=8 # Adjust based on available resources
    
    # Run STAR alignment for paired-end reads
    # This command aligns the collapsed paired-end reads to the reference genome.
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --readFilesCommand zcat \
         --twopassMode Basic \
         --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting, adjust as needed
  18. 18

    Sorts resulting bam file for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.19
    
    # Sort the BAM file by coordinate (default behavior)
    samtools sort -o sorted_output.bam input.bam
  19. 19

    Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

    Picard v2.x.x (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install Picard (e.g., via conda or by downloading the jar)
    # conda install -c bioconda picard
    # Or download the latest release from https://github.com/broadinstitute/picard/releases
    # Ensure Java is installed (e.g., OpenJDK 11 or later)
    
    # Define variables for clarity
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    TMP_DIR="/full/path/to/files/.queue/tmp"
    QUEUE_JAR_PATH="/path/to/gatk/dist/Queue.jar"
    
    # Create temporary directory if it doesn't exist
    mkdir -p "${TMP_DIR}"
    
    # Execute Picard SortSam command
    java -Xmx2048m \
      -XX:+UseParallelOldGC \
      -XX:ParallelGCThreads=4 \
      -XX:GCTimeLimit=50 \
      -XX:GCHeapFreeLimit=10 \
      -Djava.io.tmpdir="${TMP_DIR}" \
      -cp "${QUEUE_JAR_PATH}" \
      net.sf.picard.sam.SortSam \
      INPUT="${INPUT_BAM}" \
      TMP_DIR="${TMP_DIR}" \
      OUTPUT="${OUTPUT_BAM}" \
      VALIDATION_STRINGENCY=SILENT \
      SO=coordinate \
      CREATE_INDEX=true
  20. 20

    Takes output from sortSam, makes bam index for use downstream.

    samtools index (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assuming 'sorted.bam' is the output from sortSam
    # Replace 'sorted.bam' with the actual path to your sorted BAM file
    samtools index sorted.bam
  21. 21

    Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (e.g., using conda)
    # conda install -c bioconda samtools=1.19
    
    samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
  22. 22

    Takes inputs from multiple final bam files.

    samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.19
    
    # Merge multiple BAM files into a single output BAM file.
    # Replace input1.bam, input2.bam, etc., with your actual BAM file paths.
    # Replace merged_output.bam with your desired output file name.
    samtools merge merged_output.bam input1.bam input2.bam input3.bam
  23. 23

    Merges the two technical replicates for further downstream analysis.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Merge two technical replicate BAM files
    # Replace replicate1.bam and replicate2.bam with actual input files
    # Replace merged_replicates.bam with the desired output file name
    samtools merge merged_replicates.bam replicate1.bam replicate2.bam
  24. 24

    Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Merge multiple sorted BAM files into a single sorted BAM file
    samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
  25. 25

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.10
    
    # Create BAM index for the sorted BAM file
    # Assuming 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  26. 26

    Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
  27. 27

    Takes output from sortSam.

    samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.19.1
    
    # This step takes a sorted BAM file (output from sortSam) and creates an index file (.bai).
    # The index file is essential for many downstream applications that require random access to reads in the BAM file.
    # Replace 'sorted.bam' with the actual name of your sorted BAM file.
    samtools index sorted.bam
  28. 28

    Only outputs the second read in each pair for use with single stranded peak caller.

    reformat.sh (part of BBMap suite) (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install BBMap (contains reformat.sh)
    # conda install -c bioconda bbmap
    
    # Define input and output files
    # This command assumes the input is an interleaved FASTQ file containing paired-end reads.
    INPUT_INTERLEAVED_FASTQ="input_interleaved_reads.fastq.gz"
    OUTPUT_R2_FASTQ="output_second_reads.fastq.gz"
    
    # Only outputs the second read in each pair for use with single stranded peak caller.
    # 'r1=f' prevents outputting read 1, 'r2=t' enables outputting read 2.
    reformat.sh in="${INPUT_INTERLEAVED_FASTQ}" out="${OUTPUT_R2_FASTQ}" r1=f r2=t
  29. 29

    This is the final bam file to perform analysis on.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Assume 'final.bam' is the input BAM file that has undergone alignment, sorting, and duplicate removal.
    # Index the final BAM file to enable fast random access for downstream analysis tools.
    samtools index final.bam
  30. 30

    Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

    samtools v1.10+ GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
  31. 31

    Takes results from samtools view.

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define input and output file paths
    INPUT_BAM="input_aligned_reads.bam" # Placeholder for an input BAM file, e.g., from an alignment step
    OUTPUT_SAM="output_viewed_reads.sam" # Placeholder for an output SAM file
    
    # Command: Takes results from samtools view.
    # This command converts a BAM file to a SAM file, including the header (-h).
    # This is a common operation to view or extract data from a BAM file for further text-based processing or inspection.
    samtools view -h "${INPUT_BAM}" > "${OUTPUT_SAM}"
  32. 32

    Calls peaks on those files.

    clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install clipper (if not already installed)
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # python setup.py install # Or ensure clipper.py is executable and in PATH
    
    # Placeholder for input BAM file (aligned reads) and output peak file
    INPUT_BAM="aligned_reads.bam"
    OUTPUT_PEAKS="peaks.bed"
    
    # Placeholder for genome assembly (hg38) and its approximate size.
    # clipper's default genome list might not include hg38, so we provide the size directly.
    GENOME_SIZE_HG38="3100000000" # Approximate size for human genome assembly hg38
    
    # Execute clipper to call peaks
    # Assuming clipper.py is accessible in the current directory or system PATH
    python clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE_HG38}" -o "${OUTPUT_PEAKS}"
  33. 33

    Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

    CLIPper vNot specified GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # CLIPper is a Python-based tool. Installation typically involves cloning the repository
    # and running the setup script, or ensuring the main script is executable and in your PATH.
    # Example installation (adjust if 'clipper' is not directly in PATH):
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # python setup.py install --user # Install to user's local site-packages
    # # Or, if you just want to run the script directly:
    # # python /path/to/clipper/clipper.py ...
    
    clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

Tools Used

Raw Source Text
Library strategy: eCLIP-seq
Takes output from raw files.  Run to trim off both 5’ and 3’ adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3’ adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding
← Back to Analysis