GSE135435 Processing Pipeline
Publication
The <i>Thermus thermophilus</i> DEAD-box protein Hera is a general RNA binding protein and plays a key role in tRNA metabolism.RNA (New York, N.Y.) (2020) — PMID 32669294
Dataset
GSE135435The Thermus thermophilus DEAD-box protein Hera is a general RNA chaperone and plays a key role in tRNA metabolism
Processing Steps
Generate Jupyter Notebook-
1
Takes output from raw files.
$ Bash example
# Install FastQC (example using conda) # conda install -c bioconda fastqc=0.11.9 # Define input and output file paths # Assuming 'raw_files' refers to FASTQ sequencing reads. # Replace 'input_raw_reads.fastq.gz' with your actual raw sequencing file. INPUT_FASTQ="input_raw_reads.fastq.gz" OUTPUT_DIR="fastqc_reports" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run FastQC to generate quality control reports from raw sequencing reads fastqc "${INPUT_FASTQ}" -o "${OUTPUT_DIR}" -
2
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=3.4 # Define input and output file paths READ1_IN="input_read1.fastq.gz" READ2_IN="input_read2.fastq.gz" READ1_OUT="trimmed_read1.fastq.gz" READ2_OUT="trimmed_read2.fastq.gz" # Define adapter sequences # These are common Illumina adapters. Adjust if specific adapters are known for your assay. # For 3' adapters (adapters at the 3' end of the read) ADAPTER_3PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina universal adapter ADAPTER_3PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina TruSeq adapter, often used for R2 # For 5' adapters (adapters at the 5' end of the read, e.g., primer remnants or adapter dimers) # If specific 5' adapters are known, replace these placeholders. # Using the same sequences as 3' adapters is a common fallback if specific 5' adapters are not provided, # as it helps remove adapter dimers or partial adapters that might appear at the 5' end. ADAPTER_5PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_5PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Number of threads for parallel processing THREADS=8 # Quality trimming threshold (Phred score) QUALITY_THRESHOLD=20 # Minimum read length after trimming MIN_LENGTH=20 # Execute cutadapt for paired-end trimming cutadapt \ -j "${THREADS}" \ -q "${QUALITY_THRESHOLD}","${QUALITY_THRESHOLD}" \ --minimum-length "${MIN_LENGTH}" \ --pair-filter=any \ -a "${ADAPTER_3PRIME_R1}" \ -A "${ADAPTER_3PRIME_R2}" \ -g "${ADAPTER_5PRIME_R1}" \ -G "${ADAPTER_5PRIME_R2}" \ -o "${READ1_OUT}" \ -p "${READ2_OUT}" \ "${READ1_IN}" \ "${READ2_IN}" -
3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.metrics
$ Bash example
# Install Skipper (if not already installed) # It's recommended to install Skipper in a dedicated conda environment. # conda create -n skipper_env python=3.8 # conda activate skipper_env # pip install skipper # Define input/output files and parameters INPUT_R1="/full/path/to/files/file_R1.X1A.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.metrics" QUALITY_THRESHOLD="6" MIN_LENGTH="18" ADAPTER_R1_1="NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" ADAPTER_R1_2="CTTCCGATCTACAAGTT" ADAPTER_R1_3="CTTCCGATCTTGGTCCT" # Adapters for R2 ADAPTER_R2_LIST=( "AACTTGTAGATCGGA" "AGGACCAAGATCGGA" "ACTTGTAGATCGGAA" "GGACCAAGATCGGAA" "CTTGTAGATCGGAAG" "GACCAAGATCGGAAG" "TTGTAGATCGGAAGA" "ACCAAGATCGGAAGA" "TGTAGATCGGAAGAG" "CCAAGATCGGAAGAG" "GTAGATCGGAAGAGC" "CAAGATCGGAAGAGC" "TAGATCGGAAGAGCG" "AAGATCGGAAGAGCG" "AGATCGGAAGAGCGT" "GATCGGAAGAGCGTC" "ATCGGAAGAGCGTCG" "TCGGAAGAGCGTCGT" "CGGAAGAGCGTCGTG" "GGAAGAGCGTCGTGT" ) # Construct the -A arguments for R2 ADAPTER_R2_ARGS="" for adapter in "${ADAPTER_R2_LIST[@]}"; do ADAPTER_R2_ARGS+=" -A ${adapter}" done # Execute the command quality-cutoff "${QUALITY_THRESHOLD}" \ -m "${MIN_LENGTH}" \ -a "${ADAPTER_R1_1}" \ -g "${ADAPTER_R1_2}" \ -g "${ADAPTER_R1_3}" \ ${ADAPTER_R2_ARGS} \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" > "${METRICS_FILE}" -
4
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt (e.g., via conda) # conda create -n cutadapt_env cutadapt=4.0 # conda activate cutadapt_env # Define input and output files # Assuming input is a single-end FASTQ file that has already undergone a first round of trimming INPUT_FASTQ="sample_R1_trimmed_round1.fastq.gz" OUTPUT_FASTQ="sample_R1_trimmed_round2.fastq.gz" REPORT_JSON="sample_R1_trimmed_round2_report.json" # Define adapter for round 2 (e.g., poly-A tail, common in eCLIP after initial adapter trimming) # This is an inferred adapter for a second round of trimming in eCLIP workflows (e.g., Skipper) POLY_A_ADAPTER="A{30}" # Execute cutadapt for the second round of trimming # Parameters: -a for adapter, -m for minimum length, -q for quality trimming, --cores for parallel processing, --json for report cutadapt \ -a "${POLY_A_ADAPTER}" \ -m 18 \ -q 20 \ --cores 8 \ --json "${REPORT_JSON}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
5
Run to trim off the 3â adapters on read 2, to control for double ligation events.
cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output file paths (placeholders) INPUT_R1="read1.fastq.gz" INPUT_R2="read2.fastq.gz" OUTPUT_R1="trimmed_read1.fastq.gz" OUTPUT_R2="trimmed_read2.fastq.gz" # Define the 3' adapter sequence for eCLIP (common Illumina TruSeq adapter) # Source: https://github.com/yeolab/skipper/blob/master/config/config.yaml ADAPTER_3_PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Run cutadapt to trim 3' adapters from read 2 # -A: 3' adapter for read 2 # -q 20: Trim low-quality bases from 3' end (Phred score < 20) # -m 18: Discard reads shorter than 18 bp after trimming # -o: Output file for read 1 # -p: Output file for read 2 cutadapt -A "${ADAPTER_3_PRIME}" \ -q 20 \ -m 18 \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" -
6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# Install cutadapt (e.g., via conda) # conda install -c bioconda cutadapt=4.1 cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.metrics
-
7
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define input and output files # INPUT_FASTQ represents the output from cutadapt round 2 INPUT_FASTQ="sample_r2.fastq.gz" OUTPUT_FASTQ="sample.r3.fastq.gz" # Define parameters based on the Yeo Lab eCLIP workflow (cutadapt_r3 step) # Adapter sequence: Illumina TruSeq Small RNA 3' adapter ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" MINIMUM_LENGTH=18 ERROR_RATE=0.1 NEXTSEQ_TRIM=20 # Trim low-quality ends from NextSeq data (default in eCLIP workflow) QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming CORES=4 # Number of CPU cores to use (adjust based on available resources) cutadapt \ -a "${ADAPTER_SEQUENCE}" \ --nextseq-trim="${NEXTSEQ_TRIM}" \ -q "${QUALITY_CUTOFF}" \ -m "${MINIMUM_LENGTH}" \ -e "${ERROR_RATE}" \ -j "${CORES}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
8
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
RepeatMasker (Inferred with models/gemini-2.5-flash) v4.1.5 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install RepeatMasker (example using conda) # conda install -c bioconda repeatmasker # Download human genome assembly (e.g., GRCh38/hg38) as a placeholder if not already available # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip hg38.fa.gz # Run RepeatMasker to identify and mask repetitive elements in the human genome. # The -species human flag ensures the use of the human-specific RepBase library. # -pa 8: Use 8 parallel processors for faster execution. # -gff: Output results in GFF format, which is useful for downstream analysis. # -dir repeatmasker_output: Specify an output directory for all generated files. # -s: Perform a sensitive search, which is generally recommended for thorough masking. # -xsmall: Mask simple repeats and low complexity regions with 'x' instead of 'N'. # hg38.fa: Placeholder for the input human genome FASTA file. RepeatMasker -species human -pa 8 -gff -dir repeatmasker_output hg38.fa
-
9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for clarity # GENOME_DIR is a placeholder for a STAR index built from RepBase for human GRCh38. # RepBase is a database of repetitive DNA elements. GENOME_DIR="/path/to/STAR_index_RepBase_human_GRCh38" READ1="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.fastq.gz" READ2="/full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_BAM="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam" # Note: The --outFileNamePrefix parameter will set the prefix for other STAR output files # (e.g., Log.out, SJ.out.tab). Since --outStd BAM_Unsorted is used, the primary BAM output # is redirected to stdout and then to the OUTPUT_BAM file. OUTPUT_PREFIX="${OUTPUT_BAM}" STAR --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ1}" "${READ2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 10 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --readFilesCommand zcat \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM}" -
10
Takes output from STAR rmRep.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.7.10a # Install samtools (example using conda) # conda install -c bioconda samtools # Define variables INPUT_FASTQ="input.fastq.gz" GENOME_DIR="/path/to/STAR_genome_index/hg38" # Replace with actual path to STAR genome index GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Replace with actual path to GTF annotation file OUTPUT_PREFIX="aligned_reads" THREADS=8 # 1. STAR Alignment # This step aligns reads to the reference genome and generates a sorted BAM file. STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --quantMode GeneCounts \ --sjdbGTFfile "${GTF_FILE}" # 2. Remove PCR duplicates (inferred "rmRep" step using samtools markdup) # This step removes PCR duplicates from the aligned BAM file, which is a common post-alignment step in eCLIP. # First, fixmate information is added, then sorted by query name, then duplicates are marked. # The output is a deduplicated, coordinate-sorted BAM file. samtools fixmate -m "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam" "${OUTPUT_PREFIX}.fixmate.bam" samtools sort -n "${OUTPUT_PREFIX}.fixmate.bam" -o "${OUTPUT_PREFIX}.queryname_sorted.bam" samtools markdup -r -s "${OUTPUT_PREFIX}.queryname_sorted.bam" "${OUTPUT_PREFIX}.dedup.bam" # Index the deduplicated BAM file samtools index "${OUTPUT_PREFIX}.dedup.bam" # Clean up intermediate files (optional) # rm "${OUTPUT_PREFIX}.fixmate.bam" "${OUTPUT_PREFIX}.queryname_sorted.bam" -
11
Maps reads to the HB27 (GCF_000008125.1) genome.
$ Bash example
# Install bwa and samtools if not already present # conda install -c bioconda bwa samtools # Define reference genome and output paths GENOME_ACCESSION="GCF_000008125.1" GENOME_NAME="GRCh37.p13" GENOME_FASTA_GZ="${GENOME_ACCESSION}_${GENOME_NAME}_genomic.fna.gz" GENOME_DIR="ref_genome" INDEX_PREFIX="${GENOME_DIR}/human_hb27_grch37" DECOMPRESSED_FASTA="${INDEX_PREFIX}.fna" # Create directory for reference genome mkdir -p "${GENOME_DIR}" # Download reference genome if not exists if [ ! -f "${GENOME_DIR}/${GENOME_FASTA_GZ}" ]; then echo "Downloading reference genome ${GENOME_FASTA_GZ}..." wget -O "${GENOME_DIR}/${GENOME_FASTA_GZ}" "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/${GENOME_ACCESSION}/${GENOME_ACCESSION}_${GENOME_NAME}/${GENOME_FASTA_GZ}" fi # Decompress the genome for indexing if [ ! -f "${DECOMPRESSED_FASTA}" ]; then echo "Decompressing genome..." gunzip -c "${GENOME_DIR}/${GENOME_FASTA_GZ}" > "${DECOMPRESSED_FASTA}" fi # Build BWA index # Check if index files exist (e.g., .bwt) if [ ! -f "${INDEX_PREFIX}.bwt" ]; then echo "Building BWA index for ${DECOMPRESSED_FASTA}..." bwa index -p "${INDEX_PREFIX}" "${DECOMPRESSED_FASTA}" fi # Define input FASTQ files (replace with actual filenames) # Assuming paired-end reads for a robust example. Adjust if single-end. READ1="input_reads_R1.fastq.gz" READ2="input_reads_R2.fastq.gz" # Define output BAM file SORTED_BAM="aligned_reads.sorted.bam" echo "Mapping reads to ${GENOME_NAME} (${GENOME_ACCESSION}) using BWA MEM..." # Map reads using bwa mem, pipe to samtools for conversion to BAM, sorting, and indexing bwa mem -t 8 "${INDEX_PREFIX}" "${READ1}" "${READ2}" | \ samtools view -Sb - | \ samtools sort -@ 8 -o "${SORTED_BAM}" - samtools index "${SORTED_BAM}" echo "Alignment complete. Sorted BAM: ${SORTED_BAM}" -
12
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam
$ Bash example
# Define variables for paths # Replace with actual paths to your STAR genome index and input/output files/directories GENOME_DIR="/path/to/STAR_genome_indices/GRCh38" # Example: /path/to/STAR_genome_indices/hg38_gencode_v38 READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" READ_FILE_2="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # This will be the prefix for STAR's auxiliary output files (e.g., Log.out, SJ.out.tab, etc.) FINAL_BAM_OUTPUT="/full/path/to/files/file_R1.X1AUnmappedAligned.out.bam" # This is the primary alignment BAM output, redirected from stdout # Install STAR (e.g., via conda) # conda install -c bioconda star STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 10 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd \ > "${FINAL_BAM_OUTPUT}" -
13
Takes output from STAR genome alignments.
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.10a -c bioconda -c conda-forge # conda activate star_env # --- Placeholder Variables --- # Path to the STAR genome index directory (e.g., built from hg38/GRCh38) # To build a STAR index, you would typically run: # STAR --runMode genomeGenerate --genomeDir /path/to/star_index/hg38 \ # --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf \ # --runThreadN <num_threads> GENOME_DIR="/path/to/star_index/hg38" # Input FASTQ files (gzipped) FASTQ_R1="sample_R1.fastq.gz" FASTQ_R2="sample_R2.fastq.gz" # Use "" if single-end # Prefix for output files OUTPUT_PREFIX="aligned_sample_" # Output directory OUTPUT_DIR="star_output" mkdir -p "${OUTPUT_DIR}" # --- STAR Alignment Command --- STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 8 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --outFilterScoreMinOverLread 0.0 \ --outFilterMatchNminOverLread 0.0 \ --limitBAMsortRAM 30000000000 \ --readFilesCommand zcat -
14
Run samtools sort.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Sort a BAM file by coordinate. This is the default sorting order for samtools sort. # Replace 'input.bam' with the path to your unsorted BAM file. # Replace 'output.sorted.bam' with the desired name for the sorted output BAM file. samtools sort -o output.sorted.bam input.bam
-
15
Command: samtools sort -o /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam
$ Bash example
samtools sort -o /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam
-
16
Takes output from sortSam, makes bam index for use downstream.
samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19 # Assume the output from sortSam is a sorted BAM file, e.g., 'sample_sorted.bam' # This command creates an index file (e.g., 'sample_sorted.bam.bai') for the BAM file. samtools index sample_sorted.bam
-
17
Command: samtools index /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam
$ Bash example
bash # Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.19 samtools index /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam
-
18
Takes inputs from multiple final bam files.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not available # conda install -c bioconda samtools # This step takes multiple final BAM files and merges them into a single BAM file. # This is a common operation when combining technical or biological replicates # before downstream analyses like peak calling or quantification. # Replace 'input_file_1.bam input_file_2.bam input_file_3.bam' with your actual input BAM file paths. # Replace 'merged_output.bam' with your desired output merged BAM file name. # The '-@' flag specifies the number of threads to use. samtools merge -@ 4 merged_output.bam input_file_1.bam input_file_2.bam input_file_3.bam
-
19
Merges the two technical replicates for further downstream analysis.
samtools (Inferred with models/gemini-2.5-flash) v1.x (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Merge technical replicates (e.g., BAM files) # Assuming replicate1.bam and replicate2.bam are the input BAM files # and merged_replicates.bam is the output merged BAM file. samtools merge merged_replicates.bam replicate1.bam replicate2.bam
-
20
Command: samtools merge HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1BUnmappedAligned.out.sorted.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.19 # Define input and output files INPUT_BAM_1="/full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam" INPUT_BAM_2="/full/path/to/files/file_R1.X1BUnmappedAligned.out.sorted.bam" OUTPUT_BAM="HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam" # Execute samtools merge samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}" -
21
Takes output from samtools sort, makes bam index for use downstream.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19 # Create a BAM index for the sorted BAM file # input.sorted.bam is the output from samtools sort samtools index input.sorted.bam
-
22
Command: samtools index HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam
$ Bash example
# Install samtools (e.g., via conda) # conda install -c bioconda samtools=1.9 samtools index HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam
-
23
Takes output from samtools sort.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.10 # This step takes a sorted BAM file as input and creates an index (.bai) file. # Replace 'input.sorted.bam' with your actual sorted BAM file name. samtools index input.sorted.bam
-
24
Only outputs the second read in each pair for use with single stranded peak caller.
$ Bash example
# Install bbmap suite (which includes reformat.sh) # conda install -c bioconda bbmap # This command takes paired-end FASTQ files (input_R1.fastq.gz and input_R2.fastq.gz) # and outputs a new FASTQ file (output_R2_only.fastq.gz) containing only the second reads. # 'r1=f' ensures read 1 is not outputted, and 'r2=t' ensures read 2 is outputted. reformat.sh in1=input_R1.fastq.gz in2=input_R2.fastq.gz out=output_R2_only.fastq.gz r1=f r2=t
-
25
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Define placeholder for the input BAM file before final processing # Replace 'input_raw.bam' with the actual path to your unsorted or intermediate BAM file. INPUT_BAM="input_raw.bam" # Define the name for the final sorted and indexed BAM file FINAL_BAM="final.bam" # Sort the BAM file by coordinate # The -o option specifies the output file name samtools sort -o "${FINAL_BAM}" "${INPUT_BAM}" # Index the sorted BAM file # This creates an index file (e.g., final.bam.bai) in the same directory samtools index "${FINAL_BAM}" -
26
Command: samtools view -hb -f 128 -o HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam
$ Bash example
# Install samtools (e.g., via conda) # conda install -c bioconda samtools=1.10 samtools view -hb -f 128 -o HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam
-
27
Takes results from samtools view.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Convert a BAM file to SAM format, including the header, # to "view" the alignment results. This command takes an input BAM file # and outputs its content in SAM format to standard output, redirected to a file. # Replace 'input.bam' with your actual input alignment file. # Replace 'output.sam' with your desired output SAM file name. samtools view -h input.bam > output.sam
-
28
Calls peaks on those files.
clipper (Inferred with models/gemini-2.5-flash) vv1.0.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install clipper (if not already installed) # pip install clipper # Define input files and genome assembly IP_BAM="ip_sample.bam" # Placeholder: Replace with the path to your IP sample BAM file CONTROL_BAM="control_sample.bam" # Placeholder: Replace with the path to your control sample BAM file (e.g., input, IgG) GENOME_ASSEMBLY="hg38" # Placeholder: Use the appropriate genome assembly (e.g., hg19, mm10) OUTPUT_PREFIX="eclip_peaks" # Execute clipper to call peaks # Common parameters: -b (IP BAM), -c (Control BAM), -s (genome assembly), -o (output prefix) # Optional parameters might include -p (p-value threshold, default 0.01), -w (window size, default 10) clipper -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_PREFIX}" -
29
Command: clipper --bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam --species hb27 --outfile /full/path/to/files/CombinedID.merged.r2.peaks.bed --maxgenes 1000000 --save-pickle
CLIPper vNot specified$ Bash example
# Installation instructions (example, adjust path as needed) # git clone https://github.com/yeolab/clipper.git # cd clipper # # Ensure Python environment is set up, e.g., with conda # # conda create -n clipper_env python=3.8 # # conda activate clipper_env # # pip install -r requirements.txt # If a requirements.txt exists # # Add clipper to your PATH or run directly using python # # export PATH=$PATH:$(pwd) # Define input and output files INPUT_BAM="HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam" OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed" SPECIES_ID="hb27" # Custom species identifier, likely referring to a specific human genome build and annotation used by the Yeo lab. # Execute CLIPper peak calling clipper --bam "${INPUT_BAM}" \ --species "${SPECIES_ID}" \ --outfile "${OUTPUT_BED}" \ --maxgenes 1000000 \ --save-pickle
Tools Used
Raw Source Text
Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps reads to the HB27 (GCF_000008125.1) genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam Takes output from STAR genome alignments. Run samtools sort. Command: samtools sort -o /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1BUnmappedAligned.out.sorted.bam Takes output from samtools sort, makes bam index for use downstream. Command: samtools index HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam Takes output from samtools sort. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 -o HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam Takes results from samtools view. Calls peaks on those files. Command: clipper --bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam --species hb27 --outfile /full/path/to/files/CombinedID.merged.r2.peaks.bed --maxgenes 1000000 --save-pickle Genome_build: hb27 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding