GSE181140 Processing Pipeline
Publication
Identification of the global miR-130a targetome reveals a role for TBL1XR1 in hematopoietic stem cell self-renewal and t(8;21) AML.Cell reports (2022) — PMID 35263585
Dataset
GSE181140Identification of the Global miR-130a Targetome Reveals a Novel Role for TBL1XR1 in Hematopoietic Stem Cell Self-Renewal and t(8;21) AML
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).
$ Bash example
# Install UMI-tools via conda # conda create -n umi_tools_env python=3.8 # conda activate umi_tools_env # conda install -c bioconda umi_tools=1.0.0 # This command extracts Unique Molecular Identifiers (UMIs) from the start of Read 1 # (assuming a 10bp UMI) and appends them to the read headers of both Read 1 and Read 2. # Replace 'input_R1.fastq.gz', 'input_R2.fastq.gz' with your actual input files # and 'output_R1.fastq.gz', 'output_R2.fastq.gz' with your desired output file names. # Adjust '--bc-pattern' if your UMI length or location is different (e.g., 'NNNNNNNNNNNN' for 12bp, or 'NNNNNNNNNN(BC)' if a cell barcode follows). # If UMIs are in a separate index read, use '--bc-file' instead of including the pattern in --input-fastq. umi_tools extract \ --input-fastq=input_R1.fastq.gz \ --output-fastq=output_R1.fastq.gz \ --read2-in=input_R2.fastq.gz \ --read2-out=output_R2.fastq.gz \ --extract-method=string \ --bc-pattern=NNNNNNNNNN \ --log=umi_tools_extract.log -
2
Args: --random-seed 1 --bc-pattern NNNNNNNNNN
$ Bash example
# Install umi_tools (e.g., using conda) # conda install -c bioconda umi_tools=1.1.2 # Example usage of umi_tools extract with specified arguments # This command extracts Unique Molecular Identifiers (UMIs) from reads # and places them in the read name, based on the provided barcode pattern. # Replace 'input.fastq.gz' with your actual input FASTQ file # and 'output_extracted.fastq.gz' with your desired output file name. umi_tools extract \ --random-seed 1 \ --bc-pattern NNNNNNNNNN \ --input input.fastq.gz \ --output output_extracted.fastq.gz \ --log umi_tools_extract.log -
3
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.14 # Define input and output files INPUT_READS="input_reads.fastq.gz" TRIMMED_READS="trimmed_reads.fastq.gz" # Define common adapter sequences (replace with actual adapters if known) # For Illumina, common adapters include: # TruSeq Universal Adapter: AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCT # TruSeq Small RNA 3' Adapter: TGGAATTCTCGGGTGCCAAGGAACTCCAG # TruSeq Small RNA 5' Adapter: GTTCAGAGTTCTACAGTCCGACGATC ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Example Illumina adapter # Trim reads using cutadapt # -a: 3' adapter sequence # -q 20: Trim low-quality bases from 5' and 3' ends with quality score below 20 # -m 20: Discard reads shorter than 20 bp after trimming # -o: Output file cutadapt -a "${ADAPTER_SEQUENCE}" -q 20 -m 20 -o "${TRIMMED_READS}" "${INPUT_READS}" -
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# This step appears to be a custom read processing or UMI extraction/deduplication step within the eCLIP pipeline. # The arguments provided do not directly map to a standard bioinformatics tool like umi_tools or cutadapt, # but rather suggest a specialized script or internal tool within the YeoLab eCLIP framework. # The tool name "eCLIP" is provided, implying this is a command within the overall eCLIP context. # Installation (assuming a CWL/Snakemake environment for the eCLIP pipeline): # git clone https://github.com/YeoLab/eclip.git # cd eclip # # Follow instructions to set up CWL environment, e.g., using conda or pip # # conda create -n eclip_env cwltool # # conda activate eclip_env # Download reference sequences (InvRNA.fasta is an example input from the eclip repository) # wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA.fasta -O InvRNA.fasta # Placeholder for input FASTQ file (replace with actual input, e.g., raw reads) INPUT_FASTQ="input_reads.fastq" # Execute the eCLIP read processing step # Note: The command 'eCLIP' here is a placeholder for the specific executable/script # that would accept these arguments within the YeoLab eCLIP pipeline. # The actual command might be a specific CWL tool invocation or a custom Python script. eCLIP --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA.fasta "${INPUT_FASTQ}" > processed_reads.fastq -
5
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.
$ Bash example
# Install cutadapt 1.14 # conda install -c bioconda cutadapt=1.14 # Define input and output files INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed_output.fastq.gz" # Define adapter sequence for double-ligation events (Illumina universal adapter) # This adapter sequence is commonly used to remove adapter contamination, # including instances where adapters ligate to each other or to very short inserts. ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Trim reads with cutadapt to remove the specified adapter sequence # -a: 3' adapter sequence # --minimum-length: Discard reads shorter than this length after trimming (common practice) cutadapt -a "${ADAPTER_SEQUENCE}" \ --minimum-length 18 \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install umi_tools if not available # conda install -c bioconda umi_tools # Download reference genome Ril19.fasta from YeoLab eCLIP example inputs # wget https://github.com/YeoLab/eclip/raw/master/example/inputs/Ril19.fasta # Note: The provided arguments are a mix of parameters typically used by 'umi_tools extract' # (e.g., --match-read-wildcards) and 'umi_tools dedup' (e.g., -e, -m, --quality-cutoff). # The argument '--times 1' is not a standard umi_tools parameter, and '-a Ril19.fasta' # uses a FASTA file (a reference genome) which is not a typical input for umi_tools extract/dedup. # This suggests the command might be for a custom wrapper script within the eCLIP pipeline # that combines UMI processing with other functionalities, or a highly specific, # potentially deprecated, tool. The version 'umi_tools 1.1.2' is inferred from the # YeoLab eCLIP CWL workflow, which uses umi_tools for UMI processing. # For demonstration, we use a placeholder script 'eclip_umi_processor' that accepts all given arguments. # Placeholder input: 'input_reads.bam' (assuming reads are already aligned and UMIs extracted/marked) # Placeholder output: 'deduplicated_reads.bam' eclip_umi_processor --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta -i input_reads.bam -o deduplicated_reads.bam
-
7
For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt # Define input and output file names INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_FASTQ="filtered_reads.fastq.gz" PRIMER_SEQUENCE="CAGUGCAAUGUUAAAAGGGCAU" # Filter out reads that do not contain the primer sequence. # -b PRIMER_SEQUENCE: Search for the primer sequence anywhere in the read. # --discard-untrimmed: Discard reads where the primer sequence was not found. # The output file will contain only reads that had the primer sequence. cutadapt -b "${PRIMER_SEQUENCE}" \ --discard-untrimmed \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
8
For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools.
$ Bash example
# Install UMI-tools (e.g., via conda) # conda install -c bioconda umi-tools # Define input and output file names (placeholders) INPUT_READ1="read1.fastq.gz" INPUT_READ2="read2.fastq.gz" OUTPUT_READ1="trimmed_read1.fastq.gz" OUTPUT_READ2="trimmed_read2.fastq.gz" # Trim 10nt from the 3' end of Read1 by extracting it as a UMI. # The pattern `^(?P<discard_1>.*)(?P<umi_1>.{10})$` captures the last 10 bases of Read1 as the UMI # and moves it to the read header, effectively trimming it from the sequence. umi_tools extract \ --bc-pattern="^(?P<discard_1>.*)(?P<umi_1>.{10})$" \ -I "${INPUT_READ1}" \ --read2-in "${INPUT_READ2}" \ -S "${OUTPUT_READ1}" \ --stdout-read2 "${OUTPUT_READ2}" -
9
Args: -u -9
$ Bash example
# Install zip if not available # conda install -c conda-forge zip # # Or on Debian/Ubuntu: # sudo apt-get install zip # Example usage of zip with -u (update) and -9 (maximum compression). # This command updates an existing archive 'output.zip' with 'input_file.txt' # or creates it if it doesn't exist, using maximum compression. zip -u -9 output.zip input_file.txt
-
10
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Define variables READS="trimmed_filtered_reads.fastq.gz" # Placeholder for trimmed and filtered input reads GENOME_DIR="path/to/repbase_18.05_star_index" # Placeholder for STAR index built from RepBase 18.05 OUTPUT_PREFIX="repbase_mapping_" # Prefix for output files (e.g., repbase_mapping_Aligned.sortedByCoord.out.bam) THREADS=8 # Number of threads to use # Note: The STAR genome index for RepBase 18.05 must be pre-built in ${GENOME_DIR}. # Example command to build the index (uncomment and modify if needed): # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles path/to/RepBase18.05.fasta \ # --runThreadN ${THREADS} \ # --genomeSAindexNbases 10 # Adjust for smaller genomes/repeat databases if default (14) is too large # Map trimmed and filtered reads with STAR against the repeat element database STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 -
11
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Note: 'human_repbase' is assumed to be a pre-built STAR genome index directory. # If not, it needs to be generated first using STAR --runMode genomeGenerate. STAR \ --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
-
12
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19).
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.7.6a # --- Reference Genome Preparation (hg19) --- # STAR requires a pre-built genome index. Below are example steps to download hg19 # genome and GTF annotation (e.g., from UCSC and GENCODE) and build the index. # Create directory for STAR index # mkdir -p star_index/hg19 # Download hg19 genome FASTA (UCSC) # wget -P star_index/hg19 https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip star_index/hg19/hg19.fa.gz # Download hg19 GTF annotation (GENCODE v19) # wget -P star_index/hg19 https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip star_index/hg19/gencode.v19.annotation.gtf.gz # Build STAR genome index (adjust --runThreadN and --sjdbOverhang as needed) # STAR --runMode genomeGenerate \ # --genomeDir star_index/hg19 \ # --genomeFastaFiles star_index/hg19/hg19.fa \ # --sjdbGTFfile star_index/hg19/gencode.v19.annotation.gtf \ # --sjdbOverhang 100 \ # --runThreadN 8 # --- STAR Alignment Command --- # Define input and output paths INPUT_FASTQ="unmapped_reads.fastq" # Placeholder for your input FASTQ file GENOME_DIR="star_index/hg19" # Path to your pre-built STAR genome index OUTPUT_PREFIX="mapped_reads" # Prefix for output files (e.g., mapped_reads_Aligned.sortedByCoord.out.bam) # Execute STAR alignment STAR --runThreadN 8 \ # Use 8 threads for alignment (adjust as per available resources) --genomeDir "${GENOME_DIR}" \ # Path to the STAR genome index for hg19 --readFilesIn "${INPUT_FASTQ}" \ # Input FASTQ file containing unmapped reads --outFileNamePrefix "${OUTPUT_PREFIX}_" \ # Prefix for all output files --outSAMtype BAM SortedByCoordinate \ # Output sorted BAM file --outBAMcompression 1 # Compression level for BAM output -
13
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
STAR (Inferred with models/gemini-2.5-flash) vLatest stable version (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Placeholder for STAR installation (e.g., conda install -c bioconda star) # Define input and output paths GENOME_DIR="/path/to/star_genome_index" READ_FILE_1="/path/to/your/read1.fastq.gz" OUT_PREFIX="aligned_reads" # Execute STAR alignment STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILE_1}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
14
Aligned reads were sorted with samtools (1.6)
samtools v1.6 -
15
Sorted reads were collapsed with umi_tools (1.0.0).
$ Bash example
# Install umi_tools if not already installed # conda create -n umi_tools_env umi_tools=1.0.0 -c bioconda -c conda-forge # conda activate umi_tools_env # Placeholder for input and output files # The input file is expected to be a BAM/SAM file with UMIs in the read IDs or as a tag, and sorted. INPUT_BAM="sorted_reads.bam" OUTPUT_BAM="deduplicated_reads.bam" STATS_FILE="deduplication_stats.tsv" # Collapse sorted reads using umi_tools dedup # The 'directional' method is commonly used for UMI-based deduplication to account for sequencing errors. # Assuming UMIs are already present in the read IDs (e.g., from a previous umi_tools extract step). umi_tools dedup \ --input "${INPUT_BAM}" \ --output "${OUTPUT_BAM}" \ --output-stats "${STATS_FILE}" \ --method directional \ --log "umi_tools_dedup.log" -
16
Args: --random-seed 1 --method unique
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Execute umi_tools dedup # This command assumes 'input.bam' is the aligned BAM file containing UMIs in read names # and will produce 'output.dedup.bam' as the deduplicated output. umi_tools dedup \ --random-seed 1 \ --method unique \ -I input.bam \ -S output.dedup.bam -
17
BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper (assuming Python 3 is available) # git clone https://github.com/yeolab/clipper.git # cd clipper # Placeholder for reference genome (e.g., hg38) # Download hg38 fasta if not available # wget -O hg38.fa.gz http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # gunzip hg38.fa.gz # samtools faidx hg38.fa # Generate chromosome sizes file for hg38 # awk 'BEGIN{FS="\t"}; {print $1 FS $2}' hg38.fa.fai > hg38.chrom.sizes # Input BAM file (replace with actual path) INPUT_BAM="input.bam" # Chromosome sizes file CHROM_SIZES="hg38.chrom.sizes" # Output prefix OUTPUT_PREFIX="output_peaks" # Run CLIPper to identify peak clusters # Using default threshold (0.05) and window size (20) as no specific parameters were provided. python clipper/clipper.py -b "${INPUT_BAM}" -s "${CHROM_SIZES}" -o "${OUTPUT_PREFIX}" --threshold 0.05 --window 20 -
18
Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
$ Bash example
# Install clipper (if not already installed) # clipper is a Python script. It can be installed via pip or run directly from the cloned repository. # pip install clipper # Reference dataset: Human genome assembly hg19. # This typically refers to a pre-indexed genome or a genome FASTA file # that clipper uses internally or expects to be configured in its environment. # Example source for hg19 genome data: UCSC Genome Browser (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/) clipper.py --species hg19 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
-
19
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# The peaksnormalize.pl script is part of the eclip toolkit. # Ensure the eclip repository is cloned and its 'bin' directory is in your PATH. # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:$(pwd)/eclip/bin # Define input files and output prefix # Replace with your actual peak cluster file (e.g., BED or narrowPeak format) PEAK_FILE="your_peak_clusters.bed" # Replace with your actual IP BAM file IP_BAM="your_ip_sample.bam" # Replace with your actual INPUT BAM file INPUT_BAM="your_input_sample.bam" # Define the output prefix for normalized peak files OUTPUT_PREFIX="normalized_peaks" # Execute the peaksnormalize.pl script # Usage: peaksnormalize.pl <peak_file> <IP_bam> <INPUT_bam> <output_prefix> peaksnormalize.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}" -
20
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install Perl (if not already available) and clone the eCLIP repository # conda create -n eclip_env perl # conda activate eclip_env # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:$(pwd)/eclip/scripts # Define input peak files (replace with actual paths to normalized peak BED files for replicates) INPUT_PEAK_FILE_REP1="sample_rep1_normalized_peaks.bed" INPUT_PEAK_FILE_REP2="sample_rep2_normalized_peaks.bed" # Define output merged peak file OUTPUT_MERGED_PEAKS="sample_merged_peaks.bed" # Execute the script to merge overlapping normalized peak regions # The script is part of the eCLIP pipeline, ensure it's in your PATH or specify its full path. # The -f parameter specifies the minimum fraction of overlap for merging (default is 0.5). # Adjust -f or -d (maximum distance) if specific values were used in the original analysis. compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ -i "${INPUT_PEAK_FILE_REP1}" \ -i "${INPUT_PEAK_FILE_REP2}" \ -o "${OUTPUT_MERGED_PEAKS}" \ -f 0.5 -
21
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr # Install merge_peaks pipeline (if not already installed) # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$PATH:/path/to/merge_peaks/scripts # Assuming scripts are in a 'scripts' subdirectory # Define input files (placeholders for normalized peak files) INPUT_PEAKS_REP1="replicate1_normalized_peaks.compressed.bed" INPUT_PEAKS_REP2="replicate2_normalized_peaks.compressed.bed" OUTPUT_PREFIX="reproducible_peaks_idr" # Step 1: Rank normalized peak files by entropy score using make_informationcontent_from_peaks.pl # This script, part of the merge_peaks pipeline, prepares inputs for IDR by adding an entropy score. # IDR will then typically use the p-value (or other specified rank) from these processed files. perl make_informationcontent_from_peaks.pl "${INPUT_PEAKS_REP1}" > "${INPUT_PEAKS_REP1%.compressed.bed}_ranked.bed" perl make_informationcontent_from_peaks.pl "${INPUT_PEAKS_REP2}" > "${INPUT_PEAKS_REP2%.compressed.bed}_ranked.bed" RANKED_PEAKS_REP1="${INPUT_PEAKS_REP1%.compressed.bed}_ranked.bed" RANKED_PEAKS_REP2="${INPUT_PEAKS_REP2%.compressed.bed}_ranked.bed" # Step 2: Run IDR (Irreproducible Discovery Rate) to determine reproducible peaks # The --rank p.value parameter is commonly used in the merge_peaks pipeline for IDR. idr --samples "${RANKED_PEAKS_REP1}" "${RANKED_PEAKS_REP2}" \ --output-file "${OUTPUT_PREFIX}" \ --rank p.value \ --plot \ --log-output-file "${OUTPUT_PREFIX}.log" -
22
To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8)
seqtk seq, ctk fasta2collapse.pl v1.3-r106, v1.0.8$ Bash example
# Install seqtk # conda install -c bioconda seqtk # Install CTK (CLIP Tool Kit) # git clone https://github.com/yeolab/ctk.git # export PATH=$PATH:/path/to/ctk # Add CTK scripts to PATH # Assuming input file is 'trimmed_reads.fastq.gz' # And output files will be 'trimmed_reads.fasta' and 'collapsed_unique_reads.fasta' # Convert trimmed fastq to fasta using seqtk seq seqtk seq -a trimmed_reads.fastq.gz > trimmed_reads.fasta # Collapse fasta into unique sequences using ctk fasta2collapse.pl ctk fasta2collapse.pl -i trimmed_reads.fasta -o collapsed_unique_reads.fasta
-
23
Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6)
$ Bash example
# Install Bowtie2 # conda install -c bioconda bowtie2=2.2.6 # --- Reference Data Preparation --- # Download mature miRNA sequences from miRBase (example for a specific version, adjust as needed) # For instance, from miRBase FTP (replace CURRENT with a specific version like v22): # wget ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz # gunzip mature.fa.gz # mv mature.fa miRBase_mature.fasta # Build Bowtie2 index for mature miRNA sequences # miRBase_mature.fasta: Path to the FASTA file containing mature miRNA sequences from miRBase # miRBase_mature_index: Prefix for the generated Bowtie2 index files bowtie2-build miRBase_mature.fasta miRBase_mature_index # --- Mapping Step --- # Map collapsed sequences (reads) to the mature miRNA index # -x miRBase_mature_index: Specifies the Bowtie2 index to use # -U collapsed_sequences.fastq: Specifies the input file containing unaligned reads (collapsed sequences) # -S output_alignment.sam: Specifies the output SAM file for alignments # The description does not specify additional parameters like --local, -L, -D, -R, or -p (threads). bowtie2 -x miRBase_mature_index -U collapsed_sequences.fastq -S output_alignment.sam
-
24
Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables INPUT_FASTQ="reassigned_reads.fastq" # Placeholder for sequences re-assigned after miR mapping GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for human genome (GRCh38) STAR index OUTPUT_PREFIX="aligned_to_genome_" NUM_THREADS=8 # Adjust as needed # Create STAR genome index if it doesn't exist (example command, run once) # STAR --runMode genomeGenerate --genomeDir ${GENOME_DIR} --genomeFastaFiles /path/to/GRCh38.fa --sjdbGTFfile /path/to/GRCh38.gtf --runThreadN ${NUM_THREADS} # Align re-assigned sequences to the human genome STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --runThreadN ${NUM_THREADS} \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000
Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out. For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools. Args: -u -9 Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.6) Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8) Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6) Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described. Genome_build: hg19 Supplementary_files_format_and_content: gk.Au9.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed and gk.Au13.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed contain enriched regions from the miR-130a targeted chimeric eCLIP dataset Supplementary_files_format_and_content: cellsB_rep1.vs.cellsB_rep2.bed contains reproducible total chimeric eCLIP peaks across replicates Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities