GSE180129 Processing Pipeline
Publication
Identification of the global miR-130a targetome reveals a role for TBL1XR1 in hematopoietic stem cell self-renewal and t(8;21) AML.Cell reports (2022) — PMID 35263585
Dataset
GSE180129Global Identification of the miR-130a targetome in human HSPC Reveals a Novel Role for TBL1XR1 in Hematopoietic Stem Cell Self-Renewal and t(8;21) A…
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi-tools # Example: Extract 10bp UMI from the start of each read and add it to the read header. # Replace 'input.fastq.gz' with your actual input file. # Replace 'output.fastq.gz' with your desired output file. # Adjust '--bc-pattern' based on the actual UMI length and position in your reads. # For example, if the UMI is 10bp at the start of the read, use NNNNNNNNNN. # If the UMI is extracted using a regex, use --extract-method=regex and --bc-pattern="^(?P<umi_1>.{10})" # If working with paired-end reads, use --read2-in and --read2-out parameters. umi_tools extract \ --bc-pattern=NNNNNNNNNN \ --stdin=input.fastq.gz \ --stdout \ --log=umi_tools_extract.log \ --output-stat=umi_tools_extract.tsv \ > output.fastq.gz -
2
Args: --random-seed 1 --bc-pattern NNNNNNNNNN
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi-tools # Example usage of umi_tools extract # Replace input.fastq.gz and output.fastq.gz with actual file paths umi_tools extract \ --random-seed 1 \ --bc-pattern NNNNNNNNNN \ -I input.fastq.gz \ -S output.fastq.gz \ --log umi_tools_extract.log -
3
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt # Define input and output files # Replace 'input.fastq.gz' with your actual input file INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed.fastq.gz" # Define adapter sequence (replace with the actual adapter sequence used) # A common Illumina 3' adapter sequence is: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA # If a 5' adapter was used, use -g instead of -a. # If paired-end reads, use -A and -G for the respective adapters. ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Run cutadapt to trim adapters and perform quality trimming # -a: 3' adapter sequence # -m: Minimum length of reads to keep after trimming (e.g., 15 bp) # -q: Quality cutoff (e.g., 20 for Phred score 20) for 5' and 3' ends # -o: Output file cutadapt -a "${ADAPTER_SEQUENCE}" \ -m 15 \ -q 20 \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install cutadapt # conda install -c bioconda cutadapt=1.18 # Define input and output files INPUT_FASTQ="input.fastq.gz" # Placeholder for your input FASTQ file (e.g., SRR1042713_1.fastq.gz from YeoLab/eclip example inputs) OUTPUT_FASTQ="trimmed_output.fastq.gz" # Placeholder for your output trimmed FASTQ file ADAPTER_FASTA="InvRNA.fasta" # Adapter sequences from YeoLab/eclip example inputs # Download adapter file if not present (example) # wget -O ${ADAPTER_FASTA} https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA.fasta # Execute cutadapt command cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a file:${ADAPTER_FASTA} \ -o ${OUTPUT_FASTQ} \ ${INPUT_FASTQ} -
5
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=1.14 # Define input and output files INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads.fastq.gz" # Placeholder for the adapter sequence. This needs to be replaced with the actual adapter sequence # that causes double-ligation events in your specific library preparation. # Common adapter sequences include Illumina TruSeq adapters, but it depends on the kit. # Example: ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_SEQUENCE="YOUR_DOUBLE_LIGATION_ADAPTER_SEQUENCE" # Trim reads to remove double-ligation events (adapter sequences) # -a ADAPTER_SEQUENCE: Removes the 3' adapter sequence. If the adapter is present anywhere, it will be removed. # -o: Specifies the output file. cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}" -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install dependencies (example, adjust as needed) # The eCLIP script requires Python, cutadapt, STAR, samtools, bedtools, R, pysam, pybedtools, numpy, scipy. # conda create -n eclip_env python=3.8 # conda activate eclip_env # conda install -c bioconda cutadapt star samtools bedtools # pip install pysam pybedtools numpy scipy # Clone the eCLIP repository if eclip.py is not already available # git clone https://github.com/YeoLab/eclip.git # cd eclip # # Make sure eclip.py is executable or run with python # chmod +x eclip.py # # Add to PATH or use full path to eclip.py # export PATH=$(pwd):$PATH # Download the adapter sequence wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/Ril19.fasta -O Ril19.fasta # Execute the eCLIP command (assuming 'eclip' is in PATH or using 'python eclip.py') eclip --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta
-
7
For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out.
cutadapt (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define primer sequence PRIMER_SEQUENCE="CAGUGCAAUGUUAAAAGGGCAU" # Define input and output files (placeholders) INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="filtered_output.fastq.gz" # Filter out reads that do not contain the 5' primer sequence cutadapt -g "${PRIMER_SEQUENCE}" --discard-untrimmed -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}" -
8
For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools.
$ Bash example
# Install umi-tools via conda # conda install -c bioconda umi-tools=1.1.2 # Example usage: Trim 10nt from the 3' end of Read1 using umi_tools extract. # This command assumes the 10nt at the 3' end of Read1 constitutes the UMI, # which is extracted and subsequently removed from the read sequence, then appended to the read header. # Replace input_R1.fastq.gz and output_R1_trimmed.fastq.gz with actual file names. umi_tools extract \ --bc-pattern=.*(?P<umi_1>.{10})$ \ --extract-method=regex \ -I input_R1.fastq.gz \ -S output_R1_trimmed.fastq.gz -
9
Args: -u -9
Data Processing Script (Inferred with models/gemini-2.5-flash) vCustom$ Bash example
# The arguments '-u' and '-9' are highly generic and do not point to a specific, well-known bioinformatics tool or a standard combination of flags for a single tool. # Without further context (e.g., assay type, specific pipeline, or surrounding commands), it is impossible to definitively identify the tool. # It is inferred to be a custom data processing script where '-u' and '-9' serve specific, custom-defined purposes (e.g., unique processing, a numerical threshold, or a specific mode). # Placeholder command for a hypothetical custom script: # Assuming 'my_pipeline_step_script.sh' is the tool that interprets these arguments. # Replace 'input_file.txt' and 'output_file.txt' with actual file names. my_pipeline_step_script.sh -u -9 input_file.txt > output_file.txt
-
10
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Assuming paired-end reads. Adjust if single-end (e.g., READS_R2="") READS_R1="trimmed_filtered_reads_R1.fastq.gz" READS_R2="trimmed_filtered_reads_R2.fastq.gz" # Placeholder for the STAR index of the RepBase 18.05 database # This index needs to be built once using STAR --runMode genomeGenerate REPEAT_DB_INDEX="/path/to/STAR_index/RepBase_18.05" OUTPUT_DIR="star_repeat_alignment_output" NUM_THREADS=8 # Example number of threads # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run STAR alignment against the repeat element database # Parameters are chosen to be suitable for mapping to repetitive sequences: # --outFilterMultimapNmax: Allows reads to map to multiple locations (important for repeats) # --alignIntronMax 1: Disables splicing detection, as repeats are not typically spliced units STAR \ --runThreadN "${NUM_THREADS}" \ --genomeDir "${REPEAT_DB_INDEX}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_DIR}/" \ --outSAMtype BAM SortedByCoordinate \ --outReadsUnmapped Fastx \ --outFilterMultimapNmax 100 \ --alignIntronMax 1 -
11
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # GENOME_DIR should be the path to your pre-built STAR genome index for 'human_repbase'. # For example, if 'human_repbase' refers to a GRCh38-based index: # GENOME_DIR="/path/to/STAR_indices/human_repbase_GRCh38" GENOME_DIR="human_repbase" READ_FILE="path/to/read1" OUTPUT_PREFIX="out_prefix" # Execute STAR alignment STAR --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILE}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
12
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.6a # Define variables STAR_VERSION="2.7.6a" GENOME_DIR="/path/to/your/hg19_STAR_index" # Placeholder for hg19 STAR index INPUT_FASTQ="unmapped_reads_filtered.fastq.gz" # Placeholder for input reads (e.g., from previous filtering step) OUTPUT_PREFIX="mapped_reads" # Note: The STAR genome index for hg19 must be pre-built using STAR --runMode genomeGenerate. # Example command to build the index (commented out): # mkdir -p ${GENOME_DIR} # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/hg19.fa \ # --sjdbGTFfile /path/to/hg19.gtf \ # --runThreadN 8 # Adjust threads as needed # Run STAR alignment # The description implies mapping against a human genome (hg19). STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 # Adjust threads as needed based on available resources # Rename output BAM for clarity mv ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam mapped_reads.bam -
13
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
STAR (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables (placeholders) GENOME_DIR="/path/to/GRCh38_STAR_index" # Example: STAR index for human GRCh38 READ_FILE_1="/path/to/sample_R1.fastq.gz" OUT_PREFIX="sample_aligned_" # Execute STAR alignment command STAR \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILE_1}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
14
Aligned reads were sorted with samtools (1.6)
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.6 # Sort aligned reads (e.g., a BAM file) # Replace input.bam with your actual input file and output.bam with your desired output file name samtools sort -o sorted_reads.bam aligned_reads.bam
-
15
Sorted reads were collapsed with umi_tools (1.0.0).
$ Bash example
# Install umi_tools (if not already installed) # conda install -c bioconda umi-tools # Example: Collapse sorted reads with umi_tools dedup # This assumes UMIs have already been extracted and stored in a BAM tag (e.g., 'xf') # Replace input.sorted.bam and output.deduplicated.bam with actual file names. # Adjust --umi-tag and --extract-method if UMIs are stored differently (e.g., in read names). umi_tools dedup \ --extract-method=xf \ --umi-tag=xf \ --output-stats=dedup_stats.txt \ -I input.sorted.bam \ -S output.deduplicated.bam -
16
Args: --random-seed 1 --method unique
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Example usage: Deduplicate reads in a BAM file # Replace input.bam and output.bam with actual file names umi_tools dedup \ --random-seed 1 \ --method unique \ -I input.bam \ -S output.bam -
17
BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper (if not already installed) # conda install -c bioconda clipper # Run CLIPper to identify peak clusters from a BAM file # Replace 'input.bam' with your actual input BAM file. # Replace 'output_peaks' with your desired output file prefix. # '-s 3.1e9' specifies the genome size for human (hg38). Adjust if using a different organism or assembly. clipper.py -b input.bam -s 3.1e9 -o output_peaks
-
18
Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
(Inferred with models/gemini-2.5-flash)$ Bash example
# This script processes a BAM file, potentially performing gene-level analysis or filtering. # The tool is inferred based on arguments like --maxgenes and --save-pickle, suggesting a custom Python script. # The reference genome hg19 (GRCh37) is specified. # Example command for the inferred tool python inferred_tool_script.py \ --species hg19 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
19
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Ensure Perl and necessary modules are installed. # Clone the eCLIP repository and add its 'bin' directory to your PATH. # For example: # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:$(pwd)/eclip/bin # Define input files (placeholders) IP_BAM="ip_sample.bam" INPUT_BAM="input_sample.bam" PEAK_FILE="peak_clusters.bed" # Assuming peak clusters are in BED format OUTPUT_PREFIX="normalized_peaks" # Normalize peak clusters using peaksnormalize.pl # This script internally uses overlap_peakfi_with_bam_PE.pl as described. peaksnormalize.pl \ --peak_file "${PEAK_FILE}" \ --ip_bam "${IP_BAM}" \ --input_bam "${INPUT_BAM}" \ --output_prefix "${OUTPUT_PREFIX}" -
20
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Installation of eCLIP pipeline (if needed, typically via git clone) # git clone https://github.com/yeolab/eclip.git # cd eclip # This script is located in the 'scripts' directory of the eCLIP repository. # Define input peak files (these would be the normalized peak regions from replicates) # Replace with actual paths and filenames of your normalized peak BED files INPUT_PEAK_FILE_REP1="sample_A_replicate1_normalized_peaks.bed" INPUT_PEAK_FILE_REP2="sample_A_replicate2_normalized_peaks.bed" # Add more replicate files if available, e.g., INPUT_PEAK_FILE_REP3="sample_A_replicate3_normalized_peaks.bed" # Define the output prefix for the merged peak file OUTPUT_PREFIX="sample_A_merged_replicate_peaks" # Path to the script within the eCLIP repository # Assuming the script is run from the eCLIP base directory or its path is known ECLIP_SCRIPT_PATH="scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl" # Execute the script to merge overlapping normalized peak regions from replicates perl "${ECLIP_SCRIPT_PATH}" \ "${OUTPUT_PREFIX}" \ "${INPUT_PEAK_FILE_REP1}" \ "${INPUT_PEAK_FILE_REP2}" # Add more "${INPUT_PEAK_FILE_REP_N}" here if more replicates are used -
21
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (version 2.0.2) # conda install -c bioconda idr=2.0.2 # Install merge_peaks dependencies (Perl script) # No specific installation for make_informationcontent_from_peaks.pl, it's a Perl script. # Ensure Perl is available. # Define input peak files (placeholders) REP1_PEAKS="rep1.compressed.bed" REP2_PEAKS="rep2.compressed.bed" OUTPUT_PREFIX="idr_output" # Define reference genome files (placeholders, required by make_informationcontent_from_peaks.pl) # Replace with actual paths to your genome fasta and chromosome sizes file. GENOME_FASTA="path/to/genome.fa" CHROM_SIZES="path/to/chrom.sizes" # Step 1: Rank normalized peak files by entropy score # make_informationcontent_from_peaks.pl is part of the yeolab/merge_peaks pipeline. # It takes a BED file and adds an entropy score as the 5th column (0-indexed). # The script requires a genome fasta and chrom sizes. # Assuming make_informationcontent_from_peaks.pl is in the current directory or PATH perl make_informationcontent_from_peaks.pl -i "${REP1_PEAKS}" -o "${REP1_PEAKS%.bed}.entropy.bed" -g "${GENOME_FASTA}" -s "${CHROM_SIZES}" perl make_informationcontent_from_peaks.pl -i "${REP2_PEAKS}" -o "${REP2_PEAKS%.bed}.entropy.bed" -g "${GENOME_FASTA}" -s "${CHROM_SIZES}" # Step 2: Sort the entropy-ranked peak files by the entropy score (5th column, descending) # IDR expects peaks to be sorted by score. sort -k5,5nr "${REP1_PEAKS%.bed}.entropy.bed" > "${REP1_PEAKS%.bed}.entropy.sorted.bed" sort -k5,5nr "${REP2_PEAKS%.bed}.entropy.bed" > "${REP2_PEAKS%.bed}.entropy.sorted.bed" # Step 3: Run IDR (Irreproducible Discovery Rate) to determine reproducible peaks # IDR version 2.0.2 # The --rank parameter specifies which column to use for ranking (default is 5 for narrowPeak/broadPeak). # Since we sorted by entropy score into the 5th column, we use --rank 5. # A common IDR threshold is 0.05. The description doesn't specify, so a common default is used. idr --samples "${REP1_PEAKS%.bed}.entropy.sorted.bed" "${REP2_PEAKS%.bed}.entropy.sorted.bed" \ --output-file "${OUTPUT_PREFIX}.bed" \ --rank 5 \ --plot \ --log-output-file "${OUTPUT_PREFIX}.log" \ --soft-idr-threshold 0.05 -
22
To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8)
seqtk, ctk fasta2collapse.pl v1.3-r106, v1.0.8$ Bash example
# Install seqtk (if not already installed) # conda install -c bioconda seqtk # Install ctk fasta2collapse.pl (if not already installed) # This script is typically part of a larger pipeline or custom scripts, often found in CLIP-seq toolkits. # It might require Perl and specific modules. For eCLIP assays, it's often found within the Yeo lab's eCLIP pipeline. # Example: git clone https://github.com/yeolab/eclip.git # Then ensure ctk_fasta2collapse.pl is in your PATH or specify its full path. # Placeholder for input trimmed FASTQ file(s) # Assuming 'trimmed_reads.fastq' as a single input file for demonstration. # If multiple FASTQ files (e.g., paired-end reads), this step would typically be applied per file or after merging. INPUT_FASTQ="trimmed_reads.fastq" # Placeholder for output collapsed FASTA file OUTPUT_FASTA="collapsed_unique_reads.fasta" # Step 1: Convert trimmed FASTQ to FASTA using seqtk # The output FASTA file will have the same base name as the input FASTQ, but with a .fasta extension. seqtk seq -a "${INPUT_FASTQ}" > "${INPUT_FASTQ%.fastq}.fasta" # Step 2: Collapse unique sequences from the FASTA file using ctk fasta2collapse.pl # This script identifies and collapses identical sequences, often counting their occurrences, which is crucial for chimeric read analysis. ctk fasta2collapse.pl "${INPUT_FASTQ%.fastq}.fasta" > "${OUTPUT_FASTA}" -
23
Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6)
$ Bash example
# Install Bowtie2 (example using conda) # conda install -c bioconda bowtie2=2.2.6 # --- Reference Data Preparation (prior step) --- # Download mature miRNA sequences from miRBase (e.g., mature.fa from miRBase FTP) # Example: wget http://www.mirbase.org/ftp/CURRENT/mature.fa.gz # gunzip mature.fa.gz # Build Bowtie2 index for mature miRNA sequences # bowtie2-build mature.fa miRBase_mature_miRNA_index # --- Mapping Step --- # Map collapsed sequences to the miRBase mature miRNA index # Input: collapsed_sequences.fastq (your collapsed sequence file) # Reference Index: miRBase_mature_miRNA_index (the index built from miRBase mature miRNA sequences) # Output: mapped_miRNA.sam (SAM format alignment file) bowtie2 -x miRBase_mature_miRNA_index -U collapsed_sequences.fastq -S mapped_miRNA.sam
-
24
Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Define variables # Placeholder for human GRCh38 genome index. This directory should contain files like SA, SAindex, genome, etc. GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for the input FASTQ file containing sequences re-assigned after miR mapping. INPUT_FASTQ="reassigned_reads.fastq.gz" # Prefix for output files (e.g., reassigned_to_genome_Aligned.out.bam) OUTPUT_PREFIX="reassigned_to_genome" # Number of threads to use for STAR THREADS=8 # Align sequences to the human genome # This command assumes the STAR genome index for GRCh38 has already been built. # If not, you would first run: # STAR --runThreadN ${THREADS} --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \ # --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \ # --sjdbOverhang 100 # Adjust sjdbOverhang based on read length (e.g., read_length - 1) STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped None
Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out. For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools. Args: -u -9 Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.6) Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8) Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6) Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described. Genome_build: hg19 Supplementary_files_format_and_content: gk.Au9.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed and gk.Au13.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed contain enriched regions from the miR-130a targeted chimeric eCLIP dataset Supplementary_files_format_and_content: cellsB_rep1.vs.cellsB_rep2.bed contains reproducible total chimeric eCLIP peaks across replicates Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities