GSE193134 Processing Pipeline
Publication
LARP4 is an RNA-binding protein that binds nuclear-encoded mitochondrial mRNAs to promote mitochondrial function.RNA (New York, N.Y.) (2024) — PMID 38164626
Dataset
GSE193134LARP4 Is an RNA-Binding Protein That Binds Nuclear-Encoded Mitochondrial mRNAs To Promote Mitochondrial Function
Processing Steps
Generate Jupyter Notebook-
1
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).
$ Bash example
# Install UMI-tools (e.g., via conda) # conda install -c bioconda umi-tools=1.0.0 # Example: Extract UMIs (randomers) from the start of Read 1 (10bp long) and add them to the headers of both reads. # Replace 'input_read1.fastq.gz', 'input_read2.fastq.gz' with your actual input files. # Adjust '--bc-pattern' if the UMI length or position is different (e.g., 'NNNNNNNNNN' for 10 Ns). # If UMIs are in Read 2, swap the --input and --read1-in parameters accordingly. umi_tools extract \ --input input_read1.fastq.gz \ --output output_read1_umi.fastq.gz \ --read2-in input_read2.fastq.gz \ --read2-out output_read2_umi.fastq.gz \ --extract-method=string \ --bc-pattern=NNNNNNNNNN \ --log umi_tools_extract.log -
2
Args: --random-seed 1 --bc-pattern NNNNNNNNNN
umi_tools extract (Inferred with models/gemini-2.5-flash) v1.1.2$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi-tools # Example usage of umi_tools extract with the provided arguments. # This command assumes input FASTQ files (read1.fastq.gz, read2.fastq.gz) # and outputs UMI-extracted FASTQ files (umi_extracted_read1.fastq.gz, umi_extracted_read2.fastq.gz). # Adjust input/output filenames and which read contains the UMI (--stdin) as per your library preparation. umi_tools extract \ --random-seed 1 \ --bc-pattern NNNNNNNNNN \ --stdin read1.fastq.gz \ --read2-in read2.fastq.gz \ --stdout umi_extracted_read1.fastq.gz \ --read2-out umi_extracted_read2.fastq.gz
-
3
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.14 # Example command for trimming reads with cutadapt. # This command assumes common Illumina adapters and performs quality trimming. # Replace 'input.fastq.gz' with your actual input file and 'trimmed.fastq.gz' with your desired output file. # For paired-end reads, you would use -p for the second read file and -A for the reverse complement adapter. # The adapter sequences used here are common Illumina universal adapters; these may need to be adjusted based on the library preparation kit. cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -q 20,20 -m 20 -o trimmed.fastq.gz input.fastq.gz
-
4
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install cutadapt # conda install -c bioconda cutadapt=1.18 # Download adapter sequence file wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA.fasta # Placeholder for input reads # Replace 'input_reads.fastq.gz' with your actual input file # Placeholder for output reads # Replace 'trimmed_reads.fastq.gz' with your desired output file cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a file:InvRNA.fasta \ -o trimmed_reads.fastq.gz \ input_reads.fastq.gz
-
5
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=1.14 # Define input and output file names INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed_output.fastq.gz" # Placeholder for the adapter sequence that causes double-ligation events. # This sequence would be specific to the library preparation protocol. # Example: A common Illumina TruSeq adapter sequence is used here. ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Trim reads with cutadapt to remove double-ligation events # -a ADAPTER_SEQUENCE: Trims the 3' adapter sequence # -o: Specifies the output file cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}" -
6
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
eCLIP v1.0.1$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Download the adapter file (Ril19.fasta) if not locally available # wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/Ril19.fasta # Example usage of umi_tools dedup # Replace input.bam with your actual input BAM file # Replace output.bam with your desired output BAM file umi_tools dedup \ --match-read-wildcards \ --output-encoding=5 \ --times 1 \ --error-rate=0.1 \ --quality-cutoff=6 \ --min-reads=18 \ --adapter-file=Ril19.fasta \ -I input.bam \ -S output.bam
-
7
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.6a -c bioconda -y # conda activate star_env # Define variables STAR_VERSION="2.7.6a" REPBASE_FASTA="RepBase18.05.fasta" # Placeholder: Replace with the actual path to the RepBase 18.05 FASTA file GENOME_DIR="star_repbase_index" READS_R1="trimmed_filtered_R1.fastq.gz" # Placeholder: Replace with the actual path to your trimmed and filtered R1 reads READS_R2="trimmed_filtered_R2.fastq.gz" # Placeholder: Replace with the actual path to your trimmed and filtered R2 reads (remove if single-end) OUTPUT_PREFIX="repbase_aligned" THREADS=8 # Adjust based on available CPU cores # 1. Build STAR index for RepBase (if not already built) # This step assumes you have the RepBase FASTA file. # For smaller reference sequences like repeat databases, 'genomeSAindexNbases' might need to be adjusted from the default 14. # A value of 10 is often suitable for smaller references. echo "Building STAR index for RepBase..." mkdir -p "${GENOME_DIR}" STAR --runMode genomeGenerate \ --genomeDir "${GENOME_DIR}" \ --genomeFastaFiles "${REPBASE_FASTA}" \ --runThreadN "${THREADS}" \ --genomeSAindexNbases 10 # Recommended for smaller reference sequences like repeat databases # 2. Map reads with STAR # Assuming paired-end reads. For single-end reads, remove the second file from --readFilesIn. echo "Mapping reads with STAR..." STAR --version # Confirm STAR version STAR --runThreadN "${THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 100 \ --alignIntronMax 1 \ --alignSJDBoverhangMin 1 \ --outReadsUnmapped Fastx # Output unmapped reads to a separate file -
8
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Note: 'human_repbase' refers to a pre-built STAR genome index directory. # This index should be generated using STAR's genomeGenerate function, # typically from a human reference genome (e.g., GRCh38) and potentially # including repetitive element sequences (e.g., from Repbase) if 'repbase' # in the name implies such an inclusion. # Example for index generation (adjust paths and files as needed): # STAR --runThreadN 16 \ # --runMode genomeGenerate \ # --genomeDir human_repbase \ # --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa /path/to/repbase_sequences.fa \ # --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf STAR \ --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
9
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19).
$ Bash example
# Install STAR (version 2.7.6a) # conda install -c bioconda star=2.7.6a # Reference Genome: Human genome (hg19) # Download hg19 FASTA from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # Download hg19 GTF from GENCODE (e.g., v19): https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # Define input and output paths INPUT_READS="filtered_unmapped_reads.fastq.gz" # Placeholder for input FASTQ file STAR_INDEX_DIR="path/to/STAR_hg19_index" # Directory containing STAR index for hg19 OUTPUT_PREFIX="mapped_reads_hg19_" # Prefix for output files # Run STAR alignment STAR --runMode alignReads \ --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${INPUT_READS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --readFilesCommand zcat \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting, adjust as needed -
10
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
STAR (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star STAR \ --runThreadN 16 \ --genomeDir /path/to/STAR_genome_index \ --readFilesIn /path/to/read1.fastq.gz \ --outFileNamePrefix my_output_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
11
Aligned reads were sorted with samtools (1.6)
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.6 # Sort aligned reads # Replace 'aligned_reads.bam' with your actual input BAM file # Replace 'sorted_reads.bam' with your desired output sorted BAM file samtools sort -o sorted_reads.bam aligned_reads.bam
-
12
Sorted reads were collapsed with umi_tools (1.0.0).
$ Bash example
# Install umi_tools (if not already installed) # conda install -c bioconda umi-tools # Define input and output file names INPUT_BAM="sorted_reads.bam" # Placeholder for the sorted input BAM file OUTPUT_BAM="collapsed_reads.bam" # Placeholder for the output collapsed BAM file # Collapse reads using umi_tools dedup # This command assumes UMIs are present in the read IDs (e.g., after umi_tools extract) # If UMIs are in a SAM tag (e.g., 'XN'), use --umi-tag=XN # If reads are paired-end, add --paired # If reads are spliced (e.g., RNA-seq), consider --spliced-reads umi_tools dedup \ --input="${INPUT_BAM}" \ --output="${OUTPUT_BAM}" \ --extract-umi-method=read_id -
13
Args: --random-seed 1 --method unique
Custom Script/Utility (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This command represents a generic data processing step with the specified arguments. # The specific tool or script is not provided in the description. # Replace 'your_custom_script_or_tool' with the actual command if known. your_custom_script_or_tool --random-seed 1 --method unique
-
14
BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
# Install CLIPper and its dependencies (e.g., pysam, numpy, scipy, pybedtools) # pip install pysam numpy scipy pybedtools # git clone https://github.com/yeolab/clipper.git # cd clipper # # Assuming the main branch or a specific commit corresponds to version 1.2.2 # Placeholder for input and control BAM files INPUT_BAM="sample_ip.bam" CONTROL_BAM="sample_input.bam" # Or 'sample_control.bam' OUTPUT_DIR="clipper_peaks" OUTPUT_PREFIX="sample_peaks" # Placeholder for genome reference (hg38 chrom.sizes) # If hg38.chrom.sizes is not available, you can download it: # wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes mkdir -p "${OUTPUT_DIR}" python clipper.py \ "${INPUT_BAM}" \ -c "${CONTROL_BAM}" \ -g hg38.chrom.sizes \ -o "${OUTPUT_DIR}" \ -p "${OUTPUT_PREFIX}" -
15
Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
$ Bash example
# This script is inferred to be a custom Python script due to the unique combination of arguments, especially '--save-pickle'. # Installation instructions (example, adjust as needed): # conda create -n myenv python=3.9 # conda activate myenv # pip install pandas numpy # Example dependencies, actual dependencies would depend on the script's content # Assuming 'custom_bam_processor.py' is the name of the Python script python custom_bam_processor.py \ --species hg19 \ --bam path/to/input.bam \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile path/to/output.bam -
16
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
$ Bash example
# Clone the eclip repository if not already available # git clone https://github.com/yeolab/eclip.git # cd eclip # Assuming eclip/bin is in your PATH or you provide the full path to the script # Define placeholder variables PEAK_CLUSTER_FILE="your_peak_clusters.bed" # Example: .bed, .narrowPeak, .broadPeak IP_BAM_FILE="your_ip_sample.bam" INPUT_BAM_FILE="your_input_sample.bam" NORMALIZED_PEAK_PREFIX="normalized_peaks" # Execute the normalization script # Ensure peaksnormalize.pl is executable and in your PATH, or use its full path peaksnormalize.pl \ --peak_file "${PEAK_CLUSTER_FILE}" \ --ip_bam "${IP_BAM_FILE}" \ --input_bam "${INPUT_BAM_FILE}" \ --output_prefix "${NORMALIZED_PEAK_PREFIX}" -
17
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# The script compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl is part of the Yeo Lab's merge_peaks repository, which is used for IDR and merging reproducible peaks in eCLIP pipelines. # It is typically included or called within the broader eCLIP workflow (e.g., eclip-0.1.5+). # Installation (commented out as it's usually part of a larger pipeline setup or cloned directly): # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks # Define input and output file names based on the description. # "Overlapping normalized peak regions" implies input BED files that have already undergone normalization. # The script name suggests it processes L2-fold enrichment peak files. INPUT_REPLICATE1_PEAKS="replicate1_normalized_l2foldenr_peaks.bed" INPUT_REPLICATE2_PEAKS="replicate2_normalized_l2foldenr_peaks.bed" OUTPUT_MERGED_PEAKS="merged_replicate_normalized_peaks.bed" # Execute the Perl script. # The script typically takes multiple input BED files and merges them, often outputting to stdout. # Adjust the path to the script if it's not in the current directory or in your PATH. perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ "${INPUT_REPLICATE1_PEAKS}" \ "${INPUT_REPLICATE2_PEAKS}" \ > "${OUTPUT_MERGED_PEAKS}" -
18
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr=2.0.2 # Install merge_peaks pipeline (which includes make_informationcontent_from_peaks.pl) # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$PATH:$(pwd)/merge_peaks/bin # Define input peak files (e.g., from CLIPper or similar peak caller) # These are placeholders; replace with actual file paths. INPUT_PEAKS_REP1="rep1.compressed.bed" INPUT_PEAKS_REP2="rep2.compressed.bed" # Define output files for entropy-ranked peaks ENTROPY_RANKED_REP1="rep1.entropy_ranked.bed" ENTROPY_RANKED_REP2="rep2.entropy_ranked.bed" # Define output prefix for IDR results IDR_OUTPUT_PREFIX="reproducible_peaks" # Step 1: Rank normalized peak files by entropy score using make_informationcontent_from_peaks.pl # This script takes a BED file and outputs a BED file with an entropy score in the 5th column. make_informationcontent_from_peaks.pl -i "${INPUT_PEAKS_REP1}" -o "${ENTROPY_RANKED_REP1}" make_informationcontent_from_peaks.pl -i "${INPUT_PEAKS_REP2}" -o "${ENTROPY_RANKED_REP2}" # Step 2: Run IDR (2.0.2) using the entropy-ranked peaks as inputs. # IDR will use the 5th column (score) from the BED files for ranking, as specified by --rank score. idr --samples "${ENTROPY_RANKED_REP1}" "${ENTROPY_RANKED_REP2}" \ --output-file "${IDR_OUTPUT_PREFIX}" \ --rank score \ --plot \ --log-output-file "${IDR_OUTPUT_PREFIX}.log"
Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.6) Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Genome_build: hg19 Supplementary_files_format_and_content: Hek293_WT_P_rep_1.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.bed and Hek293_WT_P_rep_2.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.bed contain size matched input normalized eCLIP peaks from rep1 and rep2 Supplementary_files_format_and_content: LARP4_hek_WT.bed contains reproducible total eCLIP peaks across replicates (IDR peaks) Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities