GSE203089 Processing Pipeline
Publication
The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression.The Journal of experimental medicine (2022) — PMID 35593887
Dataset
GSE203089The long noncoding RNA Malat1 regulates CD8+ T cell differentiation by mediating epigenetic repression (GRID-Seq)
Processing Steps
Generate Jupyter Notebook-
1
Reads were trimmed with cutadapt -l 86 --max-n 5 -o (v1.18), mapped to RNA-Linker-DNA (GTTGGATTCNNNGACACAGCTCACTCCCACACACCGAACTCCAAC) with bwa mem -k 5 -L 4 -B 2 -O 4 (v0.7.15) and sorted with samtools sort (v1.7).
$ Bash example
# Install cutadapt # conda install -c bioconda cutadapt # Install bwa # conda install -c bioconda bwa # Install samtools # conda install -c bioconda samtools # Define input and output files INPUT_READS="input_reads.fastq.gz" TRIMMED_READS="trimmed_reads.fastq.gz" REFERENCE_FASTA="RNA_linker.fa" ALIGNED_SAM="aligned.sam" ALIGNED_SORTED_BAM="aligned.sorted.bam" # Create the custom reference FASTA file echo ">RNA-Linker-DNA" > "${REFERENCE_FASTA}" echo "GTTGGATTCNNNGACACAGCTCACTCCCACACACCGAACTCCAAC" >> "${REFERENCE_FASTA}" # Index the custom reference with bwa bwa index "${REFERENCE_FASTA}" # Step 1: Trim reads with cutadapt cutadapt -l 86 --max-n 5 -o "${TRIMMED_READS}" "${INPUT_READS}" # Step 2: Map trimmed reads to RNA-Linker-DNA with bwa mem bwa mem -k 5 -L 4 -B 2 -O 4 "${REFERENCE_FASTA}" "${TRIMMED_READS}" > "${ALIGNED_SAM}" # Step 3: Sort the alignment with samtools sort samtools sort -o "${ALIGNED_SORTED_BAM}" "${ALIGNED_SAM}" -
2
RNA and DNA reads were separated with GridTools matefq (https://github.com/biz007/gridtools).
$ Bash example
# Install GridTools (if not already installed) # git clone https://github.com/biz007/gridtools.git # cd gridtools # make # # Ensure the 'matefq' executable is in your system's PATH # Assuming 'input_reads.fastq' contains the mixed RNA and DNA reads. # This command separates reads based on length. For example, if RNA reads are expected # to be between 18 and 200 base pairs, they will be written to 'rna_reads.fastq', # and all other reads (presumably DNA or very long RNA) will be written to 'dna_reads.fastq'. # Adjust the length thresholds (18,200) based on your specific experimental design for RNA reads. matefq -s 18,200 input_reads.fastq rna_reads.fastq dna_reads.fastq
-
3
Reads were then mapped to the mm10 genome with bwa mem -k 17 -w 1 -T 1.
$ Bash example
# Install BWA (example using conda) # conda install -c bioconda bwa # Index the reference genome (if not already done) # bwa index path/to/mm10.fa # Align reads to the mm10 genome bwa mem -k 17 -w 1 -T 1 path/to/mm10.fa reads.fastq.gz > aligned.sam
-
4
GridTools evaluate was then used to correct against background and RNA-DNA mate read pair quality and quantity with bin size of 1 kb and moving windows of 10.
$ Bash example
# Installation: 'GridTools evaluate' is not a widely recognized public tool or command. # It is likely a custom script or part of a specialized pipeline, potentially within the Yeo lab's eCLIP framework. # If it were a Python script, dependencies might include standard bioinformatics libraries. # For example: # conda install -c bioconda samtools bedtools deeptools # Placeholder for input files and output file based on the description. # 'RNA-DNA mate read pair quality and quantity' would typically come from processed alignment files (e.g., BAM/BED). # 'Background' could be a control library or a pre-computed background model. INPUT_READ_PAIRS="path/to/rna_dna_mate_read_pairs.bam" # Or .bed, .tsv depending on format BACKGROUND_DATA="path/to/background_model.bedgraph" # Or control library BAM/BED OUTPUT_CORRECTED_SIGNAL="corrected_signal.bedgraph" # Parameters extracted from the description BIN_SIZE="1000" # 1 kb WINDOW_SIZE="10" # Execute the inferred command. # This is a hypothetical command structure based on the description and common bioinformatics practices. # The actual command syntax would depend on the specific implementation of 'GridTools evaluate'. GridTools_evaluate \ --input_reads "${INPUT_READ_PAIRS}" \ --background_data "${BACKGROUND_DATA}" \ --output_file "${OUTPUT_CORRECTED_SIGNAL}" \ --bin_size "${BIN_SIZE}" \ --moving_window "${WINDOW_SIZE}" \ --correction_type "rna_dna_mate_pair_quality_quantity" -
5
GridTools.py evaluate function uses a gene annotation GTF (Gene transfer format) file to ensure that RNA end of the read always uniquely maps to its correct chromosomal location.
GridTools.py (from encode-pipeline-utils) vN/A$ Bash example
# Install encode-pipeline-utils if not already installed # pip install encode-pipeline-utils # Download a reference GTF file (e.g., GENCODE human GRCh38, release 44) # wget -O gencode.v44.annotation.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz # gunzip gencode.v44.annotation.gtf.gz # Placeholder for an aligned BAM file. This file should be generated by an upstream alignment step (e.g., STAR, HISAT2). # For example: aligned_reads.bam python -m encode_pipeline_utils.grid_tools evaluate \ --alignment_file aligned_reads.bam \ --gtf_file gencode.v44.annotation.gtf \ --output_file evaluation_report.txt -
6
Thus, during the analysis, only RNA reads that map back to its own locus will be retained.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Input BAM file (e.g., from a splice-aware aligner like STAR or HISAT2) INPUT_BAM="aligned_reads.bam" OUTPUT_BAM="filtered_unique_locus_reads.bam" # Filter RNA reads to retain only those that map uniquely and with high confidence to a single locus: # -F 4: Exclude unmapped reads # -F 256: Exclude secondary alignments # -F 2048: Exclude supplementary alignments # -q 20: Retain reads with mapping quality >= 20 (often indicative of unique mapping for many aligners) # -b: Output in BAM format samtools view -F 4 -F 256 -F 2048 -q 20 -b "${INPUT_BAM}" > "${OUTPUT_BAM}" # Sort and index the filtered BAM file for downstream analysis samtools sort -o "${OUTPUT_BAM%.bam}.sorted.bam" "${OUTPUT_BAM}" samtools index "${OUTPUT_BAM%.bam}.sorted.bam" -
7
Moreover, the DNA end of the read will also be unique mapped, ensuring that both the RNA and DNA ends of each read are all unique reads.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Filter for uniquely mapped reads (MAPQ >= 20) and exclude unmapped, secondary, and supplementary alignments. # This ensures that both the RNA and DNA ends (i.e., the entire chimeric read in eCLIP) are uniquely mapped. samtools view -b -h -F 0x4 -F 0x100 -F 0x200 -q 20 -o unique_mapped.bam aligned.bam
-
8
No read that maps to multiple locations in the genome (including repeat elements all over chromosomes) will be retained during analysis.
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define reference genome paths (example for human hg38) # Replace with your actual STAR genome index directory GENOME_DIR="/path/to/STAR_index/hg38" # Define input FASTQ files # Replace with your actual input FASTQ file(s) READ1="input_read1.fastq.gz" READ2="input_read2.fastq.gz" # Remove if single-end reads # Define output prefix for aligned files OUTPUT_PREFIX="aligned_unique_reads" # Perform STAR alignment, retaining only uniquely mapping reads # The --outFilterMultimapNmax 1 parameter ensures that only reads mapping to a single location are reported. STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNminOverLread 0.3 \ --limitBAMsortRAM 60000000000 # Example: 60GB RAM for sorting, adjust based on available memory -
9
GridTools RNA was used to identify expression levels of chromatin-enriched RNA.
GridTools RNA vUnknown (Inferred with models/gemini-2.5-flash)$ Bash example
# GridTools RNA is not a widely recognized public tool, so installation instructions are not available. # This is a placeholder command based on the description's intent. # Replace 'input_aligned_reads.bam' with your actual input file (e.g., aligned RNA-seq reads). # Replace 'hg38_reference_annotation.gtf' with your actual genome annotation file. # Replace 'chromatin_enriched_rna_expression.tsv' with your desired output file name. # Example command (highly speculative due to lack of tool specifics): GridTools_RNA --input input_aligned_reads.bam --annotation hg38_reference_annotation.gtf --output chromatin_enriched_rna_expression.tsv --mode expression_quantification
-
10
GridTools matrix was used to construct an interaction contact matrix of chromatin-associated RNA within specified genomic bin sizes of 1 kb or 100 kb.
$ Bash example
# Install cooler if not already installed # conda install -c bioconda cooler=0.9.1 # Placeholder for input data and genome assembly # INPUT_PAIRS: A tab-separated file of paired-end reads, typically generated from aligned BAM files # using tools like 'pairtools parse' and 'pairtools sort'. # Each line represents an interaction, e.g., chr1 pos1 chr2 pos2 strand1 strand2. INPUT_PAIRS="input_rna_chromatin_interactions.pairs.gz" # Example: gzipped pairs file # GENOME_ASSEMBLY: The reference genome assembly used for alignment (e.g., hg38, mm10) GENOME_ASSEMBLY="hg38" # CHROM_SIZES_FILE: A tab-separated file containing chromosome names and their lengths. # This file is crucial for defining the genomic bins. # Example for hg38: # chr1 248956422 # chr2 242193529 # ... CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes" # --- Example of how to obtain a chrom.sizes file (uncomment and adapt if needed) --- # # Download from UCSC Genome Browser (replace hg38 with your assembly) # # wget -O ${CHROM_SIZES_FILE} http://hgdownload.soe.ucsc.edu/goldenPath/${GENOME_ASSEMBLY}/bigZips/${GENOME_ASSEMBLY}.chrom.sizes # # Or from a FASTA file: # # samtools faidx reference.fasta # # cut -f1,2 reference.fasta.fai > ${CHROM_SIZES_FILE} # ----------------------------------------------------------------------------------- # Define bin sizes as specified in the description BIN_SIZE_1KB=1000 BIN_SIZE_100KB=100000 # Construct interaction contact matrix for 1 kb bin size # 'cooler cload pairs' takes a chrom.sizes file, a bin size, a pairs file, and an output .cool file. # The pairs file should be sorted. echo "Constructing 1 kb contact matrix..." cooler cload pairs "${CHROM_SIZES_FILE}:${BIN_SIZE_1KB}" "${INPUT_PAIRS}" rna_chromatin_contact_matrix_1kb.cool # Construct interaction contact matrix for 100 kb bin size echo "Constructing 100 kb contact matrix..." cooler cload pairs "${CHROM_SIZES_FILE}:${BIN_SIZE_100KB}" "${INPUT_PAIRS}" rna_chromatin_contact_matrix_100kb.cool # Optional: View the generated cooler files # cooler info rna_chromatin_contact_matrix_1kb.cool # cooler dump -t pixels rna_chromatin_contact_matrix_1kb.cool | head -
11
All lncRNAs with reads per kilobase DNA read density ³10 on any genomic region were filtered from the interaction contact matrix.
$ Bash example
# Define input files and parameters LNC_RNA_BED="lncRNA_annotations.bed" # Path to a BED file defining lncRNA genomic regions (e.g., from GENCODE) ALIGNED_BAM="aligned_reads.bam" # Path to the aligned reads BAM file (e.g., from RNA-seq or eCLIP) INTERACTION_MATRIX="interaction_contact_matrix.tsv" # Path to the input interaction contact matrix OUTPUT_MATRIX="filtered_interaction_contact_matrix.tsv" # Path for the filtered output matrix RPKD_THRESHOLD=10 # Reads per kilobase DNA read density threshold # Placeholder for reference genome assembly (not directly used in this specific calculation but good practice) GENOME_ASSEMBLY="hg38" # --- Install necessary tools if not already present --- # # Conda installation for bedtools # # conda install -c bioconda bedtools # --- Step 1: Calculate RPKD for each lncRNA region --- # Count reads overlapping each lncRNA region using bedtools coverage. # Output format of bedtools coverage -counts: chr, start, end, name, score, strand, count echo "Counting reads overlapping lncRNA regions..." bedtools coverage -a "${LNC_RNA_BED}" -b "${ALIGNED_BAM}" -counts > lncRNA_read_counts.tsv # Calculate RPKD (Reads Per Kilobase DNA read density) and filter lncRNAs based on the threshold. # RPKD = (read_count * 1000) / (region_length) # Assuming lncRNA_read_counts.tsv has columns: chr, start, end, name, score, strand, count echo "Calculating RPKD and filtering lncRNAs with RPKD >= ${RPKD_THRESHOLD}..." awk -v threshold="${RPKD_THRESHOLD}" 'BEGIN {OFS="\t"} { region_length = $3 - $2; if (region_length > 0) { rpks = ($7 * 1000) / region_length; if (rpks >= threshold) { print $4; # Output only the lncRNA name if it passes the filter } } }' lncRNA_read_counts.tsv > lncRNAs_passing_rpks_filter.txt # --- Step 2: Filter the interaction contact matrix --- # This step assumes the interaction matrix is a tab-separated file where the first two columns # represent interacting entities (e.g., lncRNA_A \t lncRNA_B \t score). # The script will keep only those interactions where *both* entities are present in the # 'lncRNAs_passing_rpks_filter.txt' list. If the matrix contains other types of entities, # or if the filtering logic is different (e.g., only one entity needs to be an lncRNA and pass the filter), # this Python script would need to be adjusted. echo "Filtering the interaction contact matrix..." python -c " import sys def filter_interaction_matrix(lnc_rna_list_file, interaction_matrix_file, output_matrix_file): with open(lnc_rna_list_file, 'r') as f: lnc_rnas_to_keep = {line.strip() for line in f} with open(interaction_matrix_file, 'r') as infile, open(output_matrix_file, 'w') as outfile: for line in infile: parts = line.strip().split('\t') if not parts: continue # Skip empty lines # Assuming the first two columns are the interacting entities (IDs) # Adjust indices if lncRNA IDs are in different columns entity1 = parts[0] entity2 = parts[1] if len(parts) > 1 else None # Keep the interaction if entity1 is in the allowed list # And if entity2 exists, it must also be in the allowed list # This interpretation assumes both interacting partners are lncRNAs subject to the filter. keep_line = True if entity1 not in lnc_rnas_to_keep: keep_line = False if entity2 is not None and entity2 not in lnc_rnas_to_keep: keep_line = False if keep_line: outfile.write(line) if __name__ == '__main__': filter_interaction_matrix( 'lncRNAs_passing_rpks_filter.txt', '${INTERACTION_MATRIX}', '${OUTPUT_MATRIX}' ) " # Clean up intermediate files rm lncRNA_read_counts.tsv lncRNAs_passing_rpks_filter.txt -
12
A 1 kb interaction matrix was directly visualized on a heatmap using the ComplexHeatmap package (v2.2.0).
$ Bash example
# Create an R script for heatmap visualization cat << 'EOF' > visualize_heatmap.R # Install ComplexHeatmap if not already installed # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("ComplexHeatmap", version = "3.12", update = FALSE, ask = FALSE) # Specify version if needed, otherwise latest # install.packages("circlize", repos = "http://cran.us.r-project.org", update = FALSE) # Load the ComplexHeatmap library library(ComplexHeatmap) library(circlize) # --- Input Data Loading --- # This script expects an interaction matrix as input. # The description mentions a "1 kb interaction matrix". # Replace the dummy matrix generation below with actual data loading from your file. # Example for loading a tab-separated file: # interaction_matrix_file <- "path/to/your/1kb_interaction_matrix.tsv" # interaction_matrix <- as.matrix(read.delim(interaction_matrix_file, row.names = 1, header = TRUE)) # For demonstration purposes, a dummy matrix is created: set.seed(123) interaction_matrix <- matrix(rnorm(100), nrow = 10, ncol = 10) colnames(interaction_matrix) <- paste0("Region_", 1:10) rownames(interaction_matrix) <- paste0("Region_", 1:10) # Define output file name output_pdf_file <- "1kb_interaction_heatmap.pdf" # Open PDF device for saving the heatmap pdf(output_pdf_file, width = 8, height = 8) # Visualize the interaction matrix as a heatmap # The 'ComplexHeatmap' package (v2.2.0) is used as specified. # Parameters like 'col', 'cluster_rows', 'cluster_columns' can be adjusted # based on the specific visualization requirements of the interaction matrix. Heatmap(interaction_matrix, name = "Interaction Score", # Name of the heatmap legend col = colorRamp2(c(min(interaction_matrix), 0, max(interaction_matrix)), c("blue", "white", "red")), # Example color scale cluster_rows = TRUE, cluster_columns = TRUE, show_row_names = FALSE, # Set to TRUE if row names are meaningful and few show_column_names = FALSE, # Set to TRUE if column names are meaningful and few heatmap_legend_param = list(title = "Interaction Value") ) # Close the PDF device dev.off() message(paste("Heatmap saved to:", output_pdf_file)) EOF # Execute the R script using Rscript Rscript visualize_heatmap.R -
13
PCA, k-means clustering set to 3 clusters, and pearson correlation analysis were performed on the interaction matrix with the R stats package (v3.6.2) removing all bins with zero interactions.
$ Bash example
# Install R (version 3.6.2) if not already available. Specific version installation might vary by OS/package manager. # Example for Conda: # conda create -n r_env r-base=3.6.2 r-essentials # conda activate r_env # Create a placeholder interaction matrix file for demonstration # In a real scenario, 'interaction_matrix.tsv' would be your input file. cat << EOF > interaction_matrix.tsv Bin1 Bin2 Bin3 Bin4 FeatureA 10 20 30 40 FeatureB 0 0 0 0 FeatureC 50 60 70 80 FeatureD 5 10 15 20 FeatureE 0 0 0 0 EOF Rscript -e ' # Load the interaction matrix # Assuming the matrix is tab-separated and has a header, with the first column as row names (bin identifiers) interaction_matrix <- read.table("interaction_matrix.tsv", sep="\t", header=TRUE, row.names=1) # Convert to a numeric matrix for analysis numeric_matrix <- as.matrix(interaction_matrix) # Remove all bins (rows) with zero interactions # This identifies rows where all values are zero rows_to_keep <- apply(numeric_matrix, 1, function(x) !all(x == 0)) filtered_matrix <- numeric_matrix[rows_to_keep, ] # Check if there is enough data after filtering for PCA and correlation if (nrow(filtered_matrix) < 2 || ncol(filtered_matrix) < 2) { stop("Not enough data after filtering for PCA and correlation analysis.") } cat("\n--- Filtered Matrix (first 5 rows/cols) ---\n") print(filtered_matrix[1:min(5, nrow(filtered_matrix)), 1:min(5, ncol(filtered_matrix))]) # Perform PCA # scale. = TRUE and center. = TRUE are standard for PCA to normalize variables pca_results <- prcomp(filtered_matrix, scale. = TRUE, center. = TRUE) cat("\n--- PCA Results (Summary) ---\n") print(summary(pca_results)) # Optionally, save PCA results # write.csv(pca_results$x, "pca_components.csv") # Principal components (scores) # write.csv(pca_results$rotation, "pca_loadings.csv") # Loadings (eigenvectors) # Perform k-means clustering set to 3 clusters # nstart = 25 runs the algorithm 25 times with different initial centroids and picks the best one kmeans_results <- kmeans(filtered_matrix, centers = 3, nstart = 25) cat("\n--- K-means Clustering Results (3 clusters) ---\n") print(kmeans_results) # Optionally, save cluster assignments # write.csv(data.frame(Bin = rownames(filtered_matrix), Cluster = kmeans_results$cluster), "kmeans_clusters.csv") # Perform Pearson correlation analysis # This computes the Pearson correlation matrix between columns (features/bins) correlation_matrix <- cor(filtered_matrix, method = "pearson") cat("\n--- Pearson Correlation Matrix (first 5 rows/cols) ---\n") print(correlation_matrix[1:min(5, nrow(correlation_matrix)), 1:min(5, ncol(correlation_matrix))]) # Optionally, save the correlation matrix # write.csv(correlation_matrix, "pearson_correlation_matrix.csv") ' # Clean up placeholder file rm interaction_matrix.tsv -
14
The correlation matrix was visualized with corrplot package (v0.89) and circos plots with the circlize package (v0.42).
R (Inferred with models/gemini-2.5-flash) vcorrplot v0.89, circlize v0.42$ Bash example
# Install R if not already installed (e.g., on Ubuntu) # sudo apt-get update # sudo apt-get install r-base # Install R packages (if not already installed) # R -e 'install.packages("corrplot", repos="http://cran.us.r-project.org")' # R -e 'install.packages("circlize", repos="http://cran.us.r-project.org")' # R -e 'install.packages("RColorBrewer", repos="http://cran.us.r-project.org")' # R -e 'install.packages("ComplexHeatmap", repos="http://cran.us.r-project.org")' # Often used with circlize for color mapping # R script to visualize a correlation matrix Rscript -e ' # Load required libraries library(corrplot) library(circlize) library(RColorBrewer) # Commonly used for color palettes with corrplot library(ComplexHeatmap) # Provides colorRamp2 function used with circlize # --- Create a dummy correlation matrix for demonstration --- # In a real scenario, this matrix would be loaded from a file (e.g., CSV, TSV) set.seed(123) M <- cor(matrix(rnorm(100*10), ncol=10)) colnames(M) <- paste0("Gene", 1:10) rownames(M) <- paste0("Gene", 1:10) # --- Visualize with corrplot package (v0.89) --- # Output to PDF file pdf("correlation_matrix_corrplot.pdf", width=7, height=7) corrplot(M, method="circle", type="upper", tl.col="black", tl.srt=45, col=brewer.pal(n=8, name="RdBu")) dev.off() message("corrplot visualization saved to correlation_matrix_corrplot.pdf") # --- Visualize with circlize package (v0.42) --- # For circlize, we often convert the correlation matrix into a list of links # For demonstration, let\'s create a data frame of significant correlations # (e.g., absolute correlation > 0.7 and remove self-loops) cor_df <- as.data.frame(as.table(M)) colnames(cor_df) <- c("from", "to", "correlation") significant_cor <- subset(cor_df, abs(correlation) > 0.7 & from != to) # Prepare data for circlize links pdf("correlation_matrix_circlize.pdf", width=8, height=8) circos.clear() circos.par(gap.degree = 5) # Adjust gap between sectors # Define sectors (all unique genes involved in significant correlations) all_genes_for_circlize <- unique(c(as.character(significant_cor$from), as.character(significant_cor$to))) if (length(all_genes_for_circlize) > 0) { circos.initialize(factors = all_genes_for_circlize, xlim = c(0, 1)) # Dummy xlim # Add a track for gene names circos.trackPlotRegion(ylim = c(0, 1), track.height = 0.1, bg.border = NA, panel.fun = function(x, y) { sector.name = get.cell.meta.data("sector.index") circos.text(CELL_META$xcenter, CELL_META$ylim[2] + mm_y(2), sector.name, facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5)) }) # Define color function for links based on correlation value col_fun = colorRamp2(c(-1, 0, 1), c("blue", "white", "red")) # Add links based on significant correlations for(i in 1:nrow(significant_cor)) { from_gene <- as.character(significant_cor$from[i]) to_gene <- as.character(significant_cor$to[i]) cor_val <- significant_cor$correlation[i] # Draw link from the middle of \'from_gene\' sector to the middle of \'to_gene\' sector circos.link(from_gene, 0.5, to_gene, 0.5, col = col_fun(cor_val), lwd = abs(cor_val) * 3) # Line width proportional to correlation strength } } else { message("No significant correlations found to plot with circlize.") } circos.clear() dev.off() message("circlize visualization saved to correlation_matrix_circlize.pdf") ' -
15
Differential RNA chromatin interaction regions were determined by taking the average RNA interaction level of each genomic bin for all lncRNAs in a cluster.
Custom Script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This step describes a custom analysis involving data aggregation and differential comparison. # It is assumed that interaction levels for each lncRNA in genomic bins are available as input. # A custom Python or R script would typically be used for such an analysis. # Example installation for a Python environment with common data science libraries: # conda create -n rna_chromatin_env python=3.9 # conda activate rna_chromatin_env # pip install pandas numpy scipy statsmodels pybedtools # Assume the following input files are prepared: # 1. `genomic_bins.bed`: A BED file defining the genomic bins (e.g., 10kb or 100kb windows) for the reference genome (e.g., hg38). # 2. `lncrna_cluster_manifest.tsv`: A tab-separated file linking lncRNA IDs to their respective clusters and experimental conditions. # Example columns: lncRNA_ID, Cluster_ID, Condition, Path_to_Interaction_BedGraph # 3. `lncRNA_interaction_data/`: A directory containing bedgraph or bigwig files for each lncRNA, representing RNA interaction levels across genomic bins. # Placeholder for a custom Python script that performs the described analysis. # This script would typically: # - Load the genomic bin definitions. # - Read the lncRNA cluster manifest to understand groupings and conditions. # - For each cluster, iterate through its associated lncRNAs. # - For each lncRNA, load its interaction data (e.g., bedgraph) and map it to the predefined genomic bins. # - Calculate the average RNA interaction level for each genomic bin across all lncRNAs within a given cluster. # - Perform differential analysis (e.g., t-test, ANOVA, or a regression model) on these averaged interaction levels # between different conditions or clusters to identify statistically significant differential regions. # - Output the identified differential regions and associated statistics. # Example execution command for an inferred Python script: python analyze_rna_chromatin_interactions.py \ --genomic_bins /path/to/reference/genomic_bins_hg38_10kb.bed \ --lncrna_manifest /path/to/lncrna_cluster_manifest.tsv \ --input_dir /path/to/lncRNA_interaction_data/ \ --output_prefix differential_rna_chromatin_regions \ --bin_size 10000 \ --min_lncrnas_per_cluster 3 \ --p_value_threshold 0.05 \ --log2fc_threshold 1.0 \ --genome_assembly hg38 -
16
These averaged genomic bins were annotated to the transcription start site of the nearest gene using the ChIPpeakAnno.
$ Bash example
# Install R and Bioconductor if not already installed # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("ChIPpeakAnno")' # R -e 'BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")' # R -e 'BiocManager::install("rtracklayer")' # For importing BED files # Create a dummy input file for demonstration if 'averaged_genomic_bins.bed' does not exist. # In a real scenario, this file would be provided as input. if [ ! -f "averaged_genomic_bins.bed" ]; then echo -e "chr1\t1000\t1500\tbin1\t0\t+" > averaged_genomic_bins.bed echo -e "chr1\t5000\t5500\tbin2\t0\t-" >> averaged_genomic_bins.bed echo -e "chr2\t10000\t10500\tbin3\t0\t+" >> averaged_genomic_bins.bed fi # R script to perform annotation of genomic bins to the nearest gene's transcription start site (TSS) Rscript -e ' library(ChIPpeakAnno) library(TxDb.Hsapiens.UCSC.hg38.knownGene) library(rtracklayer) # Required for importing BED files # Define input and output file names input_bed_file <- "averaged_genomic_bins.bed" output_csv_file <- "annotated_genomic_bins_tss.csv" # Load genomic bins from the input BED file peaks <- import(input_bed_file, format="BED") # Load gene annotations for the human hg38 assembly txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Get all genes as GRanges object genes_gr <- genes(txdb) # Define Transcription Start Site (TSS) regions as 1bp regions at the start of each gene tss_features <- promoters(genes_gr, upstream=0, downstream=1) # Add gene IDs to the TSS features, which will be used in the annotation output mcols(tss_features)$gene_id <- names(genes_gr) # Annotate peaks to the nearest gene TSS # The "output = \"nearestLocation\"" parameter ensures that the closest TSS is identified for each peak. annotated_peaks_gr <- annotatePeak( peaks, AnnotationData = tss_features, output = "nearestLocation" ) # Convert the annotated GRanges object to a data frame for easier viewing and saving annotated_df <- as.data.frame(annotated_peaks_gr) # Save the annotation results to a CSV file write.csv(annotated_df, output_csv_file, row.names=FALSE) message(paste("Annotation complete. Results saved to", output_csv_file)) ' -
17
Consecutive genomic bins annotated to a single gene with greater RNA interaction for each genomic bin in one cluster relative to another cluster were considered differentially interacting.
$ Bash example
# This step describes a conceptual analysis for identifying differentially interacting genomic regions between two clusters. # It typically involves custom scripts or specialized tools that take binned interaction data and gene annotations as input. # The exact tool and parameters would depend on the specific methodology used for differential interaction analysis (e.g., statistical tests, thresholds). # Placeholder for input files (replace with actual paths to binned interaction data for each cluster) # These files would typically be generated from eCLIP data, quantifying RNA interaction per genomic bin. CLUSTER1_BINNED_INTERACTIONS="cluster1_binned_interactions.bedgraph" CLUSTER2_BINNED_INTERACTIONS="cluster2_binned_interactions.bedgraph" # Reference dataset: Gene annotation file (e.g., GTF/GFF) for annotating genomic bins to genes. # Using Homo sapiens GRCh38 as a common reference assembly. GENE_ANNOTATION_FILE="Homo_sapiens.GRCh38.109.gtf" # Source: Ensembl or UCSC # Placeholder for output file OUTPUT_DIFFERENTIAL_INTERACTIONS="differential_interactions_cluster1_vs_cluster2.tsv" # Example conceptual command for a custom script that performs differential interaction analysis. # This script would compare interaction levels in consecutive genomic bins annotated to a single gene # between two clusters, applying statistical tests and thresholds to identify significant differences. # Parameters like --min-consecutive-bins, --fold-change, and --p-value are illustrative. # Install necessary dependencies (example, actual dependencies depend on the inferred tool/script) # conda install -c bioconda bedtools # conda install -c conda-forge r-base r-deseq2 # pip install pandas numpy scipy python custom_differential_interaction_analysis.py \ --cluster1-data "${CLUSTER1_BINNED_INTERACTIONS}" \ --cluster2-data "${CLUSTER2_BINNED_INTERACTIONS}" \ --gene-annotation "${GENE_ANNOTATION_FILE}" \ --output "${OUTPUT_DIFFERENTIAL_INTERACTIONS}" \ --min-consecutive-bins 3 \ --min-interaction-fold-change 1.5 \ --max-p-value 0.05 \ --statistical-method "wilcoxon_rank_sum" # Example: could be t-test, DESeq2-like, etc. -
18
Differential RNA chromatin interaction regions were annotated with ChiPseeker.
$ Bash example
# Install R and Bioconductor if not already present # For Debian/Ubuntu: # sudo apt-get update # sudo apt-get install r-base # Install BiocManager and ChIPseeker package within R # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("ChIPseeker")' # R -e 'BiocManager::install("TxDb.Hsapiens.UCSC.hg38.knownGene")' # For human hg38 genome annotation # R -e 'BiocManager::install("org.Hs.eg.db")' # For gene symbol/name mapping # Example: Create a dummy input file for demonstration purposes # In a real scenario, 'differential_regions.bed' would be the output of a previous step # echo -e "chr1\t10000\t10500\tregion1\t0\t+\nchr1\t20000\t20800\tregion2\t0\t-" > differential_regions.bed Rscript -e ' library(ChIPseeker) library(TxDb.Hsapiens.UCSC.hg38.knownGene) # Using human hg38 as the latest assembly placeholder library(org.Hs.eg.db) # For gene symbol/name mapping # Define the input file containing differential RNA chromatin interaction regions # This file is expected to be in a BED-like format (e.g., chr, start, end) peakFile <- "differential_regions.bed" # Load the TxDb object for genome annotation txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene # Annotate the peaks with respect to genomic features (e.g., promoters, exons, introns) # tssRegion defines the region around the Transcription Start Site to consider as promoter peakAnno <- annotatePeak(peakFile, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Hs.eg.db") # Use org.Hs.eg.db for gene symbol/name mapping # Convert the annotation result to a data frame and save it to a TSV file df <- as.data.frame(peakAnno) write.table(df, file="differential_regions_annotated.tsv", sep="\t", quote=FALSE, row.names=FALSE) # Optional: Print a summary of the annotation # print(peakAnno) # plotAnnoBar(peakAnno) # plotDistToTSS(peakAnno) ' -
19
none provided by the submitter
Raw Source Text
Reads were trimmed with cutadapt -l 86 --max-n 5 -o (v1.18), mapped to RNA-Linker-DNA (GTTGGATTCNNNGACACAGCTCACTCCCACACACCGAACTCCAAC) with bwa mem -k 5 -L 4 -B 2 -O 4 (v0.7.15) and sorted with samtools sort (v1.7). RNA and DNA reads were separated with GridTools matefq (https://github.com/biz007/gridtools). Reads were then mapped to the mm10 genome with bwa mem -k 17 -w 1 -T 1. GridTools evaluate was then used to correct against background and RNA-DNA mate read pair quality and quantity with bin size of 1 kb and moving windows of 10. GridTools.py evaluate function uses a gene annotation GTF (Gene transfer format) file to ensure that RNA end of the read always uniquely maps to its correct chromosomal location. Thus, during the analysis, only RNA reads that map back to its own locus will be retained. Moreover, the DNA end of the read will also be unique mapped, ensuring that both the RNA and DNA ends of each read are all unique reads. No read that maps to multiple locations in the genome (including repeat elements all over chromosomes) will be retained during analysis. GridTools RNA was used to identify expression levels of chromatin-enriched RNA. GridTools matrix was used to construct an interaction contact matrix of chromatin-associated RNA within specified genomic bin sizes of 1 kb or 100 kb. All lncRNAs with reads per kilobase DNA read density ³10 on any genomic region were filtered from the interaction contact matrix. A 1 kb interaction matrix was directly visualized on a heatmap using the ComplexHeatmap package (v2.2.0). PCA, k-means clustering set to 3 clusters, and pearson correlation analysis were performed on the interaction matrix with the R stats package (v3.6.2) removing all bins with zero interactions. The correlation matrix was visualized with corrplot package (v0.89) and circos plots with the circlize package (v0.42). Differential RNA chromatin interaction regions were determined by taking the average RNA interaction level of each genomic bin for all lncRNAs in a cluster. These averaged genomic bins were annotated to the transcription start site of the nearest gene using the ChIPpeakAnno. Consecutive genomic bins annotated to a single gene with greater RNA interaction for each genomic bin in one cluster relative to another cluster were considered differentially interacting. Differential RNA chromatin interaction regions were annotated with ChiPseeker. none provided by the submitter Assembly: mm10 Supplementary files format and content: Expression and scope Cis and Trans of all chromatin enriched RNAs