Text-based experiment retrieval in genomic databases

Abstract

With the growing number of genomic data in public repositories, efficient search methodologies have become a basic need to reach the relevant genomic data. However, this need cannot be fulfilled with the current repositories because they offer a limited search option which is a lexical matching of textual descriptions or metadata of the experiments. This technique is insufficient to get the required information needed to detect similarities between experiments within a large data collection. Due to the limitation of the existing repositories, in this study, we develop a text-based experiment retrieval framework by using both lexical and semantic similarity approaches to find similarities between experiments, and their retrieval performance was compared. This study is the first attempt to use text-driven semantic analysis approaches for developing a retrieval framework for experiments. An empirical study was conducted on a large textual description of Arabidopsis microarray experiments from the Gene Expression Omnibus database. In the proposed model, Jaccard similarity was used as a lexical similarity approach; Latent Semantic Analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet allocation were used as semantic similarity approaches to detect similarities between the textual descriptions of the experiments. According to the experimental results, relevant experiments can be retrieved successfully by text-driven semantic similarity approaches compared with the lexical similarity approach.

Keywords

Information retrieval lexical similarity microarray experiments semantic similarity text-based retrieval

1. Introduction

Over recent years, with the help of developments in computational biology, the accumulation of genomic data has been increasing rapidly. The genomic data are stored in various kinds of data formats such as sequences, networks and experimental measurements. Gene Expression Omnibus (GEO) [1], ArrayExpress [2], GenBank [3] and Arabidopsis Information Resource (TAIR) [4] are widely used public data repositories that contain high-throughput biological experiments. Accessing and organising these experiments is a significant task for researchers to obtain hypotheses from the retrieved information. They also need efficient and fast access tools that make their task easy, so recently the development of such retrieval methods has been of interest to researchers.

Experimental studies are stored in data repositories with the metadata information, also known as textual annotations, which contain brief information about the experiment such as organism name, author name, laboratory design and free-style descriptions about the experimental setup. They can be considered as text documents associated with the related experiment. To search for an experiment from a data repository, users generally use metadata information because current data repositories only provide metadata-based search using lexical matching or similarity of textual annotations of the experiments within large data collections. In addition, when searching through a database, logical operators such as AND and OR can be used to compound the values of the attributes. Although such a search is easily implemented, it has major limitations like preventing the conformability of the user when he or she demands a semantic query. Besides this search option, the keyword-based search can be used to search for an experiment. It can provide conformability by using text descriptions through the search, but sometimes text annotations do not have enough information about the experiment. Moreover, lexical similarity does not consider polysemy and synonymy of the terms or words that always exist in the words of natural language; however, the semantic similarity between texts can be found by considering the synonyms and polysemy of each term in the texts. Therefore, semantic-based search is a more powerful technique to overcome current searching limitations.

To date, retrieving relevant experiments has been addressed by different studies focusing on various data formats such as complementary DNA (cDNA) microarrays [5,6], time-series microarrays [7] and metagenome-sequencing samples [8]. There are studies that use query-by-example retrieval in which an abstract content is constructed to represent each experiment within the data collection and then retrieved relevant experiments by using similarity between obtained abstract contents to overcome the limitations of the lexical metadata-based retrieval [9]. In the study by Şener and Oğul [10], a content-based retrieval framework was proposed by using a dataset of Arabidopsis microarrays from the GEO database. They represented each experiment by a fingerprint as the content of the experiment, then an overlap score was calculated between these fingerprints, and the Jaccard coefficient was used to determine the gene set similarity of the experiments which is defined as the relevant information of the compared experiments. It was stated in their study that similarities between experiments can be successfully detected by the proposed framework. In addition to this study, Açıcı et al. [11] proposed a computational framework for detecting similarities between microRNA (miRNA) experiments. A normal-uniform mixture model was adopted to detect differentially expressed miRNAs; then, binarised real-valued fingerprints were obtained using a rank-based threshold. They also introduced an efficient similarity metric to find similarities between obtained fingerprints of the experiments. An open-source platform, called miRWalk, was developed to predict miRNA-binding sites of known genes of species including human, mouse, rat, etc. They combined different predictive algorithms to improve target prediction, and miRNA-target gene interaction can be visualised via a network graph [12]. In addition, there is a study in which an R package and web application were developed for retrieving information from the repository of miRNA sequences and annotation data [13]. With the developed retrieval systems, a user can query information related to the name, accession, sequence, species, version and family information of a miRNAs. In another study conducted by Şener and Oğul [14], a computational content-based framework was developed for finding relevant experiments from the data collection. In the model, different fingerprinting strategies were used and an empirical study was conducted on the microarray experiments of Arabidopsis Thaliana. According to the results, the relevant experiments can be successfully detected by the proposed content-based retrieval framework. Besides this study, Sener et al. [15] developed a retrieval framework for whole-metagenome sequencing sample retrieval. In their study, different fingerprinting approaches were used to get a representative fingerprint for the content of the sequencing samples. They also applied feature extraction and selection methods to reduce the computational complexity of the proposed system. Furthermore, there are also software tools that are applicable for content-based retrieval focusing on extracting signatures based on differentially expressed genes [16 –19]. CellMontage [16] was developed as a tool to be used for searching an experiment within a data collection. Spearman’s rank correlation coefficient was used to find similarities between differentially expressed profiles of the compared experiments. Moreover, ProfileChaser proposed by Engreitz et al. [17] is one of the most widely used tools based on differential expression to construct fingerprints of the experiments. SPIEDw [18] is another search engine in which a gene list with their expression values is taken as a query in the given data repository and relevant experiments are retrieved having similar expression values to the query experiment. In addition, SEEK [19] is a query-based search engine developed for transcriptomic data collection mostly including microarray and sequencing experiments. Besides these software tools, a web portal [20] is also available for visual identification of associations between signatures and searching for similar experiments using signatures within the data collections.

In this article, unlike the current studies and approaches, we propose a text-based experiment retrieval framework for genomic databases using lexical and semantic similarity approaches. An experimental study was conducted on textual information of microarray experiment collection from the GEO database. This study, to the best of our knowledge, is the first attempt to apply semantic analysis approaches to textual information of gene expression experiments to find similarities between experiments from data collection. According to the experimental results, the proposed system has been successful in detecting similarities between experiments. Moreover, it can easily be adaptable to the different types of genomic data.

The article is structured as follows: ‘Introduction’, ‘Methods’, ‘Results and Discussion’ and ‘Conclusion’.

2. Methods

2.1. Retrieval framework

The proposed framework aims to find similarities between experiments within a given data collection (Figure 1). To do so, metadata or textual descriptions of each experiment in the collection are taken as input for the retrieval methods. Lexical similarity and semantic similarity approaches were used as retrieval methods in the model. To adapt these methods to our study, textual descriptions of experiments are represented as documents, and words in the related description are represented as the terms. Retrieved experiments are listed based on the similarity score between the query experiment and other experiments in the data collection.

Figure 1.

Overview of the proposed framework.

2.2. Jaccard similarity

Finding similarity between a set of documents within a data collection depends on a lexical match or similarity between words in the user’s query document and the ones in the rest of the documents. Lexical similarity is defined as a measure to show how similar a set of words also called documents. There are two main approaches for finding similarity lexically between compared sets of words such as character-based and term-based similarity approaches [21]. Jaccard similarity is one of the term-based approaches defined as the intersection of two sets of words divided by a union of them [22]. Let A and B be two documents to be compared, the Jaccard similarity between them is calculated using formula (1). The similarity score ranges between 0 and 1, where 0 represents no match, and 1 represents the perfect match between the compared documents

J (A, B) = \frac{| A \cap B |}{| A \cup B |} = \frac{| A \cap B |}{| A | + | B | - | A \cap B |}

(1)

2.3. Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a corpus-based semantic similarity method, which is widely used for detecting relationships between documents and terms. It was introduced by Landauer et al. [23] to give a key insight by reducing the dimension in information retrieval problems. There are four main steps in LSA, such as the following:

Creating a term-document matrix; this matrix represents a collection of documents in which rows represent words and columns represent sets of words. The word frequency across the documents is stored in the cell of the matrices. In LSA, bag-of-words representation is used because the order of words is unimportant.

Transformation term-document matrix; after obtaining the raw term frequency matrix, the matrix is transformed using inverse document frequency or entropy-based score.

Applying Singular Value Decomposition (SVD); SVD is performed to get the k-largest singular values. Each document and term are represented by a k-dimensional vector.

Retrieval in the reduced space; similarities are computed among a set of documents in the reduced dimension. The cosine distance is used to calculate the angle between the term and document vectors.

2.4. Probabilistic latent semantic analysis

Probabilistic latent semantic analysis (pLSA) is a technique from the category of topic model, and it was developed by Hofmann [24] in 1999. Unlike LSA, which is derived from linear algebra techniques and uses occurrence tables, pLSA is based on a mixture model decomposition. It starts with an aspect model, which interrelates co-occurrence data and an unobserved class variable. The data in the model are represented by three sets of variables such as documents, words and topics that can be represented as follows:

A document is defined as $d$ and there are $N$ number of documents: $d \in D = {d_{1}, d_{2}, . ., d_{N}}$ .

A word is defined as $w$ and there are $M$ number of words given as $w \in W = {w_{1}, w_{2}, . ., w_{M}}$ .

A topic is defined as $z$ and there are $K$ number of topics, latent variable, given as $z \in Z = {z_{1}, z_{2}, . ., z_{K}}$ .

A generative process is performed such as first, a document $d_{n}$ with the probability $P (d)$ is selected and a topic $z_{i}$ from a multinomial conditioned on the related document with the probability $P (z | d_{n})$ is selected for each word in the document. Finally, a word $w_{i}$ based on the probability with the previously chosen topic $P (w | z_{i})$ is selected. In addition to this, each document in the data is assumed in the unordered structure called bag-of-words. The generative model can be completely described with the joint distribution given as the following formulas [25]. Formula (2) is the mathematical representation of the mixture model, and formula (3) is used for obtaining the factorisation of the full co-occurrence data $P (d)$

P (w | d) = \sum_{z ϵ Z} P (w | z) P (z | d)

(2)

P (w, d) = \sum_{z ϵ Z} P (z) P (d | z) P (w | z)

(3)

2.5. Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a generative probabilistic topic model, which aims to detect latent topics defined as a natural group of the documents in the data collection. LDA is an unsupervised machine learning algorithm introduced by Blei et al. [26]. It assumes that a document is a bag-of-words, and they consist of more than one topic based on the words in the related document. Basic terms are defined, such as the following:

A word is the basic unit of the vocabulary. A vector ${1, \dots, V}$ represents the vocabulary and $vth$ word in the vocabulary is given by the V-vector $w$ such that $w^{v} = 1$ and $w^{u} = 0$ for $u \neq v$ .

A document is the set of $K$ words represented by $w = (w_{1}, w_{2} \dots ., w_{K})$ .

There are $N$ documents in the corpus represented by the vector $D = {w_{1}, w_{2}, \dots, w_{N}}$ .

To adapt the LDA model to our study, were represented as documents. Furthermore, there are $T$ topics in our data collection and a word or a term was represented by $w$ . An experiment description consists of $K$ words which is given as $d = {w_{1}, w_{2} \dots ., w_{K}}$ . A topic is described as the distribution over $K$ words. The probability distribution (formula (4)) is used to describe each DE in the collection

P (w_{i}) = \sum_{i = 1}^{T} P (w_{i} | z = z_{j}) P (z = z_{j})

(4)

$P (w_{i})$ represents the probability of a word $w_{i}$ in each DE, while $P (z = z_{j})$ represents selecting a word from a topic $z_{j}$ for the current DE. Besides this, $P (w_{i} | z = z_{j})$ is used for describing the probability of sampling a word given the topic $z_{j}$ . When performing the LDA model, different numbers of topics were performed because there is no strict rule for selecting optimal topic numbers. After that, topic distribution generated by the model was used to represent each description of the experiments in the collection.

3. Results and discussion

3.1. Dataset

In this study, we used the same dataset and ground truth as the study of Şener and Oğul [14]. Overall, 120 Arabidopsis time-course experiments from GEO were used in the retrieval model. Each experiment has textual annotations or descriptions of the experimental study. Figure 2 shows an example of an experiment used in the study. As shown in the figure, detailed descriptions are given with status, title, organism, experiment type, summary and overall design. We used ‘Summary’ and ‘Overall design’ experiment descriptions to detect similarities between compared experiments. Preprocessing steps were applied to transform data in the applicable format. In the preprocessing steps, stop words and punctuation marks were eliminated, abbreviations and numbers that did not make sense to the user were removed from the dataset and stemming word was used as the main dataset. All preprocessed data can be accessed from the link below.

Figure 2.

An example experiment.

https://github.com/dygdedesener/data/blob/main/dataset_text_based_exp_retrieval.xlsx.

3.2. Experimental setup and evaluation

A retrieval framework based on the lexical and semantic similarity for descriptions of microarray experiments was proposed to get relevant experiments within the data collection. Textual descriptions of each experiment were used as a query, and a ranked list was obtained from the collection. The retrieved experiment with a high similarity score is expected to be more likely relevant to the query experiment.

As shown in Figure 3, relevance information between the experiments was defined by ground truth which describes whether compared experiments have true relevance or not. In our study, as mentioned previously, we used the same ground truth as the study of Şener and Oğul [14]. In the study, they performed Gene Set Enrichment Analysis (GSEA) [27] to set an acceptable ground truth for the compared experiments. The idea behind using GSEA as ground truth is that biologically relevant experiments should have common enriched gene sets because similar behaviour can be detected in response to an environmental factor during the experiment. Jaccard index (formula (1)) was used to calculate gene set–based similarity. A threshold was defined to define relevant information of the experiments. Let us assume that there are two distinct gene sets, called A and B, obtained from two compared experiments, then the Jaccard index is calculated as the ratio of their common gene sets divided by all gene sets. Setting the ground truth by a threshold was done by assuming a Gaussian distribution of all experiment pairs, the threshold was set as 0.35. The threshold was found by summing the mean of all values in the dataset and the standard deviation of the data. If two compared experiments have a Jaccard index value greater than the threshold, they are called relevant or otherwise irrelevant experiments.

Figure 3.

Comparison of retrieval performance of similarity approaches.

After obtaining relevant experiments based on the similarity approaches, biological validation of the experimental results should be performed to get evidence for the results. To do so, Gene Ontology (GO) analysis was performed to get biological validation for selected query experiments and the experiments retrieved by the system [28]. We first applied Differential Expression Analysis to get upregulated and downregulated gene lists for each experiment using DEBrowser Tool [29]. The analysis was performed using the DESeq2 R package [30] with the criteria absolute fold change > 2 and p-value < 0.05. Then, GO analysis results were obtained for each experiment with a software tool g:Profiler [31]. This analysis returned GO terms that belong to the Biological Processes (BP), Molecular Function (MF) and Cellular Component (CC) for the query and the retrieved experiment. In our study, GO terms with BP ontology were used for identifying common GO terms between the query experiment and the retrieved experiment. Having at least one common GO term between compared experiments is proof that they are also biologically similar experiments.

3.3. Empirical results

The retrieval task of the system was conducted by taking each experiment as a query within the experiment collection in the proposed framework. A ranked list was obtained for each query experiment based on the similarity scores generated for compared experiments. The expected result is getting higher ranks for true relevance experiments while lower ranks for other experiments.

Receiver Operating Characteristic (ROC) curves were used for evaluating the system performance. In the curve, the x-axis represents the false positive rate (FPR) (formula (5)), and the y-axis represents true positive rate (TPR) (formula (6)). In the given formulas, TN is true negatives and FN is false negatives. For a ranked retrieved list, a positive experiment means a relevant experiment and a negative experiment means an irrelevant experiment. For each query experiment, a ranked listed based on the similarity scores is obtained; thereafter, an ROC curve is created by thresholding a test set. An area under curve (AUC) score for the query is calculated by the area under the ROC curve associated with it and the average AUC of all query experiments is calculated for the similarity approach. Instead of giving an average AUC score for each method, the number of experiments that have AUC scores equal to or greater than the given score was used for the visualisation of the performance of the methods. The better retrieval performance is observed by getting a higher average AUC value

FPR = \frac{FP}{FP + TN}

(5)

TPR = \frac{TP}{TP + FN}

(6)

Jaccard similarity was used for finding lexical similarity, while LSA, PLSA and LDA methods were used for finding semantic similarity between textual descriptions of experiments. The retrieval performance of these approaches, given in Figure 3, was compared by plotting AUC scores across the number of experiments, which has a given score or a better score than the related score. The high curve shows an effective retrieval performance. As can be seen from the figure, biologically relevant experiments can be detected successfully by the LSA method for many queries. It can be clearly seen that an average AUC of larger than 0.7 can be achieved by two out of three of the experiments. In addition, the average AUC score for the LSA method is 0.79, while 0.77 for LDA, 0.77 for pLSA and 0.73 for the Jaccard similarity (lexical similarity). Besides this, we also compared the performance of the proposed framework with the study of Şener and Oğul [14], which is the retrieval framework based on the content similarity of the same microarray experiments used in our study. In their study, they achieved an average AUC score of 0.74, while in our study we got an average AUC score of 0.79 by the LSA method. This means that a retrieval framework based on the semantic similarity of textual descriptions of the experiments has become more successful than a content-based retrieval model in detecting similarities between experiments.

To investigate how the dimensionality of LSA vectors affects the retrieval performance, we performed the method using different vector dimensions. Figure 4 indicates the effect of the dimension on the retrieval performance of the LSA method. We simply computed the average AUC scores for different numbers of dimensions. As shown in the figure, the optimal vector dimension is 5 since the highest average AUC score was obtained using this vector dimension; hence, the optimal vector dimension is selected as 5 for the LSA method.

Figure 4.

Effect of the number of dimensions in the LSA method.

After observing the performance of the retrieval methods, we applied a statistical significance test to determine whether the performance difference between the most successful method and the other methods was significant. To do so, a paired t-test and a non-parametric Wilcoxon signed-rank test were performed to get a difference between pairwise AUC scores of the used methods. We got p-values that are below 0.05 in both tests, except for the Wilcoxon result for LSA versus pLSA. We got a p-value of 5.83734E−11 and 2.34E−11 for Jaccard, 0.054 and 0.018 for pLSA and finally 0.003 and 0.009 for the LDA method for Wilcoxon and paired t-test, respectively (Table 1). The results justify that achievement in finding relevant experiments using the LSA method in terms of AUC scores is statistically significant.

Table 1.

Significance tests for the pairwise comparisons of used methods.

Comparison	p-value
	Wilcoxon signed-rank test	Paired t-test
LSA–Jaccard	5.83734E−11	2.34E−11
LSA–pLSA	0.054	0.018
LSA–LDA	0.003	0.009

LSA: Latent Semantic Analysis; pLSA: Probabilistic Latent Semantic Analysis; LDA: Latent Dirichlet allocation.

In addition to the indirect evaluation based on the gene set similarity, the retrieval ability of the proposed system was evaluated by discovering textual annotations of the experiments manually. For this, two query experiments were selected as sample query experiments to observe the relevance between their annotations. The selection criteria for those experiments are that except for itself at least two relevant experiments should be retrieved from the dataset with a high similarity score. The first sample query experiment was performed to observe the expression data from xylem-pole pericycle cells of Arabidopsis roots undergoing lateral root initiation [32]. In the experiment, transcript profiling was used on sorted pericycle cells undergoing lateral root initiation to define the receptor-like kinase ACR4 of Arabidopsis as a key factor. They also used FACS on Arabidopsis roots containing a green fluorescent protein (GFP) marker for xylem-pole pericycle cells at different time points including the arrest of G1-to-S transition, auxin signalling, and progression through G1-to-S and G2-to-M transition or asymmetric cell division. The accession number of this experiment is GSE6349. The second sample query experiment was performed to observe volatiles of certain rhizobacteria’s growth inhibitory effects on Arabidopsis thaliana. The accession number is GSE35325-2 in which ‘-’ indicates the experiment number in the same GEO entry.

Table 2 gives the retrieval results of selected sample query experiments. The two relevant experiments (given in bold) and two irrelevant experiments retrieved by the system with their lexical and semantic similarity scores are given in the table. The ground truth score obtained by gene-set-based similarity, known as true relevance, is also given in the same table. The results are strong evidence that semantic similarity approaches have a significant correlation between the true relevance of the retrieved relevant and irrelevant experiments. According to the results, lexical similarity cannot succeed in detecting similarities between experiments because low similarity scores were obtained even for relevant experiments.

Table 2.

Most relevant (shown in italics) and least relevant experiments for sample queries.

Compared experiments		Lexical similarity	Semantic similarity			True relevance
Query experiment	Retrieved experiment	Jaccard score	LSA score	pLSA score	LDA score	Ground truth score
GSE6349	GSE30098	0.099	0.956	0.608	0.809	0.449
GSE6349	GSE3350-2	0.1	0.964	0.612	0.808	0.525
GSE6349	GSE18975-6	0.067	0.144	1.90E−08	0.046	0.258
GSE6349	GSE19271-7	0.029	0.102	1.21E−08	0.042	0.264
GSE35325-2	GSE18985-2	0.178	0.985	0.998	0.469	0.468
GSE35325-2	GSE19261-1	0.136	0.994	0.987	0.365	0.357
GSE35325-2	GSE576-1	0.068	0.173	0.003	0.052	0.316
GSE35325-2	GSE577-4	0.05	0.013	0.002	0.059	0.189

LSA: Latent Semantic Analysis; pLSA: Probabilistic Latent Semantic Analysis; LDA: Latent Dirichlet allocation.

For the first query experiment, GSE30098 is the most relevant experiment retrieved by the system. This experiment is about the expression analysis of Arabidopsis plant roots during sulphur deficiency. The experiment was conducted to understand the link between development and stress in Arabidopsis root by using genome-wide assays [33]. The most notable similarity between the query and this experiment is that they were developed for observing plant root development processes at different time periods and in response to environmental factors. The second most relevant experiment, named as GSE3350-2, is the study of mechanisms behind auxin-induced cell division by using lateral root imitation [34]. The similarity between the experiments is that they are conducted for investigating auxin hormone and cell division. We can also clearly state that the proposed system can infer the relevance between experiments with the same objective and using the same stimulus.

For the second query experiment, numbered GSE35325-2, the most relevant experiment is GSE18985-2, which identifies early gibberellin (GA) responsive genes in the roots of an Arabidopsis GA-deficient mutant. The most notable similarity between experiments is that the effects of growth inhibitory bacteria and a supportive hormone were observed in these experiments. The second most relevant experiment was GSE19261-1, which shows an interesting relevance. The retrieved experiment was conducted to observe the effects of light on the growth process of the plant, whereas the query experiment also presents the results of a study on the inhibitory effect of the volatile of a certain bacterium on plant growth. Evidently, both the query and retrieved experiment are the results of studies about observing plant growth for the same objective with a different stimulus.

The relevance between selected query experiments and retrieved experiments can also be observed by GO analysis. We observed that four common GO terms and related p-values are enriched for both the query and relevant experiments given in Tables 3 and 4. As can be shown from the tables, the relevant experiments have common GO terms with low p-values, which are the main criteria for evaluating GO analysis results.

Table 3.

Common GO terms enriched for query experiment (GSE63349) and the first relevant experiment (GSE30098).

GO term	p-value
Alpha-amino acid metabolic	1.123 × 10⁻⁴
The organic acid metabolic process	3.682 × 10⁻²
The cellular amino acid metabolic process	1.143 × 10⁻⁸
Oxoacid metabolic process	4.014 × 10⁻²

GO: Gene Ontology.

Table 4.

Common GO terms enriched for query experiment (GSE35325-2) and the first relevant experiment (GSE18985-2).

GO term	p-value
Cellular response to chemical stimulus	2.784 × 10⁻³
Response to chemical	1.365 × 10⁻²
Response to abiotic stimulus	3.667 × 10⁻²
Response to stimulus	4.446 × 10⁻²

GO: Gene Ontology.

4. Conclusion

With the exponential growth of genomic data in public repositories, efficient search methodologies to get relevant data from the repositories have become a significant need for researchers. Current data repositories provide only metadata-based searching approaches that mainly use lexical matching of textual descriptions of the experiments. This technique fails to retrieve relevant experiments from large data collections. Owing to the existing limitation of current data repositories’ searching options, in this study, we aim to develop a retrieval framework using semantic similarity approaches in data repositories. Similarities between experiments were found by using both lexical and semantic similarity approaches and their retrieval performance was compared. In the proposed model, Jaccard similarity was used as a lexical similarity approach, and LSA, pLSA and LDA were used as semantic similarity approaches to detect similarities between the textual descriptions of the experiments.

We used textual annotations of GEO experimental datasets to test the performance of the proposed system. This study, to the best of our knowledge, is the first attempt to use semantic similarity approaches for detecting similarities between textual descriptions of microarray experiments. Evaluation of the proposed system was done by the ROC performance of the similarity approaches. It has been observed that for most of the experiments, high ROC scores were obtained. To perform an experimental evaluation of the results, we selected sample query experiments and compared their textual annotations with the retrieved ones manually. It is clearly seen that the inference by the system can be proved with a biological justification of the results. According to the experimental results, the relevant experiments can be retrieved successfully by semantic similarity approaches compared with the lexical similarity approach. Although text-driven semantic similarity approaches have been successful in finding similarities between experiments in our study, they have some drawbacks. Topic models lack interpretable embeddings because the generated static topics are not known and they are hidden variables of the documents. Besides, the models are constructed in a bag-of-words representation assuming terms in a text can be exchangeable, sentence structure is not modelled which might cause some limitations to represent information hidden in the sentence structure of a text. In addition to this, LSA involves SVD which increases the computational cost of the system when the number of experiments has been increased in the dataset collection.

In conclusion, text-driven semantic similarity approaches have promised a successful result in retrieving relevant experiments and they provide us to get biological similarity between the query experiment and the retrieved experiments. We also expect that the proposed system will be the key insight into future studies that goes beyond simple metadata-based approaches. The results also encourage us to use data mining techniques for retrieving relevant experiments from the genomic data collections. Finally, the proposed retrieval framework is expected to be applicable not only to the microarray experiments but also to other types of genomic data contexts that have textual annotations.

Footnotes

Author’s Note

Duygu Dede Sxener is also affiliated to Department of Computer Engineering, Baskent University, Ankara, Turkey.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Duygu Dede Şener

References

Barrett

Edgar

. Mining microarray data at NCBI’s Gene Expression Omnibus (GEO). Methods Mol Biol 2006; 338: 175–190.

Brazma

Parkinson

Sarkans

et al. ArrayExpress – a public repository for microarray gene expression data at the EBI. Nucleic Acids Res 2003; 31(1): 68–71.

Benson

Cavanaugh

Clark

et al. GenBank. Nucleic Acids Res 2013; 41(D1): D36–D42.

Rhee

Beavis

Berardini

et al. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology research materials and community. Nucleic Acids Res 2003; 31(1)): 224–228.

Caldas

Gehlenborg

Faisal

et al. Probabilistic retrieval and visualization of biologically relevant microarray experiments. Bioinformatics 2009; 25(12): i145–i153.

Engreitz

Morgan

Dudley

et al. Content-based microarray search using differential expression profiles. BMC Bioinf 2010; 11: 603–614.

Hayran

Ogul

Ozkoc

. Content-based search on time-series microarray databases. In: 2014 25th International workshop on database and expert systems applications, Munich, 1–5 September 2014, pp. 89–93. New York: IEEE.

Seth

Välimäki

Kaski

et al. Exploration and retrieval of whole-metagenome sequencing samples. Bioinformatics 2014; 30(17): 2471–2479.

Oğul

. Content-based retrieval of microarray experiments. In: Elloumi

Lliopoulos

Wang

JTL

et al. (eds) Pattern recognition in computational molecular biology: techniques and approaches. Hoboken, NJ: John Wiley & Sons, 2015, pp. 315–334.

10.

Şener

Ogul

. Inferring similarity between time-series microarrays: a content-based approach. In: 2015 IEEE 2nd international conference on cybernetics (CYBCONF), Gdynia, 24–26 June 2015.

11.

Açıcı

Terzi

Oğul

. Retrieving relevant experiments: the case of microRNA microarrays. BioSystems 2015; 134: 71–78.

12.

Sticht

De La Torre

Parveen

et al. miRWalk: an online resource for prediction of microRNA binding sites. PLoS ONE 2018; 13(10): e0206239.

13.

Liu

, et al. miRBaseConverter: an R/Bioconductor package for converting and retrieving miRNA name, accession, sequence and family information in different versions of miRBase. BMC Bioinf 2018; 19: 514.

14.

Şener

Oğul

. Retrieving relevant time-course experiments: a study on Arabidopsis microarrays. IET Syst Biol 2016; 10(3): 87–93.

15.

Şener

Santoni

Felici

et al. A content-based retrieval framework for whole metagenome sequencing samples. J Integr Bioinf 2018; 15(4): 20170067.

16.

Fujibuchi

Kiseleva

Taniguchi

et al. CellMontage: similar expression profile search server. Bioinformatics 2007; 23(22): 31033104.

17.

Engreitz

Chen

Morgan

et al. ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression. Bioinformatics 2011; 27(23): 33173318.

18.

Williams

. SPIEDw: a searchable platform-independent expression database web tool. BMC Genomics 2013; 14: 765–770, https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-765

19.

Zhu

Wong

Krishnan

et al. Targeted exploration and analysis of large cross-platform human transcriptomic compendia. Nat Methods 2015; 12: 211–214.

20.

Wang

Monteiro

Jagodnik

et al. Extraction and analysis of signatures from the Gene Expression Omnibus by the crowd. Nat Commun 2016; 7: 12846.

21.

Gomaa

Fahmy

. A survey of text similarity approaches. Int J Comput Appl 2013; 68(13): 13–18.

22.

Glen

. Jaccard Index / Similarity coefficient – StatisticsHowTo.com: elementary statistics for the rest of us!https://www.statisticshowto.com/jaccard-index/ (2016, accessed 4 August 2021).

23.

Landauer

Foltz

Laham

. An introduction to latent semantic analysis. Discourse Process 1998; 25: 259–284.

24.

Hofmann

. Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkley, CA, 15–19 August 1999, pp. 50–57. New York: Association for Computing Machinery.

25.

Oneata

. Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth conference on Uncertainty, Sweeden, 1999, pp. 1–7. Morgan Kaufmann Publishers.

26.

Blei

Andrew

Jordan

. Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

27.

Subramanian

Tamayo

Mootha

et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci 2005; 102(43): 15545–15550.

28.

Ashburner

Ball

Blake

et al. Gene ontology: tool for the unification of biology. Nat Genet 2020; 25(1): 25–29.

29.

Kucukural

Yukselen

Ozata

et al. DEBrowser: interactive differential expression analysis and visualization tool for count data. BMC Genomics 2019; 20(1): 6.

30.

Love

Huber

Anders

. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 2014; 15: 550–570.

31.

Raudvere

Kolberg

Kuzmin

et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res 2019; 47(W1): W191–W198.

32.

Smet

Vassileva

De Rybel

et al. Receptor-like kinase ACR4 restricts formative cell divisions in the Arabidopsis root. Science 2008; 322(5901): 594–597.

33.

Pascuzzi

ASI

Jackson

Cui

et al. Cell identity regulators link development and stress responses in the Arabidopsis root. Dev Cell 2011; 21(4): 770–782.

34.

Vanneste

De Rybel

Beemster

et al. Cell cycle progression in the pericycle is not sufficient for SOLITARY ROOT/IAA14-mediated lateral root initiation in Arabidopsis thaliana. Plant Cell 2005; 17(11): 3035–3050.