Abstract
Word embeddings are powerful for many tasks in natural language processing. In this work, we learn word embeddings using weighted graphs from word association norms (WAN) with the node2vec algorithm. Although building WAN is a difficult and time-consuming task, training the vectors from these resources is a fast and efficient process. This allows us to obtain good quality word embeddings from small corpora. We evaluate our word vectors in two ways: intrinsic and extrinsic. The intrinsic evaluation was performed with several word similarity benchmarks, WordSim-353, MC30, MTurk-287, MEN-TR-3k, SimLex-999, MTurk-771 and RG-65, and different similarity measures achieving better results than those obtained with word2vec, GloVe, and fastText, trained on a huge corpus. The extrinsic evaluation was done by measuring the quality of sentence embeddings using transfer tasks: sentiment analysis, paraphrase detection, natural language inference, and semantic textual similarity. The word vectors learned from the WAN are available on our Github page.
Introduction
Semantic representation of words in a vector space has been an active research field over the past decades. Computational models like singular value decomposition (SVD) and latent semantic analysis (LSA) [19] are able to model continuous representations of words (embeddings) from term-document matrices. Both methods are able to reduce an n-dimensional term-document matrix using only the most important dimensions.
More recently, Mikolov et al. [40] introduced a new theoretical approach, word2vec, inspired by the distributional hypothesis, which states that words in similar contexts tend to have similar meanings [31]. This technique uses a neural network to learn lexical vector representations by predicting other words in the neighborhood. Distributed vector representations obtained with word2vec have the capability to preserve linear regularities between words.
Some alternative methods have been designed to improve the performance of word2vec. GloVe [47] aims to be an efficient vector model by training nonzero elements in a word-word co-occurrence matrix. fastText [6] is an approach based on the skip-gram model, with the difference that in this method every word is represented as a bag of character n-grams. Mikolov et al. [41] claim that the models trained with fastText exhibit the best degree of accuracy compared to other systems, becoming the new state of the art in distributed representations of words.
In order to build a suitable and reliable semantic vector space model able to capture semantic similarities and linear regularities of words, huge volumes of text are necessary. Although methods for learning word vectors are fast and efficient to train, and pretrained embeddings are usually available online, it is still computationally expensive to train huge volumes of data in non-commercial environments, i.e., personal computers. Acquisition of semantic representations require large handcrafted knowledge bases or huge volumes of text to be exploited. Some word2vec embeddings are trained on billions of words, which implies a computational cost that is not very realistic from the cognitive point of view either. In addition to this, only really big sources like Wikipedia, an encyclopedia, can provide the number of lexical items necessary to perform such processes. Encyclopedias have plenty of terms that are not part of common language (i.e., Latin names for plants and animals, geographical names, etc.), introducing some unneeded noise to the vectors.
Even though millions of different words can be computed for languages like English, this is not the actual size of the vocabulary an English speaker uses in daily life. Psycholinguists do not agree on the number of words that comprise the core of a language. In the thirties of the 20th Century, Ogden [45] suggested that there is a small set of 850 items that constitute what he calls “basic English”. West [56], elaborated a list with the 2000 more frequent words in English that were extracted from a corpus. The resource, called General Service List (GSL), was oriented to the use of ESL learners and teachers. This was updated in 2013 by Browne et al. [8], who published the New General Service List (NGSL). The authors tested both lists in the Cambridge English Corpus (CEC), concluding that the GSL covered 84% and the NGSL 92% of the words of the resource [36].
Leaving aside the field of English as a Second Language, the theory explained above has been broadly criticized [22]. However, the vocabulary of daily life is limited compared to encyclopedic resources, and the idea to restrict to the most common words is still followed in several publications of lexicography. As an example, the Longman Dictionary of Contemporary English (LDOCE) [38] uses a list of the two thousand most used words in English to define the lemmas of the whole dictionary, 230,000 in the 2003 edition.1
Summing up, it seems that having lexical resources with a more restricted number of words is not necessarily a limitation, although the ideal coverage of the vocabulary to balance complexity and utility is controversial.
In another vein, methods based on co-occurrences of words, like word2vec, provide a good idea on the syntagmatic and paradigmatic relations of words, but fail to show their psychological and associative connections. When envisioning some writing assistants and applications to help with real problems of dysnomia, these latter types of relations cannot be dismissed.
Keeping the previous ideas in mind, it makes sense to look for models that can work with more restricted sets of both tokens and types. Somehow, these methods can decrease the recall, but the computational complexity and the efficiency will be increased. Moreover, it is highly recommended to consider corpora that can provide other information on how the lexicon is structured and accessed, rather than corpora where the relation between words are only based on co-occurrence. Wordnet [42], semantic networks or word association networks are possible sources for these experiments. In this paper, we evaluate the convenience of using word association norms as a basis for learning relations between words.
Word association (WA) tests are an experimental technique for discovering the way that human minds structure knowledge [16]. In free association experiments, a person is asked to respond with the first word that comes to mind in response to a given stimulus word. Free WA tests are able to produce rich types of associations that can reflect both semantic and episodic memory contents [7] in the form of general word association norms (WAN).
The goal of this paper is to learn continuous feature representations for nodes (that represent words) in WAN. Our hypothesis is that word embeddings learned from WAN map the organized lexical items arranged in the lexicon to a vector space, thus learning richer representations. Grover and Leskovec [27] introduced an algorithm called node2vec that is able to learn mappings of nodes to a continuous vector space taking into account the network neighborhoods of nodes. The algorithm performs a biased random walk to explore different neighborhoods in order to capture not only the structural roles of nodes (words) in the network but also the communities they belong to. Our proposal is called wan2vec, because vectors are learned from a network built upon a WAN, by means of the node2vec algorithm. The wan2vec embeddings are available on Github.2
The rest of the paper is organized as follows. Section 2, discusses some related work. In Section 3, we present a well-known compilation of word association norms, the Edinburgh Association Thesaurus, and we describe the methodological framework for learning word embeddings from WAN. In Section 4, we present the evaluation of the generated vectors using standard datasets for word similarity in English. In Section 5, an extrinsic evaluation is conducted, testing the performance of wan2vec with other tasks related to NLP. Finally, in Section 6 we draw some conclusions and point to possible directions of future work.
The idea of word associations was first proposed by psychologists to tackle disorders like dementia or aphasia. Within cognitive psychology, Collins [12] applied them to simulate memory processes. From psycho-linguistics, Clark [11] presents free associations as an ability that can reveal some properties of the mechanisms of language.
Linguistics and Psycholinguistics used semantic networks [55], which are graphs relating words [2], not only to study the organization of the vocabulary, but also to approach the structure of knowledge.
Word association norms are corpora of free association words. One of the first examples is provided by Kent & Rosanoff [34], who used this method for comparisons of words, introducing 100 emotionally neutral test words. They conducted the first large scale study with 1000 test subjects, and concluded that there was uniformity in the organization of associations and that people shared stable networks of connections among words [33].
Many languages have corpora of WAN. In the past decades, some other association lists were elaborated with the collaboration of a large number of volunteers. In recent years, the web has become the natural way to get data to build such resources. Jeux de Mots3
Among the best-known resources available on the web for English are the Edinburgh Associative Thesaurus5
Borge-Holthoefer & Arenas [7] introduced a Random Inheritance Model (RIM) for extracting semantic similarity relations from free association information. The obtained vectors were compared with LSA-based vector representations and the WAS (word association space) model. Their results indicate that RIM can successfully extract word feature vectors from a free association network.
In recent years, Bel-Enguix et al. [4] used graph analysis techniques to compute associations from large texts collections. Furthermore, Garimella et al. [24] introduced a demographic-aware word association model based on a neural net skip-gram architecture, outperforming generic methods for computing associations that do not take into account writer’s demographics.
Sinopalnikova and Smrz [53] presented a methodological framework for building and extending semantic networks with word association thesaurus (WAT), including a comparison of quality and information provided by WAT vs. other language resources. The authors showed that WATs are comparable to balanced text corpora and can replace them in case of absence of a corpus.
In a recent work by De Deyne et al. [17] the authors introduced a spreading activation approach in order to encode a semantic structure from a word association network, specifically part of the Small World of Words. The word association-based model was compared with a word embeddings model (word2vec) using relatedness and similarity judgments from humans, obtaining an average of 13% of improvement over the word2vec model.
The task of measuring the semantic similarity and relatedness has been a key problem in NLP, with multiple implications in lexical semantics [1]. To approach the problem, some authors [5,10,20,29,46] have used pre-existing knowledge resources like Wordnet [42]. Other approaches are based on distributional semantics. The vector-based methods, which calculate the semantic realatedness by means of the extraction of characteristic vectors for words, typically use the cosine similarity measure [23,28]. Recently, some methods have been introduced that use different algorithms for obtaining the similarity between two given vectors; among them APSyn [50,51], that ranks the most associated contexts of two words and computes the extent of the intersection between them.
In this work, we aim at learning vector representation of words using as a basic resource the EAT [35], a compilation of word association norms in English. We claim that, in spite of the limited vocabulary, word associations capture more efficiently lexical connections and imply a lower complexity in both processing time and space. De Deyne et al. developed a similar idea in [17], but unlike their work, we use the complete WAN corpus (not only the largest connected component), increasing in this way the coverage of the vocabulary by 5% and maintaining high quality vectors.
The EAT was mainly collected with undergraduate students from different British universities. The participants were age between 17 and 22, among which 64% were males and 36% were females. Every informant gave responses for 100 words, and every word was given to 100 participants. The resource was elaborated between 1968 and 1971, and published in 1973. Although it could be said that EAT presents an “old stage” of the language, it is the most complete and balanced WAN elaborated so far.
We used an XML version of the resource,7
The graph representation of the EAT corpus is formally defined by
The graph is undirected, so that every stimulus is connected to every associated word without any precedence order.
Two different functions (ϕ) were used to assign weight to the edges: Frequency and Association Strength [44]. Only the former is included in the data provided by the EAT. The latter was calculated from the results of frequency for every word associated to its stimulus.
Counts the number of occurrences of every response associated to its stimulus in the whole corpus. In order to use this information, we calculated the inverse frequency (
Establishes a relation between the frequency (
For our experiments, we used the inverse association strength,
Note that in this work we use
The node2vec algorithm [27] finds a mapping
The sampling strategy designed in node2vec enables to explore neighborhoods with flexible biased-random walks. Parameters p and q control the switching between breath-first (BFS) and depth-first (DFS) graph searches. Parameter p controls the likelihood of immediately revisiting a node in the walk and parameter q allows the search to differentiate between “inward” and “outward” nodes. The values used in this work were
In this work, we used the implementation available on the website8
The word embeddings learned from the WAN are available on our Github page.9
There are several evaluation methods for unsupervised embedding techniques [52], categorized as intrinsic and extrinsic. The extrinsic evaluation (that will be tested in Section 5) consists in evaluating the quality of the word vectors in Natural Language Processing (NLP) tasks [25,26]. The improvement is measured in the performance metric specific to the evaluated task. Intrinsic evaluations test the ability of word vectors to capture syntactic or semantic relationships [3] of words. The hypothesis of the intrinsic evaluation states that similar words should have similar representations. This section is devoted to perform the intrinsic evaluation of wan2vec.

Projection of word vectors for 5 semantic groups (ten words each). Colors are codified as follows: animals – red, transportation – black, body parts – blue, household appliances – green, clothes – pink.
Thus, to evaluate similarity we first performed a visualization of a sample of words using a T-SNE10
When the
In addition, we evaluated the ability of wan2vec to capture semantic relations through a similarity task. We used two subsets of the WordSim-353 [21] benchmark comprised of 353 semantically related term pairs with similarity scores given by humans. This list does not distinguish between the concepts of similarity and relatedness. Aguirre et al. [1] split the list into two different sets, one with relatedness scores, containing 252 pairs [EN-WS-353-REL], and another with 203 pairs linked by similarity [EN-WS-353-SIM]. There is an overlap between the lists, all of the term pairs belonging to the 353 pairs of WordSim 353.
The WordSim-353 is based on works from the nineties elaborated in the USA. Because of the time and geographical differences, several words from this list are not included in the EAT, a British collection from the early seventies. This fact is not due to a lack of expressivity in WAN, but it is caused because some objects, people and ideas did not exist when the EAT was collected, or they were used in very different contexts. An example can be the word hardware, a neologism from the nineties, or Maradona, the soccer star from the eighties. As a result, 183 pairs in the list of similarity are in the WAN corpus, while 214 out of 253 are in the relatedness compilation. To deal with the non-inclusion of every word of the testing datasets in our EAT WAN, we introduced the concept of overlap in the experiments and calculated the total number of common words between the lists that are being compared. The others are excluded from the evaluation. In principle, having large overlaps is a positive feature for wan2vec.
We have also tested our method with SimLex-999 [SimLex-999] [32], a resource that contains 999 pairs of words and explicitly quantifies similarity in a way that pairs related by association or relatedness have a low rating. The overlap between the EAT and SimLex-999 is 968, which means that almost every word in the test data is covered by wan2vec.
The rest of the datasets we used for our experiments, measure the relatedness of words more than the similarity.
The Amazon Mechanical Turk data [MTurk-287] [48] consists of 287 word pairs, elaborated with the collaboration of the Amazon’s Mechanical Turk workers. The overlap with EAT is 203.
MEN [MEN-TR-3k] [9] is an evaluation benchmark that contains 3,000 pairs of randomly selected words. Each pair is scored on a [0, 1] normalized semantic relatedness scale via ratings obtained by crowdsourcing on the Amazon Mechanical Turk. We have a very high overlap with this corpus, 2,727 out of 3,000 word pairs.
The MTURK-771 dataset [MTurk-771] [30] evaluates the relatedness between 771 word pairs. It was obtained with the evaluation of Amazon’s Mechanical Turk workers. EAT covers 698 word pairs of this dataset.
The RG65 [RG-65] [49] collected data for 65 pairs of non-technical words. Their objective was to evaluate the judgments and perceptions about synonymy, and because of these, they used different pairs, in a range from highly synonymous to totally unrelated words. Every word pair in the RG65 is also in the EAT.
Spearman rank order correlations between wan2vec predictions (based on cosine similarity) and the evaluation benchmarks. The WAN graph was built using the
as weighting function
Spearman rank order correlations between wan2vec predictions (based on cosine similarity) and the evaluation benchmarks. The WAN graph was built using the
Finally, we evaluated the wan2vec embeddings with the MC-30 [MC-30] [42] benchmark. This list contains 30 word pairs, all of them included in the EAT.
Table 1 reports the Spearman rank order cosine correlations between the previously described datasets and the word embeddings of different dimensions learned on undirected graphs of the WAN EAT, being the weight of the edges given by the
Spearman rank order correlations between wan2vec predictions (based on cosine similarity) and the evaluation benchmarks. The WAN graph was built using the
From these tables it can be seen that, in general, the wan2vec embeddings learned on the graph weighted with the
As we have stated before, we are testing only the words in the intersection between the test lists and the WAN. The column n in Tables 1 and 2 corresponds to the number of pairs in the benchmarks, while the third column,
Spearman rank order correlation between wan2vec predictions (based on APSyn) and the evaluation benchmarks. The WAN graph was built using
Concerning the dimensions of the vectors, 50 and 100 proved to be the most efficient. With
From the point of view of the benchmarks, the similarities obtained with wan2vec achieved correlations of above 0.8 with the MC-30, RG-65, and MEN-TR-3k using both
On the other hand, the worst results are achieved with the SimLex-999 resource. This is not because of an small overlap – this is one of the best (96%). Given that the aim of the SimLex dataset is to measure only similarity, the words psychologically related have the lowest scores. We believe that this issue affected the performance on this particular resource since in a WAN, words are supposed to be associated by relatedness. However, this idea is not consistent with the results we obtained with WS-353-SIM and WS-353-REL, where the former has outperformed the latter in every experiment.
In general, the higher scores are achieved with the MC-30. The fact that every word in this list is also in the EAT is not related to the result because it is also the best when the missing pairs are removed.
More research should be done about these results in the future in order to know the dynamics of the associations and to adjust the settings of the experiments.
An additional experiment has been conducted to test the performance and the adaptation of wan2vec to different similarity algorithms. We have reproduced the same tests that we have just described; but, this time, instead of measuring the correlation of two vectors by means of the well-known method of the cosine similarity, we have used an implementation of APSyn used for count-based vectors, as it is established in [50]. This technique is defined as the extent of the weighted intersection between the top most salient contexts of the target words, weighting it by the average rank of the intersected features in the PPMI-sorted contexts lists of the target words. Therefore, the method is using a completely different algorithm to calculate the similarity between two vectors, and should provide a different perspective of the behavior of our node-based model.
Spearman rank order correlation between wan2vec predictions (based on APsyn) and the evaluation benchmarks. The WAN was built using
Tables 3 and 4 show the results of APSyn with the same benchmarks that have been tested before, and with the same weighting functions

Spearman rank order correlations obtained with different walk length using both graph weighting functions
Regarding the benchmarks, the best results are still the ones obtained with the list MC-30, but this time with
We performed an additional analysis which consisted in looking for the optimal walk length of the node2vec algorithm. The walk length indicates how deep the algorithm will walk through the graph in order to obtain the node’s corresponding embedding. The default value is 80, and this is the one we used in the above experiments.
We measured the performance of the wan2vec embeddings carrying out the same experiments reported in the previous section, but systematically changing the length of the walk the node2vec algorithm traverses for finding the mapping for each node. We evaluated the walk length values from 20 to 200 in intervals of 10. Figure 2 presents the results of the experiments with two datasets the WS-353-REL (Relatedness) and the WS-353-SIM (Similarity).
Description of the pretrained vectors of the three evaluated word embeddings models
Description of the pretrained vectors of the three evaluated word embeddings models
Cosine-based Spearman correlation between pretrained embeddings models, wan2vec with dimension 300 and wan2vec with its best correlation value. All of them regarding the benchmarks
In every case, the best results are achieved after the walk length of 60, reaching the best performance around 120. However, after this point, the improvement of the quality of the vectors is not remarkable and it can increase the time and complexity for training the model.
In order to test and compare the quality of our wan2vec vectors, we also performed the experiments with pretrained vectors, which are presented in Section 4.1. We selected three word embeddings models: word2vec,11
Table 5 presents the general features of the evaluated resources. In all cases the embeddings’ dimension is 300 (dimensions), the difference of these three models basically lies in the type and size of the training corpus (Corpus (size)), the total amount of different words (Vocab. size), the training algorithm, and the context windows size. Data about the author of these models is also presented. The n (Overlap) column here is the same as in Tables 1–4, in this way we performed a fair evaluation using only the words that are available in all embedding models (including wan2vec).
Table 6 shows the Spearman rank order correlations between the evaluated pretrained embeddings models and the human judgments (available in the benchmarks). The two last columns show the results obtained with wan2vec. One retrieves the outcomes of the method with vectors of dimension 300 with
In average, the similarity obtained with the GloVe embeddings obtained the worst correlation scores, whereas the similarity obtained with fastText achieved of 0.7032 of Spearman correlation. These outcomes outperformed by wan2vec both with dimension 300 (0.7136) and 100 (0.7242).
The performance of fastText is better than wan2vec with dimension 300, with WS-353-SIM, MTurk-287, WS-353-REL, MEN-TR-3k and RG-65, whereas the vectors wan2vec with dimension 300 outperforms fastText with MC-30, SimLex-999 and MTurk-771. However, taking our best dimension, 100, wan2vec achieves better results than fastText except for MTurk-28 and RG-65. With these benchmarks, fastText gets always the best scores.
APSyn-based Spearman correlation between pretrained embeddings models, wan2vec with dimension 300 and wan2vec with its best correlation value. All of them regarding the evaluation benchmarks
As for our wan2vec embeddings, the highest correlation was achieved with the MC-30 (300) and the MEN-TR-3k (100) datasets, obtaining correlations of 0.8582 and 0.8075, respectively. The pretrained vectors achieved the highest correlation of 0.8119 on the MC-30 with fastText embeddings and 0.8171 on the RG-65 dataset with GloVe 42B.
It is also interesting to note that the correlation with the SimLex-999 corpus is the lowest with all embeddings models, being the best the 0.51 of wan2vec-300 and the 0.49 of wan2vec-100, among the other methods.
Finally, the same tests have been performed with the APSyn similarity measure, with the results that are shown in Table 7. The columns follow the same structure than Table 6. In general, the results are very low, as it happened with wan2vec when using this measure (Tables 4 and 3). Besides this, the main difference between Tables 6 and 7 is that, in the latter, GloVe 42B vectors are more competitive. Both GloVe 42B and word2vec achieve better results than fastText. wan2vec with dimension 300 has consistent results, obtaining the best score with SimLex-999. And finally, if we take wan2vec with our best dimension, we obtain the highest scores with the benchmark MEN-TR-3k. However, GloVe 42B wins with WS-353-SIM and MTurk-287, whereas word2vec has the best results with WS-353-REL, MTurk-771 and RG-65. Finally, fastText achieve the best scores with MC-30.
Therefore, it is clear that the similarity measure between vectors we use has a clear impact on the results obtained. The models that are being compared show similar behaviors under similar tests, showing a decreasing and unexpected performance under APSyn measure. Indeed, we see an inversion of the scores obtained using cosine similarity. wan2vec and fastText, that worked better with cosine, are the worst with APSyn, while GloVe and word2vec, the worst systems with cosine, obtain better numbers in APSyn.
In order to evaluate the wan2vec embeddings in downstream tasks, we used SentEval14
Sentiment analysis tasks (SST-2, SST-5): both binary and fine-grained SST [54].
Paraphrase detection task (MRPC): aims at identifying if a pair of sentences captures a paraphrase/semantic equivalence relationship [18].
Natural language inference task (SICK-E): consists in predicting whether two input sentences are entailed, neutral or contradictory [39].
Semantic textual similarity tasks (STS’12-16): consists in evaluating how the cosine distance between two sentences correlate with a human-labeled similarity score through Pearson correlations.
For the classification tasks, the SentEval toolkit generates sentence vectors and a logistic regression classifier on top. For the unsupervised tasks (STS), it also generates sentence embeddings from the word embeddings. In both cases, the sentences vectors are the average of the word embeddings in each sentence. Although, there are other universal sentence encoders methods [14], the average of word embeddings as sentence representation is a competitive baseline for comparing the quality of word vectors.
Table 8 shows the evaluation of pretrained word vector models (GloVe and fastText) and our wan2vec vectors on several text classification tasks (SST, MRCP, and SICK-E). The results of wan2vec are around 10 points lower than the ones achieved by GloVe and fastText. In order to test if this is because the small coverage of vocabulary in our system, we have performed the evaluation with a lexical set for GloVe and fastText that was equivalent to wan2vec. As can be seen, the results are now comparable and, for tasks MRPC and SICK-E our system, with the weighting function
This experiment highlights that, although for the intrinsic evaluation the reduced vocabulary was not a problem, for the tasks contained in the SentEval toolkit this is a clear limitation.
Transfer test results for pretrained embeddings models and wan2vec embeddings. All vectors have 300 dimensions
Table 9 shows the evaluation of pretrained word vector models (GloVe and fastText) and our wan2vec vectors on the semantic textual similarity benchmarks (STS’12, STS’13, STS’14, STS’15, STS’16). The tests show the same general behaviour than the previous ones. With the entire vocabulary, our system is not competitive versus GloVe and fastText, while, with the same vocabulary, wan2vec reaches the best scores in every task except STS’13. It is worth to note that, following Tables 8 and 9, the weighting function
Evaluation of sentence representations on the semantic textual similarity benchmarks. The average of Pearson correlations is used for STS’12 to STS’16 which are composed of several subtasks. All vectors have 300 dimensions
SentEval also includes a series of probing15
Surface information: evaluates the ability of the sentence embedding for preserving the surface properties of the original sentence. The sentence length (
Syntactic information: evaluates the ability of sentence embeddings for preserving the syntactic properties of the sentences they encode. The bigram shift (
Semantic information: the tasks in these group require to understand the meaning of a sentence. The
The evaluation on the probing tasks are presented in Table 10. Again, the general tendency is that wan2vec only obtains comparable results when GloVe and fastText use the same amount of vocabulary; and, in the last case, fastText is always slightly better than wan2vec. Only for the task
Probing tasks accuracy with logistic regression. All vectors have 300 dimensions
We want to note that we only tested wan2vec with dimension 300, which is not our best dimension. Probably using other dimensions that have obtained better results in intrinsic evaluation, such as 100, the behaviour of the model would improve. In spite of this, as a conclusion, the extrinsic evaluation has demonstrated that it is necessary to have large vocabularies to obtain competitive results with every type of task.
In this paper, we introduce wan2vec a word embeddings model learned from Word Association Norms instead of large corpora. We applied the algorithm node2vec to a graph built over the Edinburgh Associative Thesaurus (EAT), a collection with 23,218 nodes. The result is a set of trained vectors that achieved better correlation with human judgments than other embedding models in the task of similarity and relatedness prediction.
The node2vec algorithm learn embeddings using a random walk that explores the neighborhoods over the nodes to capture a better representation of the graph structure and the diversity of the connectivity.
As for the weight of the edges, we took into account two different functions: inverse frequency (
We experimented with different walk length distances of node2vec, showing that the correlation tends to grow with higher distances and stabilizes above the length of 60. We believe that the default parameter of 80 is fair enough to have representative embeddings since higher values increase the complexity time of the experiments. Finally, we also evaluated a nodes pruning strategy, which consisted in keeping only those nodes that were strongly connected. Following the previous work by Deyne et al. [17] we kept only responses that also occur as stimuli and stimuli that were also given as responses. It resulted in a high loss of outcoming nodes and a high reduced overlap, that did not let us give a strong enough comparative benchmark.
The results we are reporting clearly outperform the ones obtained with some well-know pretrained vectors (word2vec, GloVe, FasText) trained in large corpora, like Wikipedia. This results are in line with the work by Deyne et al. [17], reaffirming the importance of the Word Association Norms as a resource for natural language processing tasks. This kind of resources reflect, to some extent, the mental representation of the words. Additionally, they are valuable for learning distributional representation of words.
On the contrary, the worst performance of the model has been reported when dealing with extrinsic evaluation. The tests we have conducted lead us to think that the reduced vocabulary is the main reason of these results. Therefore, this is one of the main issues the model has to deal with, as we explain in what follows.
One of the most important limitations of the model is its dependency of WAN compilations. There are two important issues related to that. First, collecting WANs is a hard and time consuming task. Not every language has this type of resources and the capability to collect it. Anyway, other methods based in corpus can take advantage of large digital collections of texts, like Wikipedia, online newspapers, etc., while WANs need an experimental approach, planning and very a specific treatment. Second, the number of words in WANs is very small compared with other corpora. As we have explained, in the larger resources based in associations, like EAT, the overlap with the benchmarks is high, but smaller than the obtained in methods that use large corpora. When comparing the performance of wan2vec with the others, it seems clear that the total number of tokens is not necessarily an advantage. Moreover, corpora like Wikipedia have many noisy elements that do not improve the behaviour of the vectors. However, it would be optimal to have models that could include a larger set of different words.
A possible solution for this problem would be to automatically generate word association norms between pairs retrieved by a medium-size corpus (i.e., BNC), and build a new resource that can account for syntactic, semantic and cognitive connections between words. There are some contributions [53] that have introduced interesting ideas on how a system can learn word associations from a WAN and extend it to larger collections. This is a task that we plan to develop in the future.
Footnotes
Acknowledgements
This research has been supported by projects PAPIIT IA400117 and IA401219 from the Universidad Nacional Autónoma de México; and CONACYT Fronteras de la Ciencia 2016-01-2225.
