Abstract
An important problem in large symbolic music collections is the low availability of high-quality metadata, which is essential for various information retrieval tasks. Traditionally, systems have addressed this by relying either on costly human annotations or on rule-based systems at a limited scale. Recently, embedding strategies have been exploited for representing latent factors in graphs of connected nodes. In this work, we propose MIDI2vec, a new approach for representing MIDI files as vectors based on graph embedding techniques. Our strategy consists of representing the MIDI data as a graph, including the information about tempo, time signature, programs and notes. Next, we run and optimise node2vec for generating embeddings using random walks in the graph. We demonstrate that the resulting vectors can successfully be employed for predicting the musical genre and other metadata such as the composer, the instrument or the movement. In particular, we conduct experiments using those vectors as input to a Feed-Forward Neural Network and we report good comparable accuracy scores in the prediction with respect to other approaches relying purely on symbolic music, avoiding feature engineering and producing highly scalable and reusable models with low dimensionality. Our proposal has real-world applications in automated metadata tagging for symbolic music, for example in digital libraries for musicology, datasets for machine learning, and knowledge graph completion.
Introduction
High-quality metadata is a prerequisite in many music information retrieval (MIR) tasks, like accessing symbolic music collections, music recommender systems and discovery [8]. Historically, the availability of this metadata has depended on manual human annotation labour, which is typically costly. Consequently, several systems have been proposed that automatically analyse music in order to annotate high-level features, some being abstract enough to be close to traditional metadata [5]. Most of these systems target the so-called
In this work, we focus on symbolic music in MIDI format [4] due to its high availability on the Web and its popularity in tasks like music generation [50] and music knowledge graphs [34]. This popularity has also given birth to large MIDI datasets that need, besides high-quality annotations, scalable approaches producing them. Despite their natural fit as a transcription format for music, MIDI files alone have several limitations, all concerning the fact that track information (instruments, vocals, metadata) is not standardized and often only implicit [43]. One way of addressing this is to express MIDI information using RDF and ontologies [34], allowing for the re-use and standardization of missing MIDI metadata at Web scale. The Lakh MIDI dataset [43] is another important MIR benchmark used in e.g. automatic audio transcription; however, only 31,034 of its 176,581 MIDI files (17.57%) are aligned with metadata from the Million Song Dataset [41].
Reconstructing this kind of high-level metadata from symbolic musical content, automatically and at scale, is a challenging task due to the existing semantic gap between the desired metadata and low-level music descriptors [10]. One way of addressing this is by analysing the content of the symbolic notation for predicting higher-level metadata, i.e. identifying symbolic patterns in melody, harmony, rhythm, structure, etc. that are characteristic of a certain genre or composer. So far, research has mainly focused on genre [9,32], emotion, and composer [17] classification mainly using supervised machine learning techniques. However, traditional machine learning algorithms are limited by the need to perform some feature selection in the so-called feature engineering process [22]. In this domain, feature engineering consists in identifying and extracting from raw music data high-level features like pitch distribution, rhythmic patterns, and represented chords, with the intention of making them good predictors towards a target variable of interest [14].
In order to overcome this, embedding-based methods have been proposed. Embeddings are mappings that transform the symbolic representations of discrete variables (elements that cannot be naturally ordered such as words or songs) into numeric vectors. Each dimension of an embedding vector represents a latent feature that has been automatically learned. Avoiding more expensive representations such as the one-hot encoding, embedding vectors are useful at reducing the dimensionality of the input data of neural networks, a typical choice to address a learning task such as automatic classification. Such embedding-based methods have been successfully applied to textual [35] and graph data [21] and notably also on MIDI data for the task of automated music generation through Recurrent Neural Networks (RNN) [24,64] and self-learning techniques such as Variational Autoencoders (VAE) [50]. Graph embeddings have become a widely used and effective way to represent graph information in a way neural networks can easily process it [49].
In this paper, we propose MIDI2vec, a method for representing MIDI data as vector-space embeddings for automated metadata classification. First, we express MIDI files as graphs, assigning unique identifiers to specific characteristics and their values such as tempo, time signature, programs (instruments), and notes. Therefore, two MIDI files will be connected in the graph if they share the same resources (e.g. same instruments, chords, tempo, etc.). Second, we use graph embedding techniques – and in particular the We leave the interesting alternative of traversing these MIDI files sequentially, i.e. following the trail of temporal event occurrence instead of contextual co-occurrence, as future work.
In other words, we assume the distributional semantics hypothesis [23] over MIDI features: notes or groups of notes used in similar contexts will tend to have similar meanings, and in particular, will be associated with similar features, even for high-level metadata such as genre, composer or instrument. More specifically, our contributions are:
the conceptualisation of relevant symbolic features (pitch, timbre, tempo, time signature) of MIDI space into the systematic application of a well-known graph embedding generation method to generate the use of such learned embeddings to predict metadata for three datasets, achieving comparable accuracy to symbolic feature-based approaches without the need of feature engineering, scaling to Web-size datasets, and with one order of magnitude fewer dimensions.
To the best of our knowledge, this is the first time that graph embedding approaches are used for representing a whole symbolic music track, and for reliably predicting symbolic music metadata.
The rest of the paper is organised as follows. In Section 2, we survey related work. In Section 3 we briefly introduce graph embeddings. In Section 4, we describe our strategy to extract relevant symbolic data from MIDI files, to represent them as graphs, and to use these graphs to build MIDI embeddings. In Section 5, we run an experiment to predict genre and other metadata on three different datasets using the MIDI embeddings. Finally, we conclude and outline some future works in Section 6.
The extraction of high-level metadata from musical content is a long-standing goal of the MIR community [8], and one of the purposes of the Essentia library [5]. This is a task to fulfil towards an automatic reconstruction of music metadata, although the semantic gap [10] between content and metadata shows that purely bottom-up approaches are hard. Nevertheless, automating the generation of high-quality metadata would be a great benefit to tasks like music knowledge graph completion and music emotion detection, for which MIDI has already been used [1,34].
In [13], a comprehensive study surveys techniques for genre classification based on symbolic music. Although the performance scores of the state-of-the-art set the baseline to outperform, many are computed on different sets of classes, monophonic MIDIs, or genre-specific datasets (e.g. folk music). Nonetheless, the survey finds a large number of methods based on machine learning. For example, the unsupervised nearest neighbours (NN) and k-nearest neighbours (kNN) is applied to genre prediction from MIDI in [32]. This work is further extended in [9] with linear discriminant classifiers (LDN) and by combining MIDI and audio features. Different data sources – audio, symbolic music, lyrics – are instead combined in [30]. However, these classical machine learning approaches suffer from the need for feature selection, which is costly and can overfit models [22].
Recent developments in neural networks have boosted work in vector space-based music metadata classification, using vectors computed from the audio signal. Some examples are genre-agnostic key classification [26] and jazz solo instrument classification [18]. However, these approaches still tackle the classification of mostly content-based features (e.g. timbre), and not high-level metadata (e.g. genre).
Both feature engineering – common in pre-deep machine learning and purely symbolic approaches – and vector-space based methods have advantages and limitations. Model overfitting might happen in both [22,61]. Because it is based on provided knowledge, feature engineering is faster than learning features from data – a process that can be computationally expensive – and is easily understandable to humans due to its intrinsic symbolic representation. On the other hand, vector space representations have the advantage of capturing latent features that might be hard or impossible for humans to describe symbolically; a more fuzzy representation can constitute an advantage when music information is ambiguous and loosely defined (e.g. “Allegro”, “Prelude”) [62] and scale very well to large datasets and number of classification classes, where current feature-engineered methods have pitfalls in scalability concerning the dataset size and the number of metadata classes to predict [30]. Therefore, one of the aims of this work is to gain a better understanding of how feature-based and embedding-based techniques compare and perform at scale in the task of symbolic music metadata prediction.
While recent MIR research relies mostly on audio analysis for metadata prediction, symbolic notation is largely used for automated music generation with these kinds of models. An example is MusicVAE [50], a hierarchical variational autoencoder (VAE) that learns a latent space of musical sequences. Similarly, in [24] MIDI embeddings are employed for automatic music generation, representing all the notes played together at regular time steps. In [64], MIDI files are used for learning a set of embeddings representing different aspects of a pitch, during the training of an RNN for music generation.
Several experiences of building Knowledge Graphs about music have been proposed in the literature. Ontologies have been designed to represent music metadata about works, performances and tracks; some notable examples are the Music Ontology [44] – further extended with other modules such as the Audio Effects Ontology (AUFX-O) [63] and the Audio Features Ontology [3] –, DOREMUS [2], the Performed Music Ontology (PMO) [52], CoMus [55]. Researching strategies for publishing and exploiting the music knowledge is the main focus of several projects such as TROMPA [60] and Polifonia,2
Vector embedding similarities on various semantic descriptors are applied in music recommendation in [29]. These vector space representations, which tie related artists, works and performances closer, eventually surface terminologies and ultimately linked vocabularies for music metadata [28].
While embedding-based approaches provide different modes of interactive musical creation, they do not require feature selection and can be mapped to other latent spaces – e.g. texts [59] or documents (the homonym
Graph embeddings are the result of the transposition of word embedding techniques – notably word2vec [35] and GloVe [38] – to networks. According to [15], a graph can be defined as a set
In 2014, Perozzi et al. published Deep Walk [39]. The core idea of this work consists in the use of random walks in the graph in order to generate sequences of nodes. The number and the length of link paths between two nodes impact the probability of those two nodes being selected together in the random walk. In other words, the more two nodes share connections6 Two nodes are connected if exists one or more paths of edges between them, of which the two edges represent respectively the first and the last node involved. A connection can consist of a single edge; in this case, we can speak about directly connected nodes.
DeepWalk has been extended by
Other notable embedding-based techniques have been proposed for representing nodes in a graph, such as rdf2vec [49], entity2vec [37], graph2vec [36] for graph embeddings, and many others.7 A regularly updated list of software for embedding generation is available at
The MIDI format does not present a graph structure, but it consists of a time-based linear succession of events, called
MIDI to graph
We propose a preliminary conversion of a MIDI file into a graph. As shown in Fig. 1, a

Schema of the graph generated from MIDI. The
In the context of graph embeddings computation, the practice is to take into account only connections between entities, that are nodes represented by identifiers. Literal values (text, numbers, etc.) are normally ignored [49] or some shrewdness is applied such as the use of contiguity windows [25]. In fact, literals can increase uncontrollably the number of nodes, in particular in presence of continuous values or entity-specific textual annotation. The effect is a very sparse graph,8 A graph is considered
The continuous values are then discretised in partitions, each one representing a range of 10 bpm. Example:
The full list of MIDI programs is available at
is represented as
the maximum duration of the notes in the group, discretised in classes of 100 ms. Example: the id their average velocity. Example: all the pitches, identified by their standard MIDI numbers. Example:
Each group has an identifier that is deterministically computed from its content using a hash function. These groups are then linked to the relative MIDI node. This ensures that two identical groups of notes have the same identifier, linked to all MIDI nodes in which this group appears without any further effort.
MIDI2vec does not encode any information concerning time, in particular about sequences of notes occurring in the same channel. This information is undoubtedly relevant and crucial in music representation. Nevertheless, we decide to not include the time dimension in this experiment for two main reasons. First, the inclusion of order and sequentiality in a graph representation is not very common and few works addressed this topic so far [45,54]. Second, this would require some choices, among them the number of consecutive notes to be grouped and the opportunity of representing pauses in the graph or not. For this reason, we decided to consider the encoding of subsequent notes for future work.
The connections between nodes are mostly of type
Formally, we realise a graph

MIDI graph example, showing possible connections between MIDI #1, #2, and #3.
In Fig. 2, an excerpt of an example graph is provided, in particular showing the connections between MIDI files through complete note groups, single notes, duration, etc. In this kind of graph, two MIDI tracks sharing multiple chords (or other elements) will have more probability to appear in the same random walk. This representation aims to track at the same time the presence of specific chords and of quick and long notes, which can respectively characterise a more virtuous or lyrical composition.
The graph is saved in the edgelist format, which includes all couples of connected nodes. In the edgelist, each line of text represent an edge and contains the identifiers of the two involved nodes separated by blank spaces. In practice, the MIDI files are read and messages are sequentially converted and appended to the edgelist. This edgelist is the output of this first conversion and feeds the second part of the approach, described in the next section. Such obtained graph contains only information extracted from the MIDI content, not including any kind of external metadata.
Embeddings are computed on the output graph of the previous process with the
In practice, each node in the graph is selected as the starting node for a random walk, occupying its first slot. The second slot will be occupied by one of the nodes directly linked to the first node, according to the probability function. Iteratively, every slot of the random walk will be occupied by one of the neighbours of the previous one. The number of walks to be produced for each node and their length are given parameters. These walks are then processed by
Following this procedure, a 100-dimensions embedding vector is computed for each node (vertex) of the input graph. Each dimension of the vector cannot be attributed to a specific feature of the described item – for example, the tempo – but it rather represents latent features learned by the embedding algorithm. We apply a post-processing step in order to keep only the vectors
Such obtained vectors can be then used in input to algorithms, in tasks such as classification, clustering, and others. In the experiment which will be detailed in the following section, we will use such generated vectors in input to a neural network for classification. In particular, all vectors used in our experiment have been computed using the following configuration of
Evaluation
We evaluate this strategy in relation to three different goals, involving three different MIDI datasets. These goals are detailed in the next sections and consist respectively in predicting the genre, some high-quality metadata, and some user-defined tags.
For each goal, we perform an experiment that relies on a common procedure. MIDI embeddings are generated on the dataset using MIDI2vec. A Feed-Forward Neural Network receives the MIDI embeddings as input (100 dimensions) in batches of size 32. The network is detailed in Fig. 3. The choice of a Feed-Forward NN is due to the fact that we do not capture the MIDI content as a temporal series – which would be necessary for working with CNN – but as a list of connections.

Scheme of the neural network.
The set of labels used for training and testing changes according to each experiment. However, it is worth reminding the reader that those labels have not been used in the embedding task, and consequently, they are not directly included in the embedding information. The neural network consists of 3 dense layers. The hidden layers count 100 neurons each and use A rectified linear unit (ReLU) is a classic activation function in Deep Learning networks, which return 0 when the input is negative or the input value itself when it is positive. ReLU is widely used because of its simplicity and its empirically demonstrated fast convergence. The sigmoid function transforms the input In 10-fold cross-validation, the dataset is split into 10 groups, choosing in sequence each of them as test set (10%) and the remaining 9 as training set (90%). The definition of
For the first two goals (Section 5.1 and 5.2), a further experiment requires a preliminary splitting of the dataset in 10 equal folds, in order to alternatively use 1 of them as test set and the remaining 9 as training set, implementing a complete 10-fold cross-validation (CCV). The embeddings are generated using exclusively the training set, while the vectors representing the MIDI files in the test set are computed
Currently, the library deals with the unpitched notes in MIDI Channel 10 (reserved to percussion by specification) as they have a pitch. When we ignore Channel 10 and use only programs representing pitched instruments,14 Musical instruments can be classified as
These experiments are available as notebooks at
In [30], McKay et al. perform a genre classification task on a contextually published SLAC Dataset,15 SLAC dataset:

From left to right and top to bottom, notes, instruments, tempo and time signature of the MIDIs in the SLAC dataset (249 files). The average note is B3 (we have omitted all drum events in channel 10); the most frequent instruments (peaks) are acoustic grand piano, string ensemble 1, and distortion guitar. The average tempo is 203.16 bpm, estimated with [42]. The most common time signature is 4/4.
We perform a 5-class genre classification experiment as well as a 10-class experiment on the same dataset. In [30], the authors use different inputs for predicting the genre: symbolic music (S) –which is the MIDI content–, lyrics (L), audio (A), cultural features (C) (tags extracted from the Web) and the multi-modal combination of all of these features (SLAC). In particular, the symbolic information is organised around 111 features (1021 dimensions). Their work has been extended and improved in a more recent paper [31] including, among others, features about chords and simultaneous notes, for a total of 172 features (1497 dimensions). We will compare our approach with these works, taking into account these 5 variants of features being used. The results are reported in Table 1.
Accuracy of the genre classification. The reported values are the average (and standard deviation) of the cross-fold validation. In *N, *P, *T, *TS the embeddings have been computed on the sole notes, programs, tempos and time signature, while ALL includes all of them and *300 uses only the first 300 note-groups. Under CCV, the results of the complete cross-fold validation
Our approach slightly outperforms [30] when only symbolic data are used in input (S), with an accuracy of 86% for 5-classes and 67% for 10-classes prediction. In addition, our method outperforms also other variants, namely lyrics (in both classes) and audio (in the 5-classes). The improvements made in [31] increase these scores by a few percentage points. We believe that the combination of melodic and chords features was crucial in this case and worth investigating in future work.
The same Table 1 shows also the accuracy scores obtained with different variations of the complete model (ALL); these variations compute the embeddings on the sole notes nodes (*N), program nodes (*P), tempo nodes (*T), and time signature nodes (*TS). None of these single features reaches the accuracy score of their combination. It is not surprising that *N reaches the higher accuracy among those models, having been computed on a more populated graph – the number of
Given the close results between the two best variations (ALL and *N), we studied their statistical significance, to understand if these two variations are likely to have the same accuracy mean. In order to do so, we extracted a t statistic computed on a 10 × 10-fold cross-fold validation, according to [6]. Applying this statistic to a Student’s t-test, we obtain their p-values, comparing them with the common significance level
In the complete cross-fold validation (CCV) experiment, the accuracy scores are around 10% lower. This decrease is mostly due to the difference in computing the vectors of the nodes in the train set (embedding algorithm) and the test set (average of other nodes). However, the results are consistent with respect to

Confusion matrices of midi2vec predictions for the SLAC dataset.
Figure 5 shows the confusion matrix between the real and the predicted values (configuration ALL). Even if there are no strong patterns, we can state that Similarly to Principal Components Analysis (PCA), t-SNE maps a high-dimensional space into a low-dimensional one. The algorithm computes the probability that a point A would choose point B as its neighbour, according to a Gaussian probability distribution centred at A.

2D representation of the embedding space learned by midi2vec from the SLAC dataset.
This task consists in predicting a set of metadata from the MIDI, namely the
We started by downloading a corpus of 438 MIDI files from MuseData.17 The MuseData dataset is available on the old musedata website:

From left to right and top to bottom, notes, instruments, tempo and time signature of the MIDIs in the MuseData dataset (439 files). The average note is E4 (we have omitted all drum events in channel 10); the most frequent instruments (peaks) are string ensemble 1, the acoustic grand piano, and the harpsichord. The average tempo is 188.96 bpm, estimated with [42]. The most common time signature is 4/4, with 3/4 also relatively frequent.

Confusion matrices of midi2vec predictions for the MuseData dataset. For the instrument predictions (c), the labels are Instrument, Orchestra, Orchestra + Soloist instrument, Orchestra + Voice, Piano, Voice.
MuseData provides also some metadata, like the composer name, the scholar catalogue number, a label for the movement. In order to obtain further information (i.e. the genre and the played instruments), we have interlinked each composition against the DOREMUS knowledge base [2], a dataset specialised in classical music metadata.
The interlinking process consists of three successive steps:
interlinking of the composer through the exact match on the full name. This limits the candidates for the composition interlinking to the sole compositions of the interlinked composer;
interlinking of the composition through the exact match on the catalogue number;
if no catalogue number match is found, the titles are involved in the process. A title can often contain other kinds of information, such as key, instruments, opus number, etc. For example, the title “Symphony No. 3 in E-flat Major” contains the order number and the key. For this reason, titles are tokenised through empirical methods based on regular expressions to separate the different parts of the string, used as input of the Extended Jaccard Measure. [57]
Every composition can be linked to more than one MIDI file, in the case of works made of multiple movements. The movement labels have been cleaned by removing the order number, the key, the instruments and eventual comments in parentheses. For example,
The interlinking gives access to more specific metadata, mostly coming from controlled vocabularies [28], in particular composers (4 classes, i.e. Bach, Beethoven, Haydn, and Mozart), genres (10 classes), and instruments. For this last dimension, given the very large number of possibilities, we decided to reduce the number of classes to 6, including piano
Furthermore, we consider also the movement label as a feature to predict, considering only those which were occurring more than 10 times. Those labels include tempos (
For each kind of metadata feature, the table reports the number of items, the number of distinct classes, the average (and standard deviation) accuracy score for midi2vec, midi2vec with complete cross-validation, jSymbolic
The final accuracy (average of all the fold scores) is reported in Table 2. The best results are achieved for composer and genre prediction, and good results can be observed for all metadata. Looking at the confusion matrices:
For the composers, the best results belong to Bach (the most present in the dataset). The two Austrian composers Mozart and Haydn are not surprisingly quite confused with one another, belonging both to the Classicism, differently from Beethoven (Classic-Romantic) and Bach (Baroque) [51]. The score for Beethoven reflect its under-representation (only 10 tracks) in the dataset (Fig. 8a);
The genres are much more specific than the ones investigated in Section 5.1. As a consequence, the greatest confusion occurs between couples of very similar genres, such as [
While the instrument prediction has great results in identifying works for orchestra, piano solo or small ensemble of instruments, it reveals some unreliable classification for voice-only pieces, probably due to the under-representation of the class in the dataset. In the same way, the approach is not able to distinguish compositions for orchestra, orchestra and voice, and orchestra and soloist, all classified under the class
Even if the movement labels include heterogeneous meaning, the network correctly predicts 7 over 10 items. Some confusion patterns can be spotted. The
Also in this case, we lose some accuracy (around 12-15%) when applying the complete cross-validation strategy. To have a comparison, we replicated the classification experiment described in [31], applying to Musedata an SVM classifier trained on the feature vectors computed by jSymbolic.18 We also used the jSymbolic vectors in combination with a Neural Network, but obtaining worse performance scores.

2D representation of the embedding space learned by midi2vec from the MuseData dataset.
The Lakh MIDI Dataset (LMD)19 Lakh MIDI Dataset: The LMD website declares that

From left to right and top to bottom, notes, instruments, tempo and time signature of the MIDIs in the Lakh dataset (31,036 files). The average note is B3 (we have omitted all drum events in channel 10); the most frequent instruments (peaks) are the acoustic grand piano, string ensemble 1, and acoustic guitar (steel). The average tempo is 213.15 bpm, estimated with [42]. The most common time signature is 4/4; 2/4 is also relatively frequent.
We extracted from MusicBrainz: The Echo Nest:
Differently from the experiment in Section 5.2, the size of the dataset allows to further filter the data to extract a balanced dataset. In particular, we select all distinct classes which are represented by at least 50 instances. For each of these classes, we randomly select 50 instances. Table 3 shows the accuracy for the predictions, measured through 10-fold cross-validation, together with the number of classes (distinct tags) against which we run the classification. In addition, in the table are reported the accuracy scores obtained by an SVM classifier built on top of jSymbolic. The comparison of these results reveals that latent features can largely boost the performance in tag classification, with evident benefit in real-world scenarios like automatic tag prediction.
For each kind of tag feature, the table reports the number of items, the number of distinct classes, the average (and standard deviation) accuracy score for midi2vec and jSymbolic
The confusion matrices are shown in Fig. 11. Among the most wrongly predicted MusicBrainz classes, we find a strong presence of nationality tags (

Confusion matrices of midi2vec predictions for the Lakh dataset.
Most frequent prediction errors for tag classification, with the percentage of cases over the total of the class
In this paper, we hypothesise that symbolic music content in MIDI files, and its embedding representation in vector space, are a powerful tool for automated metadata classification tasks. Traditionally, applications of machine learning to this problem have encountered limitations in feature selection, and more recent embedding-based techniques have been only used for other tasks (e.g. music generation) or on different data (e.g. music metadata). In this paper, we propose MIDI2vec, a method to represent MIDI content as a graph and, subsequently, in a vector space through learning graph embeddings. We gather evidence that our hypothesis holds: MIDI2vec embeddings were successful in metadata classification, obtaining comparable performances to state-of-art methods based on feature extraction from symbolic music, with the added advantages of scalability, automating feature engineering, and reducing the required dimensions by one order of magnitude.
We plan on improving this work in various ways. Even if experiments revealed that the impact of unpitched notes in Channel 10 is minimal, we intend to assign a separated branch of the graph to percussion notes in the future, in order to distinguish them from the other ones while still taking them into account, with the goal of improving the overall performance.
Being transformed into a flat graph, the MIDI content loses in MIDI2vec all time-based information, with the only exception of simultaneity. Given the importance of melodic patterns in a music piece, future work would investigate how this work can deal with note sequences. We plan to investigate a few different strategies. First, sequences of notes may be encoded similarly to the simultaneous notes groups and included in MIDI2vec as a fifth node type. A second strategy may rely upon the inclusion in the graph embedding process of sequence embeddings like Sequence Graph Transform (SGT), which is capable of short- and long- term sequence features [45]. It is possible to think of MIDI file as a
Another study may combine MIDI2vec with other feature extraction techniques – e.g. the previously cited [31] – in an ensemble system, to exploit the best of the two methods. Finally, we would like to investigate if the learned features can be employed in predicting other kinds of data, for example the category of a videogame (adventure, shooting, fighting) from a dataset of soundtracks in MIDI [16].
A MIDI ontology and a corpus of over 300 thousand MIDI in RDF format have been presented in [33]. Despite being an interesting target for MIDI2vec, the extraction of crucial information (like the duration of a note) from the dataset is hard. In the current version, the ontology faithfully reproduces the event structure of the MIDI files, while significant edges – e.g. among simultaneous notes or consecutive events – are missing. Still, the MIDI2vec approach does not exploit edges between consecutive groups of notes, while they may potentially impact the performances. We plan to extend or map the MIDI ontology to solve this issue and enable MIDI2vec for working on such corpus, e.g. to perform link discovery and knowledge graph completion. In particular, crowdsourcing strategies on these resources may be applied to create large annotated datasets to use as ground truth. Moreover, it would be interesting to extend this approach to other symbolic music notation formats, namely MusicXML.
According to some intuition from other works in the genre classification field [9], the computation should not necessarily involve the full length of the track. Experiments with different time spans or sample sizes among the graph edges can help in detecting a trade-off between the performances and the embedding computation time. Recent approaches for including literal values in graph embeddings [11,27] could be included in MIDI2vec, to avoid any arbitrary choice that value-partitioning implies. Finally, we will use MIDI2vec in more applied contexts, such as the task of knowledge graph completion in knowledge bases with incomplete metadata entries [34].
Footnotes
Acknowledgements
This work is partly supported by the CLARIAH project funded by the Dutch Research Council NWO. This work is part of a project that has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 101004746 (Polifonia: a digital harmoniser for musical heritage knowledge, H2020-SC6-TRANSFORMATIONS).
