Sage Journals: Discover world-class research

Abstract

The purpose of this research is to present a natural language processing-based approach to symbolic music analysis. We propose Mel2Word, a text-based representation including pitch and rhythm information, and a new natural language processing-based melody segmentation algorithm. We first show how to create a melody dictionary using Byte Pair Encoding (BPE), which finds and merges the most frequent pairs that appear in a collection of melodies in a data-driven manner. The dictionary is then used to tokenize or segment a given melody. Utilizing various symbolic melody datasets, we conduct an exploratory analysis and evaluate the classification performance of melody representation models on the MTC-ANN dataset. A comparison with existing segmentation algorithms is also carried out. The result shows that the proposed model significantly improves classification performance in comparison to various melodic features and several existing segmentation algorithms.

Keywords

Melody segmentation melody tokenization music and language music information retrieval natural language processing

Introduction

Music and Language

Although no definitive conclusion has been reached, there is no doubt that music and language clearly share some of the cognitive mechanisms and structural aspects. Despite the existence of differences in semantics and cultural function, a growing body of evidence suggests that language and music are more closely related than previously believed (Patel, 2003). From anthropological claims that language and music evolved in the same ancestral system (Brown, 2001) to empirical evidence that certain parts of the human brain system responsible for language transcend across two domains (Patel, 2003; Chiang et al., 2018; Maess et al., 2001; Koelsch et al., 2004; Tillmann et al., 2003; Brown et al., 2006), various significant scientific findings have revealed common features between music and language. These two share structural features and generate similar expectations for listeners (Besson and Schön, 2001). Studies in music and language include different perspectives such as speech and musical sound (Tervaniemi et al., 1999), melodic and rhythmic patterns (Patel et al., 1998), structural analysis (Lerdahl & Jackendoff, 1996; Patel, 2003; Fadiga et al., 2009; Fedorenko et al., 2009), and linguistic syntax (Patel, 2003; Hoch et al., 2011; Jung et al., 2015; Slevc et al., 2009). In the field of music cognition, music and language are considered to have common ground in the structural aspects of language; both are human universal in which perceptually discrete elements (e.g., words, chords) are organized into hierarchically structured sequences (e.g., sentences, melodies) according to syntactic principles. (Lerdahl & Jackendoff, 1996; Patel, 2003; Fadiga et al., 2009; Fedorenko et al., 2009). As Ycart and Benetos (2020) pointed out, music and language are both: 1) continuous sounds; 2) can be transcribed in symbolic forms such as text and music scores; 3) possess a sequential structure; 4) follow a set of special rules, such as the grammar of natural language and music theory; 5) use these basic rule sets to fill in missing parts in order; 6) use these rules to decide whether to make a valid sequence or not. Most notable in the relationship between music and language is that both music and language are the only old human creative activities that are communicated and used through symbolic representations that originated in ancient times (Wołkowicz & Keselj, 2010).

MIR and NLP

These commonalities between music and language have facilitated a natural language processing (NLP) approach to music processing and analysis. NLP is a field of artificial intelligence (AI) that enables computers to analyze and understand human language. The NLP approach to music can be traced back to pioneering studies that introduced formal grammar. In the early stages, theoretical attempts were made to establish a musical language model and structure similar to the syntax grammar of natural language (Roads & Wieneke, 1979; Smoliar, 1976). In particular, deeply influenced by Chomsky's generative grammar, the generative theory of tonal music (GTTM) (Lerdahl & Jackendoff, 1996) was developed based on similar tree structure-style hierarchical organization uniting musical “phrase groupings”. This model, the first systematic application of language theory to music, had a great influence on music theory, music psychology, and cognitive musicology. Until recently, various computational methods have been developed for symbolic music processing, such as melodic segmentation, generation, and reduction derived from GTTM (Hirai & Sawada, 2019; Hamanaka et al., 2008, 2019; Frankland & Cohen, 2004; Tsushima et al., 2020; Abdallah et al., 2016).

Especially in the field of music information retrieval (MIR), various attempts have been made to formulate a musicological task as a matter of natural language processing that exploits the linguistic features of music. Some early approaches used traditional NLP models such as probabilistic grammars (Bod et al., 2001; Gilbert & Conklin, 2007; García Salas et al., 2011) and N-gram models (Wołkowicz et al., 2008). Several studies defined the similarity or distance between two musical excerpts using edit distance (Mongeau & Sankoff, 1990; Crawford, 1998), N-gram measures (Downie, 1999; Uitdenbogerd, 2002), conducting melodic similarity and classification. Researchers have also increasingly used string-based methods in the NLP community for practical tasks such as pattern discovery (Conklin, 2002), melody reduction (Groves, 2016), composer recognition of musical pieces (Wołkowicz et al., 2008), and prediction (Conklin & Witten, 1995; Pearce, 2005).

Recently, NLP has drawn more attention by revealing a new potential in music analysis and music generation through the breakthrough of deep learning. In particular, the necessity of good word representation to achieve implicit contextual understanding in NLP has become apparent (Turian et al., 2010). NLP techniques such as word2vec (Mikolov et al., 2013) paved the way to represent words in a vector space. Using these methods, words are mapped into vectors in a data-driven manner, and the relationship between words can be captured according to the distance in this vector space. These word embedding techniques have been successfully absorbed into the MIR field as a new type of music representation of chords (Huang et al., 2016; Madjiheurem et al., 2016; Brunner et al., 2017), melody (Shin et al., 2017; Alvarez & Gómez-Martin, 2019), and other note patterns (Herremans & Chuan, 2017).

Research Question

While MIR has adopted a variety of NLP techniques, the fundamental and important issues in dealing with music as a language have been relatively overlooked: How can we define basic musical units for computational analysis that correspond to words in the language? Similar to the way that sentences in a language consist of a hierarchical structure of clauses, phrases, words, and morphemes, melodies in music can also be segmented in a different way according to theoretical approaches and hierarchical levels. In music, however, segmentation involves more complex issues: the terms we use for musical articulations (e.g., periods, phrases, motifs, and notes) reflect a linguistic borrowing of great antiquity, but the hierarchy is not always similar (Lidov et al., 2005), and there is no single and unambiguous set of terms (Cenkerová, 2017). Moreover, although motifs are generally considered one of the most fundamental units of music (Lerdahl & Jackendoff, 1996), it is not known how many notes a motif is made up of, and there are theoretically an infinite number of possible motifs (Sawada et al., 2020). Thus, melodic segmentation, which divides a sequence of notes into meaningful units, has been considered an important but complicated task in the field of MIR.

The sequence segmentation has been an important topic in the field of NLP as well. Due to the nature of language that new words are continuously coined or casually created, NLP systems cannot handle all the vocabulary in the world. Thus, they divide words into meaningful sub-units to represent all possible words by their combinations to avoid the out-of-vocabulary (OOV) problem. One of the notable works is Sennrich's approach (Sennrich et al., 2015), which utilized Byte-Paring Encoding (BPE) (Gage, 1994). BPE is a data compression algorithm that iteratively replaces pairs of bytes that frequently occur adjacently with an unused byte. Sennrich attempted to make these words fit into neural network models by representing rare and unseen words as a sequence of sub-word units through BPE. This approach has been actively used in numerous NLP studies since it was successfully applied to unsupervised word segmentation.

The main contribution of this study lies in the application of natural language processing (NLP) techniques to melodies, enabling the segmentation of melodies into novel encoding units suitable for semantic analysis. To this end, we introduce Mel2Word (M2W), a text-based melody representation. Mel2Word is composed of two processing steps. The first step involves converting the melody into a sequence of morpheme-level units, which consist of pitch intervals and inter-onset intervals (IOI). The second step is to segment this sequence into word-level units, which are combinations of morpheme-level Mel2Words using BPE. Using the “musical words”, we conduct melody classification and compare it to previous NLP-based methods. We adopted music classification in this study because it is related to various symbolic MIR tasks such as music similarity (Bountouridis et al., 2017; Park et al., 2019), pattern discovery (Conklin, 2009; Boot et al., 2016), genre recognition (Li & Sleep, 2004), and cultural origin prediction. (Rodríguez López & Volk, 2015).

This paper proceeds as follows: We begin by discussing studies on vector representations of words in NLP and MIR, as well as existing works on melodic segmentation methods. We next present our proposed method and carry out exploratory data analysis utilizing several datasets and quantitative evaluation experiments on the MTC-ANN dataset tune classification. We also compare this approach to existing segmentation methods to demonstrate its applicability.

Related Work

How Words Represent Meanings in NLP

The representation of word meanings is a fundamental challenge in NLP, and the vectorization of words aims to enable machines to comprehend word meanings. This section reviews two main approaches, 1) vector space model and 2) word embedding, and discusses how they have been applied to melodies in the field of music research.

The idea of the Vector Space Model (VSM) is to represent each document in a collection as a point in a vector space (Turney & Pantel, 2010). In this model, points that are close together in this space are semantically similar, and points that are far apart are semantically distant. The VSM basically has a statistical semantic hypothesis that statistical patterns of human word usage can be used to figure out what people mean. There are two main assumptions for this statistical information: 1) the “bag of words” assumption, and 2) the distributional hypothesis. The first assumption is one of the widely shared practices to treat the linguistic contexts in which a word occurs as unordered sets of words. In this so-called bag-of-words assumption, the linguistic context of any given word is defined by which words co-occur with it and with what frequency. This assumption disregards sequential and syntactic information, but the sequential order in which words occur, the argument structure, and general syntactic relationships within sentences all provide important information about the meaning of the words, consequently limiting the extent to which semantic information is extracted from the text (Andrews & Vigliocco, 2010). The second approach is motivated by the so-called distributional hypothesis (Harris, 1954), which proposes that the meaning of a word can be derived from the linguistic contexts in which it occurs. That is, similar words tend to occur in similar contexts; thus, the rows of any of these matrices can be used for estimating word similarities (Deerwester et al., 1990). Based on these assumptions, the VSM maps a document to a large number of content-related words or phrases, which successfully translates the textual document calculation to the vector calculation. The VSM, on the other hand, has constraints because it separates the original semantic relations of the text and abandons a large number of connections between words. Due to the fact that this results in high-dimensional sparse problems and high information loss, word embedding was proposed to deal with this issue (Bengio et al., 2003; Li & Yang, 2018; Ran & Han, 2020; Kazhuparambil & Kaushik, 2020).

Word embeddings are dense, distributed, fixed-length word vectors, built using word co-occurrence statistics as per the distributional hypothesis (Almeida & Xexéo, 2019). Its main goal is to map textual words or phrases into a low-dimensional continuous space to alleviate data sparseness and the small disjunct problem (Li & Yang, 2018). Since recent word embedding techniques generally induce a reduced, fixed number of dimensions, computation becomes more efficient compared to prior VSM approaches (Mikolov et al., 2013). The principle behind word embedding approaches is originally the distributional hypothesis and the vector space model of meaning. Word embeddings have evolved as a research field in and of itself because of their enhanced efficiency and several conceptual and practical advantages, such as the fact that they encode remarkably accurate syntactic and semantic word relationships (Mikolov et al., 2013). Word embedding encodes the semantic and syntactic information of words, where semantic information mainly correlates with the meaning of words, while syntactic information refers to their structural roles (Li & Yang, 2018). Word embeddings are commonly categorized into two types: 1) prediction-based and 2) count-based models, depending upon the strategies used to induce them (Almeida & Xexéo, 2019; Baroni et al., 2014; Pennington et al., 2014). Embedding models derived from neural network language models are called prediction-based models since they usually leverage language models to predict the next word based on local data (e.g., a word's context). On the other hand, other matrix-based models that use global information, generally corpus-wide statistics such as co-occurrence counts and frequencies, are called count-based models. In brief, Table 1 shows examples of models that have been actively used to infer the semantic meaning of words in NLP.

Table 1.

Models of representative vector representations for inferring semantic word information in the field of NLP.

Topic of research	Models	Methods	Overview of strategy	Context
	Distributional hypothesis	PMI	PMI is a measure used to determine the co-occurrence strength between two words.	Local
VSM approaches	Bag-of-word hypothesis	TF-IDF	TF-IDF is calculated by the product of the frequency of the word in a particular sample and the frequency of the word in the whole document.	Global
	Prediction-based	Skip-gram	Skip-gram predicts each context word given a target word.	Local
		CBOW	CBOW predicts the target word given a context.
Word embedding approaches	Count-based	LSA	Singular-value decomposition (SVD), a technique related to eigenvector decomposition and factor analysis, is applied to a term-document matrix in LSA.	Global
		GloVe	GloVe combines two features: the local context window method and global matrix factorization.

In terms of music, MIR research has embraced classical VSM methods from the beginning. A wide variety of MIR tasks based on VSM approaches include similarity (Marolt, 2008), classification (Madhusudhan & Chowdhary, 2019; Çoban, 2017), plagiarism (Müllensiefen & Pendzich, 2009), phrase recognition (Gulati et al., 2016), music structure analysis (Maddage et al., 2006), segmentation (Rodríguez López et al., 2014; Neve & Orio, 2005), feature extraction (Yanase et al., 1999), and music information retrieval (Melucci & Orio, 1999).

Word embedding techniques, which have been around recently, have been successfully absorbed into the MIR field and have been actively employed as a new type of music representation. The word2vec approach was used in early research to learn music embeddings by predicting musical symbols based on neighboring symbols (Alvarez & Gómez-Martin, 2019; Chuan et al., 2020; Herremans & Chuan, 2017; Huang et al., 2016; Hirai & Sawada, 2019; Madjiheurem et al., 2016). In these studies, low-dimensional embeddings of symbolic melodies have been used to approach the concept of ‘word’ as note event-based or motif-based, resulting in a huge vocabulary with a long-tail distribution of words; hence only embeddings of the most frequently used terms were trained (Liang et al., 2020). Progress has been made over time toward music-specific embedding models that are trained to contain structural (e.g., measures, position) and various types of information (e.g., tempo, instrument, pitch) at the note level (Liang et al., 2020; Zeng et al., 2021; Chou et al., 2021). A brief summary of current music embeddings is listed in Table 2.

Table 2.

A brief summary of music embedding studies.

	Input representation	Input level	Musical parameters	Model	Experimental validation
Chordripple (2016)	chord-symbol	chord-event	harmonic information (chord root and type)	Word2vec	composition task
chord2vec (2016)	a set of simultaneous notes	chord-event	harmonic information (polyphonic notes)	Word2vec	music classification
Herremans and Chuan (2017)	musical slices	note-event	pitch classes	Word2vec	tonal proximity, context similarity
Chuan et al. (2020)	musical slices	note-event	pitch classes	Word2vec	tonal, harmonic relationship
melody2vec (2019)	MIDI representation	motif-event (segmented with GPR rule)	pitch and duration	Word2vec	melodic similarity, melody replacement
Alvarez and Gómez-Martin (2019)	multi-words	motif-event (size of 2 and 3)	interval direction with Boolean values	Word2vec	melodic similarity
PiRhDy (2020)	multi-faceted musical token	note-event	chroma, octave, IOI, note state, velocity	RNN, Attention-based layers	melody completion, accompaniment assignment, genre classification
MusicBert (2021)	OctupleMIDI	note-event	time signature, tempo, bar, position, instrument, pitch, duration, velocity	Transformer	melody completion, accompaniment suggestion, genre and style classification
MidiBert (2021)	REMI /CP representation	note-event/motif-event (size of 4)	bar, sub-beat, pitch, duration	Transformer	melody extraction, velocity prediction, composer and emotion classification

As stated, to encode the semantic information of music, studies on music embedding have been conducted in a variety of ways through the application of new input representations, parameters, and models. However, input levels generally lack the same musical meaning as note levels (single or simultaneous) or units of fixed size, except for the study in Hirai and Sawada (2019), where they utilized the GRP rule. By this point, we propose a textual representation that can be used as an initial seed for NLP-based models that try to understand the semantic meaning of language. By applying approaches that infer the meaning of words to melodies, we aim to fundamentally improve the directions for defining and vectorizing melodic terms that reflect musical semantic and context information.

Melody Segmentation

Segmenting a melody, commonly referred to as grouping or segmentation (Lerdahl & Jackendoff, 1996; Cambouropoulos, 2001), is a fundamental processing step for symbolic music analysis and its applications (Pearce et al., 2010). As speech is perceptually segmented into phonemes or words which subsequently provide the building blocks for the perception of phrases and complete utterances, the low-level organization of melody into groups allows the use of the primitive perceptual units in more complex structural music analysis and may alleviate demands on memory (Pearce et al., 2010). There are a number of models that have been proposed for melody segmentation so far. We review some of notable ones, dividing them into psychological models and computational models.

Psychological Models

The application of Gestalt psychology's perceptual grouping mechanism to musical perception is an essential method for melody segmentation in music theory and cognitive psychology. In this regard, the GTTM model and the Implication-Realization (I-R) model have had a strong influence on empirical music studies as well as melody segmentation and grouping.

GTTM by Lerdahl and Jackendoff is probably the most well-known model for melodic segmentation. This theory pursues the formal description of musical intuitions of listeners through a combination of cognitive principles and generative linguistic theory (Lattner, 2020). The Grouping Preference Rules (GPRs) in GTTM are modeled by identifying local discontinuities or changes in the temporal proximity, pitch, length, and dynamics of events. These GPRs were directly influenced by the Gestalt school's principles of proximity and similarity in grouping visual perception (Koffka, 2013). GTTM has developed a number of rule-based computational segmentation models, which are described in the following subsection.

I-R model by Narmour is a theory of perception and cognition of melodies based on a set of basic grouping structures that the perception of a melody continuously causes listeners to generate expectations of how the melody will continue. In the I-R theory, segment boundaries are determined mainly by melodic closures, which indicate the points that ongoing cognitive process of melodic expectation is disrupted (Pearce & Wiggins, 2006). As with GTTM, the sources of those expectations are both innate and learned, and it is closely related to Gestalt Theory for visual perception (Koffka, 2013; Köhler, 1967). Narmour's theory identifies three consecutive notes as eight structures that indicate whether the second melody interval of the structure realizes the implication raised by the first interval, depending on the direction and size of the melodic interval.

Computational Models

Computational segmentation models in the symbolic domain are divided into two categories: rule-based or data-driven models (Ellis, 1996; Rodríguez López, 2016). Rule-based models follow a presumption to determine whether the sequences should be grouped or segmented, whereas data-driven models use a post-assumption that a series of events are likely to be grouped by a listener if they occur frequently; in other words, statistical regularities determine perceptual grouping.

Local Boundary Detection Model (LBDM) is based on two rules that operate on local changes in pitch, Inter-Onset-Interval (IOI), and rest (Cambouropoulos, 2001). Change rule assigns boundary strengths in proportion to the degree of change between consecutive intervals. Proximity rule assigns a higher border strength to the larger interval out of any two consecutive and non-identical intervals.

Grouper segments a melody into non-overlapping groups according to onset time, off time, chromatic pitch, and level in the metrical hierarchy of each note (Temperley, 2004). Grouper relies only on temporal information (note-to-note intervals and meters) and uses three Phase Structure Preference Rules (PSPRs) to assess the existence of segment boundaries.

GTTM-based models have attempted to quantify and scientifically evaluate melody segmentation using the rules of the grouping method proposed by GTTM. For example, some of the grouping rules (GPRs 2a, 2b, 3a, 3d) have been quantified and tested for validation purposes through perceptual experiments (Frankland & Cohen, 2004). Hamanaka et al. also developed the implementation of local boundary detection, constructing a hierarchical grouping structure to be executable on a computer using some of the grouping rules (Hamanaka et al., 2004).

Data-oriented parsing (DOP) is an example of data-driven models based on supervised learning for melody segmentation. DOP is a probabilistic approach from NLP that pursues the assumption that humans produce and interpret natural language by invoking representations of their concrete past language experiences rather than the rules of a consistent and non-redundant grammar (Bod, 2002).

IDyOM model is an information-theory-driven unsupervised learning method. IDyOM refers to a variable-order class of Markov models for capturing the statistical structure of music (Pearce, 2005). It takes into account the prior context and calculates the conditional probability and informational content (IC) of a musical event. Inspired by developments in musicology, computational linguistics, and machine learning, Pearce et al. (2010) have shown that the information content of music events, as estimated from a generative probabilistic model of those events, is a good indicator of segment boundaries in melodies (Lattner, 2020).

These models studied melody segmentation primarily at a phrase level rather than at lower levels in the structural hierarchy, such as motifs (Cenkerová et al., 2018). The efforts have mainly focused on perception of melody boundaries or understanding of the structural segmentation of music, and hence the performance of the proposed model has mostly been established as the degree of agreement with human ratings. However, a challenge here is the subjective nature of human perception in that there is no “correct” segmentation (Cenkerová et al., 2018), as well as the fact that only a few task-oriented empirical attempts for melodic analysis have been carried out. We believe that melody grouping based on subjective human perception should be distinguished from melody segmentation for MIR tasks. Therefore, rather than perceptually recognizable phrases, the goal of this study is to segment melody in a data-driven way and find representations for symbolic MIR tasks.

Proposed Method

We introduce a novel text-based melody representation, which we term Mel2Word. The key idea is to segment melody into meaningful units, which are equivalent to words in language. This section describes the processing steps.

Morpheme-Level Text Encoding

The first step of Mel2Word is to encode a melody into a sequence of text units. The text unit satisfies the following requirements:

It is notated in a textual format for melodic analysis using NLP techniques.

It contains the core information about melodic features: pitch and rhythm.

It is invariant to key, time signature, or other variables except for the melodic features.

It represents melodies in the same context in an identical notation.

The length of the single unit is fixed.

As indicated in requirement (1), the purpose of this encoding is to represent music in a textual format to facilitate the application of methods in NLP. As indicated in requirement (2), this representation carries two main elements of the melody: pitch and rhythm information. To meet requirement (3), pitch and rhythm are notated as pitch interval and IOI rather than absolute pitch and time duration, so that they are invariant to key and tempo. This also addresses requirement (4), where melodies in the same context are notated with an identical representation. Finally, for requirement (5), the length of a single text unit is fixed to a constant number of characters, which is a combination of numbers and alphabets representing the pitch interval and IOI. Since the single text unit is the smallest unit that has a musical meaning, we call it “morpheme-level” text unit ¹ .

An example of text encoding is shown in Figure 1. The details of the encoding rules are as follows: Each text unit consists of 1) a pitch feature that shows the difference in direction and interval; and 2) a rhythm feature that is represented by IOI between two consecutive notes. The unit corresponds to two consecutive notes which is regarded as the smallest musical unit with the meaning of the expression (Stein, 1979). Pitch information is represented by three letters: ‘U’ (up), ‘D’ (down), or ‘E’ (equal) denoting the direction, followed by a number between ‘0’ and ‘12’ indicating the size of the interval. To cut down on the number of redundant melodic units that are rarely used, the numbers exceeding 12 are substituted by 12. To fix the length, the numbers from 0 to 9 are notated as two digits (for example, 1 is represented as ‘01’). For rhythm information, three-digit numbers are used to represent the inter-onset interval (IOI). When the IOI is converted to text, even small changes in value can be treated very differently. Therefore, in order to minimize noise derived from minor variations, we suggest quantization of rhythm. Quantization can be done in a variety of rhythmic units, such as eighth notes, sixteenth notes, and thirty-second notes. Different quantization units will have their pros and cons. For example, the finer the unit, the better it can reflect rhythmic details such as dotted notes and triplets. But this may also result in an overly large number of rhythmic terms, and complex features may increase the computational load (e.g., when training the embedding model). Therefore, in this study, we set the rhythm in minimum units up to the 16th note, which is very common in general melodies. For better readability, the most common, if not all, quarter notes representing one beat are set to the default unit of 1, and the final rhythm value is multiplied by 100. For example, the number ‘100’ represents a quarter note, and ‘200’ represents one half note. In this study, the maximum of the rhythm is set to 4 quarter notes, so the range of IOI quantized by 16th notes ranges from a minimum of ‘025’ (16th note, original 0.25) to a maximum of ‘400’ (whole note, original 4). According to these principles, the example on the left of Figure 1 has a pitch feature of ‘E00’ (Equal ‘00’), indicating that two consecutive notes are identical, and it contains a rhythm feature of ‘100’, which represents the value of a single beat (a quarter note). Another example on the right has a pitch feature of ‘D02’, indicating that it moves down one whole-tone, and a rhythm feature of ‘050’, indicating that it is an eighth note. Pitch and rhythm can be encoded both together and separately. The table in Figure 1 shows how pitch and rhythm can be encoded individually.

Figure 1.

An example of melody notated in Mel2Word.

Morpheme-Level Units to Word-Level Units

The second step is to merge the encoded morpheme-level units into word-level units. This step consists of dictionary generation and tokenization using BPE.

Dictionary Generation

We use BPE for dictionary generation. BPE is a bottom-up approach for gradually building a vocabulary from a character unit. BPE was originally devised as a data compression technique that replaces the most frequently occurring pair of bytes in a sequence with a single unused byte iteratively (Gage, 1994). This data-driven approach has been effectively applied to word segmentation as Sennrich et al. (2015) proposed a BPE-based segmentation to acquire a vocabulary that provides a good compression rate of the text by merging character sequences. The basic process of BPE in Sennrich's work is as follows:

The symbolic vocabulary is initialized with a character-level vocabulary. Here, a special word end symbol ‘.’, a period mark, is added to each character to restore the original tokenization back to the word level.

Every pair of symbols is counted iteratively, and each occurrence of the most frequent pair is replaced with its concatenation as a new symbol. For example, ‘A’ and ‘B’ are replaced with ‘AB’.

Each merge operation creates a new symbol representing the character N-gram, and the final symbol vocabulary size is the number of merge operations in addition to the initial vocabulary size.

The process of creating a dictionary using BPE algorithm is demonstrated with a toy example in Figure 2. The training data initially contains ‘low’, ‘lower’, ‘newest’, and ‘widest’. Each word occurs by the number next to it.

Figure 2.

An example of sub-word dictionary generation through BPE.

The first step is to split all the words in a dictionary into single characters and take the union as vocabulary (Figure 2(a)). Given that a pair of ‘e’ and ‘s’ is the most frequent, it is merged into ‘es’ (Figure 2(b)). When this operation is repeated twice, a pair of ‘es’ and ‘t’ becomes ‘est’ (Figure 2(c)). When repeated ten times, the dictionary and vocabulary set become as shown in Figure 2(d).

We adapt the BPE technique to the text-based melody representation as follows:

The melody is converted into a morpheme-level text sequence. This text unit corresponds to the character-level unit in Figure 2.

The most frequent pair of two consecutive morpheme-level text units is selected.

The selected pair is merged with the ‘_’ symbol.

As an example, consider a melody sequence, M₀ = ABABABC, where each character represents a morpheme-level text unit. This sequence can be initialized with a single unit: M₁ = A,B,A,B,A,B,C. In the first iteration of merging, we convert A,B to A_B (which occurs most frequently in the sequence) to obtain the segmented sequence M₂ = A_B,A_B,A_B,C. Repeating this process once again, the most frequently occurring consecutive pair is A_B,A_B. Thus, the melody sequence M₀ = ABABABC is segmented into the melody sequence M₃ = A_B_A_B,A_B,C after two iterations. Figure 3 shows an example of segmenting a melody and creating a Mel2Word dictionary (denoted characters in the preceding explanation are marked in brackets).

Figure 3.

An example of creating word-level units of Mel2Word using pitch features for dictionary generation through BPE.

Tokenization

Once the dictionary is generated, each melody is segmented into the word-level units of Mel2Word as the tokenization process in NLP. This process is based on two criteria: 1) the length of a single text unit, and 2) the frequency of occurrence. The BPE algorithm implicitly assumes that longer or more frequent units are more essential. Following the first criterion, a melody is segmented in descending order of the unit length, with longer Mel2Word taking precedence. The second criterion is applied when the lengths of Mel2Word are equal. Therefore, the melody is tokenized in the order of its frequency. For example, to tokenize any raw melody by a dictionary with a maximum length of 11, scanning from the beginning to the end of the melody, all phrases of length 11 in the melody are scanned from beginning to end for a phrase that matches the most frequent Mel2Word phrases of length 11. Then it moves on to the next frequently occurring Mel2Word of length 11. If no matching phrase is found in all Mel2Word phrases of length 11, the tokenization process continues and moves on to the most frequent Mel2Word of length 10. This sequential procedure preferentially merges longer and more frequent phrases, matching all Mel2Word phrases in word-level units between maximum length 11 and minimum length 2 in the Mel2Word dictionary. The tokenization process ends when only morpheme-level units of Mel2Word without any matching phrases remain. Figure 4 shows an example of the process of tokenizing a melody with the generated Mel2Word. As shown, since the example phrase of ‘U03100_D02050_D01050_D02050’ is inside the melody, the matching parts (in the red dotted box) are merged and tokenized into Mel2Word as one phrase.

Figure 4.

An example of segmenting a melody with the word-level units of Mel2Word.

Experiments

Datasets

We collected symbolic monophonic melody datasets to examine various musical sequences from different genres and construct a Mel2Word dictionary. Table 3 gives a brief overview of the datasets we used. Out of a total of 3,044 digitally encoded songs in the large corpus, we discarded melodies that could not be read with music21 ² , a toolkit used to extract MIDI features. Melodies with less than two notes or more than a thousand notes were taken out so that the dictionary would not be biased toward one particular music piece. Also, duplicate melodies that shared the same pitch and rhythm were removed. In the end, 3,007 monophonic melodies were included in the training set to construct a Mel2Word dictionary.

Table 3.

A summary of the monophonic melody datasets used for dictionary generation.

Dataset	Genre	Songs (N)	Format
MTC-ANN³	Folk	360	MIDI
Nottingham⁴	Folk	1034	MIDI
WJazzD⁵	Jazz	453	MIDI
POP909⁶	Pop	909	MIDI
Orcheset⁷	Classic	64	MIDI
SWD⁸	Classic	24	MIDI
CSD⁹	Children song	200	MIDI
OTMM ¹⁰	Turkish Makam	58	MIDI

Dictionary Generation

We converted all the melodic sequences in the datasets from MIDI to morpheme-level text units and then performed BPE to build the Mel2Word dictionary. Each iteration of BPE resulted in a new set of Mel2Word in the word-level unit, and the iteration stopped when there were no more identical Mel2Word pairs. Following the claim of Stein (1979) that one meaningful phrase is 2 to 12 notes, the maximum length of a single Mel2Word was set to 11 (i.e., 12 notes). If the merged pair was longer than 11, the second most common pair was chosen instead. Since the morpheme units in Mel2Word have both pitch and rhythm units together or independently, we can use all of these features to build three different types of dictionaries (pitch only, rhythm only, or both). We created dictionaries using all three features, and when a new Mel2Word was generated, the frequency of occurrences in the dataset was also saved in the dictionary for tokenization. Table 4 shows a summary of the dictionaries built in this experiment.

Table 4.

A summary of generated dictionaries.

	Morpheme-level (N)	Full-BPE (N)	Examples (top occurrences)
Pitch	25	18962	‘D02_D02’ (N=29041)
Rhythm	19	8266	‘050_050’ (N=87655)
Pitch+ Rhythm	444	34850	‘D02050_D02050’ (N=11855)

Tokenization

Since the Mel2Word dictionary includes the frequency of each term in occurrence, it is possible to tokenize melodies in various dictionary sizes based on the prevalence of each term. For example, using 100 most frequently used Mel2Word phrases, we can create a dictionary size of 100.

With this approach, melodies can be divided into segments with different levels. Figure 5 demonstrates tokenizing melodies with two different sizes of dictionaries. A two-bar melody with 12 pitch intervals is tokenized into 7 segments with a small dictionary (N = 100) or 4 segments with a large dictionary (N = 5,000). For exploratory data analysis and evaluation, we tokenized melodies with dictionaries of different sizes, from small to large. The largest dictionary was built with the Mel2Word phrases that had at least 10 frequency occurrences (full-token N = 3529, 2043, and 5026 for pitch, rhythm, and pitch+rhythm, respectively).

Figure 5.

Examples of tokenization according to different Mel2Word dictionaries (A melody from MTC-ANN).

Exploratory Data Analysis

We examined two datasets for exploratory data analysis to compare melodies using Mel2Word tokens from different genres. We chose the annotated Meertens Tune Collection (MTC-ANN), version 2.0.1 (van Kranenburg et al., 2016), the folk-tune dataset containing 360 melodies divided into 26 tune families annotated by musicological experts, and the Weimer Jazz Database (WJazzD) (Pfleiderer et al., 2017), the database of jazz solo transcriptions, as sample examples. Figure 6 shows the statistical distribution. The left side shows the number of songs by the melody length (i.e., the number of segments in one song) when pitch intervals are used. It indicates that the range of segment lengths becomes narrower as the dictionary size is larger. The same trend is found when rhythm intervals or both pitch and rhythm intervals are used. The right side shows the distribution of Mel2Word tokens by length when each dataset was tokenized with the Mel2Word dictionary. In both datasets, the combined (pitch+rhythm) feature has the most frequent Mel2Word at the length of 2, while pitch only has the most at the length of 4. In rhythm alone, the lengths of Mel2Word are evenly distributed since repetition is more frequent.

Figure 6.

Statistical distribution of datasets (Left: the number of songs according to the length of the melody. Right: the number of Mel2Word phrases according to the length of the Mel2Word).

Figure 7 is a visualization of these two different datasets using a wordcloud ¹¹ (Mueller, 2020), which is a popular text analysis tool that gives a visual representation of word prominence. A quick look shows that dominant Mel2Word tokens differ between folk songs and jazz songs. Folk melodies, for example, tend to have diatonic melodies with more eighth notes, whereas jazz melodies are more likely to have chromatic melodies and shorter beats (i.e., 16th). For a closer look, Figure 8 provides examples of the most frequent Mel2Word in each dataset with a length of 5.¹² The upper part consists of Mel2Word tokens that appear in common in both datasets, and below are Mel2Word tokens that appear only in MTC-ANN and WJazzD, respectively.

Figure 7.

Wordclouds created by different datasets.

Figure 8.

Examples of Mel2Word tokens in different datasets.

Evaluation

To empirically evaluate our approach, we conducted a tune classification experiment using the MTC-ANN dataset, which serves as a benchmark for music analysis in various studies (Boot et al., 2016; Bountouridis et al., 2017; Walshaw, 2017; Park et al., 2019). In this experiment, our primary focus was to demonstrate the effectiveness of our tokenization method, utilizing minimal features (pitch and rhythm) and a straightforward similarity evaluation. We compared the results of our proposed method with a basic baseline using raw morpheme-level representations, thereby emphasizing the robustness and significance of our approach without the need for more advanced algorithms or complex feature sets. To achieve this, we conducted the classification task using different input units corresponding to various dictionary sizes.

Mel2Word Embedding

For semantic vectorization for Mel2Word, we utilized word2vec, a widely used NLP technique to represent a word in a distributed vector space (Mikolov et al., 2013). The learned representation is obtained based on the affinity of words within a context window and, as a result, words with similar meanings are located closely in the vector space. Word2vec has been actively exploited as a new representation in music analysis to capture semantic meaning in the musical corpus (Alvarez & Gómez-Martin, 2019; Chuan et al., 2020; Herremans & Chuan, 2017; Huang et al., 2016; Hirai & Sawada, 2019; Madjiheurem et al., 2016). We also applied word2vec to Mel2Word tokens to investigate the melodic similarity. For word2vec training, we used Gensim¹³, a Python implementation of word2vec (Rehurek & Sojka, 2010). We chose the skip-gram model (predicting context words according to the target word) over the CBOW model (predicting the target word according to the context) because it is considered to better represent sparse words (Landgraf & Bellay, 2017). To construct a comprehensive and versatile embedding model for melodic representations, we utilized a melody corpus composed of various datasets, which were also used for dictionary generation (refer to Table 3). We tokenized all of the melodies using dictionaries of different sizes and made several models to examine each dictionary. All of the basic model parameters were set to 512 dimensions, a context window size of 10, and 300 training iterations, based on our preliminary experiment. These models were used to obtain the vector representation of each Mel2Word, and the single melody vector can be derived by averaging the vectors of all the Mel2Word phrases in the melody. Figure 9 shows the t-SNE representation of all training melodies vectorized by the word2vec model with full tokens. Here, we can observe that the melodies are well clustered by genre. Accordingly, we assessed the performance of MTC-ANNs in terms of these models built with varying dictionary sizes and segmentation models.

Figure 9.

t-SNE visualization for training melodies for word2vec (full-token).

Segmentation Algorithms

For comparative evaluation of the proposed algorithm, we performed melody classification using melody tokens segmented by existing segmentation models. First, we employed three segmentation algorithms developed in MIDI Toolbox (Eerola & Toiviainen, 2004): 1) a Gestalt-based approach described in Tenney and Polansky (1980), 2) the Local Boundary Detection Model proposed by Cambouropoulos in Cambouropoulos (2001), and 3) Markov's model presented in Bod et al. (2001). For the segmentation boundary, we used the default parameters for the Gestalt-based and Markov models in the toolbox, as well as a threshold value of 0.1 based on our preliminary analysis of the LBDM model. Briefly introducing each approach, a Gestalt-based method detects “clang” changes, which correspond to large pitch intervals and large inter-onset intervals (IOIs). The Markov-model-based algorithm uses the probabilities derived from the analysis of melodies In this technique, the probabilities of phrase boundaries have been derived from pitch class, interval, and duration distributions at the segment boundaries in the Essen folk song collection. LBDM derives boundary strengths that are proportional to the degree of change between two intervals. For the segmentation boundary, Gestalt-based values were binary values of 0 and 1, hence they were utilized as-is for the segmentation boundary. In the case of the Markov-model-based approach, it was set as the segmentation boundary if it exceeded 0; in the case of LBDM, it was set as the segmentation boundary if it exceeded 0.1, based on our preliminary analysis. Since these algorithms were defined in terms of individual notes rather than the relative values of intervals and IOI employed in Mel2Word, we set the segmentation boundary as the second note of two consecutive notes. The melodies were also segmented into fixed sizes of 2-gram and 3-gram. In addition, we implemented GPR rules following Hirai and Sawada (2019), i.e., when considering a series of four notes (i.e., n1, n2, n3, and n4), it is considered that there is a boundary between n2 and n3 if the following requirements are satisfied: IOI_n_2–n3 > IOI_n_1–n2 and IOI_n_2–n3 > IOI_n_3–n4.

We used Music21 written in Python whereas the existing segmentation methods are implemented with MIDI Toolbox written in MATLAB. They gave different melodic features for some melodies, presumably because the two libraries parse MIDI files differently (for example, MIDI Toolbox misses some repeated notes, and the way they read multi-channel notes is different). Therefore, we excluded melodies whose pitch and rhythm features were different in the two libraries. As a result, we used a total of 2,752 melodies for the comparative analysis.

Similarity Metrics

In our study, to compare the similarity of the two melodies, each melody was compared with a value obtained by averaging the vectors of all Mel2Word segments in the word2vec embedding vector space. Cosine similarity was employed to determine vector similarity, as it is the most common way to measure the similarity of two frequency vectors (Turney & Pantel, 2010). Despite its limitation in discarding positional information as discussed in Le and Mikolov (2014), we chose this approach considering its common use in measuring vector similarity, and it provides a straightforward and effective way to assess similarity in our context.

For the compared melodic sequence with Mel2Word M_a = (a₁,a₂,–a_n), and M_b = (b₁,b₂,–b_m), where A and B are averaged vectors of Mel2Word of M_a and M_b, and A_i and B_i are components of vectors A and B respectively, the cosine similarity cos (θ) is expressed using dot product and magnitude as:

\cos (A, B) = \frac{A B}{∥ A ∥ ∥ B ∥} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}

(1)

Evaluation Metrics

To evaluate classification performance, the similarity between each melody of MTC-ANNs with 26 classes (or tune families) was calculated, and the following three evaluation metrics were utilized:

Classification Accuracy (ACC): the correct classification rate when the melody is classified into the same class as the melody with the highest similarity using the k-Nearest Neighbors (k-NN) classifier. We used the k-NN classifier to classify each input melody in the MTC-ANN dataset, focusing on 26 tune families. It identifies k nearest neighbors based on cosine similarity scores to the input melody, and the majority class among these neighbors is assigned as the predicted class.

Area Under the Curve (AUC): the commonly employed summary measure of the receiver operating characteristic (ROC) curve. In this study, the AUC was used to assess the performance of the classification task based on melody similarity. To calculate the AUC score, we first ranked songs within the same class (i.e., tune family) based on their similarity scores to a given input melody. The correct family tune was designated as the positive class, while all other tune families were set as the negative class. The AUC score was then calculated by plotting the true positive rate against the false positive rate as the discrimination threshold varied. A higher AUC score indicates better discriminative ability and classification performance for the given melody classification task.

Mean Average Precision (MAP): the mean of average precision scores for all tune families. To calculate the average precision for a tune family, the songs are sorted based on their similarity scores. The precision score represents the accuracy of retrieving relevant melodies for a specific tune family at different rank positions. A higher precision score indicates better accuracy in classifying relevant melodies within that tune family. The MAP score provides an overall evaluation of the classification performance, considering the accuracy across all tune families. Songs from the same tune family that were not retrieved are assigned a precision score of 0.

Results

Comparison of Dictionary Size

Figure 10 shows the classification performance with different melodic features and dictionary sizes. As shown, all evaluation metrics showed that the proposed method outperformed morpheme-level text units in tune classification performance. Even with a small size of dictionary (N = 100, for example), performances improved considerably. In the case of pitch and combination features (pitch+rhythm), the performance improves modestly as the dictionary size increases, resulting in the best performance when full tokens are used. However, a larger dictionary does not necessarily increase the classification accuracy. With a larger dictionary, we can obtain richer Mel2Word variants in word-level units but this can also have a negative impact because the frequency of particular terms decreases across the entire dataset. If the melody is tokenized into fewer segments, it also means the amount of data (or segments) used for training will be also reduced. Our results show that the performance does not improve much as the dictionary size continues to increase, and in the case of rhythm, the performance becomes lower. Therefore, we believe that advanced research with a bigger dataset is further needed to find the optimal size of the dictionary.

Figure 10.

The classification result of the MTC-ANN dataset, where N denotes the dictionary size for melody tokenization.

The Gensim library we used allows us to find the word most similar to a specific word or derive similarity values between words through the distance of the embedding vector of the trained model. Here, a closer vector representation of Mel2Word indicates that the word2vec algorithm has modeled them as words that are semantically closer in context. To show the difference between vectors of words and embedding spaces created by different dictionaries, Figure 11 shows two case examples of finding the most similar words through the word2vec embedding model obtained by tokenized melodies with several dictionaries (N = 100, 500, and full). For example, in the case of ‘U01050_U02050’ in the example on the left, the most similar word on a vector space trained with tokenized melodies with a dictionary of size N = 100 is ‘U02050_U01050’ with a cosine similarity of 0.55. We also used PCA to visualize the 3-D space of the embedding model built through each dictionary (the left is with N = 100, the middle is with N = 500, and the right is with full-token; the reds are the original object, the blues are the most similar Mel2Word, and the yellows are all other Mel2Word phrases). We can see here that as the size of the dictionary increases, more Mel2Word variants with word-level units are created and densely distributed. In addition, it can be observed that the Mel2Word derived as the most similar phrases have similar pitch directions and contain or share the same pitch and rhythm as the original phrases. This demonstrates that training in word embedding can be used to vectorize melodic phrases with similar contexts. Further investigation is required to determine whether these Mel2Word units are used more frequently in the same musical context and to explore whether phrases that are close in the embedding space exhibit perceptually more similarity compared to distant phrases. This analysis is particularly important as it would have been difficult to identify these phrases if pitch and rhythm were split into discrete units. We consider this exploration as a promising first step toward understanding the semantic meaning of melodic phrases and their perceptual relationships.

Figure 11.

Examples of finding the most similar Mel2Word based on word2vec vectorization.

Comparison of Melodic Features

Figure 12 illustrates the classification performance by different melodic features, t-SNE visualizations by models trained with morpheme-levels of pitch, rhythm, and combination features, respectively, from left to right. The right-most visualization is from a model trained with the combination level as a full-token¹⁴. In the lower right corner, we’ve also marked the feature-by-feature performance at the morpheme level and the full-token level to show how the performance of each feature varies. As shown, Mel2Word performed best with the combination feature at both the morpheme and full-token levels, while Mel2Word with only a rhythm performed worst. This t-SNE visualization demonstrates that various melodic features and tokenizations can facilitate more effective clustering, emphasizing the need for precise melodic representations.

Figure 12.

t-SNE visualizations and feature-to-feature performance at morpheme-level and full-token level on the classification evaluation of the MTC-ANN dataset.

Comparison of Segmentation Models

The comparison results between the segmentation algorithms are shown in Figure 13. They are arranged from left to right in an ascending order of AUC performance. The existing segmentation algorithms have higher performance over the morpheme-level units but the proposed method showed the highest performance and outperformed all existing algorithms. Even more remarkable is that this approach can build more precisely compressed and effective melodic units. Table 5 outlines the segmentation of the melody corpus using the existing segmentation algorithms. As shown, most existing algorithms generated much larger dictionaries and there were many useless segments that only appeared once in the entire dataset, suggesting that they might not be the ideal approach. But with our approach, even when all tokens were used up, there were still only 25 segments that showed up once in the whole dataset. We expect that this approach will enable the employment of current NLP-based algorithms with a variety of existing MIR tasks by allowing the utilization of small-to-large dictionary sets for music analysis.

Figure 13.

Comparison between segmentation algorithms.

Table 5.

Summary of tokenization of the melody corpus with existing segmentation algorithms (pitch+rhythm).

Segmentation algorithm	Dictionary size (N)	Segments appearing once
2-gram	10798	7011
3-gram	24332	38683
Gestalt-based¹⁵	23753	31564
LBDM¹⁶	13328	21972
Markov¹⁷	16183	46764
GPR of GTTM¹⁸	18103	31910
Mel2Word (full-token)	5278	25

Conclusions

In this study, we introduced Mel2Word, a text-based representation with pitch and rhythm information, and a natural language processing-based approach for tokenizing melodies into word units. We applied our method to build Mel2Word dictionaries for exploratory data analysis across multiple datasets and performed tune classification experiments with the word2vec word embedding on the MTC-ANN dataset. We also conducted a comparative analysis with representative existing segmentation methods in the area of symbolic music research. As a result, the proposed method improved performance on basic melodic features across all evaluation metrics and outperformed existing segmentation methods used in this experiment, showing that our approach could be employed efficiently and effectively for NLP-based word embeddings. One benefit of this data-driven method is that it can establish simple, concise, but powerful text-based musical words that would be difficult to construct with a rule-based method. As demonstrated, empirical experiments have shown that classification performance is improved even with a small dictionary size (e.g., N = 100), indicating the potential applicability of Mel2Word for other kinds of computational music analysis. It is considered that this method is applicable to practical tasks such as auto-music composition. Since this method can identify different phrases and patterns based on certain classes, it can be extended to genres, artists, emotions, etc., and used to generate melodies based on them, for example.

Beyond outstanding performance, the most promising aspect of this research is that it can serve as a starting point for comprehending the semantic relationship of melodic terms. Defining musical words as basic units for computational analysis in melodies can also mean that they can also be “vectorized” as significant units. When we understand a language, we don't break words into individual characters or fixed-length words. We perceive the meanings of words similarly or differently depending on their context and relationship to other words. For example, “word” and “vocabulary” differ in length by more than twice as much, but their meanings are similar. Similarly, as we observed in our example (Figure 11), “mi-fa-sol” and “re-mi-fa-sol” may have semantic meanings of music that are more closely related than other melodic segments. And if such semantic relationships across melodic terms can be understood, we believe that a more qualitative level of musical analysis will be possible to help address some of the critical and challenging topics in computational music analysis. Consider, for example, the issue of plagiarism. While similarity analysis of symbolic melodies has great potential in the context of plagiarism (Yin et al., 2022), how would we define the creativity (or uniqueness) of a melody? Fortunately, in language, examining the similarity of “words” in documents enables the commercial and academic deployment of such plagiarism detection systems (e.g., Turnitin¹⁹). How about musical emotions? Psychologists have defined emotional vocabulary in language, and a quantitative analysis of these emotional vocabularies helps to give a quantitative measure of emotion (Danner et al., 2001). Thus, we believe this simple music-to-NLP scheme could be a powerful means to employ the most recent advances in natural language processing, opening up a potential for fundamental or practical music analysis and applications.

Footnotes

Action Editor

David Meredith, Aalborg University, Department of Architecture, Design and Media Technology

Peer Review

Aitor Arronte-Alvarez, University of Hawai’i at Mānoa, Center for Language and Technology

One anonymous reviewer

Contributorship

Saebyul Park: main author, writing, code implementation. Eunjin Choi: code development, data organization. Jeounghoon Kim: research questions, supervision. Juhan Nam: research direction, corresponding author, review editing.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical Approval

This research did not require ethics committee or IRB approval. This research did not involve the use of personal data, fieldwork, or experiments involving human or animal participants, or work with children, vulnerable individuals, or clinical populations

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Saebyul Park

Data Availability Statement

The implementation code and the created dictionary can be accessed at .

Notes

References

Abdallah, S., Gold, N., & Marsden, A. (2015). Analysing symbolic music with probabilistic grammars. In D. Meredith (Ed.), Computational music analysis (157–189). Springer.

Almeida

Xexéo

(2019). Word embeddings: A survey. arXiv preprint arXiv:1901.09069.

Alvarez

A. A.

Gómez-Martin

(2019). Distributed vector representations of folksong motifs. In International Conference on Mathematics and Computation in Music (pp.18–21). Springer.

Andrews

Vigliocco

(2010). The hidden Markov topic model: A probabilistic model of semantic representation. Topics in Cognitive Science, 2(1), 101–113. https://doi.org/10.1111/j.1756-8765.2009.01074.x

Baroni

Dinu

Kruszewski

(2014) Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Bengio

Ducharme

Vincent

Janvin

(2003). A neural probabilistic language model. The Journal of Machine Learning Research, 3, 1137–1155. https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

Besson

Schön

(2001). Comparison between language and music. Annals of the New York Academy of Sciences, 930(1), 232–258. https://doi.org/10.1111/j.1749-6632.2001.tb05736.x

Bod

, et al. (2001). Probabilistic grammars for music. In Belgian-Dutch Conference on Artificial Intelligence. Citeseer.

Bod

(2002). Memory-based models of melodic analysis: Challenging the gestalt principles. Journal of New Music Research, 31(1), 27–36. https://doi.org/10.1076/jnmr.31.1.27.8106

10.

Boot

Volk

de Haas

W. B.

(2016). Evaluating the role of repeated patterns in folk song classification and compression. Journal of New Music Research, 45(3), 223–238. https://doi.org/10.1080/09298215.2016.1208666

11.

Bountouridis

Brown

D. G.

Wiering

Veltkamp

R. C.

(2017). Melodic similarity and applications using biologically-inspired techniques. Applied Sciences, 7(12), 1–29. https://doi.org/10.3390/app7121242

12.

Brown

(2001). Are music and language homologues? Annals of the New York Academy of Sciences, 930(1), 372–374. https://doi.org/10.1111/j.1749-6632.2001.tb05745.x

13.

Brown

Martinez

M. J.

Parsons

L. M.

(2006). Music and language side by side in the brain: A pet study of the generation of melodies and sentences. European Journal of Neuroscience, 23(10), 2791–2803. https://doi.org/10.1111/j.1460-9568.2006.04785.x

14.

Brunner

Wang

Wattenhofer

Wiesendanger

(2017) JamBot: Music theory aware chord based generation of polyphonic music with LSTMs. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE.

15.

Cambouropoulos

(2001). The local boundary detection model (LBDM) and its application in the study of expressive timing. International Computer Music Conference.

16.

Cenkerová

(2017). Melodic segmentation: Structure, cognition, algorithms. Musicologica Brunensia, 52(1), 53–61. https://doi.org/10.5817/MB2017-1-5

17.

Cenkerová

Hartmann

Toiviainen

(2018). Crossing phrase boundaries in music. In Proceedings of the Sound and Music Computing Conferences.

18.

Chiang

J. N.

Rosenberg

M. H.

Bufford

C. A.

Stephens

Lysy

Monti

M. M.

(2018). The language of music: Common neural codes for structured sequences in music and natural language. Brain and Language, 185, 30–37. https://doi.org/10.1016/j.bandl.2018.07.003

19.

Chou

Y. H.

Chen

Chang

C. J.

Ching

Yang

Y. H.

, et al. (2021). MidiBert-piano: Large-scale pre-training for symbolic music understanding. arXiv preprint arXiv:2107.05223.

20.

Chuan

C. H.

Agres

Herremans

(2020). From context to concept: Exploring semantic relationships in music with word2vec. Neural Computing and Applications, 32(4), 1023–1036. https://doi.org/10.1007/s00521-018-3923-1

21.

Çoban

(2017). Turkish music genre classification using audio and lyrics features. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 21(2), 322–331. https://doi.org/10.19113/sdufbed.88303

22.

Conklin

(2002). Representation and discovery of vertical patterns in music. In International Conference on Music and Artificial Intelligence. Springer.

23.

Conklin

(2009). Melody classification using patterns. Second International Workshop on Machine Learning and Music.

24.

Conklin

Witten

I. H.

(1995). Multiple viewpoint systems for music prediction. Journal of New Music Research, 24(1), 51–73. https://doi.org/10.1080/09298219508570672

25.

Crawford

(1998). String matching techniques for musical similarity and melodic recognition. Computing in Musicology, 11, 73–100.

26.

Danner

D. D.

Snowdon

D. A.

Friesen

W. V.

(2001). Positive emotions in early life and longevity: Findings from the nun study. Journal of Personality And Social Psychology, 80(5), 804–813. https://doi.org/10.1037/0022-3514.80.5.804

27.

Deerwester

Dumais

S. T.

Furnas

G. W.

Landauer

T. K.

Harshman

(1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6), 391–407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

28.

Downie

J. S.

(1999). Evaluating a simple approach to music information retrieval: Conceiving melodic n-grams as text. Citeseer.

29.

Eerola

Toiviainen

(2004). MIDI Toolbox: MATLAB tools for music research.

30.

Ellis

D. P.

(1996). Prediction-driven computational auditory scene analysis [PhD Thesis]. Columbia University.

31.

Fadiga

Craighero

D’Ausilio

(2009). Broca’s area in language, action, and music. Annals of the New York Academy of Sciences, 1169(1), 448–458. https://doi.org/10.1111/j.1749-6632.2009.04582.x

32.

Fedorenko

Patel

Casasanto

Winawer

Gibson

(2009). Structural integration in language and music: Evidence for a shared system. Memory & Cognition, 37(1), 1–9. https://doi.org/10.3758/MC.37.1.1

33.

Frankland

B. W.

Cohen

A. J.

(2004). Parsing of melody: Quantification and testing of the local grouping rules of Lerdahl and Jackendoff’s A Generative Theory of Tonal Music. Music Perception, 21(4), 499–543. https://doi.org/10.1525/mp.2004.21.4.499

34.

Gage

(1994). A new algorithm for data compression. C Users Journal, 12(2), 23–38.

35.

García Salas

H. A.

Gelbukh

Calvo

Galindo Soria

(2011). Automatic music composition with simple probabilistic generative grammars. Polibits, 44, 59–65. https://doi.org/10.17562/PB-44-9

36.

Gilbert

É.

Conklin

(2007). A probabilistic context-free grammar for melodic reduction. In Proceedings of the International Workshop on Artificial Intelligence and Music, 20th International Joint Conference on Artificial Intelligence.

37.

Groves

(2016). Towards the generation of melodic structure. In Proceedings of the Fourth International Workshop on Musical Metacreation.

38.

Gulati

Serra

Ishwar

Sentürk

Serra

(2016). Phrase-based rāga recognition using vector space modeling. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

39.

Hamanaka

Hirata

Tojo

(2004). Automatic generation of grouping structure based on the GTTM. International Computer Music Conference.

40.

Hamanaka

Hirata

Tojo

(2008). Melody morphing method based on GTTM. In International Computer Music Conference. Citeseer.

41.

Hamanaka

Kobayashi

Otagawa

Shimada

(2019). Melody slot machine: Melody morphing by using time-span tree of GTTM.

42.

Harris

Z. S.

(1954). Distributional structure. Word, 10(2–3), 146–162. https://doi.org/10.1080/00437956.1954.11659520

43.

Herremans

Chuan

C. H.

(2017). Modeling musical context with word2vec. arXiv preprint arXiv:1706.09088.

44.

Hirai

Sawada

(2019). Melody2vec: Distributed representations of melodic phrases based on melody segmentation. Journal of Information Processing, 27, 278–286. https://doi.org/10.2197/ipsjjip.27.278

45.

Hoch

Poulin-Charronnat

Tillmann

(2011). The influence of task-irrelevant music on language processing: Syntactic and semantic structures. Frontiers in Psychology, 2, 1–10. https://doi.org/10.3389/fpsyg.2011.00112

46.

Huang

C. Z. A.

Duvenaud

Gajos

K. Z.

(2016). Chordripple: Recommending chords to help novice composers go beyond the ordinary. In Proceedings of the 21st International Conference on Intelligent User Interfaces.

47.

Jung

Sontag

Park

Y. S.

Loui

(2015). Rhythmic effects of syntax processing in music and language. Frontiers in Psychology, 6, 1–11. https://doi.org/10.3389/fpsyg.2015.01762

48.

Kazhuparambil

Kaushik

(2020). Cooking is all about people: Comment classification on cookery channels using BERT and classification models (Malayalam-English mix-code). arXiv preprint arXiv:2007.04249.

49.

Koelsch

Kasper

Sammler

Schulze

Gunter

Friederici

A. D.

(2004). Music, language and meaning: Brain signatures of semantic processing. Nature Neuroscience, 7(3), 302–307. https://doi.org/10.1038/nn1197

50.

Koffka

(2013). Principles of Gestalt psychology, volume 44. Routledge.

51.

Köhler,W. (1967). Gestalt psychology. Psychologische Forschung, 31(1), 18–30. https://link.springer.com/article/10.1007/BF00422382

52.

Landgraf

A. J.

Bellay

(2017). Word2vec skip-gram with negative sampling is a weighted logistic PCA. arXiv preprint arXiv:1705.09755.

53.

Lattner

(2020). Modeling musical structure with artificial neural networks. arXiv preprint arXiv:2001.01720.

54.

Mikolov

(2014). Distributed representations of sentences and documents. In International Conference on Machine Learning. PMLR, pp. 1188–1196.

55.

Lerdahl

Jackendoff

R. S.

(1996). A generative theory of tonal music. MIT Press.

56.

Sleep

(2004). Melody classification using a similarity metric based on Kolmogorov complexity. Sound and Music Computing, 2012, 1–4. https://ueaeprints.uea.ac.uk/id/eprint/21575 (Accessed on: 2023-12-03)

57.

Yang

(2018). Word embedding for understanding natural language: A survey. In Srinivasan

(Ed.), Guide to big data applications (pp. 83–104). Springer.

58.

Liang

Lei

Chan

P. Y.

Yang

Sun

Chua

T. S.

(2020). PiRhDy: Learning pitch-, rhythm-, and dynamics-aware embeddings for symbolic music. In Proceedings of the 28th ACM International Conference on Multimedia.

59.

Lidov

, et al. (2005) Is language a music?: Writings on musical form and signification. Indiana University Press.

60.

Maddage

N. C.

Kankanhalli

M. S.

(2006). Music structure based vector space retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

61.

Madhusudhan

S. T.

Chowdhary

(2019). Deepsrgm-sequence classification and ranking in Indian classical music with deep learning. In 20th International Society for Music Information Retrieval Conference, International Society for Music Information Retrieval Conference 2019. International Society for Music Information Retrieval.

62.

Madjiheurem

Walder

(2016). Chord2vec: Learning musical chord embeddings. In Proceedings of the Constructive Machine Learning Workshop at 30th Conference on Neural Information Processing Systems, Barcelona, Spain.

63.

Maess

Koelsch

Gunter

T. C.

Friederici

A. D.

(2001). Musical syntax is processed in Broca’s area: An MEG study. Nature Neuroscience, 4(5), 540–545. https://doi.org/10.1038/87502

64.

Marolt

(2008). A mid-level representation for melody-based retrieval in audio collections. IEEE Transactions on Multimedia, 10(8), 1617–1625. https://doi.org/10.1109/TMM.2008.2007293

65.

Melucci

Orio

(1999). Musical information retrieval using melodic surface. In Proceedings of the Fourth ACM Conference on Digital Libraries.

66.

Mikolov

Sutskever

Chen

Corrado

G. S.

Dean

(2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. (NIPS 2013) https://proceedings.neurips.cc/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html

67.

Mongeau

Sankoff

(1990). Comparison of musical sequences. Computers and the Humanities, 24(3), 161–175. https://doi.org/10.1007/BF00117340

68.

Mueller

(2020). Wordcloud for Python documentation. WordCloud for Python documentation-wordcloud, 1(1). https://amueller.github.io/word_cloud/

69.

Müllensiefen

Pendzich

(2009). Court decisions on music plagiarism and the predictive value of similarity algorithms. Musicae Scientiae, 13(1_suppl), 257–295. https://doi.org/10.1177/102986490901300111

70.

Neve

Orio

(2005). A comparison of melodic segmentation techniques for music information retrieval. In International Conference on Theory and Practice of Digital Libraries. Springer.

71.

Park

Kwon

Lee

Kim

Nam

(2019). A cross-scape plot representation for visualizing symbolic melodic similarity. In International Society for Music Information Retrieval Conference.

72.

Patel

A. D.

(2003). Language, music, syntax and the brain. Nature Neuroscience, 6(7), 674–681. https://doi.org/10.1038/nn1082

73.

Patel

A. D.

Peretz

Tramo

Labreque

(1998). Processing prosodic and musical patterns: A neuropsychological investigation. Brain and Language, 61(1), 123–144. https://doi.org/10.1006/brln.1997.1862

74.

Pearce

M. T.

(2005). The construction and evaluation of statistical models of melodic structure in music perception and composition [PhD Thesis]. City University London.

75.

Pearce

M. T.

Müllensiefen

Wiggins

G. A.

(2010). Melodic grouping in music information retrieval: New methods and applications. In Raś

Z.W.

Wieczorkowska

Alicja A.

(Eds.), Advances in music information retrieval (pp. 364–388). Springer.

76.

Pearce

M. T.

Wiggins

G. A.

(2006). The information dynamics of melodic boundary detection. In Proceedings of the Ninth International Conference on Music Perception and Cognition.

77.

Pennington

Socher

Manning

C. D.

(2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

78.

Pfleiderer

Frieler

Abeßer

Zaddach

W. G.

Burkhart

(Eds.). (2017). Inside the Jazzomat - New Perspectives for Jazz Research. Schott Campus. Schott Music Gmbh & Co. Kg / Schott Campus.

79.

Ran

Han

(2020). Text classification algorithm based on sparse distributed representation. In 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA). IEEE.

80.

Rehurek

Sojka

(2010). Software framework for topic modelling with large corpora. In In Proceedings of the International Conference on Language Resources and Evaluation 2010 Workshop on New Challenges for NLP Frameworks. Citeseer.

81.

Roads

Wieneke

(1979). Grammars as representations for music. Computer Music Journal, 3(1), 48–55. https://doi.org/10.2307/3679756

82.

Rodríguez López

(2016). Automatic Melody Segmentation [PhD Thesis]. Utrecht University.

83.

Rodríguez López

de Haas

Volk

(2014). Comparing repetition-based melody segmentation models. In Proceedings of the 9th Conference on Interdisciplinary Musicology (CIM14). SIMPK and ICCMR.

84.

Rodríguez López

M. E.

Volk

(2015). Selective acquisition techniques for enculturation-based melodic phrase segmentation. In Proceedings of the 16th Conference of the International Society for Music Information Retrieval. International Society for Music Information Retrieval Conference press.

85.

Sawada

Yoshii

Hirata

(2020). Unsupervised melody segmentation based on a nested Pitman-Yor language model. In Proceedings of the 1st Workshop on NLP for Music and Audio.

86.

Sennrich

Haddow

Birch

(2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.

87.

Shin

Crestel

Kato

Saito

Ohnishi

Yamaguchi

Nakawaki

Ushiku

Harada

(2017). Melody generation for pop music via word representation of musical properties. arXiv preprint arXiv:1710.11549.

88.

Slevc

L. R.

Rosenberg

J. C.

Patel

A. D.

(2009). Making psycholinguistics musical: Self-paced reading time evidence for shared processing of linguistic and musical syntax. Psychonomic Bulletin & Review, 16(2), 374–381. https://doi.org/10.3758/16.2.374

89.

Smoliar

S. W.

(1976). Music programs: An approach to music theory through computational linguistics. Journal of Music Theory, 20(1), 105–131. https://doi.org/10.2307/843606

90.

Stein

(1979). Structure & style .

91.

Temperley

(2004). The cognition of basic musical structures. MIT Press.

92.

Tenney

Polansky

(1980). Temporal gestalt perception in music. Journal of Music Theory, 24(2), 205–241. https://doi.org/10.2307/843503

93.

Tervaniemi

Kujala

Alho

Virtanen

Ilmoniemi

R. J.

Näätänen

(1999). Functional specialization of the human auditory cortex in processing phonetic and musical sounds: A magnetoencephalographic (MEG) study. Neuroimage, 9(3), 330–336. https://doi.org/10.1006/nimg.1999.0405

94.

Tillmann

Janata

Bharucha

J. J.

(2003). Activation of the inferior frontal cortex in musical priming. Cognitive Brain Research, 16(2), 145–161. https://doi.org/10.1016/S0926-6410(02)00245-8

95.

Tsushima

Nakamura

Yoshii

(2020). Bayesian melody harmonization based on a tree-structured generative model of chord sequences and melodies. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1644–1655. https://doi.org/10.1109/TASLP.2020.2996088

96.

Turian

Ratinov

Bengio

(2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.

97.

Turney

P. D.

Pantel

(2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37, 141–188. https://doi.org/10.1613/jair.2934

98.

Uitdenbogerd

A. L.

(2002). Music information retrieval technology [PhD Thesis]. RMIT University.

99.

van Kranenburg

Janssen

Volk

(2016). The Meertens tune collections: The annotated corpus (MTC-ANN) versions 1.1 and 2.0.1.

100.

Walshaw

(2017) Tune classification using multilevel recursive local alignment algorithms. In Proc. 7th International Workshop on Folk Music Analysis.

101.

Wolkowicz

Keselj

(2010). Predicting development of research in music based on parallels with natural language processing. In International Society for Music Information Retrieval Conference.

102.

Wołkowicz

Kulka

Kešelj

(2008). N-gram-based approach to composer recognition. Archives of Acoustics, 33(1), 43–55. https://acoustics.ippt.pan.pl/index.php/aa/article/view/629

103.

Yanase

Takasu

Adachi

(1999). Phrase based feature extraction for musical information retrieval. In 1999 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM 1999). Conference Proceedings (Cat. No. 99CH36368). IEEE.

104.

Ycart

Benetos

(2020). Learning and evaluation methodologies for polyphonic music sequence prediction with LSTMs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1328–1341. https://doi.org/10.1109/TASLP.2020.2987130

105.

Yin

Reuben

Stepney

Collins

(2022). Measuring when a music generation algorithm copies too much: The originality report, cardinality score, and symbolic fingerprinting by geometric hashing. SN Computer Science, 3(5), 1–18. https://doi.org/10.1007/s42979-022-01220-y

106.

Zeng

Tan

Wang

Qin

Liu

T. Y.

(2021). MusicBert: Symbolic music understanding with large-scale pre-training. arXiv Preprint ArXiv:2106. https://doi.org/10.48550/arXiv.2106.05630

Mel2Word: A Text-Based Melody Representation for Symbolic Music Analysis

Abstract

Keywords

Introduction

Music and Language

MIR and NLP

Research Question

Related Work

How Words Represent Meanings in NLP

Melody Segmentation

Psychological Models

Computational Models

Proposed Method

Morpheme-Level Text Encoding

Morpheme-Level Units to Word-Level Units

Dictionary Generation

Tokenization

Experiments

Datasets

Dictionary Generation

Tokenization

Exploratory Data Analysis

Evaluation

Mel2Word Embedding

Segmentation Algorithms

Similarity Metrics

Evaluation Metrics

Results

Comparison of Dictionary Size

Comparison of Melodic Features

Comparison of Segmentation Models

Conclusions

Footnotes

Action Editor

Peer Review

Contributorship

Declaration of Conflicting Interests

Ethical Approval

Funding

ORCID iD

Data Availability Statement

Notes

References