Abstract
Intonation is a complex suprasegmental phenomenon essential for speech processing. However, it is still largely understudied, especially in the case of under-resourced languages, such as Lithuanian. The current paper focuses on intonation in Lithuanian, a Baltic pitch-accent language with free stress and tonal variations on accented heavy syllables. Due to historical circumstances, the description and analysis of Lithuanian intonation were carried out within different theoretical frameworks and in several languages, which makes them hardly accessible to the international research community. This paper is the first attempt to gather research on Lithuanian intonation from both the Lithuanian and the Western traditions, the structuralist and generativist points of view, and the linguistic and modelling perspectives. The paper identifies issues in existing research that require special attention and proposes directions for future investigations both in linguistics and modelling.
Keywords
Introduction
Alongside word stress and rhythm, intonation is an essential element of linguistic prosody. Intonation has several functions, such as highlighting grammatical structure, conveying emotion and focusing attention on relevant portions of the spoken message. Importantly, ample evidence from psycholinguistic studies showed that intonation is used to identify phrasal boundaries (prosodic phrases). Hence, intonation plays a crucial role in parsing continuous speech both in infants during early language acquisition (Jusczyk
In recent years, intonation has attracted attention due to the rise of modern speech technologies. It has been confirmed that the incorporation of intonational information improves keyword extraction performance (Lezhenin
Solutions based on artificial neural networks, which are now widely used in speech signal synthesis, are essentially an aggregative approach – all knowledge of the generated speech signal is obtained by examining many language examples that might strongly differ in the contextual information provided in them. Thus, the speech signal generated by such a synthesizer will be mainly focused on the phonetic lexical content, not the suprasegmental properties of the language (including intonation). Moreover, as prosodic information varies at a slower rate in comparison to other content in the acoustic signal (e.g. segmental information and background noise), modelling intonation from conventional speech datasets is problematic (Hodari
It is important to note that models of intonation have been mostly developed for a few most common languages, such as English (Ronanki
This paper focuses on intonation in one of such languages, namely Lithuanian, an Indo-European language from the Baltic group. This language is spoken by less than 4 mil. people around the world (according to the official information from Statistics Lithuania), and can therefore be considered an under-resourced language. From the perspective of linguistic description and study, Lithuanian had a somewhat paradoxical fate, receiving major attention on its accentuation system and being understudied in most other aspects. In the wake of historical linguistics in the 19th century, Lithuanian reached the peak of its linguistic glory, becoming the reference language for F. de Saussure, who based his Saussure’s Law on Lithuanian pitch accent that he considered as the missing link of Indo-European linguistic history (Joseph, 2009; de Saussure, 1879, 1894, 1896). Ever since, Lithuanian has been perceived as the most “archaic of modern Indo-European languages”, attracting the attention of many imminent historical and structuralist linguists, such as Antoine Meillet and Nikolai Trubetzkoy (Michelini, 2000; Petit, 2020). Later on, prominent generative phonologists, such as Morris Halle and Juliette Blevins also worked with its accentuation system (Blevins, 1993; Halle and Vergnaud, 1987). Nevertheless, due to historical circumstances in the region in the second half of the 20th century, international scientific collaborations became impossible, and most of the research on this language’s prosody was published in Lithuanian or Russian and followed other theoretical approaches than the ones developing in the Western world. This resulted in a major linguistic and theoretical gap between prosodic research on Lithuanian in Lithuania and that carried out abroad. Prosodic phenomena other than the pitch accent, such as sentence intonation, received little attention and were often studied indirectly. A new opportunity to continue the study of Lithuanian intonation has been brought by the surge of speech and language technologies, which accentuated the urgent need to enhance linguistic knowledge of Lithuanian prosody. As is the case for other languages, these technologies necessitate integrated theoretical and applied linguistic knowledge, which has the potential to boost research in linguistics (Botinis
Therefore, this paper is the first attempt to gather research on Lithuanian intonation from both the Lithuanian and the Western traditions, the structuralist and generativist points of view, and the linguistic and modelling perspectives. We included in this overview all papers, books and conference proceedings published in English, Lithuanian and French dealing directly or indirectly with intonation in Lithuanian. Several unpublished but relevant theses on the topic are also discussed in this paper. We hope that this critical multidisciplinary overview will help summarize the somewhat fragmented research in the field in order to gain a better understanding of Lithuanian intonation and identify potential directions for future research both in linguistics and modelling.
The remainder of the paper is organized as follows: Section 2 provides a clarification of the terminology describing intonation. Section 3 briefly introduces intonation in terms of prosodic typology. Section 4 reviews Lithuanian intonation from the linguistic perspective by first situating the Lithuanian language within the word-prosodic typology (4.1), then discussing findings on sentence-level intonation in phonetic (4.2) and phonological (4.3) research. Section 5 reviews various techniques and models for prosody modelling. Two main approaches to intonation modelling are overviewed: directly modelling pitch and trying to simulate the pitch production process. Section 6 presents a critical discussion of the reviewed literature and provides suggestions of the major directions and further research in this field, with an emphasis on the role of intonation modelling for the development of linguistic theory and speech technologies.
What is Intonation? – Clearing up Terminological Ambiguities
Variations of pitch are used across languages and can thus be considered a universal feature. Nevertheless, this prosodic phenomenon is highly complex, as, despite its universal use, it has language-specific characteristics linked to the whole phonetic-phonological system of a given language. Moreover, it involves several levels of language, from lexical, to post-lexical and extralinguistic ones.
Before turning to the overview of the main typological distinctions of pitch and intonation, we would like to make a brief point about the somewhat confusing terminology used in this field of research. Depending on the authors and their theoretical background, words such as “intonation”, “prosody”, “tone”, and “pitch variations” either denote distinct phenomena or are used as synonyms (Botinis

A schematic representation of the different aspects of intonation examined in the literature.
Historically, intonation has received little attention and has long been seen merely as a means of expressing the speaker’s attitudes and emotions. For this reason, it was considered to be a non-grammatical “form of animal communication” (Gussenhoven, 2004) which was “around the edge of language” (Bolinger, 1964). However, later studies and evidence showed that besides this extra- or para-linguistic function, intonation has a much more structured grammatical aspect which plays an important role in the world languages (see the autosegmental-metrical approach, Ladd, 1996; Pierrehumbert, 1980). Within the grammatical realm, intonation is still interpreted in two different ways (Hirst and Di Cristo, 1998). Under one interpretation, intonation is used interchangeably with the term “prosody” to define a wide range of suprasegmental phenomena, describing both the lexical identity of words (stress, tone) and the post-lexical or sentence-level concepts (intonational phrases etc.). Under the second interpretation, the term intonation only denotes the second, post-lexical aspect. The final source of ambiguity comes from the distinction between the physical/formal level of description and the linguistic/abstract one. While the physical approach attempts to identify and measure the acoustic and perceptual characteristics of intonation, the linguistic approach aims at examining the speakers’ abstract cognitive representations of this physical manifestation (Hirst and Di Cristo, 1998). Although attempts have been made to find a one-to-one correspondence between these abstract prosodic phenomena and their acoustic realizations, it has become clear that the same acoustic means are used to express different abstract linguistic elements (Féry, 2017). Thus, known markers of sentence-level intonation, such as variations of fundamental frequency (F0), intensity, duration and rhythm, also play a role in the realization of word-level intonation, stress, as well as the expression of extralinguistic intonation. For this reason, references to F0 or “tone” might denote a variety of distinct prosodic phenomena. Finally, according to the autosegmental-metrical approach, the same abstract phonological melody can have different realizations (Arvaniti, 2022). Hence, the need to decompose the observable continuous pitch contours into a series of primitives in order to accurately describe and model intonation across languages (Arvaniti and Ladd, 2009). The main focus of this paper is grammatical intonation in Lithuanian. Although our primary concern is sentence-level intonation, work on other relevant lexical prosodic phenomena, especially pitch accent, will be discussed, as it is a necessary prerequisite for post-lexical intonation description and modelling.
As discussed in the previous section, melodic rise or fall can be characteristic of different grammatical features (Féry, 2017). At the word-level, intonation in some languages is specified lexically. In these languages (e.g. Mandarin, Thai or Vietnamese), variations of pitch indicate changes in the lexical meaning of a word. At the sentence level, intonation is understood as the use of tone for non-lexical purposes, such as syntactic demarcation, differentiation of sentence types, and signalling of information structure (Zerbian, 2010). As on the physical level, word-intonation and sentence-intonation, as well as other prosodic phenomena, share many characteristics, they can hardly be described in isolation, and their interaction is of typological interest (Hirst and Di Cristo, 1998; Ladd, 2001). Moreover, word- and sentence-level prosodic phenomena are related from the point of view of phonological structure. Specifically, sentence intonation is associated with feet and, thus, with stressed syllables (Arvaniti, 2022).

A schematic representation of the prosodic typology with examples of languages belonging to the three classes.
As no clear intonation typology has been established so far, classical prosodic typologies based on word-prosodic characteristics are still widely used for the description of cross-linguistic distinctions of post-lexical intonation. Many typologies distinguish stress, tone and pitch-accent languages (see Fig. 2) (Jun, 2005). Stress languages, such as English, German or Dutch, have lexical stress but do not use tone at the lexical level. This means that tones do not affect word meaning in these languages, as it is determined by the segmental content and stress. In such languages, lexical stress is obligatory and marks the most prominent syllable, while different tones or melodies are assigned post-lexically (only at the sentence level) (Hyman, 2006). Stressed syllables serve as the location in the phonological structure where the intonation changes (shaping the intonation contour) occur (Gussenhoven, 2004). On the other hand, in tone languages, such as Mandarin, Thai or Vietnamese, the variations of pitch are specified lexically and are phonologically contrastive. However, sentence intonation is also used in these languages to a certain extent (Abramson, 1962; Connell
Word-Level Intonation in Lithuanian
As mentioned before, a number of studies have shown that Lithuanian is typologically a pitch-accent language, with tonal variations on heavy (i.e. bimoraic) syllables (Blevins, 1993; Girdenis, 2003). Fig. 3 is a schema of the prosodic syllable types in Lithuanian. Stress in Lithuanian is free (its position varies across words and forms of the same word). It is important to note that Lithuanian is one of the most highly inflected Indo-European languages, in which the location of stress depends on the underlying specifications of both the stem and the inflectional affix of every inflected nominal (Savičiūtė

Prosodic syllable types of standard Lithuanian, from Girdenis (2003).
Due to historical circumstances in Eastern Europe in the second half of the 20th century, Lithuanian prosody, especially at the sentence level, has not received sufficient attention in linguistics. This resulted in major gaps in the description of intonational patterns not only of Baltic (Kundrotas, 2017) but also larger, Slavic languages (Malisz and Żygis, 2017). This lack of investigations also explains why Lithuanian has not been described in the context of broader intonation typologies (Hirst and Di Cristo, 1998).
Lithuanian intonation at the sentence level was mostly studied by Lithuanian linguists from two standpoints: the syntactic and the phonetic. In his historical overview of research on Lithuanian intonation, Kundrotas (2020) notes that Lithuanian intonation at the sentence level was approached for the first time and in the most extensive manner by syntacticians (Balkevičius, 1963, 1998; Talandienė, 1970). As this research focused on the functional role of intonation rather than on its phonology and phonetics, its overview is out of the scope of the current paper. Phonological research was, for the most part, carried out by a single author, Kundrotas (2008, 2009, 2017, 2020).
Following the papers in syntax, the first, also rather sporadic, attempts to provide a phonetic description of Lithuanian intonation were published. Most of this research dealt with the phonetic description of sentence types. This direction of research was probably chosen due to the preliminary observations provided by studies on this topic in syntax. First, we can distinguish works aimed at identifying the acoustic correlates of sentence intonation in Lithuanian. Krapikaitė (2009, 2011) examined three possible phonetic correlates of intonation (F0, duration and intensity) in three sentence types (statements, questions and exclamations). She concluded that F0 is the main distinctive marker of sentence type in Lithuanian. Preliminary evidence from these studies shows that intensity modulations can signal focus-marking, while duration could be considered as a secondary marker of sentence intonation. The findings on the role of F0 have been confirmed in a pilot study by Kazlauskienė and Sabonytė (2018), who found that pitch variations in Lithuanian best reflect sentence intonation, i.e. sentence type and focus but are not a marker of stress.

(A; B; C). Annotated example of: A. a statement phrase; B. a question; C. an exclamation, from Krapikaitė (2009). The stressed syllables of focused words are in bold.
The majority of papers dedicated to sentence types in Lithuanian aimed at providing a phonetic description of pitch variations in these sentences. For instance, Krapikaitė (2009) recorded statements, questions and exclamations produced by multiple speakers of Standard Lithuanian. As can be seen in an annotated example of a statement phrase (Fig. 4A), intonation starts low at the beginning of the phrase, then rises and peaks on the third and stressed syllable of the first word. It gradually decreases in the second word and reaches a low plateau.
In the question phrase (Fig. 4B) the intonation of the first word increases slightly until the third stressed syllable of the first word. It then drops significantly on the first stressed syllable of the second word and rises again to reach the intonational boundary peak. As noted by the author, the suffix –si is a reflexive marker which often undergoes reduction to –s. Therefore, the final fall of intonation in this particular phrase should not be considered as a trait of questions in Lithuanian. Thus, questions seem to follow the raising-falling-raising patterns. In the exclamation (Fig. 4C), intonation rises to its peak on the first, stressed, syllable of the second word and gradually falls towards the end of the word. The amplitude of this fall is the highest of all three sentence types. In addition to this, the overall pitch in the exclamatory sentence is higher than in declarative sentences and questions (this has also been found by Kazlauskienė and Sabonytė, 2018). Note that the melodic contour is opposite in questions vs exclamations: while in the question intonation is at its lowest point on the first syllable of the second word, it reaches its peak on the same syllable in the exclamation. Krapikaitė (2009) concludes that this different pattern of F0 on the same stressed syllables in different sentence types are an indication that in Lithuanian, the melodic contour of intonation depends on sentence focus rather than on lexical stress. Specifically, the intonational peak is located on the focused words, and distinctive pitch events happen in or after the stressed syllables of these words. For instance, in the statement phrase
Although most of this work was of phonetic nature, in Krapikaitė (2015), an attempt to provide some phonological generalizations using ToBI (Tones and Break Indices, Beckman and Ayers, 1994) was made. The author identifies possible pitch accent tones in different sentence types, using the same sentence examples as in her previous work: a rising peak accent L+H* was identified on the last syllable –ta of the first word in the statement sentence (Fig. 4A); a low tone L* on the first syllable of the second word in the question (Fig. 4B); and a high pitch accent H* on the same syllable in the exclamation phrase (Fig. 4C). Krapikaitė suggests that each sentence type contains a single pitch accent and can thus be distinguished based on them. However, this paper uses a very limited set of tonal events proposed by ToBI, making the analysis incomplete in at least two aspects. First, it seems necessary to include additional pitch accents to account for the melodic contours of the given sentence types. Second, in order to make generalizations on markers of sentence type, it is important to address phrasal tones. For instance, we propose that the statement type should include L-L% final phrase boundaries, which have been widely attested cross-linguistically in declarative sentences (Gussenhoven, 2016). This would result in a L+H* L-L% melody for the statement phrase given in the examples.
In order to obtain a high-low-high melodic contour attested in Krapikaitė (2009, 2011) in the question phrase, the description should include a H* pitch accent in the first word (on the last syllable –ta). This would lead to H* L*, and final tones H-H%, which, again, are cross-linguistically attested as markers of yes-no questions (Hedberg
Finally, the exclamation phrase in the given example could be analysed as a sequence of a H* pitch accent in the first word, followed by the H* in the second and a L-L% final tone combination.
A more detailed analysis of question intonation was proposed by Kazlauskienė and Dereškevičiūtė (2018). They studied three types of question sentences and simple statements in order to identify the main intonational patterns of these sentence types. The authors used ToBI, but, again, chose to use a simplified notation with single-tone H* and L* pitch accents and H% and L% boundary notations. The first type of questions to be analysed was the yes/no question without the particle “ar”, which has been shown to be the most used yes/no question formulation in spoken Lithuanian (Balčiūnienė and Simonavičienė, 2009). These questions are comparable to questions without morphosyntactic question markers in English (Grabe and Karpinski, 2003). Two patterns were identified:
The first one follows the typologically common low-rise nuclear contour, which has also been observed by Krapikaitė (2009, described above). The second, the falling pattern, reflects the focus placement on the second word “eiti”, ‘can we GO home’?
The second type of questions are yes/no questions starting with the interrogative particle ‘ar’.
Similarly to pattern 1, pattern 3 ends in a low-rise nuclear contour, but the first word is assigned a high pitch accent. It is possible that the presence of the interrogative particle in the clause-initial position affects this change in pitch accent. Note also that while in pattern 1 and 2 the only marker of questions was intonation, in pattern 4 the particle “ar” unambiguously indicates the sentence type from the very beginning of the phrase. Thus, one could expect a smaller intonation rise in yes-no questions with the particle. Although the amplitude of the final rise was twice as high in pattern 2 compared to pattern 3, the rise is still clearly present in pattern 3. Kazlauskienė and Dereškevičiūtė further investigated the interaction between the particle and intonation as markers of question type. They tested the participants’ perception of sentences with and without the particle “ar” by cutting it out from the recordings and asking participants to identify the sentence type. Participants were more than 90 % correct in identifying questions even when the particle was removed, pointing to the fact that it is indeed intonation that conveys most information about the sentence type. Comparable evidence comes from Polish, where a similar particle “czy” is used in yes-no questions. Mikoś (1976) found a small, although non-significant difference in questions with vs without the particle in Polish and concluded that “czy” is an optional particle, which could be treated as a redundant feature. Turning back to the last two patterns of Lithuanian yes-no questions with interrogative particles, pattern 4 describes the case with a focus on the phrase-initial word ‘CAN [we] go home?’, which conveys impatience. Pattern 5 mirrors pattern 2 and emphasizes the second word ‘go’. Thus, this evidence points to the role of intonation in marking yes-no questions both in sentences with and without the interrogative particle “ar”. More studies are needed in order to shed light on whether this particle is a redundant marker of question type, or whether it has an impact on the phrase-initial intonational contour and, possibly, on the amplitude of the final rise. Finally, the rising contour seems to best describe the simple yes-no question, while the falling contour is used to move focus away from the last to the first or second syntagma in the phrase.
Kazlauskienė & Dereškevičiūtė end their study by analysing wh-questions:
In both cases, wh-questions have a falling contour, but they differ in focus marking: in pattern 2 the wh-word is marked, while in Pattern 7, the second word ‘can we’ is marked. Interestingly, when the wh-word was excised from the recordings, participants’ accuracy in identifying the sentence type fell drastically. This result is expected for pattern 6, where the removal of the wh-word leaves the sentence with no other lexical markers of question type (in Lithuanian, there is no subject auxiliary inversion). Note, though, that the L*H% pattern at the end of this phrase could potentially be used to identify the sentence type. The result of the perceptual experiment is even more surprising for pattern 7 as even without the wh-word, the sentence has the H*L*L*L% contour similar to pattern 4, which should be indicative of question type, as High-Fall contours are attested to be a cross-linguistically common pattern for wh-questions (Hedberg
As can be seen, the above-cited phonetic studies target sentence type intonation, although focus marking is also addressed in a less direct manner. These studies provide the first glimpse into the peculiarities of sentence intonation in Lithuanian and raise many interesting considerations as to the presence of certain cross-linguistically attested intonational patters or the complex interaction of intonation with other syntactic markers (e.g. particles). Note, however, that these studies had small sample sizes and did not carry out (or did not report) statistical tests. Therefore, their findings are of a rather preliminary nature. Moreover, none of them discussed the interaction between the lexical pitch accent and the described post-lexical pitch variations. Finally, some of the studies made a first attempt to use an internationally recognized framework for the development of prosodic annotation, i.e. ToBI, and to adapt it to the Lithuanian language. In these papers, only some fragments of the ToBI system were used, making the results rather incomplete and lacking a deeper phonological analysis. The following subsection will review studies that targeted explicitly the phonology of sentence-level intonation in Lithuanian.
Intonation in Lithuanian has so far been almost overlooked in the field of phonology. Some researchers working on Lithuanian phonology and accentology mention the major functions of sentence intonation but consider this phenomenon to be out of the scope of their research (Pakerys, 2003; Kazlauskienė, 2012). Others, such as the most prominent Lithuanian phonologist Aleksas Girdenis, acknowledge the importance of intonation but do not study this phenomenon because of its complexity. For this reason, intonation is only fragmentally studied when it is necessary in order to explain some suprasegmental elements of a word (Girdenis, 2003). Finally, a prevailing view in some research circles is that more phonetic description is needed in order to be able to make generalizations about the more abstract phonological level (Kundrotas, 2020).
Perhaps the most extensive phonological analysis of Lithuanian intonation was conducted by G. Kundrotas. It is important to note that Kundrotas carried out significant research in documenting intonational patterns in varying contexts and across speech styles, as well as testing their validity experimentally. The author follows Trubetzkoy’s (1969) method and seeks to identify meaningful phonemic oppositions in Lithuanian intonation. He distinguishes three such types of oppositions (the stressed syllables of focused words are in bold):
It is important to mention that this researcher follows the theoretical framework of the Prague structuralist school and posits a holistic view of intonation (Bolinger, 1964; Liberman, 1975; Liberman and Sag, 1974). Therefore, Kundrotas argues that the intonation contour expressed by F0 is the main and indecomposable intonation unit in a language (Kundrotas, 2008, 2009). According to the author, the contour varies according to two phonologically meaningful dimensions: the tone of the intonational nucleus (focus or sentence stress) can be either rising or falling; the height of the contour following a nuclear syllable can be higher or lower than its level before the nuclear syllable. Based on these assumptions and experimental studies, Kundrotas (2008, 2009) identifies seven intonational contours in Lithuanian, which will be presented below.
The first intonation contour (IC1) is characterized by a gradual lowering of the tone in the nucleus, the post-nuclear part of the contour being lower than the pre-nuclear part. This intonation contour is mostly used in declarative sentences:
IC2 is similar to the first one, as it also starts at the normal pitch level of the speaker, then gradually falls in the nucleus and further falls in the post-nuclear part of the contour. However, the pitch in the nucleus is slightly higher than in IC1. IC2 is used in:
The main distinctive feature of IC3 is the gradually rising or the rising-falling tone in the nucleus. In the post-nuclear position, the tone gradually falls and reaches a lower position than in the pre-nuclear part of the contour. According to Siniova and Kundrotas (2014), this contour is widely used in Lithuanian in a variety of contexts:
IC4 is characterized by a falling-rising pitch. Specifically, the tone slightly falls at the beginning of the nucleus but gradually rises throughout the nucleus and in the post-nuclear part of the contour. The post-nuclear part of the contour is higher than the pre-nuclear part. This intonation contour is used in:
IC5 has two intonation centres, with a rising tone in the first and a falling tone in the second one. Both the intensity and duration of the syllables which constitute the intonation centres are higher compared to other stressed syllables. The two intonation centres can either be located close to each other in the utterance or can be separated by several syllables. Intonation between the two centres is higher than in the pre-nuclear part of the contour but lower in the post-nuclear part. This intonation contour is used in:
IC6 is characterized by a gradually rising pitch in the nucleus. The intonation remains higher in the post-nuclear position compared to the pre-nuclear part of the contour. The duration of the nuclear syllable is longer than the duration of other stressed syllables in the phrase. This duration difference is the main distinctive feature of IC6, compared to IC3, which has also a rising pitch in the nucleus. It is used to:
IC7 is characterized by a rising or rising-falling contour in the nucleus. Pitch falls in the post-nuclear position and becomes lower than in the pre-nuclear part. The distinctive feature of IC7 is the glottalization of the nucleus. It is used in exclamatory, imperative and interrogative sentences as a means of expressive negation by using an opposing meaning to what has been said:
Our review of the studies by Kundrotas reveals that this author provided the most exhaustive description and phonological analysis of Lithuanian sentence-level intonation to date. In his numerous papers, the researcher not only identifies major melodic contours used in Lithuanian but also points to the fact that the same tune might be used in different contexts, and several tunes might be used to express the same function. Unfortunately, the holistic view of intonation posited by the author makes the further development of Lithuanian intonational theory somewhat problematic. Specifically, the view that the intonation contour cannot and should not be divided into smaller units makes it hardly possible to explain the “systematic variation” observed in the different realizations of the same contour (Arvaniti and Ladd, 2009). Moreover, in order to capture the diversity of remaining intonation patterns occurring in the language and their variations, one would have to come up with a very large number of such indecomposable melodies. In addition to this, the theoretical framework followed in these studies does not allow to take into account intonation at the word level (pitch-accents) and their interaction with sentence-level intonation. Finally, it impedes the comparison of these results with studies on other languages carried out by the international research community in recent decades, which mostly follow the now widely accepted Autosegmental-metrical framework. The following section will introduce this model as well as several other theoretical and computational models of intonation. The possibilities to apply these models to the Lithuanian language will be discussed in the Discussion section.
Intonation Modelling
Approaches to Intonation Modelling
From the modelling perspective, intonation is understood as the variation of fundamental frequency F0 (in terms of speech production) or pitch (in terms of speech perception). One way of capturing intonation could be by making physiological measurements of the articulation process – the relationships between F0 and articulatory gestures and the physiological properties of the vocal tract are obvious and undisputed. However, such measurements of the articulation process are complex and difficult to implement, thus making acoustic modelling more appealing. The principles of acoustic analysis enable the exploration of frequency properties (F0 among them) by measuring the wave of the speech signal. The obtained measurements and their derivatives are much more related to perceptible manifestations of intonation. Therefore, most of the modelling techniques are based on acoustic concepts and measurable speech signal properties. Various techniques and models were proposed for prosody modelling. According to Xu (2015), a three-way division can be made across the prosodic theories: linear vs superpositional, formal vs functional, and acoustic vs articulatory. There are two main approaches to intonation modelling: directly modelling the pitch and trying to simulate the pitch production process (Honnet, 2017). Nevertheless, as noted above, certain F0 variations are language-dependent, and this should be considered when employing various intonational models.
Fujisaki Model
The Fujisaki model (Fujisaki and Hirose, 1984) was proposed considering that the modelling of the F0 contour (which is the primary indicator of the intonation process) could not be described using a simple combination of straight, declining and rising approximations of the F0 segments. The authors claimed that natural and high-quality prosody in synthesized speech could be obtained only with complex functional dependencies.

Graphical representation of the Fujisaki model.
Their argument was based on the continuous nature of F0 contour despite discrete prosodic information elements in human speech.
The main assumption of the model was the presence of two components in the analysed contour: the phrase-level (global, sentence level) element and the accent-level (local, word level) element (Fig. 5). Both components are modelled using a second-order linear system with different excitation signals. In the case of the phrase-level component, an impulse is proposed as the system input. For intra-phrase variations of F0, positive impulses are employed, while for rapid fallings (e.g. at the end of a phrase), a negative impulse is suggested. In the case of an accent-level component, the system is excited with a stepwise unit. The sum response of two linear system combinations describes the F0 contour as the continuous variation of the F0 with baseline, maximal level and other characteristics making the intonation model smooth and close to natural intonation.
It is important to note that the Fujisaki model was initially designed for Japanese, which is considered a pitch-accented language, i.e. the F0 (tone) can convey both lexical meaning and phrase-level patterns. Therefore, the assumptions for the Fujisaki model were inspired by the F0 variation-rich Japanese language, and this fact should be taken with care in the intonational analysis of non-tonal, stress-accent languages. In the case of the Lithuanian, though, this characteristic can be useful.
The main idea of the Tilt intonation model (Taylor, 1998) is to analyse intonation as a sequence of intonation events. The author defines two kinds of acoustic events: pitch-accents (segmental level) and boundary tones (suprasegmental level). Each event is characterized by the rise, fall and other varying F0 shape properties (so-called tilt parameters). Some events can consist of the rise part only, and some can consist of the fall part only. Every event is described with a particular parameter set: the amplitude of the tilt, the duration, the tilt (the value +1 defines the rise, −1 defines the fall), and the F0 position.
The Tilt model closely relates to the Rise/Fall/Connection model (Taylor, 1994). In addition to rises and falls, the later model includes connections for non-intonational segments with a constant value of F0. There the intonation is also modelled as a linear combination (sequence) of rises, falls, and connections. For example, the pitch accent (stressed syllable) is modelled as a set of a single rise and a single fall. Similarly, like in the Tilt intonational model, parameters such as rise amplitude or duration are also considered in the RFC model. The Tilt intonation model is similar to Fujisaki model in the way that both are used to describe the variation (rises and falls) of F0.
IPO Model
Again, this model is based on acoustic and perceptual aspects of the speech signal and relies on the approximation of the F0 contour (Hart
The F0 contour is modelled in the following steps. First of all, the F0 contour is extracted from the speech. All the detected phenomena of micro-intonations (i.e. micro fluctuations) are removed in order to get relevant and meaningful intonation patterns. After preprocessing, the F0 contour is “stylized”, i.e. F0 contour is analysed in segments and approximated by lines (Fig. 6). The obtained piecewise linear approximation of the contour is described using F0 patterns such as falls and rises. Although the model was developed using the Dutch language and tested for English, the model itself is based on F0 estimation and approximation and ignores language-specific presumptions and aspects.

Linear approximation of F0 contour.

PENTA model.
The lexical and intonational nature of F0 variations is addressed in the Parallel Encoding and Target Approximation (PENTA) model (Fig. 7). Two assumptions are at the basis of this model (a detailed figure and description can be found in Xu, 2004). First of all, during speech production, parallel communicative functions (lexical, sentential, topical, grouping emotional, etc.) are conveyed through individual encoding schemes. These schemes can be language-specific or universal and determine the target parameters (pitch target, pitch range, etc.) controlling the approximating articulatory process. This parallel encoding part is the PEN part of the model, and target parameters are supposed to define the so-called melodic (tonal) primitives. The second assumption is related to the approximation process: various physical and neurological properties limit the generation of the F0 contour, resulting in F0 target approximation (the TA part of the model). Taking defined tonal primitives, the articulatory system approximates the F0 targets, each synchronized with a syllable.
These assumptions differ from the aforementioned intonation models as these models contain nucleus F0 elements to shape complex F0 contour. In comparison, the PENTA model is more abstract. Nevertheless, its complex nature enables the inclusion of specific language properties, phonological factors, etc.
The Autosegmental-Metrical (AM) approach has originated as a result of a few influential studies and doctoral theses in intonational phonology, including works by Liberman (1975), Bruce (1977), Pierrehumbert (1980), Ladd (1996). This phonological model represents intonation as a combination (sequence) of low and high tones – only two different tonal levels are considered. Between these tonal events, the F0 (pitch) contour is undefined and can be modelled as an interpolation between these events. The AM model relates these events with particular points in the utterance: prominent syllables in the segmental string and utterance boundaries. The tones associated with the syllables are called pitch accents, and the boundary-associated points are called edge tones. Edge tones mark the boundaries of the intonational phrase and are independent of pitch accents. Pierrehumbert (1980) defined two types of edge tones: final boundary tones and phrase accents. The distinguishing property of the AM model is that it considers tonal events as the linguistically important part and the remaining part of the F0 (pitch) contour only as a transition between events. This is the main difference from the IPO model.
Intensive research on AM model ideas has led to the development of ToBI (Tone and Break Index), a framework for the development of prosodic annotation systems (Beckman
Alternative Techniques
For application purposes, generative frameworks were applied to synthesize intonational and emotional speech signals. These techniques focus on the acoustic realization of the prosody (intonation), avoiding formal phonological analysis or explicit theoretical justification. Their relation with formal intonation models is weak, and it is limited to the use of certain terms and principles (such as the rise and fall of the F0).
Two different paradigms are followed here: the modification of F0 contour in synthesized speech (i.e. re-synthesis in order to provide intonation) or the generation of an intonational contour “from scratch”. In formant-based synthesis, a rule-based modification of fundamental and formant frequencies is applied (Kohler, 1991; Carlson
Later, data-driven techniques emerged and began to dominate. In concatenative synthesis, the Hidden Markov Model framework was the main intonation modelling technique. Here, corpus data-based intonation is modelled both at the phonetic unit level (e.g. syllable, phoneme) (Boidin and Boeffard, 2008) and at the utterance level (Ni
The proliferation of artificial neural network-based techniques in speech synthesis has brought new opportunities. Neural networks are well known for their ability to capture hidden and very complex dependencies, which is an excellent assumption for intonation modelling. During the past decade, numerous studies were published on neural network-based modelling of intonation (Fan
Lithuanian Intonation Modelling
An overview of the development of intonation models for Lithuanian speech is given in this section. The works related to both word- and sentence-level phenomena are considered.
Sentence-Level Intonation Modelling
Regarding the issue of sentence-level intonation, a computational model of the fundamental frequency for the Lithuanian language was proposed by Vaičiūnas
Among attempts to create a model of Lithuanian sentence-level intonation, a work of Leonavičius (2006) can be included, in which the authors modelled pitch variations of melismas. In principle, variations of pitch are considered as melismas in musical notation and as intonation in speech signal processing. The aim of this work was to synthesize melismas met in Lithuanian folk songs by applying Artificial Neural Networks. More than 500 melismas were used in the experiment. As a result, the original mathematical models of all four kinds of melismas have been created. Unfortunately, no subsequent publications by the author can be found that continue this promising research.
Finally, articles that focus on phone duration models of Lithuanian, which is one of the possible phonetic correlates of sentence-level intonation, could be mentioned. A review of existing models of duration prediction is given by Norkevičius and his colleagues (Norkevičius
With regard to intonation annotation, there have been studies related to ToBI, an international standard for annotation and transcription of prosodic events. An investigation of whether this transcription system is applicable to the prosodic annotation of Lithuanian is presented in the works of Krapikaitė (2014, 2015). The papers presented the use of ToBI transcription in the Lithuanian Prosodic Corpus, which is being built at Vytautas Magnus University. The results of the analysis show that the intonational contour in Lithuanian can be transcribed separately in individual tiers representing independent structural types.
Word-Level Intonation Modelling
There have also been some attempts to model intonation at the word level. In this context, prosodic phenomena, such as the pitch-accent, are considered. Lithuanian pitch accents were modelled by Dogil and Möhler (1998) using a parametric phonetic description of characteristic F0 shapes. The parametric model was represented by a polynomial function. Two contrastive accents of Lithuanian were investigated: the acute accent [´] and the circumflex accent [˜]. The aim of the study was to check which of the pitch-accent parameters is responsible for the prosodic salience of Lithuanian accents. The results of the approximation of the polynomial function on Lithuanian pitch-accents showed that the acute accent could be quite precisely approximated given the set of five parameters, whereas the circumflex accent defies such an approximation (Dogil and Möhler, 1998). The acute accent is a highly invariant pitch-accent with a clearly definable form, alignment point within a mora and a precisely defined slope, and it is characterized by a large F0 amplitude. The situation is different with the circumflex accent. The low F0 amplitude and the highly variant contour and alignment of the circumflex accent make it a very indeterminate representative of pitch-accents as a phonetic category. Another paper that deals with Lithuanian pitch accent modelling was published by Paulikas and Navakauskas (2005). They aim at creating a general model that could represent the voiced speech signal with the pitch accent. The term “general” means that the authors do not distinguish between models according to the actual accent types but rather according to how their parameters were estimated. A proposed model is incorporated into the restoration algorithm and used in the homograph restoration process of Lithuanian words. Several accent models were developed, and their performance was compared in the experimental part of the paper. The third-order polynomial model that includes the polynomial approximation of normalized intensity and the period of the fundamental frequency as parameters was shown to be the best. According to the authors, thorough experimentation on a wider set of homographs is necessary to evaluate the proposed polynomial model and select optimal approximation orders. It should be noted that the model does not incorporate the duration of a missing accent as a parameter. It was assumed that the duration was known.
Interestingly, literature analysis showed that, although some works state that they only deal with the stress of Lithuanian words, they, in fact, also address the issue of pitch-accents. An example of such works may include studies on the automatic stressing of Lithuanian texts. The rules of stressing are given in many grammars of the Lithuanian language (e.g. Ambrazas
In several studies, researchers have attempted to synthesize text with stress. These previous studies can be divided into two parts: the stressed sounds are created during the synthesis process by modifying their duration and fundamental frequency, or the stressed sounds are stored separately, and then their concatenation with unstressed sounds was performed in order to obtain the synthesized text with the stress. An example of stressed text synthesis employing modifications during the synthesis process is given in the paper by Pyž
Based on analysed papers, it can be concluded that both word- and sentence-level phenomena received some but not much attention from researchers. The reviewed studies cover the topic in quite a fragmented manner. In order to draw a full picture of Lithuanian intonation modelling, papers related to the stress of words and phone duration models that address the issue of pitch-accents and sentence-level intonation, respectively, were also analysed in this section. A summary of the work on intonation modelling, in terms of prosodic phenomena and methods used, is given in Table 1.
Lithuanian Intonation modelling.
Lithuanian Intonation modelling.
This paper provides a first interdisciplinary overview of existing research carried out on Lithuanian intonation. It shows that in the last three decades, several attempts have been made to tackle this topic both from the linguistic perspective and the perspective of computer-aided modelling. But perhaps unsurprisingly, the study of such a complex phenomenon with its inherent ambiguities resulted in a rather fragmented research. Moreover, it is important to note the lack of cooperation between speech engineers working on technological applications such as automatic speech synthesis and linguists. This clearly hinders the development of a common theory of Lithuanian intonation. The above-presented insights reveal that the applicability of linguistic findings to real-life modelling is quite limited.
The Linguistic Perspective
The above-reviewed phonetic research on Lithuanian intonation mostly concentrated on sentence types and only occasionally examined other phenomena, such as focus. These studies made the first attempt to use ToBi in order to provide a clear and more generalizable description of intonational patterns in Lithuanian that would be understandable to the international research community. However, the system is not used thoroughly, and only the major principles are employed without making use of all the functionalities that would allow accounting for more detailed aspects of Lithuanian intonation. The phonological work by Kundrotas, on the other hand, is very rich in the documentation of a variety of intonational patterns and their variations. Moreover, he provides an in-depth analysis of possible uses of these patterns across linguistic styles and Lithuanian dialects. Nevertheless, the theoretical approach adopted in this research, specifically the holistic view of indecomposable melodic contours, has a number of issues. For instance, it does not allow to capture the diversity of phonetic realizations of the same contour and complicates the cross-linguistic comparison of Lithuanian intonation to intonation patterns of other languages. Last but not least, the framework used for the phonological descriptions of Lithuanian intonation can hardly be understood and used by the community of computational linguists and speech engineers. Thus, there is an urgent need to apply well-known frameworks and models (both phonological and computational) to further describe the phonetics of Lithuanian intonation and provide consistent phonological generalizations.
The above-described evidence about Lithuanian word-level and sentence-level prosodic features brings up an important question about the place of Lithuanian in the intonational typology and the practical implications it can have on its modelling. As mentioned before, from the point of view of word-prosodic typology, Lithuanian is considered to a be a pitch-accent language and thus word-level prosodic specificities should influence the description of its prosodic characteristics at the sentence level. However, from the acoustic point of view, the main phonetic correlates of accent variations in the acute and the circumflex patterns have been shown to be intensity and duration, and not F0, which is used as a secondary marker only. Conversely, F0 variations best reflect sentence intonation. Thus, as the difference between the two contours lay in the location of the accent within the bimoraic syllable (in an acute syllable – on the left-hand mora, in a circumflex syllable – on the right) rather than in its tonal quality, the distinction in the notation could be reduced to a single accent mark with an indication of place (Kushnir, 2019). In this way, if sentence intonation is interpreted as being primarily marked by F0 variations, then Lithuanian could be considered to be closer to stress languages, where intonation is not defined at the word-level. This would certainly facilitate Lithuanian intonation modelling. Note, though, that in Lithuanian, the secondary markers of sentence intonation (duration and intensity) correspond to the primary markers of its accent. Thus, possible interactions could arise from this overlap, which should be examined in future studies. Based on these overall observations, we can return to the reflection on whether it would be preferable to think of languages as forming a continuous scale between tone and stress languages instead of defining them in terms of three distinct categories (Hirst and Di Cristo, 1998).
The Modelling Perspective
From the modelling perspective, there have been very few attempts to develop intonation models for Lithuanian compared to other languages, especially so for sentence-level intonation. The reviewed studies cover the topic in quite a fragmented manner. Regarding word-level intonation, there are more papers that have approached this issue. However, there are still many unresolved tasks. It is also important to note that articles on modelling were published quite a long time ago, and this reaffirms that the subject is complex and that it does not attract sufficient attention from the research community. Turning to speech synthesis, the most developed text-to-speech system for Lithuanian is based on the concatenation of natural speech segments stored in a database. However, worldwide dominant phonological intonation models have not been used by researchers working on Lithuanian speech synthesis either.
The possibilities for the modelling of Lithuanian intonation should be explored in light of the well-known models introduced above. First of all, there is one common feature to the models presented (except the AM model). Specifically, the Fujisaki, the Tilt and IPO models are formulated in terms of acoustics: intonation is described as a variation of the fundamental frequency F0, which can be expressed as a sequence of atomic variation elements (e.g. the F0 value rises and falls). On the one hand, this makes all these models purely descriptive: we can analyse and describe only the acoustic realization of intonation. The models do not have phonological justifications, which means that we do not have any mapping between acoustic events (F0 variation associated with both local and sentence-level intonation events) and linguistic structures. The absence of this mapping results in the inability of these models to synthesize intonation based on linguistic information (e.g. text).
On the other hand, model independence from phonological differences and subtleties of different languages makes these models potentially adaptable to different languages. As none of the three models (Fujisaki, IPO model or Tilt model) has a phonological part, some mapping rules can be integrated into them. The PENTA model contains a parallel encoding part, defining encoding functions as lexical or emotional, which can be considered as language-specific. The potential adaptability of the models can be illustrated by examples of the application of the Fujisaki model covering both stress languages such as English (Moberg and Parssinen, 2004), German (Mixdorff, 2000), Italian (Rossi
Considering more phonologically-oriented models, neither AM nor any other major phonological models of intonation have been applied to Lithuanian. It is true that in order to make phonological generalizations, it is necessary to already have a certain level of phonetic analysis of the language. This would allow researchers to identify which parts of the acoustic signal can be assigned to more abstract contrastive phonological units (Zerbian, 2010). Phonetic research on Lithuanian has started, but there are certainly much more researches to be done. Nevertheless, the existing material could already be used in applying the Autosegmental-Metrical model (AM), which has been successfully applied to many typologically different languages (Jun, 2005). The analysis of melodic contours based on smaller discrete linguistic primitives would help to solve the issue of variability in the realizations raised above. Importantly, AM would allow us to finally examine in greater depth the relationship between Lithuanian sentence-level intonation and the Lithuanian pitch accent. Specifically, in this framework, both word-level and sentence-level intonation is treated as a sequence of the same type of tones whose function is specified in each language’s grammar. Thus, a single model could be used to describe both levels of intonation.
Note, though, that some researchers suggest choosing models from languages that are close to Lithuanian based on morphological rather than phonological characteristics. For example, Vaičiūnas
Prospects of Using New Modelling Approaches
It is important to note that the main disadvantage of the models described above is their requirement of much a priori knowledge about the phenomenon to be modelled, or “expert knowledge”. Therefore, nowadays, scientists have come up with new language prosody modelling approaches, where models are able to learn the patterns from speech data. Specifically, these are models based on neural networks. The main difference from traditional ones is that the networks are able to learn patterns, i.e. the algorithms themselves construct rules when analysing the data, and then construct patterns based on the rules observed. In such a way, the algorithms learn the patterns from the raw audio data. Models based on deep learning show good results in intonation modelling and could be a solution, but it should be noted that these methods require large amounts of annotated training material. WaveNet network (Oord
The Interdisciplinary Benefits of Intonational Research
As stated, Lithuanian is spoken by less than four million speakers worldwide and can therefore be considered an under-resourced language. As a result, it is very difficult to build large enough corpora for speech recognition and synthesis purposes. Due to the small number of Lithuanian speakers, large language technology companies, such as Google, are not interested in investing in it. Therefore, it is largely up to scientists and local businesses to develop fundamental and applied research on the Lithuanian language and its modelling. This is valid for almost all under-resourced languages (Besacier
Intonation models could contribute to the theoretical description of Lithuanian and also widen the typological understanding of Lithuanian among other languages. It has the potential to help to identify and classify a wider range of intonational patterns. Moreover, the interrelation of acoustic correlates other than F0 that are used in the realization of intonation patterns could be uncovered.
Intonation models would also benefit Lithuanian speech technologies. First, they would fill in the current theoretical gap and provide the necessary theoretical description of the Lithuanian language, which would be directly applicable to language technologies. Second, it would increase the accuracy of speech recognition and help to identify speakers’ emotions and dialects. Lithuanian has 2 dialects and 6 subdialects which have differences in their phonetics, syntax, morphology and lexis. But the main differences lie in their vocalism and accent-intonation systems (Girdenis and Zinkevičius, 1966), which could be reflected in the created intonation models. Third, intonation models would help to increase the language naturalness of synthetic speech and foster ease of processing for listeners.
However, the main problem, which must be solved in the future, is the interdisciplinary nature of speech research and modelling. People working in this field should have a background in two main fields – linguistics and computer science – while knowledge of psycholinguistics would clearly be beneficial, as well. Until now, most experts working on Lithuanian language modelling had a background in only one of the fields, and Lithuanian universities clearlty lack such an interdisciplinary curriculum. Interdisciplinary collaborations would benefit not only the theoretical description but also the application of language technologies. Finally, the use of widely recognized state-of-the-art theoretical frameworks and methods, as well as the use of the English language in publishing, would contribute to attracting international researchers to further develop this topic.
Turning back to the Lithuanian language, there are still many important issues that have not been addressed in order to generate contours of Lithuanian intonation for a given input and which can be considered as future challenges. First of all, the prominent intonation models reported in this paper, i.e. Fujisaki, IPO, tilt, PENTA, and AM models, could be applied to Lithuanian to evaluate their suitability for this language. Since there are no publications on this issue, it may be possible that these models have not been investigated at all. It is only known that the Fujisaki model was used in the speech synthesizer developed by Kasparaitis and his colleagues. Unfortunately, the synthesizer was used on the Text-Talk website until 2010, and the details of the fundamental frequency modelling are not published by the authors.
Directions for Future Research
Considering the above-described issue, we can formulate potential directions for Lithuanian language intonation modelling as follows:
Further phonetic and phonological description of intonation patterns in Lithuanian. This work should be based on widely acknowledged theoretical frameworks, which leads to the following point.
Phonology-based modelling of intonation. In this direction, phonology knowledge (intonation, stress, etc.) in the form of rules and restrictions would be integrated or joined with the selected acoustic intonation model. The possible result of such integration would be rule-based intonation models. An inevitable condition for this work is interdisciplinary collaboration.
Data-driven intonation models. They would be obtained by applying statistical data analysis techniques. In these models, the concept of intonation patterns could be established and explored. The presence of such intonation patterns and their sets would mean the possibility to express all the intonation sequences in a time-aligned combination of patterns. The synthesis task would be to select and join particular patterns into a new intonation sequence (phrase, sentence, etc.). The formulation of data-driven models will require large labelled datasets, again hardly possible without a multidisciplinary collaboration.
Creation of a Lithuanian intonation corpus with the help of automatic annotation techniques. This approach would necessitate a close collaboration with expert phoneticians. The topic could be further explored with the help of supervised machine-learning techniques to discover new patterns.
Overall, as linguistic research on intonation was a precursor to modelling research on the topic, descriptive linguistic literature can bring valuable evidence for modelling the main acoustic correlates of intonation, its types and functions in the language. Linguistics, on the other hand, could benefit from modelling research through the identification and classification of more types of intonation patterns. Moreover, the use of a clear, single and internationally-recognized framework would facilitate the collaboration between researchers in Lithuania and abroad, as well as the inclusion of the computational modelling community.
No ethics approval was required given that no new data were collected, nor new analyses were conducted on existing data.
