Sage Journals: Discover world-class research

Abstract

This research explores the potential of Long Short-Term Memory (LSTM) and Natural Language Processing (NLP) for automated essay generation. The goal of the study is to create a model that can produce high-quality essays that are not only grammatically correct but also semantically meaningful and contextually relevant. The rise of NLP and Deep Learning has made it possible to generate text that is coherent and semantically sound. In this research, there is an advantage by leveraging the ability of LSTMs to capture long-term dependencies and context within the text, and combining it with NLP techniques, such as word embeddings, to process and encode textual data. The results of experiments show that the proposed model can effectively generate essays that are coherent, contextually relevant, and semantically meaningful. This is a significant advancement in the field of text generation and has potential applications in areas such as education, content creation, and language translation. In education, for example, the model could be used to generate essays for language proficiency tests or as a writing aid for students. In content creation, it could be used to generate articles, blog posts, and other written content. In language translation, the model could be used to generate essays in the target language that are semantically and contextually equivalent to the source language essay. The findings of this study contribute to the advancement of NLP and deep learning techniques in the area of text generation and open up new avenues for future research. Short after the proposed model was deployed, it was discovered that it outscored Multi-Topic Aware LSTM (MTA-LSTM), Topic-Attention LSTM (TAT-LSTM), and Topic-Averaged LSTM (TAV-LSTM) in human evaluation by 6.84 percent, 25.40 percent, and 34.94 percent, respectively. Furthermore, it enhanced automatic BLEU score evaluation scores by 11.68 percent, 26.23 percent, and 54.11 percent in MTA-LSTM, TAT-LSTM, and TAV- LSTM, respectively.

Keywords

Natural Language generation LSTM essay generation relevance topic integrity generation embeddings attention

1. Introduction

Natural Natural Language Generation (NLG) [1], a subset of Artificial Intelligence [2], is a method for converting data into plain language. The technology may be used to write semantic sentences and paragraphs to tell stories, very much like an analyst. For a long time, communication occurred through data ideas. Computers can extract concepts from data on a vast scale with great precision and eloquence. Productivity increases when computers automate analytical and communication tasks, allowing employees to dedicate more time to higher-value activities. Natural language can be prescribed for many applications to link with visuals. Natural language has been proposed as a means of communicating with visual interfaces in a range of applications. Dashboards may be physically appealing, but when it comes to information density, they are less so when it comes to language. A prosperous and sophisticated storyline can be told in a single paragraph as well as a few bullet points.

However, the work of essay generation is currently beset by a slew of issues. To begin with, when regarded as a whole, the sentences in the essay generated by the algorithm are unrelated to the theme terms employed. For example, if the topic words are ‘A day in the park,’ one line of the essay might describe a day when the sun is shining but there are a few issues. To begin with, the sentences in the essay generated by the algorithm are unconnected to the theme phrases used when taken as a whole. For example, if the topic words are ‘A day in a park,’ one line of the essay might describe a day, such as the sun shining, while another might define what a park is. This problem is referred to as topic relevance. Also, if one statement is like “We had a lovely day,” the next sentence is frequently unconnected to the previous sentence, such as “Our Ford Car Choked.” Output essays must be fluid to achieve human-level performance. There is also a lack of cohesiveness among the components. Coherence in writing refers to the reader’s ability to understand what the writer is trying to say. Making everything flow smoothly is what coherence is all about. Everything is semantically structured and related, and the reader can see that the overall relevance to the theme terms is maintained. The essay’s integrity has also improved significantly. The degree of vocabulary and phrases utilised to communicate sentiment in the essays is what is meant by integrity.

The proposed method addresses the issue of coherence, topic integrity, and relevance by using three distinct mechanisms: the Pool hike method, the Topic integrity mechanism, and the Coherence mechanism. Pretrained GLOVE [3] and trained word2vec embeddings, as well as the use of context vectors in attention processes, target one or more of the above concerns, allowing the model to be controlled to produce better essays. To be precise, incorporated a pool hike mechanism to the model to address the problem of text production repetition, while the inclusion of pretrained Glove and trained word2vec [4, 5, 6] embeddings ensures the essay’s subject relevance and integrity. The use of context vectors in attention [7, 8, 9] mechanisms solves the problem of repetition and maintains text coherence.

TAV-LSTM extracted topic word embeddings, averaged them, and ran them through recurrent neural networks, whereas TAT-LSTM extracts embeddings of every subject word and then applies attention to them. MTA-LSTM now includes a subject coverage vector to help track the attention mechanism.

The following are the contributions of this research work:

•
Comprehensive review of prior research in the essay generation domain, highlighting shortcomings, Topic-Attention LSTM (TAT-LSTM), and Topic-Averaged LSTM (TAV-LSTM) are compared, and it was found that TAT-LSTM is better than TAV-LSTM as it enhances integrity and uses Context-Vector in the attention [7, 8, 9] mechanism.
•
To integrate the Pool hike mechanism, the Topic integrity mechanism, and the coherence mechanism into a model that focuses on the a fore mentioned aspects of coherence, topic integrity, and relevance.
•
The use of context vectors in attention processes, trained word2vec [4, 5, 6] embeddings, and trained GLOVE address one or more of the a fore mentioned issues, allowing the model to be manipulated to create better essays.
•
To develop a comprehensive model for automated essay generation that combines Long Short-Term Memory (LSTM) and Natural Language Processing (NLP) techniques to produce grammatically correct, contextually relevant, and semantically meaningful essays.
•
Significance placed on enhancing coherence, integrity, relevance, fluency, and diversity in essay generation.

In Section 2, prior efforts in the domain of Essay Generation are examined, which were then updated by improved adaptations of models in the same domain. The next section delves into job definition, data collection, and general construction, as well as the concept. Then, in Section 4, the methodology used in the suggested model for writing the essay is discussed, which takes into account all of the aspects that go into writing a good essay. The experimental conditions and assessment criteria that were utilised to evaluate the generated essay are discussed in the following subsections. The experimental results are reviewed in Section 7, and the conclusion and future work are explored in part 8.
2. Related work

In 2016, Kiddon et al. [10] attempted to address problems by employing a planned item list, and two attention processes were introduced into the model based on the idea of this list text: monitoring used agenda items and monitoring unused agenda items.

In 2019 [11], Welleck et al. combined likelihood/probability and unlikelihood training. Likelihood gives an output probability for the next token based on the frequency with which the token appears, whereas unlikelihood generates an output probability for the token not being the following word.

The authors Feng et al. [12] used attention and word embeddings in 2018, which improved topic relevance and integrity. TAV-LSTM, TAT-LSTM, and MTA-LSTM were the three architectures described in the study. The topic word embeddings was discovered in TAV-LSTM, then average them before feeding it through the Recurrent Neural Network (RNN) [13, 14, 15]. Because the last average vector can be identical for different word sequences and has a low complexity, the outcome may be the least accurate in the list. TAT-LSTM is the process of finding each topic word’s embeddings, paying attention to them, and then forwarding them to the RNN network.

TAT-LSTM surpasses TAV-LSTM because topic words are given more attention and weight based on where they appear in the essay. As a result, the integrity of the essay has been enhanced. In MTA-LSTM, the author used TAT-LSTM and included it in the subject coverage vector, which records time. The procedures adopted address relevance, coherence, and integrity issues to some amount, but they do not significantly promote diversity.

The strategy promoted diversity by expanding the pool of words utilised to generate text (common sense words $+$ theme terms). The author’s proposed architecture is divided into two sections. The first is the Memory-Augmented Generator, which comprises an encoder and a decoder. The encoder employs a bidirectional LSTM, after which the concatenated/integrated vector is formed and delivered to the decoder. The previous state (st-1) is occupied by a decoder and the concatenation of previously generated output (yt-1), context vector (ct), and memory vector (m t), which are passed to an RNN network c, which is computed by an attention of topic input sequence, memory vector, m, which is dynamically updated at every iteration to ensure topic coherence, and is also updated by (Gated Recurrent Units) [16, 17]. The consistency of the topic bounded by the constructed essay and input themes is then calculated using a discriminator. It generated an output vector that included the likelihood of topic phrases and the likelihood of a sentence. The discriminator is also improved, and Adversarial Training is utilised to determine whether a sentence belongs in text. MTA-LSTM has been shown to underperform the model.

Lin et al. [18] solved the issue in 2020 by using GLOVE and BERT [19, 20] portrayals in self attention guidance creation, as well as a target-side contextual history method. Repetition and low attention retention issues can be resolved as a result of this. As contextual embeddings, GLOVE Word vectors and BERT embeddings are used [21]. In addition, each is linked via a dynamic weighted total to help with linguistics information inequality. There are two sections to the architecture. To begin, an encoder that uses contextual embeddings to reduce information disparity and contains a pre-trained language model BERT. GLOVE Word vectors and BERT embeddings are used as contextual embeddings, which are subsequently coupled and delivered to the encoder. By combining the covered states of BERT across encoded levels, the semantics of the subject words are enhanced/enriched, as well as exchange between multiple themes. Furthermore, a shifting weighted total is employed, resulting in more advanced semantic difference ease/semantic gap closure work. Second, a target-side contextual history mechanism in a self-attention network guides the generation. The quality of the generated text is improved by using the context aware generator. This is done to account for duplicated difficulties as well as poor attention retention.

The author presented COMMONGEN in [22], which is a novel constrained job of text creation for generative commonsense reasoning that utilises a big dataset. The author examined the task’s few obstacles, including compositional generalisation [23] and relational thinking [24]. The authors analysed the in-depth experiments for contemporary models on language production on the task in a consistent manner. Then they realised that the performance isn’t up to par with that of humans, with grammatically correct but actually ludicrous utterances being generated.

Authors, particularly ODETG, predicted a new task for 2020 [25]. (Open Domain Event Text Generation). In schemes where traditional generation tasks are ineffective, the proposed task can be applied. The WikiEvent dataset was developed by the authors and was used to pick the models that will answer the ODETG problem. The model contains “34,000” pairs (entity chain, text) consisting of an entity and its description text. Furthermore, an encoder, a retriever, and a decoder were presented as three pieces of a framework. The entity chain is encoded into hidden depiction by the encoder. To improve the generation procedure, the retriever retrieves related data. The decoder then displays an entity chain with extra accurate data and an uneven drop component in a sequence. The proposed model outperforms a few baselines in terms of delivering better event text on the WikiEvent dataset.

In 2020, the author [26] developed a Plug and Play for Controlled Language Model (PPLM), which flexibly associates an extensive, pre-trained Language Model (LM) with a Bag of Words (BoW) [27] or a small discriminator that is easy to train. They talked about how disciplined LMs should act ethically. With the use of a gradient-based sampling technique, PPLM achieves a fine-grained command of attributes. In addition, PPLM can manage generation while controlling fluency and hold affirmation for enabling the next generation of language models.

In [28], the author suggested a challenge called Generalized Few-Shot Detection, which aims to distinguish between a previously preoccupied joint label space, rich annotation, and a preoccupied novel with only a few cases. The goal (Generalized Few-Shot Detection) was intended to be replaced by CG-BERT (Conditional Text Generation with BERT), a model that creates new claims based on a preoccupied book. In addition, the suggested approach outperforms the competition on two real-world preoccupied word detection datasets.

The authors of [29] introduced BLEURT, a reference-based statistic for text production in English. The metric can accurately mimic human judgement because it is taught point-by-point.

A few problems in current essay creation approaches were found after completing a thorough examination of the literature. There are few designs that explicitly address all of the issues associated with essay generating. The generated essays’ integrity and coherence are not at all comparable to human performance. Integrity, Relevance, Fluency, and Diversity are all concerns that the proposed model tries to address. Coherence.accuracy. Furthermore, pre-training boosts the measures. Specifically, both the domain and the quality drift are booming.

From the literature review, it was concluded that essay generation technologies continue to have problems.

Only a few systems are specifically focused on all aspects of essay generating. In comparison to human performance, the coherence and integrity of the created essays are far from satisfactory. Relevance, Integrity, Fluency, Coherence, and Diversity are all concerns that the proposed approach tries to address.

3. Methodology

3.1 Task definition

The goal of essay creation is to generate an article (a paragraph) within the theme of these topics using a set $T=$ topic1, …, topici, …, topic $k$ consisting of an integer $k$ topic terms.

3.2 Data collection and preprocessing

A concatenated corpus of highly perceptive writings by Paul Graham [30], which includes a variety of essays on various topics, was used. The dataset contains 452944 words and has a vocabulary of 30064 terms. It was predetermined that 5 words would be used to produce the next word, 4 words would be used to forecast the next word, and so on because this corpus is in the form of continuous text. All punctuation, except full stops and commas, is removed from the text, which turns all letters to lowercase and removes all noise from the dataset (urls, font size, references, etc.).

Figure 1.

Diagram representing pool hike mechanism.

3.3 Pool hike mechanism

This technique, as shown in Fig. 1, was used to address the issue of duplication and ensure the diversity of the output articles. The process increases the input size by a scalar multiple; hence, the model proposed has more words to draw from when producing output essays, leading to the production of more varied essays. Assuming an input with $n$ words, first, the input vector’s individual entries’ pretrained glove embeddings are generated as described in [3], using which $k$ related Glove vectors are extracted from each entry and added to the input, giving the input the shape ( $k*n$ , 1). Similar vectors are obtained based on the comparison between various similarities namely, cosine similarity, jaccard similarity and euclidean distance, defined as .

$\displaystyle\textit{similarity}\ (x,y)=(x*y)/(|x|*|y|)$ (1)

Where $x$ and $y$ are the two input vectors, represents dot product and $|x|$ is the modulus of $x$ .

Here, $J=$ Jaccard distance, ( $A, B$ ) are vectors and represent the Elements.

Jaccard distance $=$ 1 – Jaccard Similarity

$\displaystyle J(A,B)=|A\cap B||A\cup B|.J(A,B)=|A\cap B||A\cup B|$

Euclidean distance: $d(a,b)=\sqrt{\mathop{\sum}\limits_{i=1}^{n}(bi-ai)^{2}}$ , $a,b=$ two points in Euclidean $n$ -space $n=n-$ space.

With this method, the input size is increased by $k$ times, giving the model more words to choose from. Additionally, topic integrity isn’t altered by the selection of comparable terms, so this broadens the scope of the essays without altering the semantics of the original vectors. It was called the “Pool hike mechanism” because the underlying model has a larger pool of words to draw from.

Algorithm 1:
1. Create Pool Hike(X,Glove_vectors):
2. Emb_vectors $=$ Array;
3. For word in X:
4. Most_similar_words $=$ array;
5. Most_similar_words $=$ Glove_vectors.most_similar(number_of_similar_words $=$ n,similarity cosine_similarity);
6. For similar_word_vector in Most_similar_words:
7. X.append(similar_word_vector);
8. Return X;

3.4 Use of pre trained glove and trained word2vec embeddings

Figure 2.

Figure representing the use of pretrained Glove and trained word2vec embeddings.

Word embeddings are vector representations of input words as shown in Fig. 2, with reduced cosine distance between comparable words having similar vectors. These embeddings are employed in this suggested model because they provide finer semantics than tokenization alone, which just converts words to numbers, can. Word2vec embeddings and pretrained glove embeddings are the two types of word embeddings employed in the suggested model. Glove embeddings are known to occupy finer semantics of words (in many cases faster as well as finer as compared to humans) because they are pre-trained over a vast amount of knowledge. The model’s semantic understanding is then considerably improved by including these and the word2vec embeddings. For instance, depending on the dataset, if the word “good” is provided as an input, trained embeddings may learn other words like “day” or “car,” but by linking them with the reliable pretrained Glove embeddings, the enormous understanding gap between the suggested model and that of humans is narrowed. Therefore, even if it wasn’t learned from the dataset that, this model will also be aware that “human,” “grief,” and “good” can all coexist.

First, construct Word2vec embeddings and then extract pretrained embeddings of the $t$ words, assuming a training example of size ( $t$ , 1), where $t$ represents the total number of topic words to be fed. Assume that both the embeddings $G$ and $W$ have a dimension of $d m$ (they might have different dimensions, but $d m$ was chosen for uniformity). To obtain corresponding embeddings $E$ , concatenate the data.

$\displaystyle E=\textit{Concat}(G,W)$ (2)

Algorithm 2:
1. Create Concat_Embeddings(X,Glove_vectors,word2vec_vectors):
2. For i in range(0,len(X)):
3. Word2vec $=$ word2vec_vectors(X[i]);
4. Glove $=$ Glove_vectors(X[i]);
5. X[i] $=$ Concatenate(Glove,Word2vec)
6. Return X;

3.5 Context vector in attention mechanisms

Figure 3.

Diagram representing context vector mechanism.

When performing prediction tasks, the neural network may train in relevant input components as well as a good deal of irrelevant components thanks to the attention mechanism.

Targeting attention is crucial for understanding the processes humans engage in. For instance, when manually translating a lengthy sentence, individuals often focus more intently on a specific word or phrase in the translation process, without necessarily considering its location within the input sentence. For neural networks, attention recreates this mechanism.

In sequence-to-sequence models, the attention mechanism is frequently employed. Without an attention mechanism, the model would have to summarize the entirety of the input sentence or sequence in one hidden state (depicted by S in the diagram below), which is not an ideal solution to try and apply. This drawback becomes more pronounced the longer the input sentence or sequence is.

This paradigm is strengthened by the Attention mechanism, which permits us to “glance back” at the input sentence at each stage of the decoding process. Currently, each decoder output is dependent on a weighted average of all the input states rather than just the most recent decoder state.

But there are also disadvantages. The essay concentrates on a small number of words since it skips past prior attention-grabbing information, which causes select theme words to appear more obviously than others. To keep track of the frequency with which each topic term was used, context vector is created. In order to control the attention strategy and enable the model to evaluate more unbalanced topic words, this is done by maintaining a topic coverage context vector c, which denotes the degree to which a subject word should be conveyed in subsequent generation. This is also safeguarded by a parameter $j$ , which is referred to as the discourse-level attention weight for a topic. A model of this can be seen in Fig. 3.

$C_{t}$ can be seen as a one-dimensional vector. $c_{t}$ , $j$ is determined as follows when creating a replacement word at time step $t$ :

$\displaystyle C_{t,j}=C_{t-1,j}-\frac{1}{\varphi j}\alpha_{t,j}$ (3)

where $\alpha_{t,j}$ is the attention weight of the topic word $i$ at time step $t$ and $\varphi_{j}$ , $=N\cdot\sigma(\textit{Uf}[T1,T2,\ldots,Tk])$ , $\textit{Uf}\in R^{\textit{kdw}}$ . And $g_{\textit{tj}}$ is updated as follows:

$\displaystyle g_{tj}=C_{t-1,j}\nu_{a}^{T}\tan h(W_{a}h_{t-1}+U_{a}\textit{% topic}_{j})$ (4)

As a result, the probability of the next word $y_{t}$ can be calculated as follows:

$\displaystyle P(y_{t}/y_{t-1},T_{t},C_{t})=\textit{softmax}(g(h_{t}))$ (5)

Algorithm 3:
1. Create Context_in_attention(W_t-1,b_t-1,h,c_t-1):
2. Initialize W,b,h using Xavier Initialization.
3. Initialize c as a vector of Ones.
4. alpha $=$ softmax(tanh(W.x+b));
5. Phy $=$ Sigmoid(E_t);
6. G $=$ c_t-1*tanh(W.h_t-1+U.alpha);
7. C $=$ C_t-1-1/phi*alpha
8. P $=$ softmax(G);
9. Return P;

3.6 Embeddings

The only method for converting a discrete feature into a vector format is through embeddings. Every machine learning algorithm utilises a vector as input and produces predictions as output. Therefore, when dealing with a categorical feature, the only viable approach for incorporating it into a machine learning model is by embedding it into a vector.

The most basic form of embedding is one-hot encoding:

1 $->$ (1, 0, 0) 2 $->$ (0, 1, 0) 3 $->$ (0, 0, 1)

Categorical feature can be replaced with three possible values with the vectors as above without losing any information.

These vectors have as many elements as the number of values of the categorical feature. When the categorical feature has a lot of possible values, it is often better to replace it with embeddings with lower dimensionality.

Lower dimensionality gives two advantages:

1.
It is more computationally efficient, because smaller embeddings require less memory.
2.
It regularizes the model, because the smaller number of parameters the model has, the better it is regularized.

Embeddings are often used to map words to vectors in NLP systems; words represented as vectors can be used as an input for recurrent neural networks.
3.7 Architectural overview

The suggested paradigm is succinctly defined in Fig. 4. A “Pool hike mechanism” uses an array of input words to expand the pool size of the word array by $k$ times, where $k$ is a scalar. This technique increases the output’s diversity. Then extract these words’ pretrained Glove embeddings as well as their word2vec embeddings trained on the dataset. This aids in improving the model’s semantic understanding. Finally, feed the model into a ‘Context vector’ approach to aid with the repetition problem, which occurs when the same phrases are used repeatedly to guide the generation. The next sections go through each of these mechanisms in further depth.

Figure 4.

The workings of this model are depicted in above diagram.

Suppose the input sentence is ‘go house park’ and $K=$ 2. In this mechanism, word embeddings of dimension emb_dim are extracted from the embedding matrix. Suppose the embedding matrix is like:

[‘go’:[ e1,e2,e3],‘open’:[e10,e11,e12], ‘beginning’:[e13,e14,e15], ‘House’:[e4,e5,e6],‘family:[e17,e18,19],‘palace’:[e20,e21,e22] ‘Park’:[e7,e8,e9],‘grass’:[e23,e24,25],‘swings’:[e26,e27,e28] ]

Then $X$ will now become:

[[e1, e2,e3],[e4, e5,e6], [e7, e8,e9]]

$K$ vectors with Maximum cosine similarity are then extracted from the embedding matrix. Whole embedding matrix is scanned for each word in $X$ , cosine similarity is calculated for each vector in $X$ with all vectors in the embedding matrix. Then Vectors are sorted according to cosine similarity and $K$ top elements are appended to $X$ . $X$ now becomes:

[[e1, e2, e3],[e10,e11,e12], [e13,e14,e15] ,[e17,e18,19],[e20,e21,e22] ,[e23,e24,25],[e26,e27,e28]] Its Dimension Is Now [ $n*(k+1)*$ emb_dim].

Input received in this mechanism is the output from pool hike mechanism. Therefore $X$ is

[[e1,e2,e3], [e10,e11,e12], [e13,e14,e15] ,[e17,e18,19],[e20,e21,e22] ,[e23,e24,25],[e26,e27,e28]]

In this step, another embedding matrix is created. In this embedding matrix, mappings are learnt from the respective corpus only. Suppose this looks like:

[‘go’:[ pe1,pe2,pe3],‘open’:[pe10,pe11,pe12],‘beginning’:[pe13,pe14,pe15], ‘House’:[pe4,pe5,pe6],‘family:[pe17,pe18,pe19],‘palace’:[pe20p,e21,pe22] ‘Park’:[pe7,pe8,pe9],‘grass’:[pe23,pe24,pe25],‘swings’:[pe26,pe27,pe28]]

These are concatenated with the previous embeddings present in $X$ . Thus new $X$ becomes:

[[ e1, e2, e3,pe1, pe2, pe3], [e10,e11,e12,pe10,pe11,pe12], [e13,e14,e15, pe13,pe14,pe15], [e4, e5, e6, pe4,pe5, pe6], [e17,e18,19, pe17,pe18,pe19], [e20,e21,e22,pe20p,e21,pe22] [e7,e8, e9,pe7, pe8,pe9], [e23,e24,25,pe23,pe24,pe25], [e26,e27,e28,pe26,pe27,pe28]]

This is then passed into the next mechanism which outputs a probability for each word present in vocabulary. Output dimension is [word_vocab, 1].

3.8 Evaluation metrics

3.8.1 Human evaluation

Twenty participants with an understanding of the English language were given 25 distinct essays from each model. The essays were graded on “Topic-Integrity,” “Topical-Relevance,” “Fluency,” and “Coherence” by these individuals. Each component of the essay is given as cores ranging from 1 to 5, with 5 being the highest. Finally, all the scores were added up and averaged to get a final score.. Despite the fact that scores were not always stable, it was discovered that they followed a consistent pattern.

3.8.2 BLEU score

As a metric for automatic evaluation, Bilingual Evaluation Understudy (BLEU) [34, 35] was employed. This measure is commonly used in machine translation systems for automatic evaluation. For automatic evaluation, original essays were used to construct and find a BLEU-2 score.

BLEU measures the discrepancies between an automatic translation and one or more reference translations of the same source sentence that were written by humans.

The BLEU algorithm calculates the number of matches in a weighted manner by comparing the consecutive phrases of the automatic translation with the consecutive phrases it discovers in the reference translation. These competitions don’t consider position. A higher score and greater similarity to the reference translation are indicated by a higher match degree. Grammar and intelligence are not taken into consideration.

The advantage of BLEU is that it correlates well with human judgement by averaging out individual sentence judgement errors throughout a test corpus rather than attempting to simulate every single phrase’s identical human judgement.

The amount of data that has to be trained with and the consistency of the test data with the training & tuning data set all have a significant impact on the BLEU outcomes. One can anticipate a high BLEU score if models were trained on a certain domain and the training and test sets of data matched.

BLEU looks at words, phrases, word-break (related to test sets), and order of the words to generate its probability score. The more the machine translation is in line with all three metrics of the human translation, the higher the score.

4. Experimental settings

In this study, Paul Graham essay dataset has been used, which contains a collection of Paul Graham’s works. It is made up of several writings on diverse topics. The top 50,000 words are chosen for training and testing. This suggested model employs word2vec and glove embeddings, each with a dimensionality of 300. The first five words of each batch in the dataset are utilised for training to predict the next word. The suggested model was developed utilising the Keras API, TensorFlow, as well as other libraries like Genism, Numpy, and String. Each of the three LSTM layers in this suggested model has 300 hidden units. The value of “ $k$ ” in the pool hiking mechanism is set to 7. Adam optimization [33] is used for training with a minibatch size of 64, while Xavier initialization [32] is utilised to randomly initialise model parameters.

5. Experimental result and discussion

Table 1
Table representing human evaluation scores of this model with different values of ‘ $k$ ’ in the pool hike mechanism

Value of $k$	Integrity	Relevance	Fluency	Diversity	Coherence
2	3.40	3.80	4.0	2.5	3.88
3	3.50	3.72	4.2	2.66	3.87
4	3.45	3.93	4.11	2.82	3.9
5	3.53	3.99	4.04	3.05	3.92
6	3.58	4.10	4.08	3.28	3.89
7	3.60	4.10	4.1	3.3	3.91
8	3.62	3.82	4.15	3.32	3.92
9	3.59	3.75	4.05	3.51	3.91

Experimentation was conducted with several $k$ values in the pool hiking technique using Table 1. It was discovered that when the number of words in the pool (‘ $k$ ’) grows, the diversity of the generated essays grows as well, but only up to a point. After a certain point, the diversity of the essays generated remains relatively constant, and when $k$ is increased, the diversity of the writings grows, but the relevancy of the essays diminishes dramatically.

The following is an explanation for this behaviour:

Word embeddings are employed to locate words that are similar to those in the input. Now, diversity grows at first because the model has a larger pool of vectors to utilize in prediction, but as $k$ increases, it must be remembered that there are only a finite number of words in any language that are similar to a given word. The top $k$ similar vectors from the input vector are identified, but as $k$ increases, the range of similarity (the difference between the input vector and the $k$ -th similar vector) increases as well. As a result, if $k$ is too large, the k-th vector may be very distinct from the input vector, reducing the relevancy of the essays. Therefore, $k$ must be chosen with care. For this project, $k$ was chosen as 7.

Table 2

Table comparing various similarity measures

Similarity used	Euclidean distance	Jaccard similarity	Cosine similarity
Integrity	2.4	3.51	3.6
Relevance	1.78	3.34	4.1
Fluency	2.13	3.13	4.1
Diversity	2.65	3.45	3.3
Coherence	2.21	3.21	3.91
Bleu score	2.23	3.32	3.44

In Table 2, it is discernible that Cosine similarity yields the most favourable outcomes among the three similarity metrics considered (Euclidean, Jaccard, Cosine). Given the superior average score achieved by Cosine Similarity, the selection was made to adopt Cosine Similarity as the chosen similarity measure. The table presented above, designated as Table 2, has been cited from a prior paper [36].

Table 3

Table representing human evaluation scores of various models

Model	TAV-LSTM	TAT-LSTM	MTA-LSTM	Proposed model
Integrity	2.4	2.52	3.2	3.6
Relevance	2.72	3.2	3.8	4.1
Fluency	3.6	3.82	4.08	4.1
Diversity	2.1	2.3	2.7	3.3
Coherence	2.84	2.92	3.52	3.91
Average	2.89	3.11	3.65	3.9

Table 4

Table representing BLEU scores of various models

TAV-LSTM	2.234
TAT-LSTM	2.725
MTA-LSTM	3.080
Proposed model	3.44

Table 3 shows the findings of the human examination, from which similar conclusions can be drawn. With a 12.5 percent improvement in integrity and a 7.8 percent improvement in topic relevance, it is clear that the model surpasses baselines by a large margin. Fluency improved by 4%, coherence improved by 11%, and diversity increased by 22%. All of this can be attributed to the several methodologies merged in this model, each focused on one or more metrics. The table above, Table 3, has been referenced from another paper [36].

Table 4 shows the findings of the BLEU score evaluation. As it can be seen, the model outperforms the comparison model in practically every metric by a large margin. This indicates how effective the approach is and how much the quality of the essays created improves as a result of it.

6. Conclusion and future work

This comprehensive study has delved into a wide array of techniques to advance text generation, addressing vital aspects like coherence, topic integrity, and relevance. Through the integration of methods such as the Pool hike mechanism, trained word2vec embeddings, and context vectors within attention processes, remarkable results have been achieved. The model, when compared to three prominent models – TAV-LSTM, TAT-LSTM, and MTA-LSTM – demonstrates superiority across nearly all evaluated criteria. Particularly, it excels in categories such as coherence, topic relevance, and integrity, marking a significant leap in text generation. In terms of future work, envisioning the incorporation of BERT embeddings, renowned for their reliability and potential to enhance results. Additionally, a shift towards an iterative process of finding relevant words throughout generation, as opposed to at the outset, could introduce more diversity into the generated content.

Moreover, the integration of adversarial training shows substantial promise for further enhancing the model’s performance. Through systematic implementation of these improvements, the aim is to push the boundaries of text generation, making substantial contributions to the domains of Natural Language Processing and Deep Learning. While this study represents a significant milestone, it merely marks the inception of a journey towards more advanced and versatile text generation systems.

References

Hashimoto

Zhang

Liang

. Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv1904.02792. 2019.

Sabharwal

Selman

. S.

Russell

. Norvig

. Artificial Intelligence: A Modern Approach, Third Edition. Artificial Intelligence. 2011; 175: 935-937.

Pennington

Socher

Manning

. Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014 Oct; pp. 1532-1543.

Church

. Word2Vec. Natural Language Engineering. 2017; 23(1): 155-162.

Rong

. Word2Vec parameter learning explained. arXiv preprint arXiv1411.2738. 2014 Nov 11.

Herremans

Chuan

. Modelling musical context with Word2Vec. arXiv preprint arXiv1706.09088. 2017 Jun 28.

Firat

Cho

Bengio

. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv preprint arXiv1601.01073. 2016.

Bahdanau

Cho

Bengio

. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv1409.0473. 2014.

Zitouni

, editor. Natural language processing of Semitic languages. Berlin: Springer; 2014, pp. 299-334.

10.

Kiddon

Zettlemoyer

Choi

. Globally coherent text generation with neural checklist models. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016 Nov; pp. 329-339.

11.

Welleck

Kulikov

Roller

Dinan

Cho

Weston

. Neural text generation with unlikelihood training. arXiv preprint arXiv1908.04319. 2019.

12.

Feng

Liu

Qin

Sun

Liu

. Topic-to-Essay Generation with Neural Networks. In: IJCAI, 2018 Jul; pp. 4078-4084.

13.

Wulczyn

Jacoby

. Softmax RNN for Short Text Classification, Conference proceedings, 2015.

14.

Medsker

Jain

. Recurrent neural networks. Design and Applications. 2001; 5.

15.

Rodriguez

Wiles

Elman

. A recurrent neural network that learns to count. Connection Science. 1999 Mar 1; 11(1): 5-40.

16.

Dey

Salemt

. Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international Midwest symposium on circuits and systems (MWSCAS). IEEE. 2017 Aug; pp. 1597-1600.

17.

Chung

Gulcehre

Cho

Bengio

. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv1412.3555. 2014 Dec 11.

18.

Lin

Chen

Zhou

Liu

. PC-SAN: Pretraining-Based Contextual Self-Attention Model for Topic Essay Generation. KSII Transactions on Internet and Information Systems (TIIS). 2020; 14(8): 3168-3186.

19.

Zhang

Kishore

Weinberger

Artzi

. Bertscore: Evaluating text generation with BERT. arXiv preprint arXiv1904.09675. 2019.

20.

Tenney

Das

Pavlick

. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv1905.05950. 2019 May 15.

21.

Zhao

Peyrard

Liu

Gao

Meyer

Eger

. Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv1909.02622. 2019.

22.

Fan

Wang

Zhuang

Wang

Xiao

. A hierarchical attention-based seq2seq model for Chinese lyrics generation. In: Pacific Rim International Conference on Artificial Intelligence. Springer, Cham. 2019 Aug; pp. 279-288.

23.

Gormley

Dredze

. Factor-based compositional embedding models. In: NIPS Workshop on Learning Semantics. 2014 Dec; pp. 95-101.

24.

Santoro

Raposo

Barrett

Malinowski

Pascanu

Battaglia

Lillicrap

. A simple neural network module for relational reasoning. arXiv preprint arXiv1706.01427. 2017.

25.

Bing

Lam

. Open domain event text generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020 Apr; Vol. 34(5); pp. 7748-7755.

26.

Dathathri

Madotto

Lan

Hung

Frank

Molino

Liu

. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv1912.02164. 2019.

27.

Zhang

Jin

Zhou

. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics. 2010; 1(1-4): 43-52.

28.

Xia

Zhang

Nguyen

Zhang

. Cg-bert: Conditional text generation with BERT for generalized few-shot intent detection. arXiv preprint arXiv2004.01881. 2020.

29.

Sellam

Das

Parikh

. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv2004.04696. 2020.

30.

Kaggle. Paul Graham Essays. Available at: https://www.kaggle.com/krsoninikhil/paul-graham-essays?select=paul_graham_essay.txt.

31.

Rahutomo

Kitasuka

Aritsugi

. Semantic cosine similarity. In: The 7th International Student Conference on Advanced Science and Technology (ICAST). 2012 Oct; Vol. 4(1); p. 1.

32.

Sirignano

Spiliopoulos

. Scaling Limit of Neural Networks with the Xavier Initialization and Convergence to a Global Minimum. arXiv preprint arXiv1907.04108. 2019.

33.

Kingma

. Adam: A method for stochastic optimization. arXiv preprint arXiv1412.6980. 2014.

34.

Papineni

Roukos

Ward

Zhu

. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002 Jul; pp. 311-318.

35.

Deng

. Maximum expected BLEU training of phrase and lexicon translation models. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2012 Jul; pp. 292-301.

36.

Gaur

Arora

Prakash

Kumar

Gupta

Nagrath

. Analyzing Natural Language Essay Generator Models Using Long Short-Term Memory Neural Networks. AISC, volume 1394.

Integrating LSTM and NLP techniques for essay generation

Abstract

Keywords

1. Introduction

3. Methodology

3.1 Task definition

3.2 Data collection and preprocessing

3.8.1 Human evaluation

3.8.2 BLEU score

4. Experimental settings

5. Experimental result and discussion

Table 1 Table representing human evaluation scores of this model with different values of ‘ k ’ in the pool hike mechanism

References

Table 1
Table representing human evaluation scores of this model with different values of ‘ $k$ ’ in the pool hike mechanism