Abstract
This paper combines Long Short-Term Memory (LSTM) and Self-Attention mechanisms to predict melody direction in vocal performances, exploring its application in folk music innovation to enhance expressiveness and creative diversity. Using the MAESTRO dataset, which includes audio and MIDI data, the continuous note sequence is divided into fixed-length segments (20 notes each) via a sliding window. The LSTM model captures temporal dependencies in the sequence, while Self-Attention assigns varying weights to inputs across time steps to better capture global context. To overcome the limitations of traditional pitch prediction methods, 10-fold cross-validation is employed to evaluate model performance. Experimental results show that with a window size of 20, the Mean Squared Error (MSE) is 0.023, and training takes 86 minutes, yielding the most balanced results across all configurations. Compared to Bi-LSTM and MusicTransformer, the proposed model excels in pitch prediction accuracy, achieving an average Mean Absolute Error (MAE) of 0.022, R2 of 0.895, and Pearson Correlation Coefficient (PCC) of 0.925. The model is also tested in the creation of six types of folk music, demonstrating that while it reduces creation time, the harmony consistency slightly lags behind manually created melodies. This model has significant potential in pitch prediction and folk music creation, offering a valuable tool for music composition with high practical application value.
Keywords
Get full access to this article
View all access options for this article.
