Improving text relationship modelling with artificial data

Abstract

Data augmentation uses artificially created examples to support supervised machine learning, adding robustness to the resulting models and helping to account for limited availability of labelled data. We apply and evaluate a synthetic data approach to relationship classification in digital libraries, generating artificial books with relationships that are common in digital libraries but not easier inferred from existing metadata. Artificial books are generated by remixing existing texts into synthetically constructed formats. We find that for classification on whole–part relationships between books, synthetic data improves a deep neural network classifier by 91%. Furthermore, we consider the ability of synthetic data to learn a useful new text relationship class from fully artificial training data.

Keywords

Data augmentation digital libraries machine learning

Get full access to this article

View all access options for this article.

References

Bamman

Carney

Gillick

et al. Estimating the date of first publication in a large-scale digital library. In: 2017 ACM/IEEE joint conference on digital libraries (JCDL), Toronto, ON, Canada, 19–23 June 2017, pp. 1–10. New York: IEEE.

Organisciak

Shetenhelm

Vasques

DFA

et al. Characterizing same work relationships in large-scale digital libraries. In: Taylor

Christian-Lamb

Martin

et al. (eds.) Information in contemporary society. Lecture notes in computer Science. Cham: Springer International Publishing, 2019, pp. 419–425.

Gatenby

Greene

Oskins

et al. GLIMIR: manifestation and content clustering within WorldCat. Code4lib J 2012; (17), https://journal.code4lib.org/articles/6812

Schofield

Thompson

Mimno

. Quantifying the effects of text duplication on semantic models. In: Conference on empirical methods on natural language processing, Copenhagen, 7–11 September 2017.

American Library Association, Canadian Library Association, Chartered Institute of Library and Information Professionals, and Joint Steering Committee for Development of RDA. RDA: Resource Description and Access: 2013 Revision, 2013. DOI: 10.29085/9781856047159

IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional requirements for bibliographic records. Technical report, IFLA, Munich, 1998.

Michel

Shen

Aiden

et al. Quantitative analysis of culture using millions of digitized books. Science 2011; 331(6014): 176–182.

York

. Building a future by preserving our past: the preservation infrastructure of HathiTrust digital library. In: World library and information congress: 76th IFLA general conference and assembly, Gothenburg, 10–15 August 2010, pp. 10–15, https://www.hathitrust.org/documents/hathitrust-ifla-201008.pdf

Hamilton

Leskovec

Jurafsky

Diachronic word embeddings reveal statistical laws of semantic change. arXiv pre print arXiv:1605.09096, 2016, https://arxiv.org/abs/1605.09096

10.

Lieberman

Michel

Jackson

et al. Quantifying the evolutionary dynamics of language. Nature 2007; 449: 713–716.

11.

Moretti

Distant reading. London: Verso Books, 2013.

12.

Manovich

Cultural analytics: visualising cultural patterns in the era of more media. Domus, March 2009, http://manovich.net/content/04-projects/063-cultural-analytics-visualizing-cultural-patterns/60_article_2009.pdf

13.

Smith

Cordell

A research agenda for historical and multilingual optical character recognition. Technical report, NULab, Northeastern University, 2018, https://ocr.northeastern.edu/report

14.

Manmatha

Feng

. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In: Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (JCDL ’06), Chapel Hill, NC, 11–15 June 2006, pp. 109–118. New York: IEEE.

15.

Dong

Smith

. Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), Melbourne, VIC, Australia, 15–20 July 2018, pp. 2363–2372. Melbourne, VIC, Australia: Association for Computational Linguistics.

16.

Nikolenko

SI.

Synthetic data for deep learning. arXiv: 1909.11512, 2019, http://arxiv.org/abs/1909.11512

17.

Mikołajczyk

Grochowski

. Data augmentation for improving deep learning in image classification problem. In: 2018 international interdisciplinary PhD workshop (IIPhDW), Swinoujscie, 9–12 May 2018, pp. 117–122. New York: IEEE.

18.

Wong

Gatt

Stamatescu

et al. Understanding data augmentation for classification: when to warp? In: 2016 international conference on digital image computing: techniques and applications (DICTA), Gold Coast, QLD, Australia, 30 November–2 December 2016, pp. 1–6. New York: IEEE.

19.

Perez

Wang

The effectiveness of data augmentation in image classification using deep learning. arXiv: 1712.04621, 2017, http://arxiv.org/abs/1712.04621

20.

You

et al. Adversarial noise layer: regularize neural network by adding noise. In: 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019, pp. 909–913. New York: IEEE.

21.

Peng

Sun

Ali

et al. Learning deep object detectors from 3D models. arXiv: 1412.7122, 2015, http://arxiv.org/abs/1412.7122

22.

Richter

Vineet

Roth

et al. Playing for data: ground truth from computer games. In: Leibe

Matas

Sebe

et al. (eds) Computer vision ECCV 2016. Lecture notes in computer science. Cham: Springer International Publishing, 2016, pp. 102–118.

23.

Wei

Zou

EDA: easy data augmentation techniques for boosting performance on text classification tasks. arXiv: 1901.11196, 2019, http://arxiv.org/abs/1901.11196

24.

Vaibhav

Singh

Stewart

et al. Improving robustness of machine translation with synthetic noise. arXiv: 1902.09508, 2019, http://arxiv.org/abs/1902.09508

25.

Chiron

Doucet

Coustaty

et al. ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), vol. 1, Kyoto, Japan, 9–15 November 2017, pp. 1423–1428. New York: IEEE.

26.

Rigaud

Doucet

Coustaty

et al. ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 20–25 September 2019, pp. 1588–1593. New York: IEEE.

27.

Fonseca Cacho

Taghva

. Aligning ground truth text with OCR degraded text. In: Arai

Bhatia

Kapoor

(eds) Intelligent computing. Advances in intelligent systems and computing. Cham: Springer International Publishing, 2019, pp. 815–833.

28.

Drobac

Kauppinen

Lind

. OCR and post-correction of historical Finnish texts. In: Proceedings of the 21st Nordic conference on computational linguistics, Gothenburg, 22–24 May 2017, pp. 70–76. Melbourne, VIC, Australia: Association for Computational Linguistics.

29.

Chernyshova

Gayer

Sheshkus

. Generation method of synthetic training data for mobile OCR system. In: Tenth international conference on machine vision (ICMV 2017), vol. 10696, Vienna, 13–15 November 2017, p. 106962G. Bellingham, WA: International Society for Optics and Photonics.

30.

Krishnan

Jawahar

CV.

Generating synthetic data for text recognition. arXiv: 1608.04224, 2016, http://arxiv.org/abs/1608.04224

31.

Kang

Rusinol

Fornes

et al. Unsupervised adaptation for synthetic-to-real handwritten word recognition. In: 2020 IEEE winter conference on applications of computer vision (WACV), Snowmass, CO, 1–5 March, pp. 3502–3511

32.

Puri

Spring

Patwary

et al. Training question answering models from synthetic data. arXiv: 2002.09599, 2020, http://arxiv.org/abs/2002.09599

33.

Zhu

Liu

et al. Emotion classification with data augmentation using generative adversarial networks. In: Phung

Tseng

Webb

et al. (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science. Cham: Springer International Publishing, 2018, pp. 349–360.

34.

Antoniou

Storkey

Edwards

Data augmentation generative adversarial networks. arXiv: 1711.04340, 2018, http://arxiv.org/abs/1711.04340

35.

Frid-Adar

Klang

Amitai

et al. Synthetic data augmentation using GAN for improved liver lesion classification. In: 2018 IEEE 15th international symposium on biomedical Imaging (ISBI 2018), Washington, DC, 4–7 April 2018, pp. 289–293. New York: IEEE.

36.

Organisciak

Capitanu

Underwood

et al. Access to billions of pages for large-scale text analysis. In: iConference 2017 proceedings, vol. 2. Wuhan, China: iSchools, https://www.ideals.illinois.edu/bitstream/handle/2142/96256/iconf-ef.pdf?sequence=2

37.

Pennington

Socher

Manning

. GloVe: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, https://nlp.stanford.edu/pubs/glove.pdf

38.

Srivastava

Hinton

Krizhevsky

et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014; 15(1): 1929–1958.

39.

Kingma

Adam: a method for stochastic optimization. arXiv: 1412.6980, 2017, http://arxiv.org/abs/1412.6980

40.

Dawson

Mistaikes in books, 2016, https://rarebooksdigest.com/2016/07/05/mistaikes-in-books/