Abstract
With a mass digitisation programme underway and the addition of non-print legal deposit and web archive collections, the National Library of Scotland is now both producing and collecting data at an unprecedented rate, with over 5PB of storage in the Library’s data centres. As well as the opportunities to support large scale analysis of the collections, this also presents new challenges around data management, storage, rights, formats, skills and access. Furthermore, by assuming the role of both creators and collectors, libraries face broader questions about the concepts of ‘collections' and ‘heritage', and the ethical implications of collecting practices. While the ‘collections as data’ movement has encouraged cultural heritage organisations to present collections in machine-readable formats, new services, processes and tools also need to be established to enable these emerging forms of research, and new modes of working need to be established to take into account an increasing need for transparency around the creation and presentation of digital collections. This commentary explores the National Library of Scotland's new digital scholarship service, the implications of this new activity and the obstacles that libraries encounter when navigating a world of Big Data.
This article is a part of special theme on Heritage in a World of Big Data. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/heritageinworldbigdata
Digital scholarship is bringing about a paradigm shift in cultural heritage organisations. Librarians have always worked with data: from the card catalogue to modern Library Management Systems, libraries have created and curated data, with the aim of making collections available and discoverable. However, the increasing use of computational techniques and methods in research practice, and resulting changes to how libraries curate, deliver and even create collections, mean that they are now working with data at an unprecedented scale. While these challenges of volume and velocity are encouraging libraries to reconsider practical issues such as digitisation processes, storage, discovery and skills, they also problematise the very idea of ‘collections’ themselves and how we create and frame the concept of ‘heritage’. Big Data in the library emphasises existing problems and brings about new questions of ethics, methodologies and transparency. It demonstrates the tensions within libraries as they increasingly become both collectors and creators, and it leads to new approaches towards digitisation and access. Digital scholarship is a disruptive influence in the cultural heritage sector.
What is Big Data in a library context? How do libraries adapt to collections as data and computational uses of the collections? What are the challenges of working in a world of future heritage; and what are the broader implications of these changes for heritage institutions? This article explores the challenges presented by cultural heritage Big Data in the context of the National Library of Scotland’s development of a digital scholarship service, which aims to encourage and enable the use of computational methods with the collections. This work involves making collections available in machine-readable format and establishing a culture in the Library which can support this. Yet equipping a library for digital scholarship, navigating a world of Big Data, and understanding the implications of this shift, requires organisation-wide re-evaluation.
Big Data and the library
Founded in 1925, with collections deriving from the Faculty of Advocates Library, which was established in 1682, the National Library of Scotland is one of the UK’s six legal deposit libraries. This means it is entitled to claim a copy of all new print and electronic publications from the UK through the Legal Deposit Libraries Act 2003 and the Legal Deposit Libraries (Non Print Works) Regulations 2013, alongside the British Library, the National Library of Wales, Oxford University, Cambridge University and Trinity College Dublin (UK Government, 2003, 2013). Also included in these regulations is the ability to harvest UK websites. Now totalling over 31 million items, the Library’s collections are rapidly expanding as both electronic and print material are acquired through legal deposit and focused collecting methods.
The Library’s digital collections have increased rapidly in recent years, with a strategic aim to have one-third of the total collections available in digital form by 2025 (National Library of Scotland, 2015). To date, over 5.2 million e-journals and e-books have been deposited via non-print legal deposit into shared infrastructure. This has majorly contributed to the increasing percentage of collections in digital format, which, at the start of 2020, sits at approximately 22%. In parallel, an in-house mass-digitisation programme, operating a two-shift-per-day pattern to maximise camera use, has led to large amounts of newly digitised materials, with 128,810 items digitised in 2017/18 and 201,679 in 2018/19, including books, manuscripts, maps, pamphlets, magazines, posters and sound and moving image files, with the main focus on out-of-copyright material.
Following Laney’s (2001) definition of Big Data, these digital objects and data collected and created through non-print legal deposit and digitisation bear the three traits of volume (hundreds of terabytes), velocity (many items ingested daily), and variety (from many collections). Kitchin and McArdle (2016) assert that the defining characteristics of Big Data are velocity and exhaustivity, yet the Library has taken over ten years for 22% of the collection to be available in digital format. However, from a practical standpoint, Big Data is often relative to the starting position of an organisation and when its ability to handle data using existing methods becomes no longer viable. Big Data for an individual (terabytes) will be different to that of a National Library (petabytes), which in turn will be different for a facility such as CERN (zettabytes).
This corpus of digital collections provides greater opportunities for data-intensive analysis than ever before; yet it is not always possible to use them for computational research. Since 2014, the UK has provided a copyright exception for text and data mining (TDM) for non-commercial research, enabling TDM of in-copyright material, provided that the data is not transferred to researchers (Intellectual Property Office, 2014). However, as described by Gooding et al. (2019), TDM cannot be undertaken on non-print legal deposit, leaving challenges for libraries as they seek to make collections available at scale.
Collections as data
Alongside this increasing digital availability of collections, the Collections as Data movement has gained significant momentum in recent years, largely due to the Always Already Computational project, which ran from 2016 to 2018: ‘While cultural heritage practitioners have broad experience replicating the analog experience of watching, viewing, and reading in a digital environment’, the project’s report states, ‘they less commonly share the experience of supporting users who want to work with collections as data – a conceptual orientation to collections that renders them as ordered information, stored digitally, that are inherently amenable to computation’ (Padilla et al., 2019a: 7). This project, along with its Mellon-funded successor, Collections as Data: Part to Whole (Padilla et al., 2019b), and the broader OpenGlam movement (a movement to make Gallery, Library, Archive and Museum collections available openly to users), has advocated for organisations to make collections available openly and in machine-readable formats, and explored potential approaches for doing so, as well as the challenges (Fauconnier, 2019; Valeonti et al., 2019). This activity is set amidst a burgeoning context of transnational funding streams such as the UK Arts and Humanities Research Council (AHRC) and USA National Endowment for the Humanities call for UK–US Collaboration for Digital Scholarship in Cultural Institutions, which includes focus on machine learning and ‘unlocking new data’ (AHRC, 2020). Furthermore, universities are increasingly offering data science and digital heritage-related courses. Within Scotland, there has been an increasing investment in data science, including Scottish Government City Deal Funding with a significant emphasis on data-driven innovation (Scottish Government, 2018).
Set against this shifting national and international backdrop, the mass-digitisation programme at the National Library of Scotland lends itself well to this computational turn in cultural heritage. A new digital scholarship service at the Library (with ‘digital scholarship’ here defined as ‘the use of computational methods, with National Library of Scotland collections, to enable new forms of research’ (Ames, 2020a)) has given an initial priority to providing digitised collections as datasets, with future plans to publish metadata, maps-as-data, audiovisual material, web archive and organisational data: digitised print collections are only part of a much bigger landscape within which digital scholarship operates. These datasets are made available in consistent formats, with clear rights information, on the Data Foundry (National Library of Scotland, 2019a), the service’s open data platform. The Data Foundry is a place where collections as datasets can be accessed or downloaded, melted down and welded back together again to produce new outcomes, findings or analysis, and in the future potentially re-ingested into the Library’s collections for others to reuse. This has enabled staged levels of service, from making data and tools available, to consultation time with Library subject-matter experts, to funded collaboration and partnership on projects.
Equipping the library for cultural heritage Big Data
Presenting collections as data has implications across the Library, from the ways in which, and reasons why, libraries digitise, to the files they produce and preserve, the infrastructure required for storing digital objects, and levels of collaborative working and skillsets needed. Crafting heritage as Big Data is changing libraries as organisations and requiring fresh approaches to existing practice. Digital scholarship, for example, brings a new use-case to digitisation: where, previously, the Library’s digitisation programme has focused on producing images for online galleries, datasets require additional file formats and metadata, as well as considerations around structures, storage and persistent identifiers, to make collections available at scale.
Providing datasets on the Data Foundry as simple downloads (with future plans for API access to some collections) reduces technical barriers to use, but involves considering a number of directory/file structures and including an inventory file and readme file, which contains high-level information about the dataset, such as file numbers and formats, date of publication and subsequent revisions, and rights information. Including METS (a Library of Congress standard used to describe digital objects), ALTO (a Library of Congress standard for storing layout information), OCR (Optical Character Recognition) text and image files has led to some datasets reaching more than 40 GB in size when uncompressed; storing and making them available in the cloud ensures fast downloads. When the dataset is compiled, zipped and published online, a DOI is then added as a persistent identifier to enable citation.
The Data Foundry website was developed with an emphasis on simplicity and clarity, following three principles: openness, transparency and practicality. With digitised collections going through a prior rights assessment process, all datasets on the Data Foundry have clear rights statements, and the Library does not assert further copyright over collections (National Library of Scotland, 2019b). Datasets on the Data Foundry include information about how and why they have been produced, and are presented in a number of file formats to encourage broad use. Each dataset is introduced with key information including the number of files and words; level of OCR clean-up; date range; and a curator-led introduction. Together, the features and files of the Data Foundry are designed to ensure minimal effort is needed to gain an overview of the data.
This presentation of collections has jump-started a programme of broader change within the Library. Both the process of providing access to data collections, and the potential uses of these datasets, means rethinking core library skillsets. A Research Libraries UK survey reported a skills gap in libraries in areas such as text analysis, data visualisation and data analysis (Greenhall, 2019); Ligue des Bibliothèques Européennes de Recherche (LIBER) noted a gap around ‘Technical knowledge – such as coding or tool expertise’ (Wilms et al., 2019: 21). The USA Association of Research Libraries found concern about ‘both capacity and sustainability’ of supporting digital scholarship, identifying skill shortages including data visualisation, text analysis and data curation (Mulligan, 2016: 8). To support digital scholarship activity, and understand the implications of computational access to collections, the Library has collaborated with the University of Edinburgh and the Software Sustainability Institute to offer Library Carpentry data skills training, from spreadsheets, through to Python and R (Cope and Baker, 2018).
Collecting and creating future heritage
This focus on presenting collections in machine-readable form, and at scale, presents a number of challenges for libraries: the dichotomy between heritage collections and new computational techniques used with data collections can be difficult to navigate. Most immediately, digital scholarship disrupts existing library services and the way collections are delivered, resulting in a changing emphasis of workplans and core activity. Yet from a broader perspective, this disruption can be even more challenging: presenting collections as data brings collections, collecting and the concept of ‘cultural heritage’ itself into question. How libraries collect, curate and, to an extent, create the future of cultural heritage generates issues of bias and narrative, as well as determining future heritage legacies.
Firstly, what do collections even look like as data? Where plain text files and XML (Extensible Markup Language) may be recognisable outputs from OCR, there are further possibilities than simply text. How do we extract ‘data’ from maps (places, landmarks, topography, streets) in an automated way, and how would we represent this data when we have done so? The GB1900 crowdsourcing project compiled a gazetteer of Ordnance Survey maps through manual methods (Southall et al., 2017), but how could creation of this dataset be automated, and what would the implications be for the presentation of spatial information? What do audiovisual collections look like as data, and how can libraries present and publish these datasets? How can we extract information about scenery in films, for example, and how could we integrate this data back into catalogue data to enhance discovery methods? The solutions to such problems could lie in new applications of Artificial Intelligence to cultural heritage collections and data. However, as recent articles by Cordell (2020) and Padilla (2019) explain, with this comes a number of difficulties for institutions to balance, including shortage of staff expertise and time to experiment; quantity of ground truth data and challenges in counteracting any cultural or social bias within this; and a rigorous understanding of ethics within this domain and how these align with organisational values. Collaborations between organisations, as well as with academic projects, are one way of sharing resource and expertise, becoming ‘critical friends’. Amidst these challenges lie potential solutions for improving access to collections, presenting them in new ways and bringing themes to the foreground that have previously been left out of the cataloguing process due to practical – or other – reasons.
Clarity about the data itself – what it is and why it is – is also a fundamental challenge. Zwitter (2014) identifies three categories of ‘Big Data stakeholders’: collectors, utilisers and generators, noting the ‘power relationships’ between the three and the role of the collector in determining what is collected in the first place p. 3. Libraries increasingly hold two of these three positions, as collectors and generators, both determining and creating the collections, holding weighty responsibilities around how data is put forward to utilisers. This might include publishing details of OCR quality, or conveying information about any data clean-up: the accuracy of text has significant impact on natural language processing (Alex et al., 2012; van Strien et al., 2020) and, for more general uses, those readers without detailed understanding of OCR may not realise that when a word they search for in an item does not appear, it does not mean that it is not there. It also means being transparent about the provenance of data collections: now becoming producers of their own collections, with data from scanners, digital production tools and metadata all included in datasets, libraries need to ensure that they retain their position as a trusted source of information, by declaring why and how material was turned into a dataset (Ames, 2020b). What decision-making processes led to one collection being digitised and presented as data over another, and what technical processes did it undergo? What (or who) funded the digitisation of this collection? With increasing emphasis on reproducible research, libraries, too, should provide information which enables users to retrace the thread from the dataset to the physical object. While METS enables recording of technical provenance information, there are no current standards for presenting information about why an item was digitised: as an intermediary solution, collections on the Data Foundry include unstructured text in an ‘Other’ field to ensure this information remains with the dataset.
More broadly, given the scale of library collections, placing digitised or data collections in context becomes increasingly important. From the entire ‘pool’ of a library’s collections (itself informed by, often problematic, historic collecting practices), only some items are digitised: a result of factors including copyright, conservation and internal selection processes. From this subset of the collection, only some are then presented as datasets: again, depending on resource, OCR quality or copyright. How to present these collections in context, and how these thinned-down collections could become representative of a broader, tacit understanding of ‘culture’, is problematic. Given current copyright laws and the standardised format of many nineteenth century collections, numerous data collections (and research projects) have a strong emphasis on this period, and with many other audiences holding a stake in the availability of digital and data collections, what could this mean for creative outputs, school learning, business analysis? In their study of newspaper digitisation practices Hauswedell et al. (2020) set out a series of recommendations for digitisers, including ‘engag[ing] in critical (self-)reflection on the implicit and explicit selection criteria that shape their collections’, as well as providing information about selection rationale and funding, all as a part of the digital archive, rather than in addition to it. From a research perspective, Bonacchi and Krzyzanska (2019) note the ‘duty [of heritage researchers] to study the ways in which heritage is made and assessed online’ (p. 9), and the importance of the ‘ability to provide counter-narratives’ (p. 8). And libraries, too, bear responsibility for how these decision-making processes are recorded, and how (and if) this information is subsequently conveyed to their users: libraries must become partners in this ‘digital heritage activism’ (p. 8) to encourage critical engagement with the collections. Defining and identifying ‘lack’ has not traditionally been part of library practices: we must step away from the idea of libraries as the summation of their collections (and, to a remote audience, digital collections), with emphasis on what is not as well as what is.
Furthermore, presenting collections as data also means rethinking what cultural heritage organisations collect, and what constitutes a ‘collection’. Should libraries collect the outputs of digital scholarship projects if they are published online? Should modified collections data be re-ingested into the collections? And what should the criteria be for this? Where is the boundary between the responsibilities of university and national libraries? What infrastructure would be needed for this kind of collecting? As McDonald et al. (2020) point out, ‘There have always been more things that could be collected and kept than are actually preserved for the future’ (p. 424), but with the shift towards digital comes a need to assign a weightier focus towards this area. An Alan Turing Institute white paper (McGillivray et al., 2020) makes the recommendation that cultural heritage organisations move towards shared infrastructures to ‘democratise access to digital resources, and to guarantee their continued maintenance and improvement’ (p. 18), pointing towards more porous boundaries between the traditional responsibilities of the university and the national collecting institution, and the crumbling of silos. And, as platforms providing digital resources fragment and portions of collections are made available behind paywalls or under embargo, what does this mean for the idea of a ‘collection’? Conversely, with users increasingly able to curate their own collections from vast digital libraries such as HathiTrust, a collection increasingly becomes a matter of individual perspective. While libraries make challenging decisions about what becomes heritage, the dispersal of resulting digital ‘heritage’ means that epistemological decisions are, simultaneously, also increasingly being made by the user, pointing towards the subjectivity of collections, collecting and the cycles of how we create and perceive the world through these practices.
Navigating a world of Big Data
Digital scholarship is a disruptive influence in the library. Its extensive scale and scope leaves few areas of the Library untouched, whether this is around the way collections are described or presented; the changing focus of existing services and the ways they are delivered; modes of access to the collections, and what collections are; or staff skills. In many ways, digital scholarship is the Library, just seen from a different perspective. Laying foundations for libraries to support digital scholarship involves bringing disparate teams together to work in new ways, and all of the practical challenges that this entails around changes to processes, workflows, skills and culture. But it also means rethinking what libraries are in the first place, who their audiences are, and resituating them in a world of Big Data. Supporting digital scholarship within the library means looking again at libraries’ structures, removing the rigidity of divisions and hierarchies, and reconsidering the roles that libraries recruit to enable fresh ideas and practices to circulate.
It also means rethinking the levels of transparency between library and user and the implications of this. In a world of Big Data, where libraries increasingly curate and create information themselves, how do we ensure that libraries do not compound a data-driven epistemological crisis? Remaining an authoritative information source requires libraries to collect and convey information about how and why items become a part of the national collections – and available on digital galleries and as datasets for global access – above others, to enable and encourage critical engagement with the concepts of ‘cultural heritage’ and ‘collections’ and to avoid, themselves, becoming a black box. Libraries should become champions of transparency, embracing openness – and open source, open access, open licensing.
Inhabiting this world of Big Data and supporting digital scholarship should also enable us to explore the diversity of our collections and to consider what is missing, and what these absences can tell us – and, crucially, to act on this. It also means rethinking discovery and how people find our collections, and what they find when they get there. Releasing catalogue data will enable libraries to explore where the biases in collecting practices lie and acknowledge and address them, as well as communicating them to users. By navigating the ethics and practicalities of curating, creating and presenting Big Data, libraries – and their roles and responsibilities – are themselves adapting, reassessing and transforming as future heritage makers.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
