Sage Journals: Discover world-class research

Abstract

This paper examines optical character recognition (OCR) through the lens of archival ethics as outlined in the Society of American Archivists (SAA) Core Values Statement and Code of Ethics, given the current debates surrounding artificial intelligence (AI). A literature review highlights persistent challenges of authenticity and integrity, transparency and accountability, access and equity, and responsible stewardship and sustainability, as well as new concerns about bias, sustainability, and accountability using large language models (LLM). A case study describes systematic testing of LLM, transformer model (TM), and neural network (NN) architectures and examines the challenges in creating a reliable, scalable in-house OCR tool named Opticolumn. This case study finds that NN approaches better align with archival ethics than do LLM tools, which may generate fabrications, but that OCR tool choice will depend on the capacities and preferences of individual institutions.

Keywords

optical character recognition archival ethics digital preservation artificial intelligence accessibility digital stewardship sustainable digital practices computer vision

Get full access to this article

View all access options for this article.

References

Allen

Marie

. 1987. “Optical Character Recognition: Technology with New Relevance for Archival Automation Projects.” The American Archivist 50 (1): 94–6. doi:10.17723/aarc.50.1.8j8m4l8q8q2h5p7.

Archives and Records Association (UK). 2024. “Code of Ethics.” https://www.archives.org.uk/ara-code-of-ethics.

Blanke

Tobias

Bryant

Michael

Hedges

Mark

. 2011. “Ocropodium: Open Source OCR for Small- Scale Historical Archives.” Journal of Information Science 38 (1): 65–76. doi:10.1177/0165551511429418.

Blanke

Tobias

Bryant

Michael

Hedges

Mark

. 2012. “Open Source Optical Character Recognition for Historical Research.” Journal of Documentation 68 (5): 613–27. doi:10.1108/00220411211256021.

Breeding

Marshall

. 2025. “Mergers.” Library Technology Guides. https://librarytechnology.org/mergers/ (accessed October 1, 2025).

Burchardt

Jørgen

. 2023. “Are Searches in OCR-Generated Archives Trustworthy? An Analysis of Digital Newspaper Archives.” Economic History Yearbook 64 (1): 1–25. doi:10.1515/jbwg-2023-0003.

Bushey

Jessica

, et al. 2025. “Report on the Survey ‘Digitization and Artificial Intelligence for Archives and Documentary Heritage Materials.’” InterPARES Trust AI, May 2025. https://interparestrustai.org/assets/public/dissemination/RA03-InterPARESAI-Survey_Report_FINAL.pdf.

Chiron

Guillaume

Doucet

Antoine

Coustaty

Mickael

Moreux

Jean-Philippe

. 2017. “Impact of OCR Errors on the Use of Digital Libraries: Towards a Better Access to Information.” In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 277–80. New York, NY: ACM Press. doi:10.1109/JCDL.2017.7991582.

Clova AI Research. 2019. Clovaai/Deep-Text-Recognition-Benchmark. Version 1.0. April 3, 2019. https://github.com/clovaai/deep-text-recognition-benchmark.

10.

Cordell

Ryan

. 2017. “‘Q i-jtb the Raven’: Taking Dirty OCR Seriously.” Book History 20: 188–225. doi:10.1353/bh.2017.0006.

11.

Davet

Jeremy

Langelier

Karine

Angevaare

Inge

Mas

Sabine

. 2023. “Archivist in the Machine: Paradata for AI-Based Automation in the Archives.” Archival Science 23: 1–25. doi:10.1007/s10502-023-09408-8.

12.

de Oliveira

Lucas Lima

Simões

Gabriel

Lins

Rafael Dueire

Simske

Steven J.

Fan

Jian

. 2023. “Evaluating and Mitigating the Impact of OCR Errors on Information Retrieval.” International Journal on Digital Libraries 24: 1–22. doi:10.1007/s00799-023-00345-6.

13.

Gallagher

Patrick

Griffin

Ray

. 2025. “Algorithmic Profiling of the Unemployed.” In Digital Public Employment Services in Action, edited by Griffin

Ray

Demazière

Didier

Leschke

Janine

Hansen

Magnus Paulsen

, 28–43. Bristol: Bristol University Press. doi:10.2307/jj.18323746.7.

14.

Ghaseminejad Raeini

Mohammad

. 2025. “The Evolution of Language Models: From N-Grams to LLMs, and Beyond.” Natural Language Processing Journal 12: 100168. doi:10.1016/j.nlp.2025.100168.

15.

Google. 2025. Gemma-3-27b-It. September 12, 2025. https://huggingface.co/google/gemma-3-27b-it.

16.

Hickey

Michael

. 2024. “How Higher Ed Institutions Are Responding to Google Storage Limits.” EdTech, October 14, 2024. https://edtechmagazine.com/higher/article/2023/01/how-higher-ed-institutions-are-responding-google-storage-limits.

17.

Hilton

Michael L.

Goessling

Jeffrey M.

Knezevich

Leah M.

Downer

Jane M.

2022. “Utility of Machine Learning for Segmenting Camera Trap Time-Lapse Recordings.” Wildlife Society Bulletin 46 (4): 1–15. doi:10.1002/wsb.1342.

18.

Huang

Lei

Weijiang

Weitao

, et al. 2025. “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.” ACM Transactions on Information Systems 43 (2): 42:1–55. doi:10.48550/arXiv.2311.05232.

19.

Hugging Face. 2025. “TrOCR: Transformer-Based Optical Character Recognition.” https://huggingface.co/docs/transformers/en/model_doc/trocr (accessed October 14, 2025).

20.

International Council on Archives. 1996. “Code of Ethics.” https://www.ica.org/app/uploads/2023/12/ICA_1996-09-06_code-of-ethics_EN.pdf.

21.

Jaided AI. 2020. JaidedAI/EasyOCR. Version 1.7. March 14, 2020. https://github.com/JaidedAI/EasyOCR.

22.

Jaillant

Lise

Rees

Arran

. 2023. “Applying AI to Digital Archives: Trust, Collaboration and Shared Professional Ethics.” Digital Scholarship in the Humanities 38 (2): 571–85. doi:10.1093/llc/fqac073.

23.

Karppinen

Marko.

2016. “How I Ended Up Paying $150 for a Single 60GB Download from Amazon Glacier.” Medium, January 17, 2016. https://medium.com/@karppinen/how-i-ended-up-paying-150-for-a-single-60gb-download-from-amazon-glacier-6cb77b288c3e.

24.

Keinan-Schoonbaert

Adi.

2025. “Automatic Text Recognition (OCR/HTR): A LIBER Digital Scholarship & Data Science Topic Guide for Library Professionals.” Digital Scholarship & Data Science Topic Guides. Published electronically May 14, 2025. doi:10.23636/k8qs-wc65.

25.

Khan

Arsh

Biswas

Sanket

Tolu

Huseyin

. 2024. “OCR Approaches for Humanities: Applications of Artificial Intelligence/Machine Learning on Transcription and Transliteration of Historical Documents.” Digital Studies in Language and Literature 1 (1–2): 1–18. doi:10.1515/dsll-2024-0013.

26.

Kiessling

Benjamin.

2025. The Kraken OCR System. Version 6.0. August 2025. https://kraken.re.

27.

Koenecke

Allison

Choi

Anna Seo Gyeong

Mei

Katelyn X.

Schellmann

Hilke

Sloane

Mona

. 2024. “Careless Whisper: Speech-to-Text Hallucination Harms.” In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency (FAccT ‘24). Association for Computing Machinery, New York, NY, USA. doi:10.1145/3630106.3658996.

28.

Machidon

O. M.

Machidon

A. L.

2025. “Comparing OCR Pipelines for Folkloristic Text Digitization.” Digital Heritage. Published electronically. doi:10.48550/arXiv.2507.19092.

29.

Mannheimer

Sara

Higgins

Devin

Nelson

Kaylin

Tanner

Allison

Sterman

Leila

Cain

Jonathan

Reese

Benjamin

. 2024. “Responsible AI Practice in Libraries and Archives: A Review of the Literature.” Information Technology and Libraries 43 (3): 1–22. doi:10.5860/ital.v43i3.17245.

30.

McLeod

Julie

Gormly

Brianna

. 2017. “Using the Cloud for Records Storage: Issues of Trust.” Archival Science 17: 1–22. doi:10.1007/s10502-017-9280-5.

31.

Mindee. 2021.Mindee/Doctr. Version 1.0. January 8, 2021. https://github.com/mindee/doctr.

32.

Mita

Amanda

Radick

Caryn

Weller

Ann

Shao

Hsuan

. 2018. “CONTENTdm to Digital Commons: Considerations and Workflows.” Journal of Archival Organization 15 (1–2): 1–15. doi:10.1080/15332748.2019.1609308.

33.

OpenGVLab. 2023. OpenGVLab/InternVL. Version 1.0. November 22, 2023. https://github.com/OpenGVLab/InternVL.

34.

PaddlePaddle. 2020. PaddlePaddle/PaddleOCR. Version 2.6. May 8, 2020. https://github.com/PaddlePaddle/PaddleOCR.

35.

Reitshamer

Stefan.

2017. “Amazon Glacier Pricing Changes and Retrieval Tiers.” Arq Blog, February 3, 2017. https://www.arqbackup.com/blog/amazon-glacier-pricing-changes-and-retrieval-tiers/.

36.

Ringel

Sharon

Woodall

Angela

. 2019. “A Public Record at Risk: The Dire State of News Archiving in the Digital Age.” Columbia Journalism Review, March 28, 2019. https://www.cjr.org/tow_center_reports/the-dire-state-of-news-archiving-in-the-digital-age.php.

37.

Slater

Kailyn “Kay”

. 2025. “Against AI: Critical Refusal in the Library.” Library Trends 73 (4): 588–608. doi:10.1353/lib.2025.a968497.

38.

Society of American Archivists. 2020. “SAA Core Values Statement and Code of Ethics.” https://www2.archivists.org/statements/saa-core-values-statement-and-code-of-ethics.

39.

Society of American Archivists. 2024. “American Archivist Generative AI Statement.” https://www2.archivists.org/american-archivist/AI-statement (accessed February 15, 2024).

40.

Society of American Archivists. 2026. “Committee on Ethics and Professional Conduct.” https://www2.archivists.org/groups/committee-on-ethics-and-professional-conduct (accessed February 22, 2026).

41.

Tesseract OCR Team. 2021. Tesseract-OCR/Tesseract. Version 5.0. https://github.com/tesseract-ocr/tesseract (accessed November 30, 2021).

42.

Traub

Myriam C.

van Ossenbruggen

Jacco

Hardman

Lynda

. 2015. “Impact Analysis of OCR Quality on Research Tasks in Digital Archives.” In Research and Advanced Technology for Digital Libraries: 19th International Conference on Theory and Practice of Digital Libraries, TPDL 2015, Limerick, Ireland, September 7–11, 2015, edited by Aroyo

Lora

Collier

Natalie

Dieke

Anne R.

Lange

Christoph

, 283–94. Cham: Springer. doi:10.1007/978-3-319-24592-8_19.

43.

U.S. Small Business Administration, Office of Advocacy. 2026. “Justice Department Finalizes Rule Requiring State and Local Governments to Make Their Websites Accessible.” https://advocacy.sba.gov/2024/04/25/justice-department-finalizes-rule-requiring-state-and-local-governments-to-make-their-websites-accessible/ (accessed February 22, 2026).

44.

University of Idaho Library. 2025a. “About, Dr. Richard B. Wells Collection.” Dr. Richard B. Wells Collection, University of Idaho Digital Collections. https://www.lib.uidaho.edu/digital/wells/about.html.

45.

University of Idaho Library. 2025b. “Digital Collections Search.” https://digital.lib.uidaho.edu (accessed October 1, 2025).

46.

University of Idaho Library. 2026a. “Digital Collections.” https://www.lib.uidaho.edu/digital/.

47.

University of Idaho Library. 2026b. “Taylor Wilderness Research Station Archive.” https://www.lib.uidaho.edu/digital/taylor-archive/ (accessed February 22, 2026).

48.

W3C Web Accessibility Initiative (WAI). 2025. “Understanding Success Criterion 1.3.1: Info and Relationships.” https://www.w3.org/WAI/WCAG21/Understanding/info-and-relationships.html (accessed October 1, 2025).

49.

Warner

Lia

Belova

Polina

. 2025. “Issues in Representing Archives: AI, Ethics, and Metadata: An Interview with Lia Warner.” Comparative Literature Studies 62 (3): 493–9. doi:10.5325/complitstudies.62.3.0493.

50.

Williamson

Evan

Wilke

Olivia

Klytie

Dobbins

Kevin

Weymouth

Andrew

. 2025. “Processing Documents.” Digital Collections Docs. https://uidaholib.github.io/digital-collections-docs/content/processing/06-documents-processing.html (accessed October 1, 2025).

Transparent Practices: OCR and AI in the Archives

Abstract

Keywords

Get full access to this article

References