Sage Journals: Discover world-class research

Abstract

Accurate industrial classification of firms forms the backbone of business surveys, economic policymaking, and international trade analysis. However, national statistics institutes (NSIs) worldwide grapple with the labor intensive manual assignment of International Standard Industrial Classification (ISIC) codes: a process prone to human error, inconsistent across regions, and particularly burdensome for developing economies. This study confronts these challenges by assessing performance of token-overlap (Jaccard), TF-IDF cosine similarity, edit-distance (fuzzy) and SBERT embeddings against human-coded ground truth in classifying firms. Using a dataset of 6588 firms, performance diverges sharply: SBERT attains Accuracy $=$ 0.78 and Weighted $F 1 = 0.78$ (Cohen’s $κ \approx 0.75$ ), while surface methods lag (Fuzzy: Accuracy 0.43; Cosine: 0.31; Jaccard: 0.26). Statistical tests confirms these differences (Cochran’s ( $Q = 8320.81$ ) with $p < 0.001$ ) and inter-method agreement is only fair ( $κ_{Fleiss} \approx 0.270$ ), motivating a class-level diagnostic approach. Using confusion matrices and Haberman adjusted residuals we expose systematic off-diagonal confusions (notably between manufacturing, professional/service and certain retail/wholesale categories) and identify classes with strong, automatable diagonals versus sparse or ambiguous tails that require human coding.

Keywords

Industry classification automation sentence-BERT (SBERT)ISIC coding natural language processing

Get full access to this article

View all access options for this article.

References

Hilbert

. Big data for development: a review of promises and challenges. Dev Policy Rev 2015; 34: 135–174.

Bechara

Zhang

Yuan

et al. Applying NLP techniques to classify businesses by their international standard industrial classification (ISIC) code. In: 2022 IEEE international conference on big data (Big Data). IEEE, 2022, pp.3472–3477. DOI: 10.1109/bigdata55660.2022.10020787.

Hoffmann

Chamie

. Standard statistical classifications: basic principles. Stat J UN Econ Comm Eur 2003; 19: 223–241.

Rizinski

Jankov

Sankaradas

et al. Comparative analysis of NLP-based models for company classification. Information 2024; 15: 77.

Suadaa

Ridho

Monika

et al. Automatic text categorization to standard classification of Indonesian business fields (KBLI) 2020. In: 2023 International Conference on Electrical Engineering and Informatics (ICEEI). IEEE, 2023, pp.1–6. DOI: 10.1109/iceei59426.2023.10346866.

Watambwa

Sibanda

Chinwadzimba

et al. Measuring access to schools and health facilities in Zimbabwe: a distance-based analysis of service islands. Stat J IAOS 2025; 41(4): 1166–1181. doi: https://doi.org/10.1177/18747655251393052

International standard industrial classification of all economic activities (ISIC), Rev.4. United Nations, 2008. Available at: https://unstats.un.org/unsd/publication/seriesm/seriesm_4rev4e.pdf.

Manning

Raghavan

Schütze

. Introduction to information retrieval. Stanford University, California: Cambridge University Press, 2008. DOI: 10.1017/cbo9780511809071

Jagrič

Herman

. AI model for industry classification based on website data. Information 2024; 15: 89.

10.

Mahmud

Chowdhury

. URL based website classification using deep learning and word-based multiple N-gram models. Chapman and Hall/CRC, 2025. DOI: 10.1201/9781032648705-3.

11.

wahyuningsih

. Text mining an automatic short answer grading (ASAG), comparison of three methods of cosine similarity, Jaccard similarity and Dice’s coefficient. J Appl Data Sci 2021; 2.

12.

. Similarity based information retrieval using Levenshtein Distance algorithm. Int J Adv Sci Res Eng 2020; 06: 06–10.

13.

Leite

Vaconcelos

. Rapidfuzz: string similarity computation using “rapidfuzz”. The R Foundation, 2024. DOI: 10.32614/cran.package.rapidfuzz.

14.

Mouselimis

. fuzzywuzzyR: fuzzy string matching. The R Foundation, 2017. DOI: 10.32614/cran.package.fuzzywuzzyr.

15.

Zhu

. Bert and its applications in natural language understanding. Appl Comput Eng 2025; 175: 99–105.

16.

Reimers

Gurevych

. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on empirical methods in natural language orocessing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019, pp.3980–3990. DOI: 10.18653/v1/d19-1410.

17.

Yang

Zhang

Cutsforth

et al. Emerging industry classification based on bert model. Inf Syst 2025; 128: 102484.

18.

Pruthi

Dhingra

Lipton

. Combating adversarial misspellings with robust word recognition. In: Proceedings of the 57th annual ,eeting of the association for computational linguistics. Association for Computational Linguistics, 2019. DOI: 10.18653/v1/p19-1561.

19.

Winata

Aji

Yong

et al. The decades progress on code-switching research in NLP: A systematic survey on trends and challenges. In: Findings of the association for computational linguistics: ACL 2023. Association for Computational Linguistics, 2023, pp.2936–2978. DOI: 10.18653/v1/2023.findings-acl.185.

20.

Giesberts

Van Rompaey

. Reinforcing capacity for economic statistics in Africa, perspectives on building statistical infrastructure for business statistics 2022. Available at: https://iariw.org/wp-content/uploads/2022/10/Arthur-and-Catherine-IARIW-TNBS-2022.pdf.

21.

Ezugwu

Oyelade

Ikotun

et al. Machine learning research trends in Africa: A 30 years overview with bibliometric analysis review. Arch Comput Methods Eng 2023; 30: 4177–4207.

22.

Blohm

Hanussek

Kintz

. Leveraging automated machine learning for text classification: Evaluation of autoML tools and comparison with human performance. In: Proceedings of the 13th iInternational conference on agents and artificial intelligence. SCITEPRESS - Science and Technology Publications, 2021, pp.1131–1136. DOI: 10.5220/0010331411311136.

23.

International Labour Organization. ISIC code file (CSV), https://ilostat.ilo.org/methods/concepts-and-definitions/classification-economic-activities/ (2024, accessed 16 November 2025).

24.

Bird

Klein

Loper

et al. Multidisciplinary instruction with the natural language toolkit. In: Proceedings of the third workshop on issues in teaching computational linguistics - TeachCL ’08. TeachCL ’08, Association for Computational Linguistics, 2008, p.62. DOI: 10.3115/1627306.1627317.

25.

Miller

. WordNet: a lexical database for English. Commun ACM 1995; 38: 39–41.

26.

Tam

. Methodology architecture: a roadmap for new methodological directions in the Australian bureau of statistics. Stat J IAOS 2014; 30: 371–375.

27.

Haberman

. The analysis of residuals in cross-classified tables. Biometrics 1973; 29: 205.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.13 MB