Abstract
Accurate industrial classification of firms forms the backbone of business surveys, economic policymaking, and international trade analysis. However, national statistics institutes (NSIs) worldwide grapple with the labor intensive manual assignment of International Standard Industrial Classification (ISIC) codes: a process prone to human error, inconsistent across regions, and particularly burdensome for developing economies. This study confronts these challenges by assessing performance of token-overlap (Jaccard), TF-IDF cosine similarity, edit-distance (fuzzy) and SBERT embeddings against human-coded ground truth in classifying firms. Using a dataset of 6588 firms, performance diverges sharply: SBERT attains Accuracy
Keywords
Get full access to this article
View all access options for this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
