Abstract
Occupation coding encodes job titles into standard occupation labels, which is effective for data processing but tedious. Research proves that classic machine learning is effective, but accuracy needs further improvement. We construct a real data set with 881 occupation categories, including 41,297 pairs of job titles and corresponding labels. We design a hierarchy-aware heterogeneous graph neural network, combining prior knowledge from occupation category trees and synonyms. Results show our model outperforms other methods by 7.62% on micro-F1. It also alleviates the dependence on data as it achieves 52.28% on micro-F1 with only 30% of the original training data set.
Keywords
1. Introduction
Encoding collected pieces of free text into standard coded labels with many categories is a prevalent task in research studies (Linneberg and Korsgaard 2019), such as occupation coding. There are many open-ended questions in survey questionnaires asking about respondents’ jobs, which are more informative than closed questions. To be analyzed, these answers usually have to be coded or aggregated into predefined categorical occupation systems (Elliott 2018). For example, in a social questionnaire, some possible question wordings are “What work do you do?” or “Does your work have a special name?” (Scholz and Wasmer 2009). Occupation coding has applications in statistical, social, and epidemiology studies. Researchers collect respondents’ occupation information in the form of questionnaire texts and then formulate classification rules to categorize the textual information for subsequent research, such as classifying the social status (Connelly et al. 2016) or employment status (Montebruno et al. 2020) between occupation categories, determining which category of work affects physical health (Tannis et al. 2020).
Besides the application in categorizing textual data from surveys, occupation title coding is also critical in job market analysis. Government agencies, companies, and vocational schools need to code various job titles that employees and employers fill out and that are mentioned in various job advertisements. Classifying job titles according to a standardized taxonomy makes occupations in different areas comparable (Boselli et al. 2018). The comparability allows the study of current occupational vacancies, labor market changes, and job requirements.
Although textual job titles contain a wealth of helpful information, most researchers find it complicated to interpret and analyze such information in an automated manner (Reja et al. 2003), especially with extensive datasets. Textual data normalization and coding cost more time and labor than structured numerical data (such as ages and income). Respondents provide their job titles informally in various styles, whether through handwritten responses on paper, typed entries in online surveys, or verbal descriptions recorded by interviewers in administered surveys. Researchers subsequently classify textual occupation information into specific categories according to the occupation category tree released by officials or defined by researchers. This manual classification procedure is tedious and demanding. First, the accuracy must be high, which can significantly impact the flowing statistical analysis. Second, it is difficult for people without domain knowledge to make the correct classification because they might not clearly understand the textual data or the category tree. The number of expert-designed categories ranges from several to thousands of kinds, making it more difficult to classify correctly (Alazaidah and Ahmad 2016).
Given these difficulties, the existing literature has applied several methodologies to tackle the problem. There exist different approaches both within and across disciplines. At first, social scientists mainly used rule-based methods based on a dictionary of occupation-related keywords. Then, with the rapid development in machine learning, many social scientists started to adopt text classification methods in machine learning to solve this problem (Uysal and Günal 2014). Many researchers view occupation coding as a high-dimension text classification task, involving a large feature space or numerous categories like the 881 classes in our dataset, or a multi-label task where a single job title, such as “software developer” and “project manager” in hybrid roles, may carry multiple labels (Keller 2017). Some classical machine learning methods have been employed for this task and have proven to be efficient in saving labor costs. Supervised algorithms learn the distribution of human-labeled textual data pairs (free job title and corresponding answer) in the training set and give predictions for the test set, significantly saving time and labor. In more detail, we review these methods and the corresponding literature in Section 2.
Firstly, while the advance brought by machine learning is evident, there are several challenges this task presents to the researchers and practitioners. Firstly, the accuracy needs further improvement. Although machine learning-based methods outperform traditional rule-based methods, they still cannot completely replace manual classification because the classification accuracy is not high enough for research purposes (Basit 2003). Instead, machine learning methods reduce the workload in transferring text to qualitative data by giving several possible categories as hints for coders to choose.
Secondly, occupation coding is different from text classification. Current advanced study directly employs text classification models, which is not entirely reasonable. The occupation coding task comes from the fact that short texts with the same semantics have a high degree of variety because of the synonyms, adjectives, syntax, abbreviations, grammatical forms (or order of sequences), singular and plural, and typos (Chen et al. 2018). These informal texts from respondents are short and variable. For example, “scholar” and “research scientist” are semantically similar and belong to the label “researchers.”“Senior software developer” and “junior software developer” have different adjectives, but they are both software developers. While in ordinary text classification tasks, we are more concerned with understanding semantics through context, negation, synonyms, and antonyms. At the same time, occupation coding data sets have more categories and are more imbalanced. Compared with the general text classification data set, the number of categories is 881 in our collected data set, much larger than the widely used benchmark data sets (such as data set WOS-46985 (Kowsari et al. 2017), which contains 46,985 documents with 134 categories). Some common categories, such as “software developers,” have 3,000 samples, while the less common categories, such as “mathematics,” have only about ten samples.
Thirdly, annotated data sets are limited when using automatic coding in practice. The labor market changes rapidly (Colombo et al. 2019), and occupation title annotation rules become quickly obsolete and need frequent updating. Consequently, the outdated labeled data set is not universal in current social science. Apart from it, social scientists may label the data according to their research objectives. For instance, if social scientists want to find out how managers influence enterprise culture, the job title “CTO” might be regarded as “chief execution.” However, if social scientists pay more attention to technicians, this job title can be labeled “technical specialist.” Since the annotation rules changed and the annotation intention is variable, many social scientists still manually labeled their own data set. The state-of-art methods in occupation coding are supervised machine learning, which highly relies on the labeled data set (Bethmann et al. 2014).
To solve these problems, we propose a novel graph neural network, a hierarchy-aware heterogeneous graph neural network (HHGNN). Most studies only directly apply text classification to occupation title coding without considering the features of this task. The motivation is that human coders often refer to category trees when coding. The categories are so many that coders cannot remember every category. So they use a complex hierarchy to find the primary and minor categories. As for the synonym, it is apparent that many occupation titles have synonyms with their corresponding labels. A better classification model for occupation coding should identify not only the vocabulary similarity but also the semantic similarity. Recently, graph neural networks have had successful applications in classification tasks (Malekzadeh et al. 2021) and can express a wide range of heterogeneous information (Zhang et al. 2019), so we adopt this technique to construct a two-layer hierarchy-aware heterogeneous graph neural network (HHGNN) to extract valuable prior knowledge and further improve classification performance. We use the edges in the graph to mimic the domain knowledge, the category tree, and the synonym. The edges in the graph link texts that are connected in a hierarchy tree or semantically similar.
We illustrate our proposed model by comparing state-of-art methods from a different discipline, using a data set of 41,297 job titles from LinkedIn collected in 2021. We compare with the current occupation coding method and advanced text classification models, such as Bidirectional Encoder Representations from Transformers (bert) and graph neural networks.
We contribute to the existing literature by proposing a new effective model and comparing it with current occupation coding and text classification methods. Our comparison experiments serve as a reference for a subsequent social scientist to choose an appropriate method. Moreover, we incorporate expert knowledge into HHGNN, and the ablation experiment, where components are systematically removed to assess their impact, illustrates the validity of this expert knowledge, which encourages further exploration of hierarchical category trees and synonyms. This paper proposes an approach that integrates the features of occupation coding, posing this problem as a multi-category classification. We collected raw data from social science research: two trained master students coded the variable job titles of respondents to standard occupation labels. Duplicate records are removed, and all samples are different to avoid duplication between the training and test data sets. To summarize, our contributions are as follows:
We construct a data set comprising 41,297 pairs of job titles and corresponding labels, which can reflect the current labor market.
We design a novel graph neural network with a unique structure, which is more automatic and achieves higher accuracy.
Our proposed HHGNN adopts a hierarchy tree and semantic features to alleviates the dependency on labels, demonstrating the importance of expert knowledge, category tree, and semantic similarity.
Our paper is organized as follows. Section 2 reviews the primary and advanced approaches applied to occupation coding. Section 3 presents the details of the proposed graph neural network. Section 4 provides critical information about the experiment, such as the data set and metrics. In Section 5, we conclude extensive experiments to test the effectiveness of the proposed method and other algorithms. Finally, we summarize the conclusions and implications in Section 6.
2. Related Work
Nowadays, the main procedures of occupation coding are as follows: specially designed computer programs integrate free texts into tables for professional coders. Then, several coders independently assign a category to each entry based on their judgment, and the program chooses categories according to their frequencies. For example, if two of three coders assign “software developer” and the other coder chooses “web designer” instead, then the program selects “software developer” as the final answer.
Many researchers mainly explored the robust and classic methods in machine learning to get more satisfying accuracy and cost savings. Domain approaches in the occupation coding task can be categorized as rule-based, unsupervised, and supervised learning methods (Nelson et al. 2021). It is worth noting that many studies use job titles and job descriptions (or job advertisements) for occupation classification (Boselli et al. 2018), but in considerable research, there is no job description. For example, we cooperated with social scientists and knew that they collected job titles from LinkedIn resumes. Most users only write down their job titles without descriptions of their occupations. Therefore, we focus on studies that only use job titles to classify occupations.
2.1. Rule-Based Methods
The rule-based method is a straightforward and widely used automated textual analysis tool that analyzes linguistic phenomena and uses syntactic, semantic, and discourse information, known as the knowledge-based or dictionary-based method. The most straightforward logical rule known as code index, whereby a code is assigned if the free text is the same as the code. For example, the corresponding code is given when a (preprocessed) answer is identical to a given string. There are more complicated rules. For instance, a rule may specify to assign each word a category and then calculate the probability of a sentence that belongs to a particular category. Rules are refined progressively for coverage and accuracy (Bundesagentur für Arbeit 2011), which provides a recognized framework. Researchers often adapt or select subsets of rules based on specific datasets or research questions.
This category of approaches needs a rich dictionary of synonyms, as there are often several ways to express the same concept. Tarrow (1995) applies the linguistic approach to survey data analysis, using dictionaries to remove stop-words and unify singular and plural. Standard dictionaries such as the Linguistic Inquiry and Word Count (Tausczik and Pennebaker 2010) are reliable, but only in limited domains (Franzosi 1989). Bao develops rule-based methods into two steps: the search and filter stages. The former recommends a series of possible codes for a textual answer, and the filter selects a single NOC (National Occupational Classification) code from a list of candidate codes (Bao et al. 2020). This method is applied to about 500 manually coded jobs, and the accuracy rate at the four-level code level is 58.7%. Rule-based methods are still labor-intensive and cannot cover some types of textual data as some textual information may not match the criteria for any rules. With the rapid progress of natural language process (NLP) discipline, social scientists apply the foremost NLP tools to support the analysis of textual data. Crowston combines tokenization (Webster and Kit 1992) and part-of-speech tagging (Voutilainen 2004) techniques with dictionaries to analyze raw text (Crowston et al. 2010), which further automates the process of building rules (Crowston et al. 2012). The rule-based methods are far from automatic as this method requires a large amount of domain knowledge to build the dictionary and can not find the corresponding labels for every job title.
2.2. Unsupervised Learning Method
Unsupervised learning methods, which identify patterns or categories in data without labeled examples, have been applied to occupation coding to reduce manual effort. Similarity-based approaches, leveraging cosine similarity, automatically exploit linguistic relations between informal job titles and target categories by assigning the most similar class to a textual response. Jung et al. (2008) demonstrate that this method outperforms dictionary-based and multinomial regression techniques, achieving approximately 73% accuracy on a dataset with 450 standard occupation codes, while the PACE system further employs cosine similarity within a k-nearest neighbor framework. Meanwhile, nearest-neighbor approaches, using Jaccard similarity, classify text into the Standard Occupational Classification (SOC) scheme. For instance, a job title like “software engineer” might be assigned the “software developer” code if their TF-IDF (Term Frequency-Inverse Document Frequency) vector representations exhibit the highest Jaccard similarity, reflecting shared terms. Russ et al. (2014) report a 64% agreement rate at the 3-digit SOC level in a small-scale study, and Gweon et al. (2017) adapt this method for questionnaire coding, representing texts as TF-IDF vectors and achieving 65% accuracy on 9,137 observations across 399 occupation codes, where a low similarity score may indicate a new category requiring human intervention. These methods save labor and cover more samples, but their accuracy requires further improvement.
2.3. Supervised Learning Method
Supervised learning methods, which train models on labeled data to predict categories, have been widely adopted by social scientists for occupation coding. The Vector Space Model (VSM), introduced for information retrieval (Salton et al. 1975), encodes counts of occurrences of single terms in documents. The VSM provides the foundation for machine learning in the occupation coding task. Early applications primarily focused on Bayesian methods. Bethmann et al. (2014) employs two machine learning algorithms, Naive Bayes and Bayesian Multinomial, for a data set of 300,000 coded job titles. Schierholz converts text to verbatim vectors and uses coding by duplicates, Naive Bayes, a Bayesian approach using Dirichlet priors, and a gradient boosting model (Friedman 2001). They conclude that Bayesian methods performed similarly on accuracy rate when desired low production rates and high precision. Kirby combines edit distance, logistic regression using stochastic gradient descent (Komárek 2004) and Naive Bayes as ensemble models by majority voting methods (Kirby et al. 2015). The performance of the Naive Bayes classifier is much better than the other individual classifiers. The majority voting ensemble cannot improve the performance of Naïve Bayes due to an accident of no agreement.
With advancements in algorithms, research expanded to include more complex models. Nahoomi (2018) implements an experiment on 65,962 SOC-coded job titles in all four levels of hierarchy, broken down into 23 major groups, 97 minor groups, 461 broad groups, 840 detailed occupations. They use uni-gram features to represent texts. They report that support vector machine and convolution neural networks (CNNs) perform similarly. Support vector machine achieves 0.55 micro-F1 and 0.48 macro-F1, and CNNs arrives at 0.61 micro-F1 and 0.43 macro-F1 (metrics detailed in Section 4.2). Both methods are better than Naive Bayes, with only 0.46 micro-F1 and 0.41 macro-F1. This aligns with our experimental findings.
Some research explored flat versus hierarchical classification methods. Nahoomi reports the performances of four models, with Naïve Bayes, Maximum Entropy (Nigam et al. 1999), Support Vector Machines, and Convolutional Neural Networks, in flat and hierarchical methods. They train a binary classifier for each node of the hierarchy except the root. If the leaf node and its ancestor nodes create a path from the first to the last level of the hierarchy, the path is assigned to the record. The conclusion is that flat and hierarchical methods perform similarly, with flat methods having a higher recall and the hierarchical approach having a higher precision. The disadvantage of hierarchical classification is that this method has a longer running time, and errors in the higher hierarchy levels are propagated to the rest of the classification (Nahoomi 2018). In summary, the supervised method further enhances the accuracy, and the above mentioned experiments show the trend: Compared with logistic regression and the Bayesian approach, Support Vector Machine (SVM) and CNN have higher accuracy.
Textual information with numerous categories is a common and vital practice in social science, increasing the difficulty of choosing the correct label. One characteristic of such a data set is that many categories have sparse examples in the training data set. Supervised machine learning usually performs better on labels with more instances but poorly on types with sparse samples in the data set (Zhou 2018), which leads to data sparsity. The ideal training data would need to contain examples for every possible category more than once, requiring tens of thousands of observations, a number rarely collected in typical surveys. Considering occupation category trees are quite censuses in other research, Malte collects data from multiple surveys to alleviate this concern (Schierholz and Schonlau 2021). It is a practical solution but does not solve the problem radically.
The existing research shows that only the basic classification models are applied, and the models are not adjusted according to the characteristics of this problem. Therefore, the motivation is to implement a more sophisticated classification model, combined with the prior knowledge from the occupation coding task, to see whether advances in machine learning and domain knowledge contribute to improvement in this task. At the same time, graph neural networks have been proven to be very effective in text classification (Chen et al. 2021), recommendation (Wu et al. 2022), etc. Graph neural networks can alleviate data sparsity by propagation through edges (Rossi et al. 2016). So in this paper, we combine the features of occupation coding and the more advanced graph neural network techniques to propose a heterogeneous GNN for better accuracy.
3. Methodology: Hierarchy-Aware Heterogeneous Graph Neural Network
We first introduce notations and define the occupation title coding task. Followed by a detailed description of the heterogeneous graph construction as outlined in Section 3.2 and Section 3.3, the hierarchical edges, semantic similarity edges, and node TF-IDF features are built using predefined occupational classification trees and unsupervised cosine similarity calculations, integrating unsupervised prior knowledge. Subsequently, we present the HHGNN architecture and its loss function, which operates within a supervised learning framework through labeled data to enhance prediction accuracy.
3.1. Problem Definition
We suppose that the total number of occupation labels is
3.2. Graph Construction
Figure 1 shows the graph in HHGNN. There are two types of nodes. Blue nodes represent texts, and green nodes represent labels. There are three types of edges: green edges exist only between labels and are used to mimic the category tree to make full use of the established categorical structure; blue edges only exist between texts and are artificially constructed to show the similarity between texts; red edges indicate that the text has synonyms with labels. There are two reasons for the construction of these edges. First, these edges compensate for the shortness of context information in the graph neural network. Second, we use these edges to absorb the patterns we find in the real data set. To be more accurate, we refer to three kinds of edges: hierarchical edges, similarity edges, and synonym edges. We refer this graph as

Graph of HHGNN.
Category trees are often used to expresses the hierarchical relationship between classes when the number of categories is large. In our experiment, we use the Standard Occupational Classification (SOC) system, released by United States federal government agencies to cover all occupations. An example of the fractional category tree structure of the data set we collected is shown in Figure 2. The picture describes a 4-level hierarchical category, whose last level of the category tree is the actual labels, and the number of the labels is as high as 881. HHGNN leverages the prior knowledge of label correlations regarding the predefined hierarchy by constructing hierarchical edges. If two label nodes are connected in the category tree, then there is a green hierarchical edge in the graph between the two nodes.

Hierarchical category tree in standard occupational classification.
All job titles are denoted as
The TF (Term Frequency) measures how frequently a token
We find that texts in the same category tend to have many same words, so we build similarity edges to emphasize relationships between texts. Table 1 shows some texts with high similarity. The cosine similarity metric is adopted to quantify the semantic relatedness between the TF-IDF vector representations of textual nodes. This approach is grounded in information retrieval theory, where cosine similarity serves as a well-established measure for comparing document vectors in high-dimensional spaces (Salton et al. 1983). It is defined as the cosine of the angle between two vectors, thereby providing a measure of orientation similarity that is invariant to their magnitudes. This property is particularly advantageous for textual similarity tasks, as it focuses on the shared discriminative terms between documents while mitigating the influence of variable text length. Bao et al. (2020) also leveraged TF-IDF and cosine similarity for matching noisy job titles to standard classifications, demonstrating its practical efficacy in a closely related domain.
Samples of Job Titles with High Similarity.
We introduce
In our data set, every sample not only has respondents’ job titles but also their industries. At first, we try to use the similarity between respondents’ industry information with third-level category labels. However, this similarity harms the performance of our model. That is because people write their industries according to their corporations instead of their occupations. For instance, c++ developers who work at a bank probably write their industries as “banking” instead of “software,” which is misleading in the occupation coding task.
Synonym edges only exist between a pair of a text node and a label node. If the text and the label have common synonyms, then there will be a synonym edge. Table 2 gives examples of synonyms edges. The free text “scholar,”“research fellow,” and the label “researchers” are synonyms. Wordnet, an NLTK corpus reader in Python, is used to find synonyms.
Examples of Synonyms Edges.
3.3. Neighbor Sampling
We adopt a general inductive framework, which leverages nodes’ local information to allow batch training on the large-scale graph. We sample

Neighbors of node v.
The connections and neighbors motivate us that the neighbor sampling can be used to alleviate data sparsity. We further illustrate whether our proposed method alleviate the data sparsity problem in Section 5.3.
3.4. Structure of HHGNN
To embody multiple edge-type information, we employ an edge-level heterogeneous network to learn specific parameters for every kind of edge and apply an edge-level attention structure to integrate information from different edges.
Heterogeneous edge type
The HHGNN processes node information through edge-type-specific transformations to account for the unique semantics of hierarchical, similarity, and synonym edges. Vector
For the first layer, the edge-type-specific embedding is computed in Equation 3.
Equation 4 shows the structure of the second layer.
The neighbor nodes, linked with the same type of edges, are aggregated by mean aggregation, as shown in Equation 5.
Edge-Level attention aggregation
To integrate the edge-type-specific
As shown in Equation 7, we use an attention mechanism to measure the importance of different types of edges by calculating the similarity between mapped edge-specific embedding
Loss function
The output of the model can be represented as
The loss function is defined in Equation 9. We utilize a cross-entropy function on the training dataset
4. Experiments
We evaluate the performance of HHGNN on data from social science research. This data is collected to analyze the influence of management and technical talent flow on the strategic flexibility of enterprises. This empirical analysis uses LinkedIn’s online resume. It needs to match every candidate’s informal job title to the corresponding standard occupation labels, majors to the standard major category, school name to the corresponding university name, etc. Textual data was crawled from LinkedIn, where individuals described their occupations in various expressions. The category tree in our experiment, released by the United States federal government, is modified by the social scientists to fit the social science research purpose.
Two trained master students code the informal text to qualitative data, as Table 3 shows: assigned descriptive job title labels to a specific occupation code based on the expert-designed category. More specifically, two human coders give their answers. If they are consistent, then the standard occupation is found. Otherwise, the two coders review the inconsistent samples and re-labeled them after discussing and agreeing.
Samples of Original Data Set.
We conduct experiments on the job title data set to test our model’s performance. First, we describe our data sets which are based on manual proofreading. Then we choose some other well-known baselines and variations of HHGNN for comparative experiments.
4.1. Dataset Description
The description statistic of the dataset is given in Table 4. The fist row “job title” refers to the raw write-in responses provided by respondents, while “categories (labels)” refer to the standardized occupation codes they are mapped to. The dataset contains 41,297 samples and covers 881 categories, divided into the training set, validation set, and test set according to the ratio of 8:1:1. The average length of job titles is 2.54 characters, and the average length of category name is 2.69 characters. Job titles contains up to 16 characters, while category names can contain up to 10 characters.
Description Statistics of Dataset Statistics.
The sample size under different labels varies significantly in the labeled data set, as shown in Figure 4. “Software developer” is the most frequently occurring text message among respondents, which corresponds to the most frequently occurring standard label “Software Developers.” Popular occupations such as “software developers” and “computer and information system managers” have 2,918 and 2,507 samples, respectively. Occupations such as “mathematics” and “life scientist” appear only once.

Sample sizes of every category.
According to Table 5, the hierarchical category tree consists of four levels, and the quantity of every level is 23, 98, 459, and 881, respectively. The number of the total labels in this data set is 881.
Description Statistic of Revised Hierarchical Category Tree.
4.2. Metrics
Standard evaluation metrics, including micro-F1 and macro-F1, are employed to evaluate our model. Considering the imbalance of samples among different categories, macro accuracy can better evaluate the model’s performance from the perspective of not focusing on frequent labels to a certain degree. For category
TopK accuracy is also our evaluation metrics (Baeza-Yates and Ribeiro-Neto 1999), where K is a positive integer, meaning the fraction of job titles for which the correct answer is included in the recommendation list of length K. TopK accuracy is defined as:
4.3. Compared Methods
We select effective methods from two disciplines: occupation coding and text classification. The practical methods of occupation coding include TF-IDF+cosin, Naive Bayes, SVM, and Bidirectional Recurrent Neural Network (BiRNN). We choose the famous and advanced text classification methods in machine learning, including fastText, bert+cls, TextGNN. TextGNN and our proposed method use graph neural networks as the primary network architecture, but they have different edges and contain different information. We briefly introduce compared methods as follows:
TF-IDF+cosin
We apply TF-IDF (H. C. Wu et al. 2008) as a two-step method, a flat but intuitive baseline. First, job titles and categories are represented by TF-IDF vectors. Then, each job title is assigned to the class with the highest cosine similarity.
Naive Bayes
Naive Bayes is a conditional probability model with Bayes’ theorem. Despite its oversimplified assumptions, this classifier has worked well in many classification tasks (McCallum and Nigam 1998).
SVM
SVMs (Cortes and Vapnik 1995) are one of the most robust prediction methods, using the kernel trick to map inputs into high-dimensional feature spaces.
BiRNN
BiRNN processes text sequences in both forward and backward directions. This allows the network to capture contextual information from both past and future words at any given point, creating a more comprehensive sentence representation for classification (Schuster and Paliwal 1997).
fastText
fastText often achieves scalable solutions for text classification tasks while processing large datasets quickly. Hierarchical softmax in fastText proves to be very efficient when there are many categories (Joulin et al. 2016).
Siamese
The model combines a stack of character-level bidirectional Long Short-Term Memory (LSTM) with a siamese architecture (Neculoiu et al. 2016). We apply the siamese neural network as a two-step method. First, the siamese network is trained to learn the text similarity between job title and labels. Next, each job title is coded to the category with the highest text similarity.
bert+cls
Large pre-trained language models, such as bert (Bidirectional Encoder Representations from Transformers), have proven helpful. Bert (Devlin et al. 2018) is a strong baseline for various language tasks and multiple languages. The first token of the sentence sequence is always [CLS] containing the unique classification embedding (Sun et al. 2019), which is suitable for text classification tasks based on our data set.
TextGNN
TextGNN applied graph convolutional networks for text classification and built a single text graph for a corpus-based on word co-occurrence and document word relations (Yao et al. 2019). It then jointly learns the embeddings for both words and documents, supervised by the known class labels for documents.
4.4. Settings
For HHGNN, we set the embedding size of the first convolution layer as 300 and the hidden dimension as 512. We tuned other parameters: learning rate as 0.01 (the step size for updating model weights during optimization), L2 loss weight as 1e-5 (the regularization parameter for L2 penalty), and batch size as 256 (the number of samples processed per training iteration). We will further demonstrate how essential parameters are determined in Section 5.1.
All experiments are performed on Intel® Core™ i5-9400F CPU@2.90GHz with 16.0 GB RAM. The operation system and software platforms are Windows 10 Professional x64 Edition, PyTorch 1.11.0, and python 3.8.
5. Result
The results are shown in Table 6. Every row represents a metric, and every column shows the performance of a corresponding algorithm. For every metric, the best results are in bold, and the second-best results are underlined.
Test Performance.
Note. Bold values indicate the best performance, and underlined values indicate the second-best performance.
5.1. Performance on Fourth Level
As for the topK accuracy metric, it is evident that the larger K the higher score. The TF-IDF+cosine method provides a flat baseline, which displays that the apparent TF-IDF vectors between job titles and category name are beneficial for coding free job titles to standard labels. However, simply relying on cosine similarity is insufficient. Bayes has better results compared with TF-IDF+cosine. The traditional deep neural network and SVM further improve the performance, and these methods are comparable in this data set. FastText performs better at metric topK accuracy, micro-F1, and macro-F1, while BiRNN enhances the rest of the metrics. Our result shows that SVM and BiRNN have similar performance and are better than Bayes, which agrees with the experiments in the literature review.
The models with more complicated structures, TextGNN, siamese, bert+cls, and HHGNN, achieve better results. The metrics of these four models are showcased in Figure 5.

TopK accuracy of TextGNN, siamese, bert+cls, and HHGNN.
When K is small (K = 1), HHGNN outperforms other models, followed by bert+cls as a strong baseline. For K = 2 and 3, TextGNN and siamese show stronger performance than bert+cls, though HHGNN remains the leader. Correspondingly, HHGNN and bert+cls also outperform other models in micro-F1. When K is larger (K = 10, 20), siamese and TextGNN achieve better results, HHGNN performs lower, but the results of bert+cls are inferior to others. In practice, we emphasize more topK accuracy with small K. That is because long recommendation lists contain redundant information. It takes more time to read the long recommendation lists, which makes coders spend more time reading hints when annotating. Thus, we think HHGNN outperforms compared methods in respect of topK accuracy.
As for the ranking regarding macro-F1: the order is TextGNN, siamese, HHGNN, and bert+cls ranked in descending. The reason why siamese has good results at the macro level is likely because the training goal of siamese is to measure the similarity between text information and category name, not the one-hot label of text information. TextGNN builds category-word edges and word-information edges, which capture word co-occurrence. Word nodes act as bridges or critical paths in the graph so that label information can be propagated to the entire graph. It seems that bert is a strong baseline but focuses more on the categories with more samples. Our proposed HHGNN achieves a good balance between macro and micro indicators. HHGNN has a similar advantage to TextGNN. The word nodes and label nodes absorb information from their neighbors. Three different types of edges propagate information and enrich minority categories.
It is worth noting that the micro metrics are more critical when applying automatic machine learning-based methods. Micro metrics and topK accuracy can directly reflect the accuracy of different methods. Methods with higher micro and topK accuracy indicate that they can provide a higher proportion of accurate hints for human coders. The macro-F1 is calculated as the average of each category’s accuracy, reflecting the difference in performance in the majority and minority categories. Therefore, from a practical point of view, bert and HHGNN have more practical value (especially HHGNN), and siamese and TextGNN have better performance in minority categories. To analyze the effectiveness of HHGNN, we conduct more in-depth experiments.
Case Study
To gain deeper insight into the decision-making mechanism of the HHGNN model, we analyze its output response to variations in TF-IDF input features. This is achieved by removing key terms, and observing how the corresponding TF-IDF vector changes affect the prediction distribution.
In Table 7, the job title “software trainee engineer” (correctly classified as “intern part time contractors”) has a sparse TF-IDF vector [0.239, 0.463, 0.297], yielding HHGNN’s top-5 predictions: {intern part time contractors, software developers, web developers, computer hardware engineers, computer programmers}, with the correct category ranked first. The high TF-IDF value for “trainee” highlights its importance. Simplifying to “trainee engineer” adjusts the top-5 predictions to {intern part time contractors, computer hardware engineers, systems engineers, engineers all other, computer network architects}, retaining the correct category while emphasizing “engineer” in the top-2 and top-3 positions. Further simplifying to “software engineer,” the top-5 predictions are {software developers, computer information systems managers, engineers all other, software quality assurance analysts testers, industrial engineers}, where the higher weight of “software” drives predictions toward software-related categories. These observations demonstrate that HHGNN effectively adapts to TF-IDF vector changes, ensuring predictions align with input semantics.
TF-IDF Vectors and Test Performance of HHGNN.
To further validate its practical utility, we conducted a user study detailed in the Appendix, in which two master students trained in professional coding using HHGNN-generated top-5 hints. The results indicated that the hints reduced average annotation time.
5.2. Performance on First Level
We further compared the predicted distributions with the true distribution at the first level of occupational categories. This comparison evaluates how well each method reflects the actual data distribution and helps identify systematic biases or overfitting to specific categories.
As shown in Figure 6, the true distribution is highly imbalanced, with only three categories accounting for approximately 80% of the samples: Computer and Mathematical Occupations (Category 5), Management Occupations (Category 15), and Business and Financial Operations Occupations (Category 3). HHGNN produces a distribution that closely matches the true one in these major categories, showing a reasonable overall shape. TextGNN achieves the lowest KL divergence (0.1065), indicating the closest fit to the true distribution. It also maintains noticeable probability over several low-frequency categories, demonstrating an ability to capture long-tail trends. In contrast, bert and siamese show substantial deviation from the true distribution. Bert yields a high KL divergence of 1.2122, with predictions overly concentrated in a few categories, suggesting limited generalization ability.

Prediction distribution on first level.
5.3. Parameter Sensitivity
Table 8 shows micro-F1 and macro-F1 with a different number of neighbors. We can see that both metrics increase as the model has more neighbors but stop increasing when the number of neighbors is larger than 3. Therefore, we think the optimal number of neighbors N is 3. This suggests that too few neighbors could not generate sufficient global information in the graph, while too many neighbors may add information that is not very closely related.
Test Performance of HHGNN in Different Numbers of Neighbors N.
Table 9 depicts the classification performance of HHGNN with different dimensions D of the hidden layer. Too low dimensional embeddings may not propagate label information to the whole graph well, and high dimensional embeddings improve classification performances a little. The further expansion of dimension brings minor enhancement, but the training time and memory usage increase significantly. So there is no need to keep increasing the dimension since the dimension is larger than 512. We also choose 512 as the optimal dimension of the hidden layer.
Test Performance of HHGNN in Different Dimensions of the Hidden Layer D.
5.4. Effects of the Size of Labeled Data
To evaluate the extent to which different methods depend on labeled data, we test several best-performing models with different proportions of the training data. Figure 7 reports micro metrics and macro metrics with 30%, 50%, 70%, and 100% of the original training data set. Every line means an algorithm, and the shadow represents the range of results. We note that HHGNN still achieves the best micro-F1 with limited labeled data. GNN models, especially HHGNN, have minor variances in terms of micro-F1. TextGNN and siamese still perform better but are more viable at the macro level. These micro-F1 results suggest that HHGNN alleviates the dependency on data compared with other methods, as it can still perform well and stably with a small proportion of labeled data.

Test performance with varying training data proportions.
Even though HHGNN only performs better on micro metrics, recall section 5, we regard the micro-F1 as more effective metrics in practical application, and we pay more attention to micro-F1. Therefore, HHGNN helps to alleviate the dependency on labeled data in a practical perspective because HHGNN still outperforms other methods with less proportion of labeled data, let alone that the result variation of HHGNN is relatively tiny.
6. Conclusion and Implications
This paper applies advanced machine learning to occupation title coding. The result shows that graph neural networks, followed by sophisticated neural networks and traditional classifications, generate improvements in accuracy. It is noticeable that SVM’s performance is comparable with deep neural networks on this data set. The result confirms that our proposed HHGNN can outperform other models robustly regarding micro-F1 and topK accuracy (K ≤ 5) while siamese has advantages in terms of other metrics. The test performance of variations of HHGNN proves that it is worthwhile to incorporate heterogeneous information, hierarchical category tree, and synonym information into graph neural networks. At the same time, our method also eases the dependency on training data as it can still have relatively good results with only 30% of the original labeled data.
Our research has both theoretical and practical implications. One of the theoretical implications is to show that the expanding range of advanced methods applied to occupation coding can produce significant performance increases. Previous research only applies traditional machine learning methods, while we test both effective methods in occupation coding and text classification and find out that different methods have their advantages. Therefore, researchers can actively look for the best-performing algorithm according to their needs. Our research also highlights that HHGNN is effective and performs well in micro-F1, which encourages subsequent exploration of graph neural networks, hierarchical category trees, and synonyms in this task. As for practical implications, given an adequate supply of previously labeled data, researchers can incorporate this method to support human annotations to process free texts into standard occupation titles in an exquisite category tree. Human coders can use the hints given by HHGNN to save time and labor. However, considering there still needs improvement in average accuracy, current methods can not completely replace hand-coded work.
As for future research, exploring weak supervision or cold-start strategies represents a promising direction. Such approaches could leverage the unsupervised components—for instance, performing label propagation via “synonym edges” and “semantic similarity edges” in the graph, or generating pseudo-labels using a nearest-neighbor classifier. These methods could initialize the supervised parameter training process, thereby enhancing the adaptability when training data are not available.
Footnotes
Appendix
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.
Received: June 11, 2025
Accepted: March 10, 2026
