Hierarchy-Aware Heterogeneous Graph Neural Network for Occupation Title Coding

Abstract

Occupation coding encodes job titles into standard occupation labels, which is effective for data processing but tedious. Research proves that classic machine learning is effective, but accuracy needs further improvement. We construct a real data set with 881 occupation categories, including 41,297 pairs of job titles and corresponding labels. We design a hierarchy-aware heterogeneous graph neural network, combining prior knowledge from occupation category trees and synonyms. Results show our model outperforms other methods by 7.62% on micro-F1. It also alleviates the dependence on data as it achieves 52.28% on micro-F1 with only 30% of the original training data set.

Keywords

textual data hierarchical category tree occupation title coding heterogeneous graph neural network

1. Introduction

Encoding collected pieces of free text into standard coded labels with many categories is a prevalent task in research studies (Linneberg and Korsgaard 2019), such as occupation coding. There are many open-ended questions in survey questionnaires asking about respondents’ jobs, which are more informative than closed questions. To be analyzed, these answers usually have to be coded or aggregated into predefined categorical occupation systems (Elliott 2018). For example, in a social questionnaire, some possible question wordings are “What work do you do?” or “Does your work have a special name?” (Scholz and Wasmer 2009). Occupation coding has applications in statistical, social, and epidemiology studies. Researchers collect respondents’ occupation information in the form of questionnaire texts and then formulate classification rules to categorize the textual information for subsequent research, such as classifying the social status (Connelly et al. 2016) or employment status (Montebruno et al. 2020) between occupation categories, determining which category of work affects physical health (Tannis et al. 2020).

Besides the application in categorizing textual data from surveys, occupation title coding is also critical in job market analysis. Government agencies, companies, and vocational schools need to code various job titles that employees and employers fill out and that are mentioned in various job advertisements. Classifying job titles according to a standardized taxonomy makes occupations in different areas comparable (Boselli et al. 2018). The comparability allows the study of current occupational vacancies, labor market changes, and job requirements.

Although textual job titles contain a wealth of helpful information, most researchers find it complicated to interpret and analyze such information in an automated manner (Reja et al. 2003), especially with extensive datasets. Textual data normalization and coding cost more time and labor than structured numerical data (such as ages and income). Respondents provide their job titles informally in various styles, whether through handwritten responses on paper, typed entries in online surveys, or verbal descriptions recorded by interviewers in administered surveys. Researchers subsequently classify textual occupation information into specific categories according to the occupation category tree released by officials or defined by researchers. This manual classification procedure is tedious and demanding. First, the accuracy must be high, which can significantly impact the flowing statistical analysis. Second, it is difficult for people without domain knowledge to make the correct classification because they might not clearly understand the textual data or the category tree. The number of expert-designed categories ranges from several to thousands of kinds, making it more difficult to classify correctly (Alazaidah and Ahmad 2016).

Given these difficulties, the existing literature has applied several methodologies to tackle the problem. There exist different approaches both within and across disciplines. At first, social scientists mainly used rule-based methods based on a dictionary of occupation-related keywords. Then, with the rapid development in machine learning, many social scientists started to adopt text classification methods in machine learning to solve this problem (Uysal and Günal 2014). Many researchers view occupation coding as a high-dimension text classification task, involving a large feature space or numerous categories like the 881 classes in our dataset, or a multi-label task where a single job title, such as “software developer” and “project manager” in hybrid roles, may carry multiple labels (Keller 2017). Some classical machine learning methods have been employed for this task and have proven to be efficient in saving labor costs. Supervised algorithms learn the distribution of human-labeled textual data pairs (free job title and corresponding answer) in the training set and give predictions for the test set, significantly saving time and labor. In more detail, we review these methods and the corresponding literature in Section 2.

Firstly, while the advance brought by machine learning is evident, there are several challenges this task presents to the researchers and practitioners. Firstly, the accuracy needs further improvement. Although machine learning-based methods outperform traditional rule-based methods, they still cannot completely replace manual classification because the classification accuracy is not high enough for research purposes (Basit 2003). Instead, machine learning methods reduce the workload in transferring text to qualitative data by giving several possible categories as hints for coders to choose.

Secondly, occupation coding is different from text classification. Current advanced study directly employs text classification models, which is not entirely reasonable. The occupation coding task comes from the fact that short texts with the same semantics have a high degree of variety because of the synonyms, adjectives, syntax, abbreviations, grammatical forms (or order of sequences), singular and plural, and typos (Chen et al. 2018). These informal texts from respondents are short and variable. For example, “scholar” and “research scientist” are semantically similar and belong to the label “researchers.”“Senior software developer” and “junior software developer” have different adjectives, but they are both software developers. While in ordinary text classification tasks, we are more concerned with understanding semantics through context, negation, synonyms, and antonyms. At the same time, occupation coding data sets have more categories and are more imbalanced. Compared with the general text classification data set, the number of categories is 881 in our collected data set, much larger than the widely used benchmark data sets (such as data set WOS-46985 (Kowsari et al. 2017), which contains 46,985 documents with 134 categories). Some common categories, such as “software developers,” have 3,000 samples, while the less common categories, such as “mathematics,” have only about ten samples.

Thirdly, annotated data sets are limited when using automatic coding in practice. The labor market changes rapidly (Colombo et al. 2019), and occupation title annotation rules become quickly obsolete and need frequent updating. Consequently, the outdated labeled data set is not universal in current social science. Apart from it, social scientists may label the data according to their research objectives. For instance, if social scientists want to find out how managers influence enterprise culture, the job title “CTO” might be regarded as “chief execution.” However, if social scientists pay more attention to technicians, this job title can be labeled “technical specialist.” Since the annotation rules changed and the annotation intention is variable, many social scientists still manually labeled their own data set. The state-of-art methods in occupation coding are supervised machine learning, which highly relies on the labeled data set (Bethmann et al. 2014).

To solve these problems, we propose a novel graph neural network, a hierarchy-aware heterogeneous graph neural network (HHGNN). Most studies only directly apply text classification to occupation title coding without considering the features of this task. The motivation is that human coders often refer to category trees when coding. The categories are so many that coders cannot remember every category. So they use a complex hierarchy to find the primary and minor categories. As for the synonym, it is apparent that many occupation titles have synonyms with their corresponding labels. A better classification model for occupation coding should identify not only the vocabulary similarity but also the semantic similarity. Recently, graph neural networks have had successful applications in classification tasks (Malekzadeh et al. 2021) and can express a wide range of heterogeneous information (Zhang et al. 2019), so we adopt this technique to construct a two-layer hierarchy-aware heterogeneous graph neural network (HHGNN) to extract valuable prior knowledge and further improve classification performance. We use the edges in the graph to mimic the domain knowledge, the category tree, and the synonym. The edges in the graph link texts that are connected in a hierarchy tree or semantically similar.

We illustrate our proposed model by comparing state-of-art methods from a different discipline, using a data set of 41,297 job titles from LinkedIn collected in 2021. We compare with the current occupation coding method and advanced text classification models, such as Bidirectional Encoder Representations from Transformers (bert) and graph neural networks.

We contribute to the existing literature by proposing a new effective model and comparing it with current occupation coding and text classification methods. Our comparison experiments serve as a reference for a subsequent social scientist to choose an appropriate method. Moreover, we incorporate expert knowledge into HHGNN, and the ablation experiment, where components are systematically removed to assess their impact, illustrates the validity of this expert knowledge, which encourages further exploration of hierarchical category trees and synonyms. This paper proposes an approach that integrates the features of occupation coding, posing this problem as a multi-category classification. We collected raw data from social science research: two trained master students coded the variable job titles of respondents to standard occupation labels. Duplicate records are removed, and all samples are different to avoid duplication between the training and test data sets. To summarize, our contributions are as follows:

We construct a data set comprising 41,297 pairs of job titles and corresponding labels, which can reflect the current labor market.

We design a novel graph neural network with a unique structure, which is more automatic and achieves higher accuracy.

Our proposed HHGNN adopts a hierarchy tree and semantic features to alleviates the dependency on labels, demonstrating the importance of expert knowledge, category tree, and semantic similarity.

Our paper is organized as follows. Section 2 reviews the primary and advanced approaches applied to occupation coding. Section 3 presents the details of the proposed graph neural network. Section 4 provides critical information about the experiment, such as the data set and metrics. In Section 5, we conclude extensive experiments to test the effectiveness of the proposed method and other algorithms. Finally, we summarize the conclusions and implications in Section 6.

2. Related Work

Nowadays, the main procedures of occupation coding are as follows: specially designed computer programs integrate free texts into tables for professional coders. Then, several coders independently assign a category to each entry based on their judgment, and the program chooses categories according to their frequencies. For example, if two of three coders assign “software developer” and the other coder chooses “web designer” instead, then the program selects “software developer” as the final answer.

Many researchers mainly explored the robust and classic methods in machine learning to get more satisfying accuracy and cost savings. Domain approaches in the occupation coding task can be categorized as rule-based, unsupervised, and supervised learning methods (Nelson et al. 2021). It is worth noting that many studies use job titles and job descriptions (or job advertisements) for occupation classification (Boselli et al. 2018), but in considerable research, there is no job description. For example, we cooperated with social scientists and knew that they collected job titles from LinkedIn resumes. Most users only write down their job titles without descriptions of their occupations. Therefore, we focus on studies that only use job titles to classify occupations.

2.1. Rule-Based Methods

The rule-based method is a straightforward and widely used automated textual analysis tool that analyzes linguistic phenomena and uses syntactic, semantic, and discourse information, known as the knowledge-based or dictionary-based method. The most straightforward logical rule known as code index, whereby a code is assigned if the free text is the same as the code. For example, the corresponding code is given when a (preprocessed) answer is identical to a given string. There are more complicated rules. For instance, a rule may specify to assign each word a category and then calculate the probability of a sentence that belongs to a particular category. Rules are refined progressively for coverage and accuracy (Bundesagentur für Arbeit 2011), which provides a recognized framework. Researchers often adapt or select subsets of rules based on specific datasets or research questions.

This category of approaches needs a rich dictionary of synonyms, as there are often several ways to express the same concept. Tarrow (1995) applies the linguistic approach to survey data analysis, using dictionaries to remove stop-words and unify singular and plural. Standard dictionaries such as the Linguistic Inquiry and Word Count (Tausczik and Pennebaker 2010) are reliable, but only in limited domains (Franzosi 1989). Bao develops rule-based methods into two steps: the search and filter stages. The former recommends a series of possible codes for a textual answer, and the filter selects a single NOC (National Occupational Classification) code from a list of candidate codes (Bao et al. 2020). This method is applied to about 500 manually coded jobs, and the accuracy rate at the four-level code level is 58.7%. Rule-based methods are still labor-intensive and cannot cover some types of textual data as some textual information may not match the criteria for any rules. With the rapid progress of natural language process (NLP) discipline, social scientists apply the foremost NLP tools to support the analysis of textual data. Crowston combines tokenization (Webster and Kit 1992) and part-of-speech tagging (Voutilainen 2004) techniques with dictionaries to analyze raw text (Crowston et al. 2010), which further automates the process of building rules (Crowston et al. 2012). The rule-based methods are far from automatic as this method requires a large amount of domain knowledge to build the dictionary and can not find the corresponding labels for every job title.

2.2. Unsupervised Learning Method

Unsupervised learning methods, which identify patterns or categories in data without labeled examples, have been applied to occupation coding to reduce manual effort. Similarity-based approaches, leveraging cosine similarity, automatically exploit linguistic relations between informal job titles and target categories by assigning the most similar class to a textual response. Jung et al. (2008) demonstrate that this method outperforms dictionary-based and multinomial regression techniques, achieving approximately 73% accuracy on a dataset with 450 standard occupation codes, while the PACE system further employs cosine similarity within a k-nearest neighbor framework. Meanwhile, nearest-neighbor approaches, using Jaccard similarity, classify text into the Standard Occupational Classification (SOC) scheme. For instance, a job title like “software engineer” might be assigned the “software developer” code if their TF-IDF (Term Frequency-Inverse Document Frequency) vector representations exhibit the highest Jaccard similarity, reflecting shared terms. Russ et al. (2014) report a 64% agreement rate at the 3-digit SOC level in a small-scale study, and Gweon et al. (2017) adapt this method for questionnaire coding, representing texts as TF-IDF vectors and achieving 65% accuracy on 9,137 observations across 399 occupation codes, where a low similarity score may indicate a new category requiring human intervention. These methods save labor and cover more samples, but their accuracy requires further improvement.

2.3. Supervised Learning Method

Supervised learning methods, which train models on labeled data to predict categories, have been widely adopted by social scientists for occupation coding. The Vector Space Model (VSM), introduced for information retrieval (Salton et al. 1975), encodes counts of occurrences of single terms in documents. The VSM provides the foundation for machine learning in the occupation coding task. Early applications primarily focused on Bayesian methods. Bethmann et al. (2014) employs two machine learning algorithms, Naive Bayes and Bayesian Multinomial, for a data set of 300,000 coded job titles. Schierholz converts text to verbatim vectors and uses coding by duplicates, Naive Bayes, a Bayesian approach using Dirichlet priors, and a gradient boosting model (Friedman 2001). They conclude that Bayesian methods performed similarly on accuracy rate when desired low production rates and high precision. Kirby combines edit distance, logistic regression using stochastic gradient descent (Komárek 2004) and Naive Bayes as ensemble models by majority voting methods (Kirby et al. 2015). The performance of the Naive Bayes classifier is much better than the other individual classifiers. The majority voting ensemble cannot improve the performance of Naïve Bayes due to an accident of no agreement.

With advancements in algorithms, research expanded to include more complex models. Nahoomi (2018) implements an experiment on 65,962 SOC-coded job titles in all four levels of hierarchy, broken down into 23 major groups, 97 minor groups, 461 broad groups, 840 detailed occupations. They use uni-gram features to represent texts. They report that support vector machine and convolution neural networks (CNNs) perform similarly. Support vector machine achieves 0.55 micro-F1 and 0.48 macro-F1, and CNNs arrives at 0.61 micro-F1 and 0.43 macro-F1 (metrics detailed in Section 4.2). Both methods are better than Naive Bayes, with only 0.46 micro-F1 and 0.41 macro-F1. This aligns with our experimental findings.

Some research explored flat versus hierarchical classification methods. Nahoomi reports the performances of four models, with Naïve Bayes, Maximum Entropy (Nigam et al. 1999), Support Vector Machines, and Convolutional Neural Networks, in flat and hierarchical methods. They train a binary classifier for each node of the hierarchy except the root. If the leaf node and its ancestor nodes create a path from the first to the last level of the hierarchy, the path is assigned to the record. The conclusion is that flat and hierarchical methods perform similarly, with flat methods having a higher recall and the hierarchical approach having a higher precision. The disadvantage of hierarchical classification is that this method has a longer running time, and errors in the higher hierarchy levels are propagated to the rest of the classification (Nahoomi 2018). In summary, the supervised method further enhances the accuracy, and the above mentioned experiments show the trend: Compared with logistic regression and the Bayesian approach, Support Vector Machine (SVM) and CNN have higher accuracy.

Textual information with numerous categories is a common and vital practice in social science, increasing the difficulty of choosing the correct label. One characteristic of such a data set is that many categories have sparse examples in the training data set. Supervised machine learning usually performs better on labels with more instances but poorly on types with sparse samples in the data set (Zhou 2018), which leads to data sparsity. The ideal training data would need to contain examples for every possible category more than once, requiring tens of thousands of observations, a number rarely collected in typical surveys. Considering occupation category trees are quite censuses in other research, Malte collects data from multiple surveys to alleviate this concern (Schierholz and Schonlau 2021). It is a practical solution but does not solve the problem radically.

The existing research shows that only the basic classification models are applied, and the models are not adjusted according to the characteristics of this problem. Therefore, the motivation is to implement a more sophisticated classification model, combined with the prior knowledge from the occupation coding task, to see whether advances in machine learning and domain knowledge contribute to improvement in this task. At the same time, graph neural networks have been proven to be very effective in text classification (Chen et al. 2021), recommendation (Wu et al. 2022), etc. Graph neural networks can alleviate data sparsity by propagation through edges (Rossi et al. 2016). So in this paper, we combine the features of occupation coding and the more advanced graph neural network techniques to propose a heterogeneous GNN for better accuracy.

3. Methodology: Hierarchy-Aware Heterogeneous Graph Neural Network

We first introduce notations and define the occupation title coding task. Followed by a detailed description of the heterogeneous graph construction as outlined in Section 3.2 and Section 3.3, the hierarchical edges, semantic similarity edges, and node TF-IDF features are built using predefined occupational classification trees and unsupervised cosine similarity calculations, integrating unsupervised prior knowledge. Subsequently, we present the HHGNN architecture and its loss function, which operates within a supervised learning framework through labeled data to enhance prediction accuracy.

3.1. Problem Definition

We suppose that the total number of occupation labels is $T$ . Then an informal text $v$ is represented as a TF-IDF vector $F_{v}$ (detailed in Section 3.2), and the corresponding output after the proposed model is defined as $h_{v}$ $(h_{v} \in R^{T \times 1})$ . The label of this text is a one-hot vector $y_{v} (y_{v} \in R^{T \times 1})$ . We aim to predict the label of the text $v$ correctly by minimizing the difference between $h_{v}$ and $y_{v}$ .

3.2. Graph Construction

Figure 1 shows the graph in HHGNN. There are two types of nodes. Blue nodes represent texts, and green nodes represent labels. There are three types of edges: green edges exist only between labels and are used to mimic the category tree to make full use of the established categorical structure; blue edges only exist between texts and are artificially constructed to show the similarity between texts; red edges indicate that the text has synonyms with labels. There are two reasons for the construction of these edges. First, these edges compensate for the shortness of context information in the graph neural network. Second, we use these edges to absorb the patterns we find in the real data set. To be more accurate, we refer to three kinds of edges: hierarchical edges, similarity edges, and synonym edges. We refer this graph as $G = (V, E, M)$ , where $V (node v \in V)$ and $E$ are sets of nodes and edges respectively. $M$ represents the three kinds of edges in the graph. $(m \in M), M =$ {hierarchical edge, similarity edges, synonym edges}.

Figure 1.

Graph of HHGNN.

Category trees are often used to expresses the hierarchical relationship between classes when the number of categories is large. In our experiment, we use the Standard Occupational Classification (SOC) system, released by United States federal government agencies to cover all occupations. An example of the fractional category tree structure of the data set we collected is shown in Figure 2. The picture describes a 4-level hierarchical category, whose last level of the category tree is the actual labels, and the number of the labels is as high as 881. HHGNN leverages the prior knowledge of label correlations regarding the predefined hierarchy by constructing hierarchical edges. If two label nodes are connected in the category tree, then there is a green hierarchical edge in the graph between the two nodes.

Figure 2.

Hierarchical category tree in standard occupational classification.

All job titles are denoted as $W$ . Text and label nodes are tokenized and then represented as TF-IDF vectors $F_{v}$ , where $F_{vj}$ is the TF-IDF of token $j$ in document $v$ from corpus $W$ . $F_{vj}$ is defined as follow:

\begin{matrix} F_{vj} = TF (j, v) \times IDF (v, W) = \frac{count (j, v)}{v} \times \log \frac{| W |}{1 + | v^{'} \in W : j \in v^{'} |} \end{matrix}

(1)

The TF (Term Frequency) measures how frequently a token $j$ appears in a document $v$ , normalized by the length of the document. The IDF (Inverse Document Frequency) weights token $j$ that occur in many documents across the corpus, thereby highlighting terms that are more distinctive to a specific document. TF-IDF $F_{vj}$ representation emphasizes tokens that are frequent within a particular document while being rare in the broader corpus. This makes it especially suitable for capturing characteristic patterns in short texts where certain key terms serve as strong discriminative indicators, such as job titles.

We find that texts in the same category tend to have many same words, so we build similarity edges to emphasize relationships between texts. Table 1 shows some texts with high similarity. The cosine similarity metric is adopted to quantify the semantic relatedness between the TF-IDF vector representations of textual nodes. This approach is grounded in information retrieval theory, where cosine similarity serves as a well-established measure for comparing document vectors in high-dimensional spaces (Salton et al. 1983). It is defined as the cosine of the angle between two vectors, thereby providing a measure of orientation similarity that is invariant to their magnitudes. This property is particularly advantageous for textual similarity tasks, as it focuses on the shared discriminative terms between documents while mitigating the influence of variable text length. Bao et al. (2020) also leveraged TF-IDF and cosine similarity for matching noisy job titles to standard classifications, demonstrating its practical efficacy in a closely related domain.

Table 1.

Samples of Job Titles with High Similarity.

Occupation code	Job titles
Software developers	Senior software engineer (b2b)
	b2b software engineer
	Android & ios developer
	ios & android developer
	Software developer staff
	Staff c# developer

We introduce $S$ as the cosine similarity matrix. A text node is linked with the text nodes that have the top three cosine similarities. $S_{vu}$ denote the cosine similarity between job title $v$ and job title $u$ .

\begin{matrix} S_{vu} = \frac{| F_{v} \cdot F_{u} |}{| | F_{v} | | \cdot | | F_{u} | |}, S \in R^{| W | \times | W |}, \\ v, u \in W \end{matrix}

(2)

In our data set, every sample not only has respondents’ job titles but also their industries. At first, we try to use the similarity between respondents’ industry information with third-level category labels. However, this similarity harms the performance of our model. That is because people write their industries according to their corporations instead of their occupations. For instance, c++ developers who work at a bank probably write their industries as “banking” instead of “software,” which is misleading in the occupation coding task.

Synonym edges only exist between a pair of a text node and a label node. If the text and the label have common synonyms, then there will be a synonym edge. Table 2 gives examples of synonyms edges. The free text “scholar,”“research fellow,” and the label “researchers” are synonyms. Wordnet, an NLTK corpus reader in Python, is used to find synonyms.

Table 2.

Examples of Synonyms Edges.

Occupation code	Job titles
Researchers	Scholar
	Staff scientist
	Research scientist
	Postdoctoral research fellow
	Postdoctoral research associate

3.3. Neighbor Sampling

We adopt a general inductive framework, which leverages nodes’ local information to allow batch training on the large-scale graph. We sample $N$ neighbors in every edge type in two hops (a hop represents the number of steps between nodes in the graph) for every node. The neighbors of node $v$ in two hops are demonstrated in Figure 3. The effectiveness of parameter $N$ will be explored in Section 5. Neighbor sampling is significant because the advantage of GNNs is that graph neural networks combine the information of each node and neighbors to represent a node’s meaning. Then, the less-frequency categories can use their neighbor to enrich their information. More specifically, the category “loan officers” has only ten samples in the data set. At the same time, the 4-level category “loan officers” is linked with some neighbors according to the category tree, including the 4-level category “tax examiners and collectors,”“revenue agents,”“revenue agents,” and 3-level “Business and Financial Operations Occupations.” The node “loan officer” contains the literal meaning and the information from its neighbors.

Figure 3.

Neighbors of node v.

The connections and neighbors motivate us that the neighbor sampling can be used to alleviate data sparsity. We further illustrate whether our proposed method alleviate the data sparsity problem in Section 5.3.

3.4. Structure of HHGNN

To embody multiple edge-type information, we employ an edge-level heterogeneous network to learn specific parameters for every kind of edge and apply an edge-level attention structure to integrate information from different edges.

Heterogeneous edge type

The HHGNN processes node information through edge-type-specific transformations to account for the unique semantics of hierarchical, similarity, and synonym edges. Vector $h_{v}^{l, m}$ is the node embedding of node $v$ in layer l (l = {1, 2}) transferred by edge type m. For each node $v$ in layer l, we compute an edge-type-specific embedding $h_{v}^{l, m}$ , where m{hierarchical, similarity, synonym} denotes the edge type. This design enables the model to tailor representations to the specific relationships in the graph: hierarchical edges link text nodes to their parent categories (e.g., “software developers” to “Computer and Mathematical Occupations”), similarity edges capture lexical overlap (e.g., between “software engineer” and “senior software engineer”), and synonym edges encode semantic equivalence (e.g., between “scholar” and “researchers”).

For the first layer, the edge-type-specific embedding is computed in Equation 3. $h_{v}^{0}$ represents the initial feature vector of node $v$ (a TF-IDF vector for text nodes, as described in Section 3.2). $W_{1}^{l, m}$ , $W_{2}^{l, m}, W_{3}^{l, m}$ are learnable parameters in layer l for edge m, which are specific for each kind of message type m and layer l. $W_{1}^{l, m}$ matrix transfers the own information from node v, $W_{2}^{l, m}$ matrix transfers the information from neighbors, and $W_{3}^{l, m}$ is used to combine the information from neighbors and themselves. $N_{m}$ (v) is a set, which contains the neighbors of node v connected by edge m. More specifically, $N_{m}^{1}$ (v) includes the first-hop neighbors of node v connected by edge m, and $N_{m}^{2} (v)$ includes the second-hop neighbors of node v connected by edge m. ReLU is the activation function, which introduces non-linearity, enabling the model to capture complex semantic patterns. The term “vocab” denotes the total number of unique terms in the vocabulary used to initialize node features, “hid_size” specifies the dimensionality of the hidden layer, and “out_size” corresponds to the dimensionality of the output layer.

\begin{matrix} l = 1 \\ h_{v}^{1, m} = ReLU (W_{3}^{1, m} {concat [W_{1}^{1, m} h_{v}^{0}, W_{2}^{1, m} {Agg}_{μ \in N_{m}^{1} (v)} (h_{μ}^{0})]}) \\ W_{1}^{1, m} \in R^{| vocab | \times hid_size}, W_{2}^{1, m} \in R^{| vocab | \times hid_size}, W_{1}^{1, m} \in R^{hid_size \times (2 * hid_size)} \end{matrix}

(3)

Equation 4 shows the structure of the second layer.

\begin{matrix} l = 2 \\ h_{v}^{2, m} = W_{3}^{2, m} {concat [W_{1}^{2, m} h_{v}^{1}, W_{2}^{2, m} {Agg}_{μ \in N_{m}^{2} (v)} (h_{μ}^{1})]} \\ W_{1}^{2, m} \in R^{hid_size \times out_size}, W_{2}^{2, m} \in R^{hid_size \times out_size}, W_{1}^{2, m} \in R^{out_size \times (2 * out_size)} \end{matrix}

(4)

The neighbor nodes, linked with the same type of edges, are aggregated by mean aggregation, as shown in Equation 5.

\begin{matrix} {Agg}_{μ \in N_{m}^{l} (v)} (h_{μ}^{l - 1, m}) = \frac{1}{| N_{m}^{l} (v) |} \sum_{μ \in N_{m}^{l} (v)} h_{μ}^{l - 1} \end{matrix}

(5)

Edge-Level attention aggregation

To integrate the edge-type-specific $h_{v}^{hierarchical}, h_{v}^{synonym}, h_{v}^{similarity}$ into a unified node representation, we employ an attention mechanism that dynamically weighs the importance of each edge type. This approach is motivated by the varying contributions of edge types in occupation coding. For instance, synonym edges may be more critical for aligning “scholar” with “researchers” while similarity edges are more important for linking “accounting” with “account.” Another reason is that concatenating $h_{v}^{m}$ involves more learnable parameters, requiring a larger data set or coming on the over-fitting problem. In Equation 6, $α_{v}^{m}$ represent the normalized weight coefficients of edge type m, while $h_{v}^{l}$ denotes the final weighted fusion representation vector for node v at layer $l$ .

\begin{matrix} h_{v}^{l} = \sum_{m \in M} α_{v}^{m} h_{v}^{l, m} \end{matrix}

(6)

As shown in Equation 7, we use an attention mechanism to measure the importance of different types of edges by calculating the similarity between mapped edge-specific embedding $h^{m}$ . Learnable weight matrix $W_{atten}^{l}$ with dimensions $attn_size \times hid_size$ , projects the edge-specific embedding $attn_size \times hid_size$ into a lower-dimensional attention space of dimension $attn_size$ , enabling efficient computation of comparability across edge types with fewer parameters. The nonlinear activation function $Tanh = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$ compresses input values into the range (−1, 1), facilitating the modeling of complex relationships. A learnable vector $w_{proj}^{l}$ with dimensions $1 \times attn_size$ then maps the tanh-activated vector in the attention space to a scalar score $e_{v}^{m}$ . Subsequently, an exponential operation is applied to $e_{v}^{m}$ to ensure positive values, followed by normalization to yield a probability distribution, thereby reflecting the relative importance of different edge types. The term “attn_size” indicates the dimension of the attention space.

\begin{matrix} e_{v}^{m} = w_{proj}^{l} Tanh (W_{atten}^{l} h_{v}^{l, m}), α_{v}^{m} = \frac{\exp (e_{v}^{m})}{\sum_{m \in M} \exp (e_{v}^{m})} \\ W_{atten}^{l} \in R^{attn_size \times hid_size}, w_{proj}^{l} \in R^{1 \times attn_size} \end{matrix}

(7)

Loss function

The output of the model can be represented as $h_{v}^{L} (h_{v}^{L} \in R^{T \times 1})$ . One-hot vector $y_{v} (y_{v} \in R^{T \times 1})$ is the ground truth label of node $v .$ T is the dimension of the output features $h_{v}^{L}$ , which is equal to the number of classes. In Equation 8, We transform $h_{vt}^{L}$ into a probability distribution using the softmax function, where ${\hat{y}}_{vt}$ represents the predicted probability that node $v$ belongs to category $t$ .

\begin{matrix} {\hat{y}}_{vt} = \frac{\exp (h_{vt}^{L})}{\sum_{t^{'} = 1}^{T} \exp {(h}_{{vt}^{'}}^{L})} \end{matrix}

(8)

The loss function is defined in Equation 9. We utilize a cross-entropy function on the training dataset $V_{train}$ to train the model. The cross-entropy measures the dissimilarity between the predicted probability distribution ${\hat{y}}_{vt}$ and the true distribution $y_{vt}$ . Since $y_{vt}$ is a one-hot vector (with 1 at the true category and 0 elsewhere), the loss simplifies to where is the true category. This means the loss decreases as the predicted probability of the true class approaches 1.

\begin{matrix} Loss = - \frac{1}{V_{train}} \sum_{v \in V_{train}} \sum_{t = 1}^{T} y_{vt} \ln ({\hat{y}}_{vt}) \end{matrix}

(9)

4. Experiments

We evaluate the performance of HHGNN on data from social science research. This data is collected to analyze the influence of management and technical talent flow on the strategic flexibility of enterprises. This empirical analysis uses LinkedIn’s online resume. It needs to match every candidate’s informal job title to the corresponding standard occupation labels, majors to the standard major category, school name to the corresponding university name, etc. Textual data was crawled from LinkedIn, where individuals described their occupations in various expressions. The category tree in our experiment, released by the United States federal government, is modified by the social scientists to fit the social science research purpose.

Two trained master students code the informal text to qualitative data, as Table 3 shows: assigned descriptive job title labels to a specific occupation code based on the expert-designed category. More specifically, two human coders give their answers. If they are consistent, then the standard occupation is found. Otherwise, the two coders review the inconsistent samples and re-labeled them after discussing and agreeing.

Table 3.

Samples of Original Data Set.

Occupation code	Job title
Cost estimators	Quantity estimator @nawa international
IT consultants	Sr. technical analyst
Software developers	Software engineer
Management analysts	Associate consultant
Software developers	Software developer

We conduct experiments on the job title data set to test our model’s performance. First, we describe our data sets which are based on manual proofreading. Then we choose some other well-known baselines and variations of HHGNN for comparative experiments.

4.1. Dataset Description

The description statistic of the dataset is given in Table 4. The fist row “job title” refers to the raw write-in responses provided by respondents, while “categories (labels)” refer to the standardized occupation codes they are mapped to. The dataset contains 41,297 samples and covers 881 categories, divided into the training set, validation set, and test set according to the ratio of 8:1:1. The average length of job titles is 2.54 characters, and the average length of category name is 2.69 characters. Job titles contains up to 16 characters, while category names can contain up to 10 characters.

Table 4.

Description Statistics of Dataset Statistics.

Text	Unique	Average length	Maximum length	Example
Job title	41,297	2.54	16	Advertising ba program
Categories (labels)	881	2.69	10	Advertising promotions managers

The sample size under different labels varies significantly in the labeled data set, as shown in Figure 4. “Software developer” is the most frequently occurring text message among respondents, which corresponds to the most frequently occurring standard label “Software Developers.” Popular occupations such as “software developers” and “computer and information system managers” have 2,918 and 2,507 samples, respectively. Occupations such as “mathematics” and “life scientist” appear only once.

Figure 4.

Sample sizes of every category.

According to Table 5, the hierarchical category tree consists of four levels, and the quantity of every level is 23, 98, 459, and 881, respectively. The number of the total labels in this data set is 881.

Table 5.

Description Statistic of Revised Hierarchical Category Tree.

Hierarchical structure	Level	First level	Second level	Third level	Fourth level (label)
Quantity (count)	4	23	98	459	881

4.2. Metrics

Standard evaluation metrics, including micro-F1 and macro-F1, are employed to evaluate our model. Considering the imbalance of samples among different categories, macro accuracy can better evaluate the model’s performance from the perspective of not focusing on frequent labels to a certain degree. For category $t$ , ${TP}_{t}, {FP}_{t},$ ${FN}_{t}$ denote the true-positives, false-positives, and false-negatives (Manning et al. 2008). Metrics micro-F1 and macro-F1 (Sokolova and Lapalme 2009) are defined in Equation 10.

\begin{matrix} P = \frac{\sum_{t \in T} {TP}_{t}}{\sum_{t \in T} {TP}_{t} + {FP}_{t}}, R = \frac{\sum_{t \in T} {TP}_{t}}{\sum_{t \in T} {TP}_{t} + {FN}_{t}}, micro - F 1 = \frac{2 PR}{P + R} \end{matrix}

(10)

\begin{matrix} P_{t} = \frac{{TP}_{t}}{{TP}_{t} + {FP}_{t}}, R_{t} = \frac{{TP}_{t}}{{TP}_{t} + {FN}_{t}}, macro - F 1 = \frac{1}{T} \frac{2 P_{t} R_{t}}{P_{t} + R_{t}} \end{matrix}

(11)

TopK accuracy is also our evaluation metrics (Baeza-Yates and Ribeiro-Neto 1999), where K is a positive integer, meaning the fraction of job titles for which the correct answer is included in the recommendation list of length K. TopK accuracy is defined as:

\begin{matrix} topK Accuracy = \frac{# of true positive samples in topK list}{# of samples} \end{matrix}

(12)

4.3. Compared Methods

We select effective methods from two disciplines: occupation coding and text classification. The practical methods of occupation coding include TF-IDF+cosin, Naive Bayes, SVM, and Bidirectional Recurrent Neural Network (BiRNN). We choose the famous and advanced text classification methods in machine learning, including fastText, bert+cls, TextGNN. TextGNN and our proposed method use graph neural networks as the primary network architecture, but they have different edges and contain different information. We briefly introduce compared methods as follows:

TF-IDF+cosin

We apply TF-IDF (H. C. Wu et al. 2008) as a two-step method, a flat but intuitive baseline. First, job titles and categories are represented by TF-IDF vectors. Then, each job title is assigned to the class with the highest cosine similarity.

Naive Bayes

Naive Bayes is a conditional probability model with Bayes’ theorem. Despite its oversimplified assumptions, this classifier has worked well in many classification tasks (McCallum and Nigam 1998).

SVM

SVMs (Cortes and Vapnik 1995) are one of the most robust prediction methods, using the kernel trick to map inputs into high-dimensional feature spaces.

BiRNN

BiRNN processes text sequences in both forward and backward directions. This allows the network to capture contextual information from both past and future words at any given point, creating a more comprehensive sentence representation for classification (Schuster and Paliwal 1997).

fastText

fastText often achieves scalable solutions for text classification tasks while processing large datasets quickly. Hierarchical softmax in fastText proves to be very efficient when there are many categories (Joulin et al. 2016).

Siamese

The model combines a stack of character-level bidirectional Long Short-Term Memory (LSTM) with a siamese architecture (Neculoiu et al. 2016). We apply the siamese neural network as a two-step method. First, the siamese network is trained to learn the text similarity between job title and labels. Next, each job title is coded to the category with the highest text similarity.

bert+cls

Large pre-trained language models, such as bert (Bidirectional Encoder Representations from Transformers), have proven helpful. Bert (Devlin et al. 2018) is a strong baseline for various language tasks and multiple languages. The first token of the sentence sequence is always [CLS] containing the unique classification embedding (Sun et al. 2019), which is suitable for text classification tasks based on our data set.

TextGNN

TextGNN applied graph convolutional networks for text classification and built a single text graph for a corpus-based on word co-occurrence and document word relations (Yao et al. 2019). It then jointly learns the embeddings for both words and documents, supervised by the known class labels for documents.

4.4. Settings

For HHGNN, we set the embedding size of the first convolution layer as 300 and the hidden dimension as 512. We tuned other parameters: learning rate as 0.01 (the step size for updating model weights during optimization), L2 loss weight as 1e-5 (the regularization parameter for L2 penalty), and batch size as 256 (the number of samples processed per training iteration). We will further demonstrate how essential parameters are determined in Section 5.1.

All experiments are performed on Intel^® Core™ i5-9400F CPU@2.90GHz with 16.0 GB RAM. The operation system and software platforms are Windows 10 Professional x64 Edition, PyTorch 1.11.0, and python 3.8.

5. Result

The results are shown in Table 6. Every row represents a metric, and every column shows the performance of a corresponding algorithm. For every metric, the best results are in bold, and the second-best results are underlined.

Table 6.

Test Performance.

Metrics	TF-IDF+cosin	Bayes	SVM	BiRNN	fastText	siamese	bert+cls	TextGNN	HHGNN
Top1 accuracy	0.0866	0.2622	0.3281	0.3384	0.3884	0.5516	0.5636	0.5300	0.6398
Top2 accuracy	0.1392	0.3214	0.3684	0.5428	0.5349	0.7170	0.6714	0.7145	0.7832
Top3 accuracy	0.2137	0.3743	0.4274	0.6403	0.6213	0.7991	0.7006	0.7935	0.8324
Top4 accuracy	0.2441	0.4141	0.4633	0.6965	0.6732	0.8425	0.7306	0.8384	0.8563
Top5 accuracy	0.2707	0.4415	0.4984	0.7358	0.7106	0.8674	0.7509	0.8661	0.8663
Top10 accuracy	0.3313	0.5560	0.6180	0.8230	0.7896	0.9233	0.8071	0.9215	0.9007
Top20 accuracy	0.3565	0.7168	0.7506	0.8864	0.8497	0.9554	0.8661	0.9443	0.9263
Micro-F1	0.0866	0.2622	0.3281	0.3384	0.3884	0.5516	0.5636	0.5300	0.6398
Macro-F1	0.0522	0.3326	0.0780	0.0812	0.0921	0.2652	0.0752	0.2775	0.2113

Note. Bold values indicate the best performance, and underlined values indicate the second-best performance.

5.1. Performance on Fourth Level

As for the topK accuracy metric, it is evident that the larger K the higher score. The TF-IDF+cosine method provides a flat baseline, which displays that the apparent TF-IDF vectors between job titles and category name are beneficial for coding free job titles to standard labels. However, simply relying on cosine similarity is insufficient. Bayes has better results compared with TF-IDF+cosine. The traditional deep neural network and SVM further improve the performance, and these methods are comparable in this data set. FastText performs better at metric topK accuracy, micro-F1, and macro-F1, while BiRNN enhances the rest of the metrics. Our result shows that SVM and BiRNN have similar performance and are better than Bayes, which agrees with the experiments in the literature review.

The models with more complicated structures, TextGNN, siamese, bert+cls, and HHGNN, achieve better results. The metrics of these four models are showcased in Figure 5.

Figure 5.

TopK accuracy of TextGNN, siamese, bert+cls, and HHGNN.

When K is small (K = 1), HHGNN outperforms other models, followed by bert+cls as a strong baseline. For K = 2 and 3, TextGNN and siamese show stronger performance than bert+cls, though HHGNN remains the leader. Correspondingly, HHGNN and bert+cls also outperform other models in micro-F1. When K is larger (K = 10, 20), siamese and TextGNN achieve better results, HHGNN performs lower, but the results of bert+cls are inferior to others. In practice, we emphasize more topK accuracy with small K. That is because long recommendation lists contain redundant information. It takes more time to read the long recommendation lists, which makes coders spend more time reading hints when annotating. Thus, we think HHGNN outperforms compared methods in respect of topK accuracy.

As for the ranking regarding macro-F1: the order is TextGNN, siamese, HHGNN, and bert+cls ranked in descending. The reason why siamese has good results at the macro level is likely because the training goal of siamese is to measure the similarity between text information and category name, not the one-hot label of text information. TextGNN builds category-word edges and word-information edges, which capture word co-occurrence. Word nodes act as bridges or critical paths in the graph so that label information can be propagated to the entire graph. It seems that bert is a strong baseline but focuses more on the categories with more samples. Our proposed HHGNN achieves a good balance between macro and micro indicators. HHGNN has a similar advantage to TextGNN. The word nodes and label nodes absorb information from their neighbors. Three different types of edges propagate information and enrich minority categories.

It is worth noting that the micro metrics are more critical when applying automatic machine learning-based methods. Micro metrics and topK accuracy can directly reflect the accuracy of different methods. Methods with higher micro and topK accuracy indicate that they can provide a higher proportion of accurate hints for human coders. The macro-F1 is calculated as the average of each category’s accuracy, reflecting the difference in performance in the majority and minority categories. Therefore, from a practical point of view, bert and HHGNN have more practical value (especially HHGNN), and siamese and TextGNN have better performance in minority categories. To analyze the effectiveness of HHGNN, we conduct more in-depth experiments.

Case Study

To gain deeper insight into the decision-making mechanism of the HHGNN model, we analyze its output response to variations in TF-IDF input features. This is achieved by removing key terms, and observing how the corresponding TF-IDF vector changes affect the prediction distribution.

In Table 7, the job title “software trainee engineer” (correctly classified as “intern part time contractors”) has a sparse TF-IDF vector [0.239, 0.463, 0.297], yielding HHGNN’s top-5 predictions: {intern part time contractors, software developers, web developers, computer hardware engineers, computer programmers}, with the correct category ranked first. The high TF-IDF value for “trainee” highlights its importance. Simplifying to “trainee engineer” adjusts the top-5 predictions to {intern part time contractors, computer hardware engineers, systems engineers, engineers all other, computer network architects}, retaining the correct category while emphasizing “engineer” in the top-2 and top-3 positions. Further simplifying to “software engineer,” the top-5 predictions are {software developers, computer information systems managers, engineers all other, software quality assurance analysts testers, industrial engineers}, where the higher weight of “software” drives predictions toward software-related categories. These observations demonstrate that HHGNN effectively adapts to TF-IDF vector changes, ensuring predictions align with input semantics.

Table 7.

TF-IDF Vectors and Test Performance of HHGNN.

Input text	TF-IDF vector (sparse form)	Top-5 predicted categories
Software trainee engineer	[0.239, 0.463, 0.297]	1. Intern part time contractors
		2. Software developers
		3. Web developers
		4. Computer hardware engineers
		5. Computer programmers
Trainee engineer	[0.660, 0.340]	1. Intern part time contractors
		2. Computer hardware engineers
		3. Systems engineers
		4. Engineers all other
		5. Computer network architects
Software engineer	[0.554, 0.446]	1. Software developers
		2. Computer information systems managers
		3. Engineers all other
		4. Software quality assurance analysts testers
		5. Industrial engineers

To further validate its practical utility, we conducted a user study detailed in the Appendix, in which two master students trained in professional coding using HHGNN-generated top-5 hints. The results indicated that the hints reduced average annotation time.

5.2. Performance on First Level

We further compared the predicted distributions with the true distribution at the first level of occupational categories. This comparison evaluates how well each method reflects the actual data distribution and helps identify systematic biases or overfitting to specific categories.

As shown in Figure 6, the true distribution is highly imbalanced, with only three categories accounting for approximately 80% of the samples: Computer and Mathematical Occupations (Category 5), Management Occupations (Category 15), and Business and Financial Operations Occupations (Category 3). HHGNN produces a distribution that closely matches the true one in these major categories, showing a reasonable overall shape. TextGNN achieves the lowest KL divergence (0.1065), indicating the closest fit to the true distribution. It also maintains noticeable probability over several low-frequency categories, demonstrating an ability to capture long-tail trends. In contrast, bert and siamese show substantial deviation from the true distribution. Bert yields a high KL divergence of 1.2122, with predictions overly concentrated in a few categories, suggesting limited generalization ability.

Figure 6.

Prediction distribution on first level.

5.3. Parameter Sensitivity

Table 8 shows micro-F1 and macro-F1 with a different number of neighbors. We can see that both metrics increase as the model has more neighbors but stop increasing when the number of neighbors is larger than 3. Therefore, we think the optimal number of neighbors N is 3. This suggests that too few neighbors could not generate sufficient global information in the graph, while too many neighbors may add information that is not very closely related.

Table 8.

Test Performance of HHGNN in Different Numbers of Neighbors N.

N	1	2	3	4	5
Micro-F1	0.5676	0.5875	0.6398	0.6241	0.6338
Macro-F1	0.1620	0.1843	0.2113	0.1911	0.2020

Table 9 depicts the classification performance of HHGNN with different dimensions D of the hidden layer. Too low dimensional embeddings may not propagate label information to the whole graph well, and high dimensional embeddings improve classification performances a little. The further expansion of dimension brings minor enhancement, but the training time and memory usage increase significantly. So there is no need to keep increasing the dimension since the dimension is larger than 512. We also choose 512 as the optimal dimension of the hidden layer.

Table 9.

Test Performance of HHGNN in Different Dimensions of the Hidden Layer D.

D	64	128	256	512
Micro-F1	0.5542	0.5983	0.6368	0.6398
Macro-F1	0.1620	0.1843	0.2049	0.2113

5.4. Effects of the Size of Labeled Data

To evaluate the extent to which different methods depend on labeled data, we test several best-performing models with different proportions of the training data. Figure 7 reports micro metrics and macro metrics with 30%, 50%, 70%, and 100% of the original training data set. Every line means an algorithm, and the shadow represents the range of results. We note that HHGNN still achieves the best micro-F1 with limited labeled data. GNN models, especially HHGNN, have minor variances in terms of micro-F1. TextGNN and siamese still perform better but are more viable at the macro level. These micro-F1 results suggest that HHGNN alleviates the dependency on data compared with other methods, as it can still perform well and stably with a small proportion of labeled data.

Figure 7.

Test performance with varying training data proportions.

Even though HHGNN only performs better on micro metrics, recall section 5, we regard the micro-F1 as more effective metrics in practical application, and we pay more attention to micro-F1. Therefore, HHGNN helps to alleviate the dependency on labeled data in a practical perspective because HHGNN still outperforms other methods with less proportion of labeled data, let alone that the result variation of HHGNN is relatively tiny.

6. Conclusion and Implications

This paper applies advanced machine learning to occupation title coding. The result shows that graph neural networks, followed by sophisticated neural networks and traditional classifications, generate improvements in accuracy. It is noticeable that SVM’s performance is comparable with deep neural networks on this data set. The result confirms that our proposed HHGNN can outperform other models robustly regarding micro-F1 and topK accuracy (K ≤ 5) while siamese has advantages in terms of other metrics. The test performance of variations of HHGNN proves that it is worthwhile to incorporate heterogeneous information, hierarchical category tree, and synonym information into graph neural networks. At the same time, our method also eases the dependency on training data as it can still have relatively good results with only 30% of the original labeled data.

Our research has both theoretical and practical implications. One of the theoretical implications is to show that the expanding range of advanced methods applied to occupation coding can produce significant performance increases. Previous research only applies traditional machine learning methods, while we test both effective methods in occupation coding and text classification and find out that different methods have their advantages. Therefore, researchers can actively look for the best-performing algorithm according to their needs. Our research also highlights that HHGNN is effective and performs well in micro-F1, which encourages subsequent exploration of graph neural networks, hierarchical category trees, and synonyms in this task. As for practical implications, given an adequate supply of previously labeled data, researchers can incorporate this method to support human annotations to process free texts into standard occupation titles in an exquisite category tree. Human coders can use the hints given by HHGNN to save time and labor. However, considering there still needs improvement in average accuracy, current methods can not completely replace hand-coded work.

As for future research, exploring weak supervision or cold-start strategies represents a promising direction. Such approaches could leverage the unsupervised components—for instance, performing label propagation via “synonym edges” and “semantic similarity edges” in the graph, or generating pseudo-labels using a nearest-neighbor classifier. These methods could initialize the supervised parameter training process, thereby enhancing the adaptability when training data are not available.

Footnotes

Appendix

ORCID iD

Yi Xie

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Received: June 11, 2025

Accepted: March 10, 2026

References

Alazaidah

Ahmad

F. K.

2016. “Trending Challenges in Multi-Label Classification.” International Journal of Advanced Computer Science and Applications 7 (10): 127–31. DOI: https://doi.org/10.14569/IJACSA.2016.071017.

Baeza-Yates

Ribeiro-Neto

1999. Modern Information Retrieval: The Concepts and Technology Behind Search. ACM Press/Addison-Wesley.

Bao

Baker

C. J.

Adisesh

2020. “Occupation Coding of Job Titles: Iterative Development of an Automated Coding Algorithm for the Canadian National Occupation Classification (ACA-NOC).” JMIR Formative Research 4 (7): e16422. DOI: https://doi.org/10.2196/16422.

Basit

2003. “Manual or Electronic? The Role of Coding in Qualitative Data Analysis.” Educational Research 45 (2): 143–54. DOI: https://doi.org/10.1080/0013188032000133548.

Bethmann

Schierholz

Wenzig

Zielonka

2014. “Automatic Coding of Occupations.” Proceedings of Statistics Canada Symposium: Beyond Traditional Survey Taking – Adapting to a Changing World, Ottawa, Canada.

Boselli

Cesarini

Marrara

, et al. 2018. “WoLMIS: A Labor Market Intelligence System for Classifying Web Job Vacancies.” Journal of Intelligent Information Systems 51: 477–502. DOI: https://doi.org/10.1007/s10844-017-0488-x.

Bundesagentur für Arbeit. 2011. “Klassifikation der Berufe: Definitorischer und beschreibender Teil”. https://doku.iab.de/fdz/reporte/2013/MR_08-13.pdf.

Chen

Cai

Chen

2021. “HHGN: A Hierarchical Reasoning-Based Heterogeneous Graph Neural Network for Fact Verification.” Information Processing & Management 58 (5): 102659. DOI: https://doi.org/10.1016/j.ipm.2021.102659.

Chen

N.-C.

Drouhard

Kocielnik

Suh

Aragon

C. R.

2018. “Using Machine Learning to Support Qualitative Coding in Social Science: Shifting the Focus to Ambiguity.” ACM Transactions on Interactive Intelligent Systems (TiiS) 8 (2): Article 1–20. DOI: https://doi.org/10.1145/3185515.

10.

Colombo

Mercorio

Mezzanzanica

2019. “AI Meets Labor Market: Exploring the Link Between Automation and Skills.” Information Economics & Policy 47: 27–37. DOI: https://doi.org/10.1016/j.infoecopol.2019.05.003.

11.

Connelly

Gayle

Lambert

P. S.

2016. “A Review of Occupation-Based Social Classifications for Social Survey Research.” Methodological Innovations 9: Article 2059799116638003. DOI: https://doi.org/10.1177/2059799116638003.

12.

Cortes

Vapnik

1995. “Support-Vector Networks.” Machine Learning 20 (3): 273–97. DOI: https://doi.org/10.1007/BF00994018.

13.

Crowston

Liu

Allen

E. E.

2010. “Machine Learning and Rule-Based Automated Coding of Qualitative Data.” Proceedings of the American Society for Information Science and Technology 47 (1): 1–4. DOI: https://doi.org/10.1002/meet.14504701328.

14.

Crowston

Allen

E. E.

Heckman

2012. “Using Natural Language Processing Technology for Qualitative Data Analysis.” International Journal of Social Research Methodology 15 (6): 523–43. DOI: https://doi.org/10.1080/13645579.2011.625764.

15.

Devlin

Chang

M.-W.

Lee

Toutanova

2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint. https://arxiv.org/abs/1810.04805.

16.

Elliott

2018. “Thinking About the Coding Process in Qualitative Data Analysis.” The Qualitative Report 23 (11): 2850–61. DOI: https://doi.org/10.46743/2160-3715/2018.3560.

17.

Franzosi

1989. “From Words to Numbers: A Generalized and Linguistics-Based Coding Procedure for Collecting Textual Data.” Sociological Methodology 19: 263–98. DOI: https://doi.org/10.2307/270955.

18.

Friedman

J. H.

2001. “Greedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5): 1189–232. DOI: 10.1214/aos/1013203451.

19.

Gweon

Schonlau

Kaczmirek

Blohm

Steiner

2017. “Three Methods for Occupation Coding Based on Statistical Learning.” Journal of Official Statistics 33 (1): 101–22. DOI: https://doi.org/10.1515/jos-2017-0006.

20.

Joulin

Grave

Bojanowski

Mikolov

2016. “Bag of Tricks for Efficient Text Classification.” arXiv preprint. https://arxiv.org/abs/1607.01759.

21.

Jung

Yoo

Myaeng

S.-H.

Han

D.-C.

2008. “A Web-Based Automated System for Industry and Occupation Coding.” In Web Information Systems Engineering – WISE 2008, edited by J.

Bailey

Maier

Schewe

K. D.

Thalheim

Wang

X. S.

Springer.

22.

Keller

2017. “How to Gauge the Relevance of Codes in Qualitative Data Analysis? – A Technique Based on Information Retrieval”. https://aisel.aisnet.org/wi2017/track11/paper/1/.

23.

Kirby

Carson

Dunlop

, et al. 2015. “Automatic Methods for Coding Historical Occupation Descriptions to Standard Classifications.” In Population Reconstruction, edited by G.

Bloothooft

Christen

Mandemakers

Schraagen

Springer.

24.

Komárek

2004. “Logistic Regression for Data Mining and High-Dimensional Classification.” PhD dissertation, Carnegie Mellon University.

25.

Kowsari

Brown

D. E.

Heidarysafa

Meimandi

K. J.

Gerber

M. S.

Barnes

L. E.

2017. “HDLTex: Hierarchical Deep Learning for Text Classification.” Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, December 18–21. https://ieeexplore.ieee.org/abstract/document/8260658.

26.

Linneberg

M. S.

Korsgaard

2019. “Coding Qualitative Data: A Synthesis Guiding the Novice.” Qualitative Research Journal 19 (3): 259–70. DOI: https://doi.org/10.1108/QRJ-12-2018-0012.

27.

Malekzadeh

Hajibabaee

Heidari

Zad

Uzuner

Jones

J. H.

2021. “Review of Graph Neural Network in Text Classification.” Proceedings of the 2021 IEEE 12th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, December 1–4. DOI: https://doi.org/10.1109/UEMCON53757.2021.9666633.

28.

Manning

C. D.

Raghavan

Schütze

2008. Introduction to Information Retrieval. Cambridge University Press.

29.

McCallum

A. K.

Nigam

1998. “Employing EM and Pool-Based Active Learning for Text Classification.” Proceedings of the Fifteenth International Conference on Machine Learning (ICML ’98), Madison, WI, USA, July 24–27.

30.

Montebruno

Bennett

R. J.

Smith

Van Lieshout

2020. “Machine Learning Classification of Entrepreneurs in British Historical Census Data.” Information Processing & Management 57: 102210. DOI: https://doi.org/10.1016/j.ipm.2020.102210.

31.

Nahoomi

2018. “Automatically Coding Occupation Titles to a Standard Occupation Classification.” PhD thesis, University of Guelph.

32.

Neculoiu

Versteegh

Rotaru

2016. “Learning Text Similarity with Siamese Recurrent Networks.” Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, August 11.

33.

Nelson

L. K.

Burk

Knudsen

McCall

2021. “The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods.” Sociological Methods & Research 50 (1): 202–37. DOI: https://doi.org/10.1177/0049124118769114.

34.

Nigam

Lafferty

McCallum

1999. “Using Maximum Entropy for Text Classification.” IJCAI-99 Workshop on Machine Learning for Information Filtering, Stockholm, Sweden.

35.

Reja

Manfreda

K. L.

Hlebec

Vehovar

2003. “Open-Ended vs. Close-Ended Questions in Web Questionnaires.” Developments in Applied Statistics (Metodološki zvezki) 19: 159–77. https://begrijpelijkeformulieren.org/sites/begrijpelijkeformulieren/files/Reja_e.a._Open-ended_vs._Close-ended_Questions_in_Web.pdf.

36.

Rossi

R. G.

de Andrade Lopes

Rezende

S. O.

2016. “Optimization and Label Propagation in Bipartite Heterogeneous Networks to Improve Transductive Classification of Texts.” Information Processing & Management 52 (2): 217–57. DOI: https://doi.org/10.1016/j.ipm.2015.07.004.

37.

Russ

D. E.

K.-Y.

Johnson

C. A.

Friesen

M. C.

2014. “Computer-Based Coding of Occupation Codes for Epidemiological Analyses.” Proceedings of the 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), New York, NY, USA, May 27–29.

38.

Salton

Fox

E. A.

1983. “Extended Boolean Information Retrieval.” Communications of the ACM 26 (11): 1022–36. DOI: https://doi.org/10.1145/182.358466.

39.

Salton

Wong

Yang

C.-S.

1975. “A Vector Space Model for Automatic Indexing.” Communications of the ACM 18 (11): 613–20. DOI: https://doi.org/10.1145/361219.361220.

40.

Schierholz

Schonlau

2021. “Machine Learning for Occupation Coding—A Comparison Study.” Journal of Survey Statistics and Methodology 9 (5): 1013–34. DOI: https://doi.org/10.1093/jssam/smaa023.

41.

Scholz

Wasmer

2009. “German General Social Survey 2006: English Translation of the German ‘ALLBUS’ Questionnaire.” GESIS-Technical Reports, 2009/06, GESIS—Leibniz-Institute for the Social Sciences, Mannheim. https://www.ssoar.info/ssoar/bitstream/handle/document/20703/ssoar-2009-scholz_et_al-german_general_social_survey_2006.pdf?sequence=1&isAllowed=y&lnkname=ssoar-2009-scholz_et_al-german_general_social_survey_2006.pdf.

42.

Schuster

Paliwal

K. K.

1997. “Bidirectional Recurrent Neural Networks.” IEEE Transactions on Signal Processing 45 (11): 2673–81. DOI: https://doi.org/10.1109/78.650093.

43.

Sokolova

Lapalme

2009. “A Systematic Analysis of Performance Measures for Classification Tasks.” Information Processing & Management 45 (4): 427–37. DOI: https://doi.org/10.1016/j.ipm.2009.03.002.

44.

Sun

Qiu

Huang

2019. “How to Fine-Tune BERT for Text Classification?” In Proceedings of the China National Conference on Chinese Computational Linguistics, edited by Sun

Huang

Liu

Springer.

45.

Tannis

Chernov

Perlman

McKelvey

Toprani

2020. “Cardiovascular Health Risk Behaviors by Occupation in the NYC Labor Force.” Journal of Occupational and Environmental Medicine 62 (9): 757–63. DOI: https://doi.org/10.1097/JOM.0000000000001960.

46.

Tarrow

1995. “Bridging the Quantitative-Qualitative Divide in Political Science.” American Political Science Review 89 (2): 471–74. DOI: https://doi.org/10.2307/2082444.

47.

Tausczik

Y. R.

Pennebaker

J. W.

2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods.” Journal of Language and Social Psychology 29 (1): 24–54. DOI: https://doi.org/10.1177/0261927X09351676.

48.

Uysal

A. K.

Günal

2014. “The Impact of Preprocessing on Text Classification.” Information Processing & Management 50: 104–12. DOI: https://doi.org/10.1016/j.ipm.2013.08.006.

49.

Voutilainen

2004. “Part-of-Speech Tagging.” In The Oxford Handbook of Computational Linguistics, edited by Mitkov

Oxford University Press.

50.

Webster

J. J.

Kit

1992. “Tokenization as the Initial Phase in NLP.” Proceedings of the 14th International Conference on Computational Linguistics (COLING ’92), Nantes, France, July 23–28. https://www.aclweb.org/anthology/C92-4173/.

51.

H. C.

Luk

R. W. P.

Wong

K. F.

Kwok

K. L.

2008. “Interpreting TF-IDF Term Weights as Making Relevance Decisions.” ACM Transactions on Information Systems (TOIS) 26 (3): Article 13. DOI: https://doi.org/10.1145/1361684.1361686.

52.

Sun

Zhang

Xie

Cui

2022. “Graph Neural Networks in Recommender Systems: A Survey.” ACM Computing Surveys 55 (5): 1–37. DOI: https://doi.org/10.1145/3535101.

53.

Yao

Mao

Luo

2019. “Graph Convolutional Networks for Text Classification.” Proceedings of the AAAI Conference on Artificial Intelligence 33 (1): 7370–7. DOI: https://doi.org/10.1609/aaai.v33i01.33017370.

54.

Zhang

Song

Huang

Swami

Chawla

N. V.

2019. “Heterogeneous Graph Neural Network.” Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19), Anchorage, AK, USA, August 4–8. DOI: https://doi.org/10.1145/3292500.3330961.

55.

Zhou

Z.-H.

2018. “A Brief Introduction to Weakly Supervised Learning.” National Science Review 5 (1): 44–53. DOI: https://doi.org/10.1093/nsr/nwx106.