Sage Journals: Discover world-class research

Abstract

The rapid development of welding fabrication demands guidance data to support expert decision-making, retroactive process, and production pattern upgrades. Yet, the data is lacking. This paper developed the guidance corpus for welding manufacturing. The unstructured document guiding actual production was collected, and the data were segmented into words employing a conditional random field model. In addition, we organized an annotation team and developed the tool to accomplish data annotation using a Begin-Inside-Outside method. Machine learning and deep learning models were trained for the named entity recognition task under supervised learning to accomplish the challenge of corpus scalability. The corpus contained 19,410 labeled pairs of samples and involved multiple entity categories, that is, “standard,”“technology,”“design,”“department,”“manufacturing,” and “quality.” The baseline of the model on Precision, Recall, and F₁-score metrics were listed to provide a reference for welding production research. Furthermore, we also gave some examples of task assignments and knowledge retrieval in conjunction with real production problems. The long-term goal was continually to enhance the corpus through more researchers, suggesting that a robust corpus would support more research and engineering applications in the welding domain.

Keywords

Welding fabrication guidance corpus machine learning deep learning

Introduction

As a critical manufacturing method, the welding technique has gradually developed in the past decades, especially in smart manufacturing. Some intelligent methods based on essential domain data have been applied for process selection,¹ fault diagnosis,² and expert decision-making.³ Most of the data are stored in structured databases, requiring strict form and correspondence. This makes data access difficult, and the application scope is limited due to the missing amount of data. Extracting data from unstructured text to support domain operations can effectively compensate for the shortage of structured data. However, the data from unstructured files are lacking in the manufacturing domain,⁴ and even no relevant research has been performed in the field of welding. The main content of our research is how to automatically extract data from critical unstructured files to form the supporting data.

Meaningful data have a positive impact on engineering applications. In the welding process, domain experts must design the product, develop the strategy, guide the production, inspect the quality, and guarantee maintenance according to the guiding documents, such as standards, regulations, and requirements. The welding instruction documents cover the whole life cycle of welding and are characterized by multiple types, specialized content, and complex knowledge. Therefore, the guidance document was chosen as the data source due to its great engineering value. A large amount of knowledge in unstructured guidance documents requires domain experts to acquire proper knowledge through extensive searching and thinking. In addition, different areas and types of work involve different document contents, making the manual acquisition of practical knowledge more complex. These challenges make us focus on Natural Language Processing (NLP) technology. Building a corpus and extracting information with Named Entity Recognition (NER)⁵ is essential for research and applications in welding engineering. Entity-based data weaken the limitations of content and format in different schemas. The corpus gathers discrete data in a standard form and can be used as a data driver for knowledge acquisition, decision-making, etc.

Corpus plays a vital role in knowledge extraction and application as the database of NLP technology. In recent years, many corpus pieces of research have focused on corpus types,^6,7 construction methods,⁸ and applications.^9,10 Some of the studies are listed in Table 1. Yet, the corpus already contains monolingual corpus,¹¹ bilingual corpus,¹² and even multilingual corpus.^13,14 NER is always an effective method for construction to develop and improve the corpus. Model learning methods are used as entity recognition to automate data collection. Cho et al.¹⁵ proposed extracting different character-level representations from CNN and bi-LSTM based on bi-LSTM with conditional random field (CRF) and introducing an attention mechanism to improve the performance of the recognition model. Moreover, long-short dependencies¹⁶ and transfer learning¹⁷ were also used to refine and improve the NER technology. The NER approach is also used to define multiclass entities, which are essential components of the corpus. Typically, multi-entity classification focuses on the classification of the open domain^18,19 (e.g. time, organization, position, etc.) and domain-specific entities²⁰ (e.g. disease, drugs, diet, etc.). These classified entities have been applied in biomedical,²¹ linguistics,²² etc. Based on these methods, more and more corpora are being built and used to analyze and solve practical problems. eHealth-KD corpus,²³ NCBI Disease Corpus,²⁴ Classical Music Corpus,²⁵ and Spanish AzBio Sentence Corpus²⁶ were developed to support research and applications in related domains. However, despite recent advances, the corpus is lacking in industries, especially the welding field, and little relevant research has focused on the welding corpus and the classification of welding entities. Perhaps there are several reasons for this: (i) the specialization and complexity of knowledge make it challenging to identify and tag the knowledge; (ii) the missing open web data or APIs make metadata challenging to obtain in the welding process; and (iii) weak connection between traditional welding domain and novel intelligent technology also make it challenging.

Table 1.

Corpus-related studies.

Corpus research	Research examples
Corpus categories	Monolingual corpus¹¹ Bilingual corpus¹² Multilingual corpus^13,14
Corpus construction methods	CNN and LSTM¹⁵ long-short dependencies¹⁶ and transfer learning¹⁷ Conditional random field²⁸
Corpus applications	eHealth-KD corpus²³ NCBI disease corpus²⁴ Classical music corpus²⁵ Spanish AzBio Sentence corpus²⁶

Combined with the characteristics of the corpus, we found that it has positive effects on the rapid acquisition and application of welding guidance knowledge. A Chinese welding corpus was created based on the actual production domain of welding. In corpus building, original data were gathered from many unstructured guidance documents for high-speed train fabrication. The CRF model is used as a Chinese word separation²⁷ and Begin-Inside-Outside (BIO) annotation method as an entity definition. Furthermore, we have organized teams and developed the tag tool to make the markup process more accurate and efficient. On this basis, the tagged data are trained via Hidden Markov Model (HMM), CRF,²⁸ Bidirectional Long Short-Term Memory (BILSTM), and Bidirectional Long Short-Term Memory and CRF (BILSTM + CRF). Furthermore, Precision, Recall, and F₁-score (PRF) metrics are employed to evaluate models which provide a baseline on the corpus. We describe two use cases based on the corpus to improve knowledge acquisition efficiency and assist welding production. The research contribution can be described as follows:

(i) Collecting unstructured guidance data to form a corpus positively impacts the welding manufacturing domain. Original data were collected from unstructured welding instruction documents, and we manually annotated these words to construct the corpus. This work provides data support for knowledge engineering research in the welding domain.

(ii) Machine learning and deep learning methods are used as corpus data training, and baselines are listed via training results. The scalability of the corpus is achieved while providing a reference for subsequent research.

(iii) To express the importance of the corpus more visually, novel cases on welding task assignment and rapid welding knowledge acquisition are described based on the corpus. The work is significant for practical productivity improvement, knowledge reasoning, process decision-making, etc.

Our primary purpose is to supplement the lack of a welding corpus, so we build a corpus of welding process guidance documents, which provides supporting data for the research and application of knowledge during the welding process. The further aim is to enhance and improve the corpus. Thus, a complete and accurate corpus will benefit the research and application of welding knowledge. Furthermore, machine learning and deep learning methods were used for entity extraction, and the model baseline was released as a reference for subsequent research.

Material and methods

Data and annotations

The purpose of knowledge acquisition, efficient expert decision-making, and production model upgrade push us to focus our attention on guidance documents. Hence, we collected a large number of unstructured documents such as requirements, regulations, and standards. These documents have the characteristics of high standardization, guidance, and recognition. In addition, a data structure covering all sub-processes of the welding process might help establish a complete corpus, so the application of the data was also considered a fundamental principle for selection information. According to the actual welding production process, we selected guiding documents on product design, process development, assembly production, quality inspection, and others and changed these data to the initial form we needed.

For building an efficient corpus, a uniform annotation specification was used as a convention document. We organized the annotation team, which consisted of two annotators and a decision-maker. The annotators and the decision-maker have at least 3 or 4 years of welding learning experience. In addition, annotators can determine the content of the marker with an argument. In case of disagreement, the text will be marked according to the decision-maker. The welding data annotation task is the process of assigning a corresponding label to each word. Therefore, the BIO annotation method was used as data annotation. The beginning of an entity can be labeled by “B” (Begin) and belongs to the entity category; the rest of the entity can be labeled as “I” (Inside) and its relevant type. For inter-entity, connectives can be tagged with “O” (Outside).

Word segmentation

Currently, there are no apparent separator features in a sentence for the Chinese corpus, so probabilistic statistical models are often considered as an approach to separate words.²⁹ The primary purpose of this statistical analysis is to determine the most appropriate way of segmenting words from the multiple word combinations in a sentence. The most significant probability of distribution accompanies the optimal situation.

The CRF was proposed to describe the corresponding distribution probability relationship between two sequence sets. Chinese word segmentation based on the CRF iteratively calculates the maximum conditional distribution probability through observation and state sequences. The calculation principle is shown in Figure 1.

Figure 1.

CRF word segmentation principle.

Calculating the distribution probability is the key to obtain the optimal separator words. Here, two random variables, X and Y were defined. Where X takes the value x, and the conditional probability Y takes the value y [P(y|x)]. The calculation formula is shown in equations (1) and (2).

P (y | x) = \frac{1}{Z (x)} \exp (\sum_{i, k} λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) + \sum_{i, l} u_{l} s_{l} (y_{i}, x, i))

(1)

Z (x) = \sum_{y} \exp (\sum_{i, k} λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) + \sum_{i, l} u_{l} s_{l} (y_{i}, x, i))

(2)

Where t_k and s_l are characteristic functions; λ_k and u_l are corresponding weights; and Z(x) is the normalization factor.

Named entity recognition

Named entity recognition as a corpus subtask enables the automatic recognition and extraction of entity information. Rule-based, machine learning and deep learning are the frequently used methods to achieve entity recognition. The rule-based approach is a process of entity matching according to dictionary entity-specific rules based on a data dictionary. Machine learning-based methods mainly rely on probabilistic statistical methods such as HMM and CRF. The deep learning approach focuses on the data as a whole, with powerful vector representation and computational power. During the welding guidance corpus research, several classical machine learning (HMM and CRF) and deep learning (BILSTM and BILSTM + CRF) models were chosen to perform entity recognition tasks.

HMM is a statistically based learning model, and its labeled sequences are calculated according to five elements, that is, (N, M, A, B, and π). Where N denotes the word labels, M refers to the pronoun itself, A represents as the transfer probability matrix, B refers to the observation probability matrix, and π denotes the initial probability matrix. Defining the input sequence x = {x₁, …, x_N}, the status sequence y = {y₁, …, y_N}, and the joint distribution probability can be expressed under an equation (3). Where t represents any node state, p(y₁) can be calculated according to π because of no precursor node, and p(y₁_|y_t−1) and p(x_t_|y_t) can be calculated separately based on A and B matrices.

p (x, y) = p (y_{1}) Π_{t = 2}^{N} p (y_{t} | y_{t - 1}) Π_{t = 1}^{N} p (x_{t} | y_{t})

(3)

CRF is also a commonly used statistical model in the NER task. Defining the conditional probability model P(Y|X), X is the state sequence, and Y is the observation sequence. The model can be obtained based on the maximum likelihood estimates and data samples. We can predict the labels based on the training model and the given input sequence. The calculation formula is shown in equations (1) and (2).

Long Short-Term Memory Network (LSTM)³⁰ involves three unique structures (forget gate, input gate, and output gate), which are used to refine the long-range dependence problem based on the Recurrent Neural Network (RNN). Inter-gate action can help to obtain the desired state and achieve long-distance dependence. In the NER task, inter-semantic dependencies as non-negligible features cannot be adequately expressed via a single LSTM model. Therefore, BILSTM models are proposed to obtain bidirectional semantic features to improve model performance. In addition, BILSTM + CRF is employed as a synergistic BILSTM and CRF model for enhancing model outcome prediction, also widely used in NER tasks. The schematic of BILSTM + CRF is expressed in Figure 2, which includes the BILSTM model action process and the principle of LSTM.

Figure 2.

The schematic of BILSTM + CRF.

Method process

Corpus building and entity identification can be subdivided into several steps: data collection, sentence-level data collation, word segmentation, data annotation, model training, etc. The documents from welding production are collected and processed as sentence-level data. The CRF model is employed to segment sentences into multiple words. We also manually annotated these words by the BIO method. In addition, multiple NER models were trained to extract information and evaluate the corpus. The technical route of the task is illustrated in Figure 3.

Figure 3.

Overall research program.

Experiments

Data categories

According to the actual production application, we divided our data into six categories, namely “standard,”“technology,”“design,”“department,”“manufacturing,” and “quality.”“Standard” includes production standards, requirements, and other guidance documents. “Technology” covers the production process and technical means. “Design” contains product geometry information and performance design. “Department” includes related personnel and institutions. “Manufacturing” covers production materials and the production process. “Quality” provides quality and testing requirements. Detailed information is described in Table 2.

Table 2.

Data category information.

No	Categories	Example
1	Standard	Number and name about the Standard, name for production requirements, etc.
2	Technology	Welding methods, auxiliary welding materials, related technologies, etc.
3	Design	Welding joint size, bevel form, welding position, weld performance design, etc.
4	Department	The manufacturing department, manufacturer, user, Testing department, personnel, etc.
5	Manufacture	products, equipment, tooling, etc.
6	Quality	Product quality, defect detection, checking means, etc.

Data processing

To quickly complete the annotation task, the simple model is trained through small batches of data samples, used for initial data annotation, and then artificially perfected. We developed a corresponding entity annotation tool containing essential functions, annotation fields, label editing, etc. Considering that the datasets contain Chinese characters, the basic function module developed the word separation function. The word segmentation module was constructed based on the trained CRF model. The specialized nature of welding data motivated us to introduce a dictionary of welding terms to avoid invalid segmentation. Therefore, we defined some labels in the label editing module according to the text category requirements and marked datasets in the annotation area via clicking defined tags. The tool function interface is shown in Figure 4.

Figure 4.

Annotation tool interface.

Dataset segmentation divides data according to the sample size to support model training, validation, and evaluation. For data volume, a total of 19,410 sample pairs of data were obtained in the labeling task. We initially divided the training set, test set, and validation set according to the ratio of 8:1:1. We manually checked and adjusted some of the data to ensure that each category was evenly distributed among the division data. The training set contained 15,113 pairs of samples for the modified data set, the test set involved 2350 pairs of samples, and the validation set had 1947 pairs of samples. The detailed data without the label “O” are shown in Table 3.

Table 3.

Category data information.

No.	Categories	Train	Test	Validation	Total
1	Standard	1412	222	256	1890
2	Technology	967	161	164	1292
3	Design	1642	119	96	1857
4	Department	825	116	81	1022
5	Manufacture	1998	384	215	2597
6	Quality	1522	179	229	1930
7	Total	8366	1181	1041	10,588

Model training

Statistical model-based methods (HMM and CRF) and deep learning-based methods (BILSTM and BILSTM + CRF) were trained under supervised conditions. Furthermore, we run the model program via the Python programing language (version: 3.7) in the Scikit-Learn framework (version: 0.24.2). Moreover, the parameters of BILSTM were set to batch size of 64, the learning rate of 0.001, epochs size of 100, while the parameters of BILSTM + CRF only differed in epochs size of 200. For other model parameters, the default value was adopted.

Evaluations

The prediction samples were classified as true positive (TP), true negative (TN), false positive (FP), and false-negative (FN) according to the model prediction results. Precision, Recall, and F₁-score were used as model evaluation metrics. Here, Precision is the proportion of all true samples to the total sample, and Recall represents the proportion of true values in all positive samples. The overall performance of models was evaluated with the F₁-score metric, based on Precision and Recall. These evaluation indicators are expressed as follows:

Precision = \frac{TP}{TP + FP}; Recall = \frac{TP}{TP + FN}

(4)

F_{1} - score = \frac{2 \times Recall \times Precision}{Recall + Precision}

(5)

Results and discussion

Results and baseline

Based on a manually annotated corpus, four models were trained. The training process was analyzed, and the confusion matrix is plotted in Figure 5. In which the horizontal and vertical coordinates indicate the target category and output category, where the numbers 1–13 stand for “B-Technology,”“B-Manufacture,”“I-Manufacture,”“B-Design,”“B-Department,”“O,”“B-Standard,”“I-Design,”“I-Standard,”“I-Quality,”“B-Quality,”“I-Department,” and “I-Technology,” respectively, whereas (a), (b), (c), and (d) denote the confusion matrix of HMM, CRF, BILSTM, and BILSTM-CRF models, respectively. Any cell in the confusion matrix can be interpreted as the count of output categories predicted to be the target category.

Figure 5.

Confusion matrix. (a), (b), (c), and (d) denote the confusion matrices for the HMM, CRF, BILSTM, and BILSTM-CRF models, respectively.

As described in Figure 5, the darker-colored cells formed the diagonal of the matrix, and the values of the diagonal cells indicated the number of predicted and actual results as similar outcomes, therefore indicating that the four models have some degree of accuracy. In addition, we found that category “O” had different degrees of influence on the other categories. Furthermore, 1 and 2, 2 and 3, and 10 and 11 occurred in a more significant confusion than the other categories. The main reason for these results could be considered under the following points: (i) The characteristics of welding professional vocabulary lead to unclear category characteristics. For example, both “size of weld” and “sanding the weld” contained the keyword “weld,” but their categories were different and related to the different contexts. (ii) Confusion among the constituent entity units is also considered one of the factors affecting the model metrics.

The training process was analyzed, and the optimization process of the loss functions on BILSTM and BILSTM + CRF was obtained as described in Figure 6. We found that the model optimization process had a good convergence and no overfitting, making the training and parameter settings of our model reasonable.

Figure 6.

Models optimization process.

As a result, 19,410 pairs of annotated data sets contained six entity categories in the corpus. The NER method was used as an entity collection to improve the corpus scalability further, and the baseline was listed based on the Precision, Recall, and F1-score as described in Table 4.

Table 4.

Model results and baseline.

Model	Metrics	Technology	Manufacture	Design	Department	Standard	Quality
HMM	Precision	0.5115	0.5687	0.5933	0.5564	0.9049	0.7129
	Recall	0.5281	0.4738	0.7426	0.5904	0.8803	0.5986
	F1-score	0.5185	0.5169	0.6462	0.5693	0.8924	0.6500
CRF	Precision	0.7829	0.6861	0.7152	0.7220	0.9351	0.7579
	Recall	0.6028	0.4292	0.7882	0.5005	0.8906	0.6712
	F1-score	0.6777	0.5268	0.7483	0.5863	0.9123	0.7097
BiLSTM	Precision	0.6127	0.4763	0.5921	0.6619	0.9372	0.5691
	Recall	0.3855	0.4630	0.5422	0.4973	0.8184	0.4973
	F1-score	0.4731	0.4685	0.5348	0.5679	0.8725	0.5301
BiLSTM + CRF	Precision	0.7782	0.5915	0.6451	0.6681	0.9148	0.7480
	Recall	0.5407	0.5179	0.5085	0.4617	0.8486	0.5524
	F1-score	0.6380	0.5522	0.5679	0.5379	0.8774	0.6321

As shown in Table 4, the metrics of the other five categories were lower compared to the standard. The reason is that the standard categories showed evident characteristics, including keywords like standard, requirement, and document. In addition, the annotation quality and feature acquisition methods in five other categories were considered important factors in enhancing the metrics.

Corpus-based task allocation

A welding project or task often requires multiple departments to work together, so a reasonable task allocation is vital for efficient production. Latent Dirichlet allocation (LDA)³¹ model is often used for topic determination. The basic process is to select a topic with a certain probability and choose words from the topic with a certain probability. However, the identification of invalid terms causes a waste of resources. Therefore, we combined the NER focus on significant entities to avoid retrieving irrelevant words.

In the NER task, we divided the entities into six categories, where design, technology, manufacturing, and quality corresponded to several important departments in actual production. However, the constraints of production conditions and product requirements make the assignment of tasks a matter of thematic correspondence. They are considered the unique performance requirements of the workpiece, and the tasks that initially belong to the process need to be assigned to the design department for completion. In addition, production equipment and conditions also influence the distribution of welding tasks. To address the issue of task assignment in actual fabrication, we collected a dictionary of terms that is important in the description of tasks in different sectors from the corpus. Task topics can be defined based on dictionaries and thresholds. For example, the keywords “design,”“structure,”“performance,” and “size” are important task characteristics for the design department, and we can assign task texts to design departments when the number of these words exceeds a certain empirical threshold. The task assignment process is shown in Figure 7.

Figure 7.

Corpus-based task allocation process.

Corpus data are divided into entity data and dictionary data, which can provide data support for task allocation. For unstructured texts, important entity information is selected through the corpus and used as the definition of task categories. Some task descriptions in actual production are exemplified, and the corresponding data used to support classification are also listed in Table 5.

Table 5.

Task assignment data based on the corpus.

Task texts	Corpus data		Departments
	Entities	Dictionaries
Laser undercutting by the manufacturing department for plate thickness below 18 mm.	Laser undercutting, plate thickness below 18 mm	Manufacturing	Manufacture
The welding performance level of weldments shall be defined according to the safety level and stress level at the design stage.	Welding performance level, safety level, stress level	Performance design stage	Design
The welding current shall be selected according to the base metal plate thickness, welding position, welding speed, material, and other parameters.	Welding current, base metal plate thickness, welding position, welding speed, material	/	Technology
……	……	……	……
Bottom cover and lower cover welding seams require magnetic particle detection and ultrasonic detection	Bottom cover, lower cover, seams, magnetic particle detection, ultrasonic detection	Detection	Quality

Corpus-based knowledge graph

There has always been a knowledge discontinuity in actual fabrication because of staff turnover. Application changes can also limit the utilization of knowledge by domain experts. In addition, small and medium-sized enterprises have weak knowledge systems due to the lack of resources. Thus, it is crucial to achieving fast queries and access to knowledge systematically. The rapid retrieval and application of knowledge based on knowledge graphs³² have gradually drawn attention. The construction of knowledge graphs based on corpus positively affects realizing the rapid acquisition of knowledge of welding guidance documents. Critical entities in unstructured texts are defined based on corpus data and form relational triples to support knowledge graph construction. Some specific data are shown in Table 6.

Table 6.

The extracted data based on the corpus.

Unstructured texts	Corpus entities	Relationships
Welding of rail vehicles and their components should comply with the recommendations of the series standard EN1011 during welding.	Welding of rail vehicles and their components, EN1011	<Welding of rail vehicles and their components, reference, EN1011>
Stud welding can be used for steel welding.	Stud welding, steel welding	<Stud welding, applicable to steel welding>
The same parts of the same model shall be interchangeable.	Same parts, interchangeable	<Same parts, requirement, interchangeable>
The qualification level depends on the highest welding performance level of the component.	Qualification level, the highest welding performance level	<Qualification level based on the highest welding performance level>
……	……	……
Width of inspection area including the weld.	Width of inspection area, the weld	< Welding belongs to the width of inspection area>

In the corpus research, we manually annotated the entity information and achieved automatic entity recognition. The research considered relational extraction (RE) methods to verify the corpus’s scalability and build a knowledge acquisition system. We summarized six relationships (“belong to,”“reference,”“requirement,”“based on,”“applicable to” and “unknown”) based on corpus data. Welding process document data containing the entire life cycle of welding were selected in this research. The corpus identified significant entities, and entity verification and relationship annotation were done manually. In addition, entities and relationships were characterized by “<entity, relationship, entity>” and are embedded in the knowledge graph, as shown in Figure 8.

Figure 8.

Corpus-based knowledge graph.

Consequently, as shown in Figure 8, we searched for information about check grades, then the basics and associated knowledge were quickly displayed in the form of a knowledge relationship diagram. Therefore, we obtained several meaningful details such as: (i) check grade containing CT 1, CT 2, CT 3, and CT 4; (ii) check grade was determined according to the quality level of the weld, and the quality control department completes; and (iii) check grade-related tasks.

Conclusion

The main objective of the proposed work is to extract data from significant unstructured documents and build a corpus to support engineering applications. The major conclusion remarks are summarized as follows: (i) Unstructured welding instruction information is automatically extracted to form the corpus, which can effectively compensate for the limitations of the structured data. (ii) To automatically acquire entity data, we trained four classical models for the named entity recognition task and published baselines with Precision, Recall, and F1-score metrics. (iii) We provide two use cases for task assignment and knowledge acquisition based on the corpus for actual fabrication.

This essential work can provide data support for engineering and manufacturing, especially for data-driven production operations. The main limitations of this study are its small sample from small-scale practice and the inherent limitations of qualitative research. Therefore, more research is needed to expand the data volume, refine the classification granularity, and design a more reasonable corpus structure. We predict that the research and data will support and contribute to welding knowledge acquisition, expert decision-making, and production pattern upgrade. Thus, improving and enriching the corpus, optimizing the entity recognition methods, and expanding the practical application scenarios of the corpus as new concerns.

Footnotes

Author statement

Kainan Guan: Conceptualization, Methodology, Software, Data Curation. Zhengguang Li: Methodology, Supervision. Li Zou: Methodology, Formal analysis. Yu Zhang: Software, Validation. Xinhua Yang: Writing-Reviewing and Editing, Conceptualization, Funding acquisition.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Grant numbers: 51875072 and 52005071) and the Foundation for Overseas Talents Training Project in Liaoning Colleges and Universities (Grant number: 2018LNGXGJWPY-YB012).

Availability of data and material

The datasets used or analyzed during the current study are available from the corresponding author upon reasonable request.

Code availability

The codes involved in the paper are available upon reasonable request to the corresponding author.

Ethics approval

Not applicable.

Consent to participate

All authors listed have reviewed and approved the attached manuscript.

Consent for publication

The manuscript is approved by all authors for publication.

ORCID iD

Kainan Guan

References

Perzyk

Biernacki

Kozlowski

Data mining in manufacturing: significance analysis of process parameters. Proc IMechE, Part B: J Engineering Manufacture 2008; 222(11): 1503–1516.

Angeli

Chatzinikolaou

Prediction and diagnosis of faults in hydraulic systems. Proc IMechE, Part B: J Engineering Manufacture 2002; 216(2): 293–297.

Rao

RV.

Evaluation of environmentally conscious manufacturing programs using multiple attribute decision-making methods. Proc IMechE, Part B: J Engineering Manufacture 2008; 222(3): 441–451.

Kumar

Starly

“FabNER”: information extraction from manufacturing process science domain literature using named entity recognition. J Intell Manuf 2022; 33(8): 2393–2407.

Lample

Ballesteros

Subramanian

, et al. Neural architectures for named entity recognition. 2016. ArXiv preprint arXiv:1603.01360.

Chung

SF.

Uses of ter- in Malay: A corpus-based study. J Pragmat 2011; 43(3): 799–813.

Bader

Häussler

Word order in German: a corpus study. Lingua 2010; 120(3): 717–762.

Lopes

Teixeira

Gonçalo Oliveira

Comparing different methods for named entity recognition in portuguese neurology text. J Med Syst 2020; 44(4): 77.

Islamaj

Wilbur

Xie

, et al. PubMed text similarity model and its application to curation efforts in the conserved domain database. Database 2019; 2019: baz064.

10.

Suh

Hansen

JH.

Acoustic hole filling for sparse enrollment data using a cohort universal corpus for speaker recognition. J Acoust Soc Am 2012; 131(2): 1515–1528.

11.

Mosavi Miangah

. FarsiSpell: a spell-checking system for Persian using a large monolingual corpus. Lit Linguist Comput 2014; 29(1): 56–73.

12.

Carpuat

. Word sense alignment using bilingual corpora. MPhil Thesis, Hong Kong University of Science and Technology, China, 2002

13.

Berkson

de Jong

Lulich

, et al. Building a multilingual ultrasound corpus. J Acoust Soc Am 2018; 144(3): 1717–1717.

14.

Fierrez

Ortega-Garcia

Torre Toledano

, et al. BioSec baseline corpus: a multimodal biometric database. Pattern Recognit 2007; 40(4): 1389–1392.

15.

Cho

Park

, et al. Combinatorial feature embedding based on CNN and LSTM for biomedical named entity recognition. J Biomed Inform 2020; 103: 103381.

16.

Derbel

Chaibi

Ben Ghezala

HH.

Disease named entity recognition using long-short dependencies. J Bioinform Comput Biol 2020; 18(3): 2050015.

17.

Giorgi

Bader

GD.

Transfer learning for biomedical named entity recognition with neural networks. Bioinformatics 2018; 34(23): 4087–4094.

18.

Nadia

Omar

Malay named entity recognition using rule based approach. Asia-Pacific Journal of Information Technology & Multimedia 2019; 08: 37–47.

19.

Todorovic

Rancic

Markovic

, et al. Named entity recognition and classification using context Hidden Markov Model. In: NEUREL 2008: Ninth symposium on neural network applications in electrical engineering, proceedings, 2008. New York: IEEE.

20.

Batbaatar

Ryu

KH.

Ontology-based healthcare named entity recognition from Twitter messages using a recurrent neural network approach. Int J Environ Res Public Health 2019; 16(19): 3628.

21.

Patrick

Wang

. Biomedical named entity recognition system, In: Proceedings of the Tenth Australasian Document Computing Symposium, Sydney, Australia, 2005, pp. 64–71.

22.

Dalen-Oskam

Does

Marx

, et al. Named entity recognition and resolution for literary studies. Comput Linguist Netherlands J 2014; 4: 121–136.

23.

Piad-Morffis

Gutiérrez

Muñoz

A corpus to support ehealth knowledge discovery technologies. J Biomed Inform 2019; 94: 103172.

24.

Doğan

Leaman

NCBI disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform 2014; 47: 1–10.

25.

London

Building a representative corpus of classical music. Music Percept 2013; 31(1): 68–90.

26.

Rivas

Perkins

Rivas

, et al. Development and validation of the Spanish azbio sentence corpus. Otol Neurotol 2021; 42(1): 154–158.

27.

Zhu

Liu

, et al. Chinese word segmentation research based on conditional random field. Comput Eng Appl 2016; 34–47.

28.

Lafferty

McCallum

Pereira

FCN.

Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, 2001. Morgan Kaufmann Publishers Inc.

29.

Huang

Zhao

Chinese word segmentation: a decade review. J Chinese Inf Process 2007; 21(3): 8–20.

30.

Hochreiter

Schmidhuber

Long short-term memory. Neural Comput 1997; 9(8): 1735–1780.

31.

Blei

Jordan

MI.

Latent Dirichlet allocation. J Mach Learn Res 2003; 3: 993–1022.

32.

Chen

Luo

An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing. Adv Eng Inform 2019; 42: 100959.

Information extraction and application for constructing guidance corpus of welding fabrication