Sage Journals: Discover world-class research

Abstract

Background

Extraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy.

Objectives

In this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language.

Methods

The extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts.

Results

We analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min.

Conclusion

We discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.

Keywords

Natural language processing clinical information systems rule-based information extraction extract-transform-load electronic health record

Introduction

The HiGHmed consortium combines and integrates the competencies of nine university hospitals. Currently, over 20 partners participate in the project, including medical faculties, research institutions, and industry specialists.

HiGHmed’s fundamental purpose is to develop an open platform for the infrastructural and innovative exchange of information, thus allowing a nationwide health information exchange interoperability that enables the development of sophisticated new solutions in precision medicine, medical data analytics, health sharing, and medical education.^1,2

For each location, various diversified information systems (IS) exist that store their data in structured, unstructured, and semi-structured data sources, and extract, transform, load (ETL) processes are required for the data integration task. HiGHmeds goal is to simplify and support the ETL and data mapping processes through sophisticated interoperable information exchange approaches.

In the ETL processes, in the transformation task, the extraction of particular terms, and their respective values from Hospital Information Systems (HIS) that contain information from unstructured or semi-unstructured sources, requires the implementation of an NLP method.

Locke et al., define NLP in medicine as: “a form of ML which enables the processing and analysis of free texts. When used with medical notes, it can aid in the prediction of patient outcomes, augment hospital triage systems, and generate diagnostic models that detect early-stage chronic disease.”³

In this paper, we apply the NLP method of RB to computationally extract clinical concepts in documents through the implementation of hard-coded rules 4, therefore to study the manual coding with REGEX and AURL. The main scope is to demonstrate developers to regard programming coding techniques for improving readability and not only to evaluate their extraction pipelines according to precision and recall.

TNM staging system

The TNM staging system is an internationally recognized standard that through the definition of three primary codes along with additional modifiers, describes the amount and spread of cancer in a patient’s body. The T defines the size of the tumor or the spreading of cancer in a near tissue. The N describes the spread of cancer to near lymph nodes. The M stands for metastasis and describes the spread of cancer to other parts of the body.⁴

The objectives of the TNM code for the tumor classification are:⁵

Aid treatment planning

■ Indicate a prognosis.

■ Assist in the evaluation of treatment results.

■ Facilitate the exchange of information between treatment centers.

■ Contribute to continuing investigations of human malignancies.

■ Support cancer control activities, including through cancer registries.

Related work

Methods of NLP provide the means to extract various information from clinical texts and for information exchange in general while their efficiency is proven in the scientific community when parsing unstructured texts.^6–9

Current state-of-the-art NLP techniques for text extraction are either RB, machine learning-based (MLB), or deep learning (DL) methods. Even though many attempts are made to create sophisticated hybrid methods that combine RB with MLB or DL, with promising results, still 60% of the studies and solutions refer to RB solutions in which the rules are manually written.¹⁰

In the study from Michael et al.,¹¹ in which they focused their research on a more practical approach regarding personal opinions and difficulties that developers face when using REGEX, the following critical points were mentioned:

■ The crypticness of the syntax.

■ Difficulty to validate the syntax during the design time.

■ A little cheat sheet is required that outlines what each symbol does.

■ Hard to validate and document.

The concept of readability and maintainability with RΒ IE pipelines are considered important for the life cycle of the rules.¹²

Heitlager et al.¹³ relate the amount of effort to maintain any software solution with the coding quality of the source code.

Some examples of tools that support the programming of RB extraction pipelines are:

■ Unstructured Information Management Architecture (UIMA)

■ Apache OpenNLP library

■ spaCy library

■ CoreNLP

The UIMA is an architecture platform designed for Java applications to define custom annotators with inbuild custom expressions and embed them through the included analysis engine in external java projects for IE pipelines.¹⁴

The Apache OpenNLP is a Java-based library for ML learning tasks. The library supports the definition of manual annotated terms and training of data for building Named entity recognition (NER) classification models. Li’s et al. definition for NER is, “the task to identify mentions of rigid designators from text belonging to predefined semantic types such a person, location, etc.”.¹⁵ Also, extraction rules that were programmed in a UIMA can be embedded for RB feature and IE.¹⁶

The spaCy library for the Python programming language is built for general NLP extraction tasks; like the pre-processing of large text volumes and feature extraction. Also, the inbuild token-based matching with REGEX through spaCy’s RB matcher engine, allows the definition of string matching specific commands.¹⁷

The CoreNLP library is a Java-based collection of various NLP processing tools, that use RB, ML, and DL components. The UIMA analysis engine can be also embedded for RB feature extraction.¹⁸

Objectives

For using MLB methods, the expertise of ML specialists for maintaining and retraining the domain adaption is required at all times together. Also, the lack of enough training data especially in medical domains and particularly in German requires non-MLB approaches.

In our project, we decided to design a framework independent of ML specialists and to apply manual RB extraction approaches for extracting TNM codes in text-based unstructured pathological reports that are manually entered by medical practitioners. Our main research is focused on the two following questions:

■ What is the time to correct wrong rules or include any exceptions since a TNM code can be written in various ways?

■ If we decide to change the programming language, can we implement the rules efficiently?

Research tools

Manual writing of extraction rules with REGEX has been the dominant approach for text extraction of unstructured and semi-structured texts.¹⁹ However, the AURL is an all-in-one solution that includes a workbench for the writing and unit-testing ²⁰ of the extraction scripts and closes the gap between code readability and maintainability.²¹

Regular expressions

Briefly, REGEX is a formal language that contains various lookup expressions for string pattern matching. REGEX is applied in various NLP tasks like morphology, text analytics, speech recognition, and IE. REGEX is included in many software tools and programming languages.²²

In the preparation part, usually, a developer reads the specification documents and annotates and identifies the rules needed. In the composition part, the rules are written and tested through for example an external online tool like regex101.com²³ or through a programmed unit testing, in which each extraction rule is required to be exclusively tested separately.

In the implementation part, the REGEX is then encapsulated in a programming language that supports REGEX. Wrongly written scripts, that do not satisfy the extraction results, are re-entering the composing part for correction until the extraction is satisfied.

For our research with REGEX, we used the Python programming language and the spaCy library for programming the extraction pipeline.

Apache UIMA ruta language

The AURL is an imperative RB language that extends the UIMA framework for the mapping of expressions and annotations and enables the rapid creation, editing, and debugging of extraction scripts while REGEX can be also included.

An Eclipse plugin called the UIMA Ruta Workbench is included that validates the script commands in design time, enables the import of files for testing, and can also in the same environment visualize the extraction results of multiple rules and significantly help reduce the time for writing and validating the rules.

The feature of immediate visualization helps to solve the problem of meaningfulness of the annotations through visual cues results so that users can understand the function of the coding components required to improve and optimize existing rules and to add efficiently new rules to an unseen set of documents in which the current extraction rule will fail.^21,24

In AURL, the composition and implementation parts are processed together since the included Java Eclipse workbench allows the parallel editing, correction, and visualization of the results from multiple rules.

Methods

This section outlines script coding factors for writing readable and maintainable manual RB pipelines. The writing of readable code offers a level of abstraction and provides an understanding of the intended purpose of the extraction script. For our study, we measure how readable and non-readable rules have an impact on importing into another language and especially on maintaining RB IE pipelines.

We program two separate pipelines, in REGEX with spaCY and AURL. Each rule is programmed then once in out-of-the-box and also in a readable style. We measure the total time that is required to update faulty rules in both pipelines and each style. Record of the time is through a handheld start-stop timer in which we log the data on an excel table.

Computation performance, portability, and accuracy extraction are outside the scope of this paper.^21,25

Code readability

Code readability as a software quality metric has an immediate effect that can positively or negatively influence maintainability and reusability.²⁶

Reusability requires software design and testing principles ²⁷ and is also an achievement concept that can be accomplished through readable code.²⁶

Code readability can not be easily quantified and measured by a deterministic function, nonetheless, maintainability can be measured by summarizing either separately or simultaneously the duration for editing, modifying, testing, and validating the results of the extraction rules.²⁸

When writing RB extraction scripts, we relate readability with the understanding of the extraction logic by reading without the presence of a knowledge engineer or developer. As script extraction logic, we refer to the steps for parsing and allocating portions of a search string from a given sequence of symbols.

Table 1 shows an example of a simple script both in REGEX and AURL for the extraction of the primary tumor T entry string from the classification for tumors from a semi-structured pathological medical report.

Table 1.

Simple REGEX and AURL script for an easy extraction rule.

Search text	REGEX script	AURL script
13. pT1b,pNx,L0,V0,PnR0; G2	T\d	DECLAREtSite; (“T" NUM){‐ > tSite};
Result	T1	T1

The extract logic for both scripts in Table 1 is readable: in the search text “13. pT1b, pNx, L0, V0, Pn0, R0; G2” allocate portions that qualify the conditions: a string that starts with an uppercase “T” character and following through a number.

Standard rules of thumb for writing readable programming code are decomposition and redundancy²⁹ and declaring rules with intention-revealing names.²⁸

In the context of writing extraction scripts, decomposition is the ability to split complicated rules into more manageable and readable sub-rules. Decomposition relates also to redundancy and helps avoid the repetition of rules by removing duplicate ones. Redundant rules can be reusable and through proper definition, they can be shared and included between various extraction pipelines.

Also according to the survey of Tashtoush et al.,³⁰ intention-revealing names along with the spacing of programming commands have an important impact factor on code readability.

Code maintainability

Akour et al.³¹ comment that “developers spend most of their time trying to read and understand the addressed code during the maintenance phase.”

Maintainability is not a design pattern but a software achievement concept and belongs to one of the most important software quality factors in general.³²

Especially for RB systems, in which many rules are to be written, tested, and corrected, if the concept of readability is not considered as a guideline, it can negatively influence the correction or the addition of new rules in unseen data.

For RB coding, maintainability is affected also by the available tools and methods in which a rule programmer can easily edit, control, evaluate, and update extraction scripts.

An RB script can only be maintainable if it is readable through coding design principles and the application of methods/tools that allow the editing, evaluation of the test rules, and visualization of the results for correction.

Research dataset

For building the rules, a data specification document was composed through team meetings between one data curator and one domain expert containing selected examples of texts for coding the rules that extract values from the TNM classification. The domain expert was a medical practitioner with experience in tumor diagnosis. In Table 2, some examples from the specification document can be seen.

Table 2.

10 examples selected in the specifications document between the data curator and domain expert for building the rules.

Example input
UICC-Klassifikation (8. Auflage, 2017) 13. pT2 (2.1 cm), pNx, pMx, L0, V1, Pn0, R1; G3
UICC-Klassifikation (7. Auflage, 2010) 13. pT1b, pNx, L0, V0, Pn0, Rx, G2
UICC-Klassifikation (8. Auflage, 2017) 16. pT2, pN1 (2/22), G2, L1, V0, Pn1, R0 (lokal)
UICC-Klassifikation (7. Auflage, 2010) 13. pT1b, pNx, L0, V0, Pn0, R0; G2
UICC-Klassifikation (7. Auflage, 2010) 13. pT1b, pNx, L0, V0, Pn0, R0; G2
UICC-Klassifikation (8. Auflage, 2017): 14. pT2, pN1 (3/13, max. 1.8 cm), L1, V0, Pn1, R1 (dorsal), CRM positiv
UICC-Klassifikation (8. Auflage, 2017): 20. pT3, pN2 (4/21), pMx - G2, L1, V1, Pn1, R1
UICC-Klassifikation (8. Auflage, 2017): 20. pT3, pN0 (0/15), L1, V0, Pn1, Rx
UICC-Klassifikation (8. Auflage, 2017): 19. pT3 (5 cm), pN1 (2/12), Pn1, L0, V1, G2, R1 (Gallengangs- und Pankreasabsetzungsrand)
UICC-Klassifikation (8. Auflage, 2017): 20. pT3, pN0 (0/32), G2, L0, V1, Pn1, R1

A total of 113 semi-structured cases containing TNM strings from the Oncology department of XXX1 were available for optimizing the rules in the German language.

In Table 3, the terms of the TNM parameter are shown, with their possible values of the specification document and a tag id for assigning the annotation term.

Table 3.

Example of the table of the tags for the manual assignment of the documents and for writing the export rules. The third column is to determine the various extraction strings for each term. The as underlined marked values represent the required string to be extracted.

Term	Tag id	Ex. Possible values
Number of examined lymph nodes	Lymphnode.S	(4/21)
Number of tumor-affected lymph nodes	Lymphnode.C	(4/21)
Morphology code ICD-0	Morphology.C	8140/3
Topography code ICD-0	Morphology.T	C25.9
Primary tumor	TNM.T	pT3,pT1b,rpT2
Prefix primary tumor	TNM.pre	rpT2
Regional lymph nodes	TNM.N	pN1
Distant metastasis	TNM.M	pM1
Histopathological grading	TNM.G	G1,G2,G3,G4
Residual tumor	TNM.R	R1,RX,R0,R2
Lymphatic invasion	TNM.L	L1,LX,L0
Vein invasion	TNM.V	V1,VX,V0,V2
Perineural invasion	TNM.Pn	Pn1,PnX,Pn0
Multiple primary tumors	TNM.m	pT2m, pT2m ( n = 2), pT2(m) or pT1 (m, n = 4)
Multimodal therapy	TNM.y	Yp
Relapse	TNM.r	Rp
TNM edition	TNM.Version	7,8

After the rules for both REGEX and AURL were written and corrected, 146 random unstructured cases were chosen that contain TNM strings. In the semi-structured texts, the TNM codes were written in numbered paragraphs while in the unstructured ones the codes were entered more sparsely. An example of the texts with the annotations to be extracted marked can be seen in Table 4.

Table 4.

Table containing example texts in the German language from semi-structured and unstructured documents and for writing the export rules for the TNM.

Semi-structured examples dataset	Unstructured examples dataset
13. pT1b, pNx, L0, V0, Pn0, R0; G2	pT2 (siehe Kommentar), pN1 (4/5), L0, V1, Pn1, R1 (Weichgewebsresektionsrand im Hilusbereich); G2
13. pT1b, pNx, L0, V0, Pn0, G2, R0	bislang pT1a, pNx, pMx, L0, Vx, Pn0, G3, R0 (lokoregionär marginal, <0.1 cm)
2. Grading (1/2/3) G2	mit Infiltration durch ein multifokales (bifokales) hepatozelluläres Karzinom mäßiger Differenzierung (HCC, G2)
2. Grading (1/2/3) G2	durchmessenden hepatozellulären Karzinom guter Differenzierung (G1), lokoregionär in toto reseziert (R0)
18. N-Status (Zahl gesamt n/n) Pankreas: pN1 (1/9)	Ein tumorfreier Lymphknoten (0/1)

Through the unstructured examples, we were able to exemplarily study the difference in factors of readability and maintainability when writing the extraction scripts.

Results

Two NLP developers implemented the pipelines in AURL and REGEX. The experience level of the AURL developer was intermediate, while the level of the REGEX developer was advanced.

In the programming phase of the rules, the visualization of the results that the AURL workbench provided offered a valuable feature that helped us to immediately identify missed or wrong annotated parameters, the total line of commands for the 20 extraction rules of the source code was 73.

The multi-extraction of the TNM parameters including the test/maintenance in AURL was programmed in 8 h.

With REGEX and spaCY, the programming and testing of the scripts were programmed in 16 h. The total programming lines for 20 rules of the source code were 205.

In Table 5 we output the evaluation of the time factor for correcting rules that were written with out-of-the-box programming compared to readable rules. The time for correcting faulty rules was significantly reduced.

Table 5.

Comparison of correcting out-of-the-box programmed rules with readable ones.

Language	Number of rules	Out-of-the-box programming of each rule	Readable programming of each rule
REGEX with spaCY	20	5–10 min	2–5 min
AURL	20	3–5 min	Max. 2 min

The examples following were chosen from the development phase of the NLP pipeline to discuss the efficiency of both methods from a developer and non-developer perspective and to serve as a paradigm for a better practical understanding of the factor’s readability and maintainability. A rule in REGEX which can be seen in the following coding example 1, was directly implemented with AURL within 10 min. After decomposing the rule, and declaring intention-revealing names, the rule was reprogrammed in AURL in 5 min.

Examples in readability

The next code in example 1 demonstrates the extraction of the primary tumor string from an unstructured pathological medical report through a REGEX script.

For this example, the density, non-intuitive syntax, and crypticness of the REGEX are demonstrated. Non-REGEX experienced developers require a cheat sheet to understand the intent parallel to the extra process of inputting the text and testing with a REGEX parser. The expression is also non-redundant. Decomposition and redundancy are only possible in the programming language, like in the following pseudo-code example in 2.

In the AURL in the following coding example 3, a demonstration of a script for extracting the direct extent of the primary tumor from a TNM classification is coded.

Furthermore, the script can be decomposed, like in code example 4.

The previous extraction rule is now decomposed into two rules: the prefix of the primary tumor that is declared as a variable and is now redundant and the primary tumor. Because of the decomposition and redundancy, the long rule is now in smaller parts and the extraction logic can be better read and understood.

The extraction logic is as follows: allocate portions of text that contain the string of prefixes with a “p” or “y” character that starts after a comma character or not, if the condition is fulfilled and a “T” character follows with a number then extract.

The decomposition of the extraction pattern, through the declaration of rules, actions, and conditions, in AURL, offers a more intuitive declaration that helps to eliminate the gap of syntax crypticness and understand the intent of the rule.

Long rules are decomposed into shorter ones, and declaring intention-revealing names can simplify the process of reading and understanding the extraction logic. In reusability, repeated rules or complicated ones that contain many symbols can be included in one file as a library or class object.

Examples in maintainability

AURL offers a workbench in which multiple rules can be written and in one batch tested directly with test texts while the results of the extracted rules are visualized and each rule can be selected and the annotated text is marked visually.

In three cases of the unstructured data, the histopathological grading was inputted as “G2-3,” this input of the parameter was not specified on the data modeling document. The batch visualization of the results provided helpful to identify the missed annotation and easily define and change the rule.

Because of the ambiguous tokenization of TNM strings, when implementing the multiple variations found after testing the pipeline that refers to the same extraction rule with spaCy and REGEX in our pipeline, the adaption proved not optimal.

In five cases of unstructured data, the value of the Primary tumor was entered with space in between.

The maintenance of the previous extraction rule required the writing of an extra on spaCy’s rule matching parser of an empty character following after a “T”. In code example 5, this issue can be viewed, in which an extra command was added.

The first matching command searches for the string that does contain a space in between, while the second is for numbers following the T character.

In programming with REGEX, each rule was tested separately and for each document, applied rules cannot be visualized in a batch. This factor can significantly negatively influence the general aspect of maintainability.

Discussion and Conclusion

The structured approach of composing and decomposing the rules together with the visualization of the extraction results can help to provide a clear intent and understanding of the written syntax. For the programming of new rules or editing, the time factor can be reduced.

Readable composing of the commands allows the exploitation of domain knowledge on a programming and readable level that help developers immediately identify portions of the commands and the intention of the extraction code while offering a level of interpretability.

In this research study, while seeking a reliable method for programming a readable and maintainable RB IE pipeline, the demonstration of the application of the AURL closed the main gap when programming with REGEX which is the general problem with the lack of abstraction.³³

However since that REGEX is used in various programming languages for IE pipelines, a readable code strategy can help to achieve a level of abstraction and also make it maintainable.

The combination of readable and structured writing of extraction strings parallel to an immediate visualization of multiple rule results can accomplish a maintainable and portable pipeline and should be considered independent of the development platform in every IE framework.

When developing rules, a challenging aspect is the various ways to tokenize an input expression due to the morphological variability in real-world clinical text. For instance, the expression “cT2N1” is equivalent to “cT2 N1,” but the first expression is usually not tokenized at all by General-domain tokenizers. Therefore, exceptions need to be added manually as tokenization rules and different variants of the same matcher have to be considered.

Such exceptions when decomposed in smaller tokenizations, can be therefore easily changed, without having to debug and update a more complex rule. This programming approach can be also applied to other RB IE applications in which items from other domains need to be extracted. In the case that medical departments allow the practitioners to freely write without considering the standards of annotating a coding system for example, they can ente spaces, commas, or other characters, it would be of advance if such rules are from the beginning in such a way designed that can be easily changed.

In the case of reconfiguring IE pipelines after some time to add new rules on unseen data or error correction, the constructive composing of rules through the principles of readability can provide the means to easily reprogram the pipeline and save development time for production environments, and reduce the overall cost.

Coding Examples

Coding example 1: REGEX script example for extracting the primary tumor site from a German unstructured text.

Extraction string:

I. Ein hyperplastischer Lymphknoten

mit einer 0,7 cm durchmessenden Metastase (T1) eines gering differenzierten Adenokarzinoms UICC-Klassifikation (8. Auflage 2017): \.br\ ,pT2, pN1 (1/2), pMx, L1, V1, Pn1, G3, R0 (klinische Residualtumorfreiheit vorausgesetzt)\.br\ \.br\

REGEX rule script:

(?<=;|,)(p|y)?T[1-3]|(p|y)?T[1-3](?=,)

Result: T2

Coding example 2: Pseudo-code of decomposing a REGEX in a programming language.

Var tNode=“T[1-3]“;

Var prefixBeforeSpecial=”(?<=;|,)”;

Var prefix=“ (p|y) ?“;

Var regexTNode= prefixBeforeSpecial + prefix + Tnode + „| „ + Prefix + Tnode

Coding Example 3: Example of the size or direct extent of the primary tumor in a TNM classification from a German unstructured text with an AURL script.

Extraction string:

I. Ein hyperplastischer Lymphknoten mit einer 0,7 cm durchmessenden Metastase (T1) eines gering differenzierten Adenokarzinoms UICC-Klassifikation (8. Auflage 2017): \.br\ ,pT2, pN1 (1/2), pMx, L1, V1, Pn1, G3, R0 (klinische Residualtumorfreiheit vorausgesetzt)\.br\ \.br\

AURL rule script:

DECLARE TNode; ((SPECIAL (“p” | “y”)) | (“p” | “y”)) (“T” NUM){->TNode}

T“ NUM,(“T“ NUM){->TNode}.

Result: T2

Coding Example 4: Decomposing the rule in AURL script for the size or direct extent of the primary tumor in TNM classification.

AURL rule script:

DECLARE TNode;

DECLARE Tprefix;

(“p”|”y”){->Tprefix};

((SPECIAL Tprefix) | Tprefix) (“T” NUM)){->TNode}

Result: T2

Coding Example 5: Example code of the addition of an extra rule for parsing a variation found on testing the REGEX with spaCy pipeline when extracting the primary tumor string value.

Extraction string 1:

ypT 1

Extraction string 2:

ypT1

REGEX script in spaCy matcher:

self.matcher.add("T",None,[{"TEXT":{"REGEX":'(?<![A-Za-z0-9])'+ r"[yra]{0,3}[upc]?T" + '$'}},

{"TEXT": {"REGEX": r"[0-4] "}}])

self.matcher.add("T",None,[{"TEXT":{"REGEX":'(?<![A-Za-z0-9])'+ r"[yra]{0,3}[upc]?T" + r"[0-4]" + '$'}}])

Result 1: yPT 1

Result 2: yPT1

Footnotes

Acknowledgements

The project within this work was done is funded by the German Ministry of Education and Research.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: This publication is funded by the Deutsche Forschungsgemeinschaft (DFG) as part of the “Open Access Publikationskosten” program.

Availability

The source codes both in AURL and spaCy with REGEX associated with our research for extracting TNM staging strings in medical german texts are freely available at https://gitlab.plri.de/NektariosLadas/tnm-extraction-with-apache-uima-ruta-language/-/blob/master/uima_ruta/tnmextract_nodes.ruta and .

Approval statement

The complete content and data on this study is approved from Hannover Medical School in Hannover Germany for publishing. Approval code: 8411_BO_K_2019.

ORCID iD

Nektarios Ladas

References

Sprivulis

Walker

Johnston

, et al. The economic benefits of health information exchange interoperability for Australia. Aust Health Rev. 2007;31(4):531, 539,

Haarbrandt

Schreiweis

Rey

, et al. HiGHmed - an open platform approach to enhance care and research across institutional boundaries. Methods Inf Med 2018; 57(S 01): e66–e81.

Locke

Bashall

Al-Adely

, et al. Natural language processing in medicine: A review. Trends in Anaesthesia and Critical Care 2021; 38: 4–9. doi: 10.1016/j.tacc.2021.02.007.

Boeker

França

Bronsert

, et al. TNM-O: Ontology support for staging of malignant tumours. J Biomed Semantics 2016; 7(1): 64, doi:10.1186/s13326-016-0106-9.

TNM classification of malignant tumours. UICC . https://www.uicc.org/resources/tnm Accessed October 27, 2022.

Home . Highmed.org. https://www.highmed.org/ Accessed April 28, 2021.

AAlAbdulsalam

Garvin

Redd

, et al. Automated extraction and classification of cancer stage mentions from unstructured text fields in a central cancer registry. AMIA Jt Summits Transl Sci Proc 2018; 2017: 16–25.

Velupillai

Suominen

Liakata

, et al. Using clinical Natural Language Processing for health outcomes research: Overview and actionable suggestions for future advances. J Biomed Inform 2018; 88: 11–19.

Wulff

Mast

Hassler

, et al. Designing an openEHR-based pipeline for extracting and standardizing unstructured clinical data using natural language processing. Methods Inf Med 2020; 59(S 02): e64–e78.

10.

Adnan

Akbar

. An analytical study of information extraction from unstructured and multidimensional big data. J Big Data 2019; 6(1): 91. doi:10.1186/s40537-019-0254-8.

11.

Michael

L. G.

Donohue

Davis

J. C.

Lee

Servant

(2019, November). Regexes are hard: Decision-making, difficulties, and risks in programming regular expressions. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) , San Diego California U.S., (pp. 415–426). IEEE.

12.

Bernhard Waltl/Georg Bonczek/Florian Matthes , Rule-based information extraction: Advantages, limitations, and perspectives. Jusletter IT 22 2018.

13.

HeitlagerKuipers

Visser

, “A Practical Model for Measuring Maintainability,” 6th International Conference on the Quality of Information and Communications Technology (QUATIC 2007), Lisbon, Portugal, 2007, pp. 30–39, doi: 10.1109/QUATIC.2007.8.

14.

Ferrucci

Lally

. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat Lang Eng 2004; 10: 327–348.

15.

Sun

Han

, et al. A survey on deep learning for named entity recognition. IEEE Trans Knowl Data Eng 2022; 34: 50–70. doi: 10.1109/tkde.2020.2981314.

16.

The Apache OpenNLP Team . Apache OpenNLP. Apache.org. http://opennlp.apache.org/ Accessed April 27, 2021.

17.

Manning

Surdeanu

Bauer

Finkel

Bethard

McClosky

. The Stanford CoreNLP natural language processing toolkit. 2014 Presented at: 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations; June; Baltimore, Maryland p. 55–60.

18.

Honnibal

(2017) Spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing, Sentometrics Research. Sentometrics Research. Available at: https://sentometrics-research.com/publication/72/ (Accessed: March 16, 2023).

19.

(2008). Regular expression learning for information extraction. In Proceedings of the 2008 Conference on empirical methods in natural language processing. Association for Computational Linguistics (pp. 21–30).

20.

Bai

G. R.

Clee

Shrestha

Chapman

Wright

Stolee

K. T.

, “Exploring Tools and Strategies Used During Regular Expression Composition Tasks,” 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), Montreal, QC, Canada, 2019, pp. 197–208, doi: 10.1109/ICPC.2019.00039.

21.

Kluegl

Toepfer

Beck

, et al. UIMA Ruta: Rapid development of rule-based information extraction applications. Nat Lang Eng. 2016; 22(1): 1–40.

22.

Kaur

Usage of regular expressions in NLP. Int J Res Eng Technol 2014; 03(01): 168–174.

23.

Dib

. regex101: build, test, and debug regex. Regex101.com. https://regex101.com Accessed April 28, 2021.

24.

Wittek

Toepfer

Fette

, et al. Constraint-driven Evaluationin UIMA Ruta. In: Kluegl

Castilho

Tomanek

, eds; 2013: 58–65.

25.

Davis

J. C.

Michael

IV L. G., Coghlan, C. A.

Servant

Lee

(2019, August). Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Tallinn Estonia , (pp. 443–454).

26.

Alawad

Panta

Zibran

, et al. An empirical study of the relationships between code readability and software complexity.27th. International Conference on Software Engineering and Data Engineering (SEDE). New: Orleans Louisiana U.S, 2018.

27.

Frakes

Kang

. Software reuse research: Status and future. IIEEE Trans Software Eng 2005; 31(7): 529–536.

28.

6th international conference on the quality of information and communications technology-TOC . In: 6th International Conference on the Quality of Information and Communications Technology (QUATIC 2007). IEEE 2007.

29.

Martin

. Clean code: A handbook of agile software craftsmanship. Hoboken, NJ: Prentice-Hall, 2009.

30.

Tashtoush

Odat

Alsmadi

, et al. Impact of programming features on code readability. Int J Softw Eng Appl 2013; 7(6): 441–458.

31.

Akour

Falah

(2016) “Application domain and programming language readability yardsticks,” 2016 7th International Conference on Computer Science and Information Technology (CSIT) [Preprint]. Available at: https://doi.org/10.1109/csit.2016.7549476

32.

Saini

Dubey

Rana

. Analytical study of maintainability models for quality evaluation. Indian Journal of Computer Science and Engineering 2011; 2(3): 449–454

33.

Erwig

Gopinath

. Explanations for Regular Expressions. Fundamental Approaches to Software Engineering. Tallinn, Estonia: Springer Berlin Heidelberg, 2012, pp. 394–408.

Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts

Abstract

Background

Objectives

Methods

Results

Conclusion

Keywords

Introduction

TNM staging system

Aid treatment planning

Related work

Objectives

Research tools

Regular expressions

Apache UIMA ruta language

Methods

Code readability

Code maintainability

Research dataset

Results

Examples in readability

Examples in maintainability

Discussion and Conclusion

Coding Examples

Footnotes

Acknowledgements

Declaration of Conflicting Interests

Funding

Availability

Approval statement

ORCID iD

References