Sage Journals: Discover world-class research

Abstract

For many people, the Internet is their primary source of knowledge in today’s modern world. Internet users frequently seek health-related information in order to better understand a health problem, seek guidance, or diagnose symptoms. Unfortunately, most of this information is inaccurate or unreliable, making it difficult for regular users to discern reliable sources of information. To determine online source reliability, specific knowledge and domain expertise are necessary. Researchers in health informatics studied a number of linguistic and non-linguistic indicators to assist ordinary individuals in judging medical web page credibility. This study proposes a method that automates the process of assessing the reliability of online medical sites based on textual and non-textual characteristics. To evaluate the proposed approach, we developed a real-world dataset of Arabic web pages that provide medical information. This dataset is the first Arabic medical web page dataset for content credibility evaluation. The hybrid approach was assessed using multiple machine learning and deep learning algorithms on the dataset, providing an accuracy and F1-score of 79% and 77%, respectively. We also identify the most observable patterns that help evaluate or detect unreliable web pages written in Arabic.

Keywords

content credibility natural language processing machine learning deep learning health websites

Introduction

The Internet nowadays has become peoples’ first adviser as a result of the massive growth of Internet users and increase in web content throughout the world. This information-loaded environment greatly influences people’s lives, as many users rely on the web as their primary information source to make decisions. This publicly available web-based information affects many aspects of individuals’ life including individuals’ purchasing behavior, lifestyle, and political choices. The open nature of the web, where anyone can produce and share content with others, has contributed to a lot of information being distributed without rigorous fact-checking.

The issue of information credibility validation is even more important for medical websites that people might rely on to diagnose specific symptoms or to know more about prescribed or over-the-counter medication. The Internet also has countless articles about alternative medicine or fringe medicine (i.e. an unproven practice to achieve the healing effects of medication).¹ The lack of credibility assessment systems has contributed to increased fake news, rumors, and non-credible information in many domains on the web. Despite this tremendous amount of misinformation on the web, many Internet users still rely heavily on web content without further investigating its veracity. For these reasons, there is a crucial need to automatically evaluate the credibility of information emerging on the Internet.

Credibility assessment has been an important topic in many domains like health and social media. Considerable attention has been paid to medical/health information credibility assessment that led to the establishment of the HON² foundation. HON, a non-governmental organization (NGO) internationally known for its pioneering work in the field of health information ethics, introduced eight principles to assess the credibility of health information on the Internet. Despite the huge interest in assessing the credibility of health and medical web content, only a few studies have investigated or studied the evaluation of health information credibility. Most of these studies have relied on limited features that were not acquired in a fully automated manner, and only a small number of studies^3–5 have proposed the automatic evaluation of health content on the web. The entire body of literature on health credibility evaluation focuses on English language websites, and there have been no studies targeting Arabic language websites, despite the increased number of Arabic web users and the vast amount of non-credible Arabic health content available on the web.

Several studies show the significance of acquiring medical knowledge from the web. A study found that 59% of Americans use the Internet frequently to search for health-related information, and about 35% of the study sample had made health decisions based on information from the Internet.⁶ Other studies show that 72% of USA users and 75% of UK users search the Internet for medical advice related to illness, treatments, and medical procedures.^7,8 In a more recent study, results indicate that one in every three adults in the United States is using the Internet to diagnose or learn about a health-related problem.⁹ Another European-based study found that about 6 of every 10 Europeans had sought health information on the Internet and that about 90% had used the Internet to advance their health knowledge.¹⁰ To evaluate the impact of online health information-seeking practice on Arabic-speaking users, we conducted a survey in 2018 on 263 random participants from Saudi Arabia. Participants were of different ages ranging from 18 to 55 and with varying levels of education (i.e. undergraduate to graduate). The survey results showed that 77.9% of the sample had sought health information on the Internet, and 40% confirmed that they had been unable to distinguish whether the health information they read was credible or not. The survey results also showed that 35.6% of participants had used a medical prescription from the Internet. These various studies also confirmed that health information credibility is an important and common problem in the entire world.

Assessing the credibility of health web pages is a challenging task for regular individuals with no medical expertise. The amount of unreliable health information on the Internet is significantly increasing. Despite the fact that there are many Arabic health information contents on the Internet, there has been an absence of Arabic health website classifications according to international criteria. No previous study investigates automatic assessment of Arabic health content on the web. Therefore, in this paper, we propose a machine-learning/deep learning-based approach to automatically assess the credibility of Arabic health web pages by extracting and analyzing their textual and non-textual features. To achieve this, we first collected a dataset of credible and non-credible real-world Arabic health web pages that multiple medical experts annotated. This dataset was created as a result of the absence of benchmark health web page credibility datasets. Then, we implemented several machine learning and deep learning models using different features to assess Arabic medical web page credibility automatically.

Accordingly, the contributions of this research are threefold. First, this research is the first to explore the impact of both non-linguistic features and linguistic features in health-related web page credibility prediction. Second, this research is the first to investigate Arabic web pages’ credibility assessment. Third, in this research, we developed the first dataset of credible and non-credible Arabic web pages. The rest of the paper is organized as follows. The related works section provides a gist of credibility prediction. The methodology section describes the methodology followed by results and discussion in the results and discussion sections. Finally, the conclusion gives a summary of the proposed study and some future directions in the domain of credibility assessment.

Related works

Most of the literature on credibility assessment has focused on understanding the factors correlated with the credibility of any given information source.^11–14 Understanding these factors is essential in building a predictive model that automatically assesses the credibility of a given content, as these factors will be represented as features or patterns that the model will be built upon. Many studies also have utilized different sets of features and feature selection strategies to assess the credibility of information found on the web. This section discusses research studies that have used textual, non-textual, and hybrid-based features (i.e. both textual and non-textual features) to assess information credibility in general. Then, a summary of studies that have analyzed web health content credibility is given.

Textual Features

Several credibility detection approaches have focused on analyzing the textual features of online content showing that the credibility of information can be assessed from the textual features of a given document. Fusilier et al.¹⁵ and Feng et al.¹⁶ proposed a deception detection model based on the textual content of online reviews and evaluated their proposed approach on a hotel review dataset using different machine learning algorithms and achieved high accuracy. Jaworski et al.¹⁷ examined several hypotheses postulating that a web page’s credibility depends heavily on its textual content and that the credibility of a text is a function of the statements that compose that text. Singh et al.¹⁸ proposed a new model that takes several textual features (keywords) as inputs to evaluate the credibility of a given web page, with an overall accuracy of 83%.

Non-textual features

Despite the importance of web pages metadata, much of the literature has focused on textual or hybrid (textual and non-textual) features for credibility evaluation, and there has been no study has utilized metadata features only for information credibility assessment. Kakol et al.¹⁹ studied the impact and significance of using multiple non-textual features for credibility evaluation on a large dataset named of 5543 web pages grouped into five categories: medicine, healthy lifestyle, politics, and economy, personal finance and entertainment. Based on a set of factors that highly correlated with credibility, a predictive model for web content credibility was built and provided an accuracy of 70%.

Hybrid features

Unlike previous works that focused only on either textual or non-textual features to assess the credibility of information, several studies proposed models that combine textual features, and source reliability. Popat et al.²⁰ proposed a novel model that assesses the credibility of textual claims using two types of features: linguistic features and source reliability features. The model evaluated the use of different feature configurations: textual only, source reliability only, and using both types of features. The best result was obtained when combining textual and source reliability features. This emphasizes the importance of integrating textual and source reliability features in credibility prediction.

Another research study²¹ was conducted to evaluate the credibility of emerging textual claims on the web and social media. The authors considered different features to assess the credibility of a given web page and used linguistic features including assertive verbs, active verbs, Hedges, implicative, report verbs, discourse markers, subjectivity, and bias. They also used a stance classifier that predicted how the text supported or refuted a claim. The approach was evaluated on two different datasets, namely Snopes and Wikipedia. The performance results showed good accuracy, with 73.69% on the Snopes dataset and 80% on the Wikipedia dataset. Despite this high accuracy, the source reliability features used in this research, such as page rank and Alexa rank, evaluated the web pages from the popularity and authority perspectives, and no credibility. For example, an Alexa rank is based on the number of website visitors who have the Alexa rank toolbar installed, which makes the source reliability features used in this research reflect weak evidence of a web page’s credibility. In the same manner, Olteanu et al.¹⁰ attempted to automatically assess web page credibility by applying a supervised learning algorithm on a real-world dataset using four groups of features: text-based, social (i.e. the popularity of a website), content-based, appearance, and metadata features. Their model achieved 70% accuracy in predicting the credibility of web pages.

Health web page credibility assessment

Credibility assessment studies have focused on a range of domains such as health, finance, politics, and the environment. This section focuses on studies that target health-related content, as this is the focus of this research.

Several organizations have defined a set of criteria that determine the credibility of information on health-related web pages. HON Foundation, in collaboration with medical experts and the International Organization for Standardization (ISO), defines a number of standards that qualify medical websites’ credibility that include authority, complementary information, privacy, attribution, justifiability, transparency, financial disclosure (i.e. if any organization has received any support or contributions, they should state this clearly), and advertising policy (i.e. if advertising is a source of funding, it should be clearly stated).²² Evaluations using these standards are typically conducted manually. Likewise, the Agency for Health Care Policy and Research (AHCPR), now known as the Agency for Healthcare Research and Quality (AHRQ),²³ established seven criteria to evaluate the quality of health information on the web that includes credibility (i.e. source credibility, author qualifications, information relevance, and review process for the information), content (i.e. complete and accurate content), disclosure (i.e. clear details on the site about financial sources), links, interactivity, caveats (i.e. the purpose of the website; whether it is commercial or other must be clarified). Similarly, DISCERN^24,25 instrument defines a set of standardized criteria for evaluating the quality of public health information about treatment choice. DISCERN questionnaire provides guidelines for appraising information and outlines standards for information producers that can be used by health professionals, patients, and the general public. The questioner consists of 16 questions assessing health information from aspects of the information content such as explicit sources, dates, bias, risks, and benefits of treatment. The reliability, credibility, and overall publication quality are assessed from content only, and content presentation or delivery aspects are not covered in the questionnaire.

Several studies have analyzed the credibility of health information and factors influencing the credibility of information on health-related web pages. Flanagin et al.²⁶ demonstrated that the website interface design is considered to be strongly related to health websites’ credibility. Similarly, Eysenbach et al.²⁷ conducted empirical research investigating health websites’ credibility, and the findings of the study showed that several characteristics of a website influence their credibilities, such as completeness (i.e. topics are covered completely), readability (i.e. ease of reading and understanding), and design (i.e. the interface design of the website). Another study by Rains et al.²⁸ showed that structural features (e.g. images, design, statistics, author details, privacy policy statements) strongly correlate with credibility. Irwansyah et al.²⁹ defined several features that are considered significant to evaluate health websites’ credibility (e.g. accuracy, authority, objectivity, and information content). Previous studies show that features that have been proved to impact the credibility of health-related web pages can be utilized to build an automatic credibility assessment model.

Only a few approaches explored the automatic evaluation of health information on the web. Mukherj et al.³⁰ produced a probabilistic graphical model based on the fact that information credibility depends heavily on the linguistics, objectivity of the text, and the author’s trustworthiness. Those three features were jointly used to evaluate the credibility of different health content from the healthboard.com website. The stylistics and the effective characteristics of the text were manually extracted along with the author’s features from the web pages. Users’ authority was measured according to multiple factors such as age, gender, country, posts, answers, thanks received, and length of his articles. This model was evaluated in multiple experiments, achieving an accuracy of 56.7% in predicting the credibility of the web page.

Efforts to evaluate health websites were made by Weitzel et al.,² who introduced a new framework that assesses the credibility of health-related web pages. The authors defined several quality indicator metrics such as source reference (i.e. authors qualification), HON certification (i.e. does the site have a HON certification?), and funding and sponsorship (i.e. funding may indicate conflicts of interest by a site’s owner). Those metrics are calculated by a function, and the results of a function give the trust score, which indicates the credibility of the website. Validation of this research was carried out by comparing the trust score of several websites with their Alexa rank. The results showed that the trust scores were very close to the Alexa rankings. In recent years, there has been a growing number of publications focused on analyzing factors that affect the credibility of health information on the web. Comparatively, only a few studies have proposed an automatic evaluation of health-related content and based on our knowledge and research. No evaluation studies have proposed an automatic evaluation of Arabic health websites. Thus, this study aims to build the first Arabic health website credibility model.

Methodology

As mentioned earlier, determining the credibility of health-related web pages is crucial, as it impacts people’s lives. The existing literature on content credibility assessment, as discussed previously, lacks a focus on automatically evaluating the credibility of web page content in general and health-related web pages in particular. These facts encouraged us to generate the first approach to automatically assessing the credibility of Arabic health web pages. Since no available datasets matched the aim of this study, a new dataset of health web pages was created. Multiple features were extracted from health web pages according to certain standards defined by other researchers in the medical field, in addition to features that were observed to be significant during the classification process. We, in particular, proposed a hybrid approach that examines both a predefined set of indicators of source reliability with other linguistic features obtained from the web page content. Labeling the web pages was delegated to field experts in order to guarantee a high accuracy of the annotation process. Several machine learning algorithms were trained and tested, and their performance was evaluated.

Data collection

To evaluate the performance of our proposed model, we built a dataset consisting of 500 web pages gathered from various Arabic health websites. To build the dataset, we started by randomly collecting a wide range of websites that contain health-related content. We, in particular, search with the following queries in Arabic “medical advice,” “medical information,” and “health information.” We then filtered the retrieved list to 100 websites that contained only Arabic health-related content. The health-relatedness was determined according to the content of the website. If web pages do not contain information about medical treatment, medication, illness, medical procedure, medical advice was excluded from the pool of websites. The list of websites covered a range of health topics (e.g. general health, diabetes, cancer, infection, mental health, etc.), and the websites also varied between governmental and commercial websites.

We then used web crawlers, also known as spiders or automatic indexers, to extract information from the websites. Web crawlers are Internet bots that crawl one web page and collect information about the website. A Web crawler starts with a list of URLs to visit, called “seeds.” As the crawler visits these URLs, it identifies all the hyperlinks on the page and adds them to a URL list. We used the Scrapy framework that was developed to build web scrapers. We built a python program that takes the URL of the initial website and automatically scrapes its web pages.

We fed the spider with the start URL list that exists in the main URL in order to allow the spider to return a list of all the web pages. The number of web pages scraped can also be specified. We then define several items or data to be extracted from each web page, such as the title of the article that exists in the heading tags, text which is the health information or the main content of the web page that resides in the body tag, the author or the writer of the article, the date in which the article was written, the reviewer who review and confirm the health information in the article, the information source which is the references of the article’s information, and the existence of advertisements in the web pages, Figure 1 shows an example of the extracted data.

Figure 1.

An example of a web page and extracted information (the title, author, date, content, references, and Ads).

Some websites protect their data by blocking the spiders using defensive algorithms to limit the repeated request of the IP. To handle this issue, we use XML and HTML parser that can automatically extract websites’ data by parsing HTML or XML documents into a tree structure format. The same features specified earlier are also defined in the parser. Only a few websites were scraped using this method to collect our dataset. The tool used for this task is Python Library named beautifulsoup.¹ Also, Web crawlers can only understand HTML files only which makes it challenging to crawl JavaScript files. For that reason, we utilize a headless browser to control these web pages using a command-line interface.

After scraping different web pages from each website using the two techniques described previously, we ended up with 30 CSV files, a file per website. It is worth noting that our initial list of websites was shrunk down to only 30 websites due to several reasons; a large number of websites closed due to reasons related to license expiration, or their content had been changed to non–health-related content.

The CSV files were merged into one file consists of over 1000 web pages, but we had to minimize the size of the data due to the annotation process. We scraped 20 web pages from each website to finally end up with 500 web pages with their corresponding features. Several features per web page were extracted from each URL that was later saved in a CSV file format. The CSV files extracted from each website were combined later to create the final dataset. Each feature was extracted by feeding the crawler the tags of this feature, and then the extracted information was added to the CSV file. Figure 2 shows a part of our health dataset. Each row in the dataset corresponds to a web page with extracted features (i.e. author, title, content, etc.) and annotations.

Figure 2.

The Arabic health website credibility dataset each row represents web page with non-textual and textual features and the true classification provided by three annotators.

Annotation

The labeling task was delegated to annotators from the medical field, as health information is sensitive and judgments on such information need expertise. All annotators are Arabic native speakers, had a degree in medicine or nursing, had over five years of job expertise in the medical fields, and had good computer skills. The annotators received forms where they would click on the link to the website, read the article, and then choose their evaluation of the website. Pre-annotation instructions were given to guide them through the process. Labels were divided into four categories: Credible if all information on the web page is valid; Non-credible if the web page contains inaccurate information, and Unknown if the annotator cannot determine the validity of information in the web page. The option “URL not working” has been given because a large number of websites links were either expired or have been changed to other content. Each annotator was asked to label the web pages by reading each web page and then evaluating its content by one of the previously described classes: credible, non-credible, or unknown.

After the annotation, we ended up having three labels per web page from three different annotators. The most common label was considered as the final label for the article. If a web page was labeled differently by all the annotators, we excluded that web page to avoid any confusion. We also excluded web pages that were labeled as not working. After this step, the size of our dataset was decreased to 463 web pages. The last dataset had 236 credible web pages, 93 labeled unknown, and 136 web pages labeled as non-credible, Figure 3 shows the distribution of each class in the dataset.

Figure 3.

The distribution of the three classes (credible, non-credible, and unknown) in the dataset.

We calculated the inter-rater agreement between the three annotators using the Cohen kappa coefficient, and the results showed an agreement of 0.28 among the three annotators. This number indicates the complexity and difficulty of content reliability evaluation for experts rather than regular people. The number of non-credible websites was large, which may raise an important question about the significance of using a credibility prediction model.

Feature selection

Different types of features were explored for websites’ credibility evaluation in the literature and other health organizations (e.g. HON). These features could be grouped into two categories: content or textual-based features that were present on the web page or meta information features. In this research, we investigate both types of features: non-textual and textual features to assess web pages’ credibility, focusing on the features that could be acquired automatically from the web page properties/meta-data or content to support the automation of the model.

Non-textual features

For non-textual features, we focused on some of the features corresponding to HON standards that can be acquired automatically. HON organization evaluates health information manually using human resources, and only a number of the standards cannot be extracted automatically. HON standardized evaluation criteria were chosen in this study because they are designed for a specific type of content (like DISCERN focuses only on treatment content), and they are based on both content and presentation of the online resources.

We also added in our feature set other features that were found to be useful in assessing website credibility in the related work. This procedure was performed in order to assure the automaticity of the model without compromising its reliability. The details of the non-textual features are shown in Table 1.

Table 1.

Type of non-textual features in the proposed hybrid-based model.

Features	HON standard	Details
Source reliability³		website’s rank (Alexa and page rank)
Article date		Publishing date of the article
Authors	Authoritative	Presence of the name of the author
Reviewers	Attribution	Presence of reviewer name
Ads	Ads policy	Presence of ads.
Information source	Transparency	Source of information (reference)
Author’s information²		Clear and consistent author information
Feedback		Source allow feedback from users
Search engine¹⁰		Source contains search engine allowing visitors to search
Certification		Website have certification (e.g., HON)

Textual features

To extract the textual features, we first perform text preprocessing and text representation on the web page’s content. Text features are rich content that may contain much unnecessary information that could negatively impact the performance of the machine learning models, such as Arabic stop words, HTML tags, and special characters. Text preprocessing is very important in our model, and thus in this step, we eliminated Arabic stop words (e.g. prepositions and pronouns) and cleaned the text from HTML tags, diacritics, non-Arabic words, and special characters using the PyArabic library. Finally, we removed words with lengths of three letters or less. After this step, the text features were ready for classification. To transform the raw textual data into a representational form that the machine learning and deep learning algorithms could handle, we used several word distributional representations. There are several popular text distributional representation techniques that vectorize and encode textual data into numerical vectors that capture the semantic properties of words. Two word representation models were used in this study: the Term frequency-inverse document frequency (TF-IDF) and neural word embedding. The TF-IDF model is one of the simplest and most popular text representation models that describe the occurrence of a word in a document in which the structure and the order of words in documents are not taken into account. Unlike Bag of the word and TF-IDF model, word neural embedding (word2vec) is capable of capturing the context of a word in a document, and AraVec ³¹, an Arabic version of Word2Vec word embedding model. Both text representations were used in our model in separate experiments in order to assess the impact of the text features and their representation on our model’s performance.

Content credibility predictive model

As mentioned before, the web page credibility assessment has not been investigated enough in this domain. In order to investigate the effect of the features on credibility assessment, we tested the performance of several ML algorithms using different feature combinations: textual features, non-textual features, and hybrid features (i.e. a combination of both textual and non-textual). For textual features, we utilized various types of word representation models, as described earlier. Three machine algorithms were also used in the experiments: Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), and a deep learning model Long short term memory (LSTM).

Machine Learning models

Logistic Regression (LR) is a simple yet powerful classification model that works by estimating a linear function that gives the probability of an input being of the positive class. If the probability is greater or equal to a predefined threshold (often = 0.5), the input will be classified as a positive class; otherwise, it will belong to the negative class. We implemented a grid-search algorithm to choose the optimal hyperparameters given a range and combination of hyperparameters values. The best hyperparameters of the LR were optimizer solver = “liblinear,” ridge regression (penalty) = “l2,” Inverse of regularization strength (C) value = “1.0,” and threshold decision = “0.5.”

We also implemented the SVM model as it often shows good results in many non-linearly separable problems. SVM creates a hyper-plane to separates the data into two sets with the maximum margin. SVM hyperparameters were also optimized to choose the optimal parameters for our problem. The best hyperparameters of the SVM were kernel = “linear” and C value = “100.”

The decision tree classifier (DT) was also examined in this study to classify the web pages into one of the predefined classes. For text classification problems, a decision tree is often constructed with internal nodes being the label, the branches represent weight, and the leaves represent the classes. The tree can classify an article by running through the query structure from the root until it reaches a certain leaf, which represents the goal for the classification of the article. We implemented the model using different depth sizes, and the optimal tree depth size was chosen.

Deep learning model

In addition to the machine learning models, we implemented a deep neural network model, namely Long Short Term Memory (LSTM). The architecture of the LSTM consists of one input layer (embedding layer), one LSTM layer with 128 neurons and one dense layer, and one output layer. We use SpatialDropout1D that performs a variational dropout in NLP models. Since we have three possible outputs, the number of neurons in the output layer is equal to three, and the activation function is Softmax. We use the categorical Crossentropy as our loss function and adam as the optimization solver.

Text representation

We utilized two different methods to represent textual data, namely Term Frequency-Inverse Document Frequency (TFID) and Neural word embedding based on word2vec (AraVec). For textual features, we used TF-IDF representation using the count vectorizer in NLTK² in order to vectorize each sentence by taking all the unique words in the sentence and creating the vocabulary. This vocabulary is then used to create the feature vector for the count of words. Each article or textual content in our dataset was represented as a vector, and these feature vectors can be used as inputs to ML algorithms. We also vectorized the title and authors’ textual features in the same manner.

For the neural word embedding, we represent each word in the article by its representative embedding in AraVec. If a word is not found in the AraVec dictionary, the word embedding values for the word are zero. A concatenated vector of the article’s words embeddings was created to represent each article.

Hybrid features fusion

In the hybrid features experiments, we combined two types of features textual and other non-textual data. To integrate both textual and non-textual data, we used different strategies for ML and DL. For the ML approach, we used the pipelines and feature unions method in python. First, we created one pipeline that accepts texts as input and preprocesses the text, as the second pipeline is created to accept non-textual data. The features from each pipeline are joined using the feature unions (concatenation) method to combine the two types of features from previous pipelines to generate a feature vector that can be used as input to the classifier. For the deep learning models, we made a few modifications to the LSTM model in hybrid features experiments by creating two sub-models. The first sub-model accepts textual data as input, and it consists of a shape layer and an LSTM input layer of 128 neurons. The second sub-model accepts the inputs in the form of meta information. This sub-model consists of an input layer and a dense layer. The output from the two sub-models is concatenated and used as input to another Dense layer. The last layer is the output layer with three neurons, and the activation function was Softmax.

Data splitting and performance evaluation

We divided the dataset into training and testing samples using stratified ten folds cross-validation to ensure that each fold has an equal proportion of the classes. In the next step, we tested the four algorithms using only the text features of our dataset as inputs. Then, we tested the performance of the algorithms on the non-textual features specified earlier, and in the last step, we combined all available features and fed them into each of the four algorithms as inputs.

The model performance was evaluated using accuracy and F-score. The accuracy in any classification is computed as the fraction of correctly classified instances. F-score is computed as the harmonic mean of both the precision and recall. Precision is equal to the true positive (TP) divided by the true positive (TP) and false positive (FP), and the recall is equal to the TP divided by the TP and false negative (FN). F1-measure is widely used as a classification metric as it is less sensitive to class imbalance. The following equations are the performance metric used in this study.

Precision = \frac{T P}{T P + F P}

(1)

Recall = \frac{T P}{T P + F N}

(2)

Accuracy = \frac{T P + T N}{T P + T N + F N + F P}

(3)

F - score = \frac{2 * Recall * Precision}{Recall + Precision}

(4)

Results

In this section, the results of different experiments are presented. We performed three experiments. In the first experiment, we evaluated the three machine learning models on three different sets of features textual (TF-IDF), non-textual, and hybrid features.

The results of the first experiment show that all three machine learning algorithms achieved a considerably low performance, but the non-textual models are lower than the textual-based model. It seems clear from the results shown in Table 2 that using hybrid features increases the performance of algorithms compared to non-textual features. However, the results remain relatively low in this experiment. This may be attributed to the sparsity of data, which is the most common disadvantage of TF-IDF and BOW models. Another possible reason is that the health articles contain precise information, which makes the content and meaning of the words is important for classification. This semantic meaning cannot be captured using simple TF-IDF representation. The results also show that SVM provided the best performance in comparison with the other classifiers.

Table 2.

Performance (accuracy% = ACC, Precision% = P, Recall% = R, and F-score% = F1) of different ml algorithms using TF-IDF representation, non-textual features (Non-textual), and hybrid features (Hybrid).

Feature	TF-IDF				Non-textual				Hybrid
Metric	ACC	P	R	F1	ACC	P	R	F1	ACC	P	R	F1
LR	58	50	69	58	50	56	45	50	53	64	46	53
SVM	61	46	92	61	51	56	46	51	57	51	65	57
DT	42	41	45	42	45	40	53	45	38	35	43	38

In the second experiment, we implemented machine learning algorithms with textual (word embedding), non-textual, and hybrid features. As expected, the performance (accuracy and F-score) of all ML models improved using the word embedding features in comparison with the TF-IDF features. From Table 3, we can see also that the hybrid-based approach that combines both textual (word embedding) and non-textual features perform better than models that use the non-textual or word embedding features alone. This indicates that the using rich word representation provided by word embedding along with non-textual features was vital to improve the performance (accuracy and F-score) of the models.

Table 3.

Performance (accuracy% = ACC, Precision% = P, Recall = R, and F-score% = F1) of different ML algorithms using word embedding representation (Word Embedding), non-textual features (Non-textual), and hybrid features (Hybrid).

Features	Word Embedding				Non-textual				Hybrid
Metric	ACC	P	R	F1	ACC	P	R	F1	ACC	P	R	F1
LR	67	69	67	67	50	56	45	50	70	64	79	70
SVM	65	50	91	65	51	56	46	51	74	64	87	74
DT	63	57	69	63	45	40	53	45	71	66	79	71

In the third experiment, we implemented LSTM using textual (word embedding), non-textual, and hybrid features. We repeated the same experiments that we had applied previously on the ML algorithms. The LSTM model performed better than all the ML approaches using TF-IDF models, word embedding, and the non-textual feature. Using the hybrid features, the LSTM reaches 79% accuracy. The results also indicate that the hybrid features-based model was better than the non-textual-based model. Detailed results are shown in Table 4.

Table 4.

Performance (accuracy% = ACC, Precision% = P, Recall = R, and F-score% = F1) of LSTM using different set of features textual features (Textual), non-textual features (Non-textual), and hybrid features (Hybrid).

Feature	Textual				Non-textual				Hybrid
Metric	ACC	P	R	F1	ACC	P	R	F1	ACC	P	R	F1
TF-IDF	65	58	68	63	62	54	70	62	66	61	67	64
Word embedding	77	81	69	75	62	54	70	62	79	81	75	77

Discussion

The previous experiments examined the impact of feature selection on the model’s performance. The results obtained from the three experiments in Tables 2 and 3 indicate that the positive impact of using hybrid features in our model. In Table 2 using hybrid features, SVM significantly outperformed the non-textual and textual-based models achieving an accuracy of 74%. SVM also achieves the highest performance among all the machine learning algorithms. This is not surprising because SVM is powerful for sparse data such as text. From the above results, LSTM stands out with the highest score among all algorithms and for all features groups using word embedding.

In the experiments, we assessed the performance of several algorithms individually using different groups of features. Tables 2–4 show the average accuracy of the different algorithms employing the different sets of features. In contrast to the other features, the non-textual feature had a minimum impact on our classification. These findings might indicate that textual features play a vital role in classification regardless of the representation technique. However, there was an observed difference in the performance of the model when using different text representations. We measured the impact of applying TF-IDF versus word embedding. Tables 2–4 illustrate that accuracy benefits from word embedding more than TF-IDF, which results in the loss of valuable information.

We could also explicitly see how applying the LSTM enhanced the overall performance of our model among different experimental conditions. LSTM also led to an acceptable performance on our dataset.

In reviewing the literature, we can confirm that no previous research closely matches our study in terms of the methodology, type of data used, or even the dataset’s language. This fact makes it difficult to compare our findings with this research, still, by looking at³⁰ who provided a form for evaluating medical pages in English, using hybrid data, and obtaining 56.7% accuracy. We can say that the results we obtained are relatively better, which may be due to their use of only three non-textual data and their neglect of other factors that may significantly increase the model’s accuracy. Also, the use of deep learning increases the accuracy of the model. Increasing the dataset size increases the model’s performance. We were surprised when collecting the dataset that the Internet lacks an adequate number of Arabic websites, despite many Arabic speakers worldwide. This was one of the obstacles that made the volume of our data Limited compared with the datasets available in the English language. This fact may be one reason that prompted many researchers to turn to the English language and build models to evaluate the reliability of data on English sites. We mentioned previously that the design might indicate the reliability of the web page. We faced many difficulties with scraping non-textual features from Arabic websites during the data collection phase due to the shortage of professionally programming and design for many Arab websites.

Conclusion

In this article, we introduced the first automatic credibility prediction model for Arabic health websites. The study also examined the impact of text representation on credibility’s performance and has shown the significant impact of textual features on credibility evaluation performance. Besides the fact that this is the first study to investigate Arabic health information on the Internet, this work contributes to the research area by creating the first Arabic health dataset to be evaluated by medical experts. The study also examined the relationship between text representation and the performance of the classification model. Our model was able to predict the credibility of health web pages with an F1-score of 79%. However, considerably more work needs to be done to expand the size of our dataset to better understand the factors that impact health information credibility and increase the model’s performance.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Areej Alhothali

Notes

References

Alqahtani

Holmes

Al-Rammah

et al. Are we meeting cancer patient needs? complementary and alternative medicine use among saudi cancer patients: a qualitative study of patients and healthcare professionals’ views. Eur J Integr Med 2018; 24: 54–60.

Weitzel

Quaresma

de Oliveira

JPM

. Evaluating quality of health information sources. In: 2012 IEEE 26th international conference on advanced information networking and applications, Fukuoka-shi, Japan, 26–29 March 2012, IEEE, pp. 655–662.

Fogg

Tseng

. The elements of computer credibility. In: Proceedings of the SIGCHI conference on Human Factors in Computing Systems, Pittsburgh, PA, 1 May 1999. pp. 80–87.

Tseng

Fogg

. Credibility and computing technology. Commun ACM 1999; 42(5): 39–44.

Hovland

Weiss

. The influence of source credibility on communication effectiveness. Public Opin Q 1951; 15(4): 635–650.

Flanagin

Metzger

. Perceptions of internet information credibility. Journalism Mass Commun Q 2000; 77(3): 515–540.

Department of Health and UK Trade and Investment. Your partner for digital health solutions TU. https://www.gov.uk/government/publications/the-uk-your-partner-for-digital-health-solutions, 2015. Accessed 2021-09-12.

Fox

. The social life of health information, 2011. California Healthcare Foundation, 2011.

Jacobs

Amuta

Jeon

. Health information seeking in the digital age: an analysis of health information seeking behavior among us adults. Cogent Social Sci 2017; 3(1): 1302785. DOI:10.1080/23311886.2017.1302785.

10.

Olteanu

Peshterliev

Liu

et al. Web credibility: features exploration and credibility prediction. In European conference on information retrieval. Springer, pp. 557–568.

11.

Wawer

Nielek

Wierzbicki

. Predicting webpage credibility using linguistic features. In: Proceedings of the 23rd international conference on world wide web, Seoul, Republic of Korea, 07–11 April 2014. pp. 1135–1140.

12.

Steinwart

Christmann

. Support vector machines. Springer Science & Business Media, 2008.

13.

Lowry

Wilson

Haig

. A picture is worth a thousand words: source credibility theory applied to logo and website design for heightened credibility and consumer trust. Int J Human-Comput Interact 2014; 30(1): 63–93.

14.

Zha

. The impact of online disruptive ads on users’ comprehension, evaluation of site credibility, and sentiment of intrusiveness. Am Commun J 2014; 16(2): 15–28.

15.

Fusilier

Montes-y Gómez

Rosso

et al. Detecting positive and negative deceptive opinions using pu-learning. Inf Process Manage 2015; 51(4): 433–443.

16.

Feng

Hirst

. Detecting deceptive opinions with profile compatibility. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, October, 2013, Nagoya, Japan. pp. 338–346.

17.

Jaworski

Rejmund

Wierzbicki

. Credibility microscope: relating web page credibility evaluations to their textual content. In: 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Warsaw, Poland, 11-14 August, 2014, Vol. 1. IEEE, pp. 297–302.

18.

Singh

Choudhary

Bindlish

et al. Neural network architecture for credibility assessment of textual claims. arXiv preprint arXiv:180310547 2018.

19.

Kakol

Nielek

Wierzbicki

. Understanding and predicting web content credibility using the content credibility corpus. Inf Process Manage 2017; 53(5): 1043–1061.

20.

Popat

Mukherjee

Strötgen

et al. Where the truth lies: explaining the credibility of emerging claims on the web and social media. In: Proceedings of the 26th International Conference on World Wide Web Companion, Perth, Australia, 03–07 April 2017, pp. 1003–1012.

21.

Popat

Mukherjee

Strötgen

et al. Credibility assessment of textual claims on the web. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, October, 2016, Indianapolis, IN. pp. 2173–2178.

22.

HON foundation. Hon code certification. https://www.hon.ch/HONcode/Patients/Visitor/visitor.html, 2021. Accessed: 2021-09-19.

23.

AHRQ Organization . Ahrq-agency for healthcare research and quality. https://www.ahrq.gov/, 2021. Accessed: 2021-09-12.

24.

Charnock

Shepperd

Needham

et al. Discern: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health 1999; 53(2): 105–111.

25.

Discern Group Inc. The discern instrument. http://www.discern.org.uk/discern_instrument.php, 2021. Accessed: 2021-09-12.

26.

Flanagin

Winter

Metzger

. Making sense of credibility in complex information environments: the role of message sidedness, information source, and thinking styles in credibility evaluation online. Inf Commun Soc 2020; 23(7): 1038–1056.

27.

Eysenbach

Powell

Kuss

et al. Empirical studies assessing the quality of health information for consumers on the world wide web: a systematic review. JAMA 2002; 287(20): 2691–2700.

28.

Rains

Karmikel

. Health information-seeking and perceptions of website credibility: examining web-use orientation, message characteristics, and structural features of websites. Comput Hum Behav 2009; 25(2): 544–553.

29.

Irwansyah

Rusfian

Ernungtyas

. Credibility of health site based on design and information content. Mimbar: Jurnal Sosial Dan Pembangunan 2016; 32(1): 107–115.

30.

Mukherjee

Weikum

Danescu-Niculescu-Mizil

. People on drugs: credibility of user statements in health communities. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, August 2014, New York, NY. pp. 65–74.

31.

Soliman

Eissa

El-Beltagy

. Aravec: a set of arabic word embedding models for use in arabic nlp. Proced Comput Sci 2017; 117: 256–265.

Hybrid machine learning approach for Arabic medical web page credibility assessment

Abstract

Keywords

Introduction

Related works

Textual Features

Non-textual features

Hybrid features

Health web page credibility assessment

Methodology

Data collection

Annotation

Feature selection

Non-textual features

Textual features

Content credibility predictive model

Machine Learning models

Deep learning model

Text representation

Hybrid features fusion

Data splitting and performance evaluation

Results

Discussion

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

Notes

References