Abstract
Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, e.g., by allowing for easily comparing offers by different vendors, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages neural language models and deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks.
Introduction
In recent years, with the advancements of the Internet and e-business services, the amount of products sold through e-shops has grown rapidly. A recent study estimates that the total e-commerce retail sales for the fourth quarter of 2015 in the USA only were 89.1 billion dollars [5]. Yet, there still is one big issue in the process of product search and purchase that consumers have to deal with. The same product may be found on many different e-shops, but the information about the products offers greatly differ across different e-shops. Furthermore, there are no global identifiers for products, and offers are most often not interlinked among each other. Therefore, there is no simple way for the consumers to find all the necessary information and best prices for the products they search for. To offer a better user experience, there are many products aggregator sites, like Google Product Search,1
To support the integration process, e-shops are increasingly adopting semantic markup languages such as Microformats, RDFa, and Microdata, to annotate their content, making large amounts of product description data publicly available. In this paper, we present an approach that leverages neural language models and deep learning techniques in combination with standard classification approaches for product matching and categorization from HTML annotations. We focus on data annotated with the Microdata markup format using the schema.org vocabulary. Recent works [26,27] have shown that the Microdata format is the most commonly used markup format, with highest domain and entity coverage. Also, schema.org is the most frequently used vocabulary to describe products. Although considered structured annotations, empirical studies [26,33] have shown that the vast majority of products are annotated only with a name and a textual description, i.e., they are rather unstructured than structured data. This helps identifying the relevant parts on the Website, but leads to rather shallow textual data, which has to be tackled with methods for unstructured data.
In a previous work [36], we have proposed an approach for enriching structured product ads with data extracted from HTML pages that contain semantic annotations. The approach is able to identify matching products in unstructured product descriptions using the database of structured product ads as supervision. We identified the Microdata dataset as a valuable source for enriching existing structured product ads with new attributes.
In this paper, we enhance the existing approach using neural language models and deep learning techniques. We inspect the use of different models for three tasks: (1) Matching products described with structured annotations from the Web, (2) enriching an existing product database with product data from the Web, and (3) Categorizing products. For those tasks, we employ Conditional Random Fields for extracting product attributes from textual descriptions, as well as Convolutional Neural Networks to produce embeddings of product images. We show that neural word embeddings outperform baseline approaches for product matching and produce results comparable to those of supervised approaches for product categorization. Furthermore, we show that image embeddings can only provide a weak signal for product matching, but a strong signal for product classification.
The rest of this paper is structured as follows. In Section 2, we give an overview of related work. In Section 3 and Section 4, we introduce our methodology for product matching and categorization, respectively. In Section 5, we present the results of matching unstructured product descriptions, followed by the evaluation of the product ads enrichment with metadata extracted from HTML annotations in Section 7. In Section 8 we present the results of the product categorization approach. We conclude with a summary and an outlook on future work.
Both product matching and product categorization for the Web have been explored with various approaches and methods within the last years.
Product matching
Since there are no global identifiers for products, and links between different e-commerce Web pages are also scarce, finding out whether two offers on different Web pages are referring to the same product is a non-trivial task. Therefore, product matching deals with identifying pairs or sets of identical products.
Ghani et al. [7] first presented enriched product databases with attribute-value pairs extracted from product descriptions on the Web, by using Naive Bayes in combination with a semi-supervised co-EM algorithm to extract attribute-value pairs from text. An evaluation on apparel products shows promising results, however, the system is able to extract attribute-value pairs only if both the attribute name (e.g., “color”) and the attribute value (e.g., “black”) appear in the text.
The XploreProducts.com platform detailed in [39] integrates products from different e-shops annotated using RDFa annotations. The approach is based on several string similarity functions for product matching. The approach is extended by using a hybrid similarity method and hierarchical clustering for matching products from multiple e-shops [38].
Kannan et al. [12] use the Bing products catalog to build a dictionary-based feature extraction model. Later, the features of the products are used to train a Logistic Regression model for matching product offers to the Bing shopping data. Another machine learning approach for matching products data is proposed in [15]. First, several features are extracted from the title and the description of the products using manually written regular expressions. In contrast, named entity recognition based feature extraction models are developed in [24] and [36]. Both approaches use a CRF model for feature extraction, however [24] has a limited ability to extract explicit attribute-value pairs, which is improved upon in [36]. Comparably, in this paper we enhance the CRF model with continues features, thus boosting the CRF’s ability to recognize a larger number of attribute-value pairs.
The first approach to perform products matching on Microdata annotations is presented in [33], based on the Silk rule learning framework [11]. To do so, different combinations of features (e.g. bag of words, dictionary-based, regular expressions etc.) from the product descriptions are used. The work has been extended in [34], where the authors developed a genetic algorithm for learning regular expressions for extracting attribute-value pairs from products.
The authors of [31] perform product matching on a dataset of the Bing search engine. In their approach, the authors use historical knowledge to generate the attributes and to perform schema matching. In particular, they visit the merchant’s web page to compare the values of the products in the catalog with the values on the web page, converting the problem to a standard table schema matching problem. Next, the authors use instance-based schema matching to align the attributes’ names in the catalog to the ones on the merchant’s web page.
In [8] the authors propose an unsupervised web-based enrichment approach for the purpose of product matching. They start with enriching the title of the product with tokens retrieved using a web search engine, i.e., they use the title of the product as a query to a search engine, and the top K returned tokens are used for the enrichment. To identify the relevance of each token in the title, they again use web search, i.e., the token that returns more results is more relevant for the given product. The pairwise product matching is performed based on cosine similarity, using prefix filtering techniques as a blocking approach.
A similar approach for enriching product descriptions with tokens using web search engine is proposed in [21]. The authors propose a similar approach to [8]. The approach first enriches the offer’s title with tokens using web search engine. Then, it uses Community Detection for an approximate matching approach which is responsible for computing the distance between the pairs, computed based on the strength of the constructed “social” network between the private tokens. The approach is evaluated on the Abt-Buy dataset4
and a small custom dataset.While the discussed studies above have implemented diverse approaches (classifiers, genetic programming, string similarities), the feature extraction techniques used are mostly dependent on a supervised dictionary-based approach.
To reduce the labeling effort required by supervised approaches, in this paper we rely on neural word embeddings and CNNs, however, a full review of deep learning and neural networks in general is beyond the scope of this paper; please see [37]. Instead, we provide a review of the most relevant work for product feature extraction that use neural word embeddings and deep learning.
Recently, a handful of approaches employ word embeddings for getting features from product data for the problem of matching, as well as other problems concerning product data. The approach by Grbovic et al. [9] discusses the problem of product recommendations as a part of online advertisements. To perform recommendations, the authors use word2vec [29] to create product embeddings from product titles for product-to-product predictions, as well as paragraph2vec [18] to create user embeddings for user-to-products predictions. Similarly, in [42], the authors present a product recommendation system for microblogging websites where the main idea is that users and products can be represented in the same feature space by employing word2vec to calculate the feature vectors.
In [4], the authors present a system that automatically estimates the quality of machine translated segments of product offers. The authors again use word embeddings, specifically paragraph2vec [18], to learn feature vectors from the product title. These vectors are then used to that predict post-edition effort (HTER) on products from three different categories.
To the best of our knowledge, product matching based on image features has been applied in couple of domains including apparel and interior design. In [14], the authors propose an approach that matches clothing in two different settings: street photos vs. online shop photos. Specifically, similarly to our approach for extracting image features they use a CNN to learn image embeddings which then are used as an input in a binary classification task. Another approach tackling the same problem is introduced in [41]. Differently from [14], the authors propose the usage of Siamese Deep Networks (SDN) to bridge the domain gap between the user photos and the online product images. Additionally, the authors propose an alternative of the popular contrastive loss used in SDN, namely robust contrastive loss, where the penalty on positive pairs is lowered to alleviate over-fitting.
It is worth mentioning that with the rapid development of deep learning and neural nets for image processing, there are several approaches to recommend clothing based on images [22].
In [1], the authors propose a similar approach to the one in [41] in the domain of interior design. Namely, to obtain the image embeddings, they use SDN on pairs of images. Several training architectures including re-purposing object classifiers, using Siamese networks, and using multitask learning are proposed.
Since the product characteristics in the domains covered by the above-mentioned works are primarily visual, it is not surprising that image-based matching methods work considerably well. This, however, cannot be easily assumed for all domains. For example, product characteristics in domains like electronics are mainly textual (description, specification, technical fact sheets), and therefore, in this paper we cannot limit our methods to purely learning image features.
Product classification deals with assigning a set of labels from a product hierarchy to a product. Since not all web sites use a hierarchy, and those who use one are unlikely to use the same, a unified classification of products from different web sites is needed to provide the user with useful browsing and searching functionalities.
While there are several approaches concerned with product data categorization [12,31,33,39], the approach by Kozareva [16] is one of the only few approaches that use neural text embeddings as features for the classification task. Specifically, the author learn a linear classification model and use a combination of lexical features, LDA topics and text embeddings as the input feature vector. The approach is extended in [10] where the authors, propose usage of a two-level ensemble instead the linear classification model from [16].
Meusel et al. [27] is the most recent approach for exploiting Microdata annotations for categorization of product data. In that approach, the authors exploit the already assigned
In this paper,
Although there are a lot of approaches for products categorization based on text features, only a few are using image features for the given task. Kannan et al. [13] proposed one of the first approaches for product categorization that besides text features uses image features. The approach is based on Confusion Driven Probabilistic Fusion
In our approach for product matching, we use various feature extraction methods to derive a set of useful features for the product matching task, i.e., for identifying identical products.
An example of a structured product record
An example of a structured product record

System architecture overview for the product matching task.
We define the problem of product matching similarly to the problem statement defined in Kannan et al. [12]. We have a database A of structured products and a dataset of unstructured product descriptions P extracted from the Web. Every record
Methodology
Our approach for products matching consists of three main steps: (i) feature extraction, (ii) calculating similarity feature vectors and (iii) classification. The overall design of our system is illustrated in Fig. 1. The workflow runs in two phases: training and application. The training phase starts with pre-processing both the structured and the unstructured Web product descriptions. Then, we build four feature extraction models as follows:
We build a dictionary of the product attributes and their values present in the structured product descriptions.
We build a CRF model with a set of discrete features.
In order to handle the dynamic text patterns in product descriptions we enhance the training of the preceding CRF model with text embedding features.
In addition to the textual features, we furthermore build an image embeddings model.
It is important that while the feature extraction part contains many time-consuming parts, like, training the CRF and image embeddings models, those steps are only taken once, and can be performed in an offline pre-processing step. Thus, at run-time, products can be matched fast, since the pre-trained CRF and embeddings models only need to be applied at that stage. The approaches are detailed in Section 3.3.
Next, we manually label a small training set of matching and non-matching unstructured pairs of product descriptions. Subsequently, we calculate the similarity feature vectors for the labeled training product pairs (Section 3.4). In the final step, the similarity feature vectors are used to train a classification model for distinguishing matching and non-matching pairs (Section 3.5). After the training phase is over, we have a trained feature extraction model and a classification model.
In the application phase, we generate a set M of all possible candidate matching pairs, which leads to a large number of candidates i.e.,
Feature extraction
We pursue different approaches for extracting features from the structured and unstructured product descriptions.
Dictionary-based approach
To implement the dictionary-based approach we were motivated by the approach described by Kannan et al. [12]. We use the database A of structured products to generate a dictionary of attributes and values. Let F represent all the attributes present in the product database A. The dictionary represents an inverted index D from A such that
Then, to extract features from a given product description
Conditional random fields
A commonly used approach for tagging textual descriptions in NLP are conditional random field (CRF) models. A CRF is a conditional sequence model which defines a conditional probability distribution over label sequences given a particular observation sequence. In this work we use the Stanford CRF implementation6

Example of attribute extraction from a product title.
Attributes and values normalization
While CRF delivers rather good results, it requires a lot of labeled and diverse data. This becomes a challenge when new products are emerging on the market everyday, and merchants are changing the textual pattern of the product description. To address this challenge, we make use of the available unstructured data. Specifically, we use neural language modeling to extract word embeddings from the unstructured product description. A similar approach has been discussed in [20] for identifying drug names in medical texts.
The goal of neural language modeling techniques is to estimate the likelihood of a specific sequence of words appearing in a corpus, explicitly modeling the assumption that closer words in the word sequence are statistically more dependent. Examples for such approaches are hierarchical log-bilinear models [30], SENNA [3], GloVe [32] and word2vec [28,29]. In our approach we use the well-established word2vec model, which is the most popular and widely used approach for word embeddings. word2vec is a particularly computationally-efficient two-layer neural net model for learning word embeddings from raw text. There are two different algorithms, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model. The Skip-Gram model predicts contextual words given window of size n, while CBOW predicts the current word, given the context in the window.
Projecting such latent representation of words into a lower dimensional feature space shows that semantically similar words appear closer to each other (see Section 5.3). Meaning that values of different product attributes are represented close to each other in the latent feature space. Therefore, we follow the approach presented in [40] to implement a CRF model that beside the previously described discrete features, includes the word embeddings as features for the CRF model. Training the CRF model using word embeddings makes the model more robust, meaning it is able to detect attributes that have not been seen during the training phase with higher precision. For example, given the same structured product description from Section 3.1, “Apple[Brand] iPhone 6s[phone type] 64 GB[Memory] 9.3 iOS[Operating System] gold[Color]”, the word embeddings can assign a priori labels on the words in a completely unseen product description by the CRF model, e.g., given the product description “New Samsung Galaxy S4 GT-19505 16 GB 5.0 inches Android White”, the word embedding model shows high semantic similarity between the words “Samsung” and “Apple”, therefore we can a priori assign the label brand to the word “Samsung”. Same applies for the rest of the attributes.
Attribute value normalization
Once all attribute-value pairs are extracted from the given dataset of offers, we continue with normalizing the values of the attributes. To do so, we use the same attribute normalization pipeline presented in [36], i.e., attribute type detection, string normalization, number and number with unit of measurement normalization.
In Fig. 2 we give an example of feature extraction from a given product title. The extracted attribute-value pairs are shown in Table 2, as well as the normalized values, and the detected attribute data type.
Image feature extraction
Many e-shops use the same or similar image for identical products. Therefore, the image can be used as an indicator for identifying matching products. In this work, we use one of the most popular image processing techniques, i.e., deep Convolutional Neural Networks (CNNs) [19]. Usually, CNN models consist of several convolutional layers connected in a row, followed by fully connected layers. In each convolutional layer, several small filters are convolved on the input image. The weights of each filter are randomly initialized, thus, different filters get triggered by different features in the image, e.g., some might get triggered by a specific shape in the image, others by a specific color, etc. As this might produce a large number of features, each convolutional layer is usually connected to a pooling layer, which reduces the number of features by subsampling. The output of the convolutional layer is connected to a standard Feed-Forward Neural Net to solve a particular task.
In this work, we adopt the architecture proposed in [17]. The architecture of the model is shown in Fig. 3. The network consists of five convolutional layers, followed by three fully connected layers. The number and size of the used filters in each convolutional layer is shown in the Figure. All neurons have a Rectified Linear Unit (ReLU) activation function to accelerate the learning process. The output of the last fully-connected layer is fed to a N-way softmax, where N is the number of labels in the dataset.

Convolutional Neural Nets architecture.
After the feature extraction is done, we can define an attribute space
Classification approaches
We treat the product matching problem as a two-class classification problem: given a pair of products, they are either matching or non-matching. Since there are far more non-matching pairs than matching pairs, this classification problem is heavily imbalanced.
Once the similarity feature vectors are calculated, we train four different classifiers that are commonly used for such a task: (i) Random Forest, (ii) Support Vector Machines (SVM), (iii) Naive Bayes and (iv) Logistic Regression. We try to find the most suitable classifier able to address the problem of high class imbalance in the product matching task, i.e., there only a few matching product pairs (positive class), and many non-matching product pairs (negative class).
Product categorization approach
The approach for product categorization consists of two steps: (i) feature extraction and (ii) classification. We use supervised and unsupervised approaches for feature extraction.
Feature extraction
We use similar feature sets as we use for product matching.
Supervised text-based feature extraction
For the task of product categorization we use a dictionary-based approach for feature extraction from product descriptions [36].7
We were not able to build a sufficiently good CRF model that is able to annotate text with high precision because of the many possible attributes across all categories. A separate CRF model for each category in the structured product dataset should be trained.
Similarly to Section 3.3.3, we use neural language modeling to extract text embeddings from the unstructured product descriptions in order to overcome the challenge of the diversity of new products and their ever changing description. Since our product descriptions represent whole documents for the classification task, we construct text embeddings given the whole document. The most prominent neural language model for text embedding on a document level is paragraph2vec [18], an extension to word2vec. As with word2vec, paragraph2vec relies on two algorithms: Distributed Memory (DM), which corresponds to CBOW, and Distributed Bag-of-Words (DBOW), which corresponds to the skip-gram model. paragraph2vec is based on the same computationally-efficient two-layer neural network architecture. However, to be able to represent document embeddings paragraph2vec maps each document to a unique paragraph vector. In DM, this paragraph vector is averaged/summed with the word vectors, making the paragraph vector a memory of what is missing from the current context. Contrary to DM, DBOW ignores the context words in the input and instead forms a classification task given the paragraph vector and randomly selected words from a text window sample.
Unsupervised image-based feature extraction
We use the same CNN model for extracting image embeddings as described in Section 3.3.5. In this case, we use the image vectors as such, i.e., we use the complete image vectors for the task of image classification.
Classification
For each of the feature extraction approaches in the end we use the feature vectors to build a classification model, i.e., Naive Bayes (NB), Support Vector Machines (SVM), Random Forest (RF) and k-Nearest Neighbors (KNN), where
While we treated product matching as a binary classification problem, the problem of product categorization is a multiclass problem, since the number of product categories is much larger than two.
Products matching evaluation
In this section, we evaluate the product matching pipeline described in Section 3. We start by describing the used datasets 5.1, i.e., the database of structured products used for supervision, and the dataset of unstructured products descriptions. Next, as the text feature extraction model is the core module of our pipeline, in Section 5.2, we evaluate the performance of our CRF model built only on discrete features and the CRF model built with discrete features and continuous features from word embeddings. In Section 5.3, we analyze the semantics of the word vector representations in order to give deeper insights of their relevance for training the CRF model. In Section 5.4, we describe the experiment setup, followed by the final results of the entire product matching pipeline in Section 5.5. In the final section we perform error analysis of the proposed approach.
Datasets
For the evaluation, we use Yahoo’s Gemini Product Ads (GPA) for supervision,8
Note: we could use any database of structured products for the given task.
For our experiments, we are using a sample of three product categories from the Yahoo’s Gemini Product Ads database. More precisely, we use a sample of 3,330 laptops, 3,476 TVs, and 3,372 mobile phones. There are 35 different attributes in the TVs and mobile phones categories, and 27 attributes in the laptops category. We use this dataset to build the Dictionary and the CRF feature extraction models.
Unstructured product offers – WDC microdata dataset
We use a recent extraction of WebDataCommons, which includes over 5 billion entities marked up by one of the three main HTML markup languages (i.e., Microdata, Microformats and RDFa) and has been retrieved from the CommonCrawl 2014 corpus.10
To evaluate the approach, we have built a gold standard from the WDC dataset on three categories in the Electronics domain, i.e., TVs, mobile phones and laptops. We have imposed some constraints on the entities we select: (i) the products must contain an
All the product descriptions extracted from English top-level domains are considered to be in English language, i.e., “com”, “org”, “net”, “eu” and “uk”.
The gold standard has been generated by manually identifying matching products in the whole dataset. Two entities are labeled as matching products if both entities contain enough information to be uniquely identified, and both entities point to the same product. It is important to note that the entities do not necessarily contain the same set of product features. The dataset is annotated by two annotators independently, where the conflicts have been resolved jointly. The number of entities, as well as the number of matching and non-matching pairs for each of the datasets, is shown in Table 3.
To build the text embeddings models, we select two subsets of the unstructured data: (i) the subset containing only the
Datasets used in the evaluation
CRF evaluation on GPA data
CRF evaluation on WDC data. The best results are marked with bold
Extracting correct attribute-value pairs with high coverage from the product text descriptions is essential for the task of products matching. Thus, we compare the performances of the CRF model built only on discrete features (just CRF in the following) with the CRF model built on discrete and continuous features from word embeddings (CRFemb in the following). To train the CRFemb model, we build both CBOW and Skip-Gram neural models for word embeddings. We train the models on the complete WDC and GPA datasets. We experimented with different parameters for the models, to finally select the following parameters’ values: window size = 5; number of iterations = 10; negative sampling for optimization; negative samples = 10; with average input vector for CBOW; vectors size = 200. We used the gensim implementation14
We evaluate both CRF models on the database of structured product ads in a split validation setting. For each of the three product categories we select 70% of the instances as a training set and the rest as a test set. The results for each category, as well as the number of instances used for training and testing, and the number of attributes are shown in Table 4.

Two-dimensional PCA projection of the 200-dimensional Skip-Gram vectors for different product attributes, extracted from the product titles.
The results show that there is no significant difference in the performance of the two models according the McNemar’s test [23], with a significance level
The dataset can be found online
Please note that the embeddings have been trained on the entire WDC corpus, including the 50 labeled examples. Hence, the effect may be partly explained by the experiment setup. However, since the test examples only account for less than 0.0001% of the WDC corpus, that effect is likely to be neglectable.
To analyze the semantics of the word vector representations, and get deeper insights of their relevance for training the CRFemb model, we employ Principal Component Analysis (PCA) to project the “high”-dimensional attribute vectors in a two dimensional feature space.18
Please note that we employ PCA solely to create the plots depicted in Fig. 4. PCA has not been used anywhere in the product matching and/or categorization pipelines.
To evaluate the effectiveness of the product matching approach we use the standard performance measures, i.e., Precision (P), Recall (R) and F-score (F1). The precision states the percent of correct matching product pairs predictions made by the classier; The recall states the percent of the matching product pairs that the classifier identifies; The F-score measures the trade off between the precision and the recall, calculated as a harmonic mean of the both. The results are calculated using stratified 10-fold cross validation. For conducting the experiments, we used the RapidMiner machine learning platform and the RapidMiner development library.19
Product matching baseline results using cosine similarity on TF-IDF, paragraph2vec and Silk. The best results for each dataset are marked in bold
Product Matching Performance. The best results for each dataset are marked in bold
We compare our approach to three baselines. As a first baseline, we match the products based on a Bag-of-Words TF-IDF cosine similarity, reporting the best score on different levels of matching thresholds, i.e., we iterate the matching threshold starting from 0.0 to 1.0 (with step 0.01) and we assume that all pairs with similarity above the threshold are matching pairs. We calculated the similarity based on different combination of title and description, but the best results were delivered when using only the product title.
As a second baseline, we use the document vectors generated as explained in Section 4.1.2. Moreover, we build both DM and DBOW models for each of the datasets. We experiment with different vectors size, i.e., 50, 200, 500 and 1000. We calculate the cosine similarity between each pair of vectors, and we report the best score on different levels of matching thresholds.
As a third baseline, we use the Silk Link Discovery Framework [11], an open-source tool for discovering links between data items within different data sources. The tool uses genetic programming to learn linkage rules based on the extracted attributes. For this experiment, we first extract the features from the product title and description using our CRF model, and then represent the gold standard in RDF format. The evaluation is performed using 10-fold cross validation.
The results for the baselines are shown in Table 6. For all three approaches the best results are achieved when using only the title of the products. The best results for the paragraph2vec approach are achieved when using the DBOW method with 50 latent features. We can see that the documents embeddings outperform the standard BOW TF-IDF, and the Silk framework outperforms both. However, the F1-measure is not too high in all cases, limiting the utility of those baselines in many real-world applications.
Next, we show the results for products matching using the CRFemb for attribute extraction, compared to the standard CRF and the Dictionary model. The results are given in Table 7. We can note that all three approaches outperform the baseline approaches. The classifiers’ overall accuracy results scores are significantly different according to the McNemar’s test with
For example, two laptops might share the same brand, same CPU, and same HDD, but if the memory differs, then the laptops are not the same.

Images for Products.

Images of boxed Products.
From the last column of the table, we can conclude that using the image embeddings slightly improves the results. Furthermore, the feature relevance analysis using information gain shows a rather high relevance of the image similarity feature for the given task, and it is comparable with the relevance of the brand attribute. The image cannot be used as a strong signal for the task of product matching because a lot of similar products of the same brand have the same or similar image. For example, “iPhone 6s 16 GB gold” and “iPhone 6s 64 GB gold” have the same image across many e-shops. However, using such image features it is possible to distinguish products of different brands. For example, in Fig. 5(a) is given an iPhone 4 16 GB, in Fig. 6(a) is given an iPhone 5 16 GB and in Fig. 6(b) is given an Samsung Galaxy S4 Mini. The cosine similarity of the image features between the first two is 0.76, while the cosine similarity between the first and third mobile phone is 0.35.
In order to gain additional insights into the performance of our method, we performed a manual error analysis of its output. More precisely, we manually investigate the false positives and false negatives to identify the different categories of errors. We identify two main error categories:
In this category, we subsume all the errors caused by any of the modules in the approach pipeline: (i) inability to extract relevant attribute-value pairs from the product description, because of complex description structure, typos, or abbreviations. Often descriptions contain subjective information, e.g “Brand new Lenovo laptop; Latest model...’, which makes difficult for our CRF model to capture the contexts in which attribute-value pairs are in even with the addition of the neural word embeddings; (ii) Wrong normalization of attributes and values: inability to detect the correct attribute data type, inability to normalize the numerical and unit values. Data types can be often ambiguous, which in our case have to be specific, for instance “14.7” can be any dimension (height, width, diagonal) or it can be associated to the release date of a particular model; (iii) Images of boxed products. Vendors on marketplaces some times prefer to include boxed products images to show the condition of the products. In this case the image embeddings are closer to any kind of boxed product. An example of this is shown in Fig. 6; (iv) Inability of the classification models to detect all matching pairs. The error arise impart because of the two previous ones. In particular, in the case of comparing two phones “Apple iPhone 5s 32 GB” vs. “Apple iPhone 64 GB”, the classifier will be able to detect a non-matching pair only if our extraction method was able to extract the storage attribute.
As shown in Meusel and Paulheim [25] publishers of Microdata often do not follow the specification for publishing semantically annotated data, thus introducing errors in the schema and in the values them selves. In our dataset we encounter similar publishing errors and therefore consider them as data errors. The errors found in the Microdata are as follows: (i) incomplete or simplified values; (ii) misaligned product names and descriptions. An instance form this error is the so called “click-bait”, where the name of the product refers to, for instance, “iPhone 6, 6s” and the description describes a cover accessory for the same model(s); (iii) Wrong product image. It seldom happens that vendors mistakenly put an image from another product. Even though we consider the image embeddings as week signal, it still might effect our classifier; (iv) multiple products descriptions in a single description field. Often websites have multiple descriptions on single web page due to recommended products or advertised products. However, the websites do not follow the specification of introducing a new schema.org/Product instance per product description and thus have multiple product entities as part of a single schema.org/Product instance.

An example for search advertising for the query “lenovo yoga”.
The limitations of this approach mainly stem from the error analysis discussed above. Namely, our matching approach heavily relies on successful product attribute extraction from textual descriptions. One could argue, that certain product domains do not contain tangible product attributes (ex. Memory, Display Size, etc.) to be extracted. For example, the domain for clothing relies more on image features instead of textual descriptions. Therefore our matching approach, would not be able to perform to the best of its abilities.
Another limitation of this approach is the size of training data needed for building the word embeddings. Specifically, studies show that the amount of data significantly improves the learned embeddings. Therefore, the CRFemb would not get better performance if the embeddings are learned on smaller datasets.
Discovered matching products in the WDC dataset
Product ads are a popular form of search advertising21
Search advertising is a method of placing online advertisements on web pages that show results from search engine queries.
Discovered matching products in the WDC dataset for product ads in the GPA dataset
In this section, we apply the previously built models on the whole WDC and GPA products datasets, in order to identify product matches for the products in the GPA dataset, and extract additional attributes from the WDC products. In this section we perform two experiments: (i) Identifying duplicate products within the WDC dataset for top 10 TV brands (Section 7.1); (ii) Identifying matching products in the WDC dataset for the product ads in the GPA dataset in the TV category, i.e., enriching GPA product ads with features extracted from products from the WDC dataset (Section 7.2). For both experiments we use the best performing models evaluated in the previous section, i.e., a Random Forest model built on CRF features, and a Random Forest model built on CRFemb features. We do not include the dictionary approach, because the results in the initial experiments were not too promising.
In the first experiment, we apply the models to identify matching products for the top 10 TV brands in the WDC dataset. To do so, we selected a sub-set of products from the WDC dataset that contain one of the TV brands in the s:name or s:description of the products. Furthermore, we apply the same constraints described in Section 5, which reduces the number of products. We use the brand name as a blocking approach, i.e., we generate candidate matching pairs only for products that share the same brand. We use the CRF and CRFemb feature extraction approaches to extract the features separately, and we tune the Random Forest model in a way that we increase the precision, at the cost of lower recall, i.e., a candidate product pair is considered to be positive matching pair if the classification confidence of the model is above 0.8.
We report the number of discovered matches for each of the TV brands in Table 8. The second column of the table shows the number of candidate product descriptions after we apply the selection constraints on each brand. We manually evaluated the correctness of the matches and report the precision for both the CRF and the CRFemb approach. The results show that we are able to find a large number of matching products with high precision. The number of discovered matches when using the CRFemb approach is slightly higher compared to the CRF approach, while the precision remains in the same range. By relaxing the selection constraints of product candidates the number of discovered matches would increase, but it might also reduce the precision.
Enriching product ads
In this experiment, we try to identify matching products in the WDC dataset for the product ads in the GPA dataset. Similarly as before, we select WDC products based on the brand name, and we apply the same filtering to reduce the sub-set of products for matching. To extract the features for the WDC products, we use the CRF and the CRFemb feature extraction models, and for the GPA products, we use the already existing features provided by the merchants. To identify the matches we apply the respective Random Forest models for the CRF and the CRFemb approach. The results are shown in Table 9. The second column reports the number of products of the given brand in the GPA dataset, and the third column in the WDC dataset.
The results show that we are able to identify a small number of matching products with high precision. Again, the number of discovered matching products is slightly higher when using the CRFemb approach compared to the CRF approach, i.e., using the CRF approach we are able to find at least one match for 269 products in the GPA dataset, while using CRFemb for 310 products, and using the CRF approach there are 534 correct matches in total, while with CRFemb there are 676 correct matches. However, we have to note that we are not able to identify any matches for the products in the GPA dataset that are released after 2014, because they do not appear in the WDC dataset. Furthermore, we analyzed the number of new attributes we can discover for the GPA products from the matching WDC products. The distribution of matches, newly discovered attribute-value pairs, offers, ratings and reviews per GPA instance using the CRF approach is shown in Fig. 8. The results show that for each of the product ads that we found a matching product description, at least two new attribute-value pairs were discovered, while for some products, up to six new attribute-value pairs can be added. The same distributions can be observed when using the CRFemb approach are shown in Fig. 9. With that approach, we are able to identify more matches, thus more attribute-value pairs, offers, ratings, and reviews.

Distribution of newly discovered matches and attributes per product ad using the CRF approach.
In a third series of experiments, we evaluate the quality of our approach for product categorization introduced in Section 4. The objective of the evaluation is to categorize Web product descriptions into an existing target product catalog.
Dataset
For our experiments we use the GS1 Product Catalog (GPC)22
To evaluate the proposed approach we use the Microdata products gold standard developed in [27].23

Distribution of newly discovered matches and attributes per product ad using the CRFemb approach.
The evaluation is performed using 10-fold cross validation. We measure accuracy (Acc), Precision (P), Recall (R) and F-score (F1). Here, we evaluate both supervised and unsupervised feature extraction for product categorization. Moreover, we compare the dictionary approach (Dict.) to the unsupervised paragraph2vec feature extraction (Par2vec), and the unsupervised image feature extraction model (ImgEmb). As for the product matching task (see Section 5.4), we build both DM and DBOW models with vector size of 50, 200, 500 and 1000.
Product Categorization results. The best results are marked in bold
Product Categorization results. The best results are marked in bold
For the image feature extraction we use a CNN model trained on the labels from the third level of the GS1 catalog (ImgEmb). Furthermore, we compare our CNN model to an existing CNN model. In particular, we used a Caffe reference model,24
bvlc_reference_caffenet from caffe.berkeleyvision.org.
We compare all of the approaches with a baseline approach based on a Bag-of-Words TF-IDF cosine similarity. Same as before, for conducting the experiments, we used the RapidMiner machine learning platform and the RapidMiner development library. All experiments were run on a MacBook Pro with 8 GB of RAM and 2.3 GHz Intel Core i7 CPU processor.
The results for each of the three levels are shown in Table 10. All the experiments that did not finish within ten days, or that have run out of memory are marked with “\”. We show only the best performing results of all the paragraph2vec models, where on each of the three levels the best results were achieved using the DBOW model with 500 dimensions, trained on both the title and description of the products. The complete results can be found online.26
We can observe that the supervised dictionary-based approach outperforms the rest on the first level. However, on the second and the third level the combination of the document and image embedding approach outperforms the others. Furthermore, the documents embedding approach alone outperforms the baseline on the second and third level, and gives comparable results on the first level. It is interesting to note that the image embeddings approach alone performs rather well on all three levels, where the model trained only on product images performs slightly worse than the model built on the ImageNet dataset. The reason is that the number of labeled products images we used is significantly lower that the dataset used for training the ImageNet model. Also, we have to note that we did not apply any pre-processing or filtering on the images.27
Many images do not directly depict the product, but, e.g., a vendor logo, or are of a bad quality.
In this paper, we have proposed an approach that focuses on two tasks of the product integration pipeline, i.e., product matching and categorization. Our approach introduces the usage of unsupervised feature extraction for product data by using neural language modeling (word2vec and paragraph2vec) and deep learning models (CNN). The highlights of this paper include: (i) word embeddings help the CRF model training significantly for “dirty” web data, (ii) text embeddings improve product categorization considerably, and (iii) image embeddings can be used as a weak signal for product matching and strong signal for product categorization. Moreover, we provide a thorough product specification fusion as a part of our use case of enriching product ads with semantic structured data.
To improve the extraction of product attributes extraction further research into embedding semi-structured data from HTML tables and list can be performed. As discussed by the authors in [35] semi-structured data should provide better context than the free text. Moreover, product specifications provide much more product attributes than product descriptions. On the other hand, embedding newer WDC structured datasets including the relatively new JSON-LD semantically embedded data28
From 2016 WDC extract JSON-LD embedded data:
In this paper, we have put the focus on product data, but the proposed approach could be easily extended to other areas. As shown in [26], HTML annotations are used in many other areas, for example, the Microdata markup format and schema.org are widely used to describe persons, events, job postings, hotels, places, organizations, recipes, movies, books, music and many other types of data. All these entities contain textual description and images. To build interesting application on top of this data, again feature-extraction approach needs to be applied on the text and image description of the entities. Our approach could be adapted and tuned for that purpose. Once the attribute-value pairs are extracted various applications can be built on top, e.g., address books, travel agents, search engines, recommender systems and aggregators for a large variety of entity types.
Besides integrating products, our approach could be used for search query processing, product recommendation and information retrieval, which would improve the shopping experience for users [2].
