Abstract
Given a rectangle
Introduction
Given a rectangle
The problem has been deeply investigated in low dimensions and has been shown insolvable in higher dimensions.4,5 Finding holes in one-dimensional datasets is an easy task, but when the dimension increases, the problem becomes challenging. It immediately turns out to be NP-hard when
If the cases in a dataset are regarded as points in
Motivation of the search for holes in datasets
Holes in datasets point are missing data and the biggest holes the serious issue. We know that missing data is a problem in itself.
It is, therefore, important to check before embarking on any operation on the dataset such as data mining, regression, clustering, and so on to check whether any holes are presented. Ideally, the holes are filled before the operation is applied. Unfortunately, as will be stated clearly in what follows, detecting holes in data is NP-Hard for dimensions
Related work
Over the decades, several algorithms have been suggested for this problem. The most efficient one was provided by Aggarwal.
10
Lui et al.
9
proposed an algorithm for detecting empty areas in datasets. They ranked them relying on their sizes and characterized their kinds as interesting or uninteresting. Ku et al.
11
provided the problem of computing the set of maximal hyper-rectangle (MHR) where the sides are parallel to the axes. The MHR has no points inside and at least one point bounding each at its surface. They suggested an incremental algorithm with running
Dumitrescu and Jiang
6
gave the first important lower and upper bounds to the problem of finding the largest volume of an empty hyper-rectangle which is axis-aligned in a unit hypercube in space including
Once again they showed that the number of largest empty boxes inside
Lemley et al. 7 reported a new polynomial-time algorithm for computing the largest empty holes in the multi-dimensional dataset. They provided a Monte Carlo-based approach that identifies the MEHR in items where dimensionality and input size would make data analysis difficult. Their algorithm may not reach an optimal solution in contrast to all previous approaches that provide optimality with exponential time.
Backer and Keil
4
proposed the bichromatic rectangle problem which is related to this one. They found the largest hyper-rectangle parallel to the axes that included only blue but no red points. Their algorithm ranks all the relevant hyper-rectangles. This approach costs
The reverse problem of this one is finding the minimum volume enclosing rectangle (or ellipsoid) that contains a set of points in the space.13–17 The paper is organized as follows: related work is provided in Section ‘Related work’; Section ‘The proposed algorithm’ contains the main proposed algorithm; computational results are in Section ‘Experimental results’. Finally, the conclusion is given.
The proposed algorithm
The exact methods for MEHR are not viable in multi-dimensions and large cardinality datasets. So approximating procedures can be used. Here, an evolutionary approach is considered, precisely the genetic algorithm (GA). The intention is to apply this heuristic to the MEHR problem which varies from implementing the existing version of GA that is found in the Matlab Optimisation Toolbox. GA can be applied in different ways.18,19 They are essentially distinguished by the way chromosomes are represented. Concerning simplicity and computation time, some representations can be suitable for a problem more than others. 20 It is important to clarify that hyper-rectangles do not qualify if they are not totally laying inside the convex hull of the points.
The technique used here depends on the novel procedure that is implemented in Abo-Alsabeh and Salhi.
17
They solved the problem of minimum volume enclosing ellipsoid (MVEE) using the GA algorithm. It is stated as given a finite set of points
The concept of GAs is based on the natural selection work of biological organisms. The populations evolved according to this concept. The fittest among the individuals will have a chance to survive and reproduce into the next generation, while those who are less fit will be removed. The GA takes an initial population of individuals (solutions) and applies some suitable genetic operators in each reproduction. It demands an appropriate representation of the individuals and a suitable fitness function that recognizes the solutions as well as stopping criteria.
To solve the problem with GA, we suggest using opposite the performance idea of the algorithms that are used to solve the problems MVEE and MVEH. 17 They proposed a fitness function that can find the minimum hyper-rectangle (ellipsoid) for a set of points by competing for a set of chromosomes within a population of solutions. Each one represents the parameters (genes) of the geometric shape in hand. For the ellipsoid; they are the major and the minor axes, the center, and the angle of the tilt.
In The coordinates The width The height The angle
For
String representation of a rectangle in
String for
Starting by constructing a random initial population of rectangles (solutions), genetic operators of crossover, mutation, and reproduction are used to propagate successive populations. There are many types of crossover operators, the one used here is the heuristic. It generates offspring that randomly sets up on the line having the two parents (strings), with a minor distance from the string with finer fitness measurement, in the direction away from the string with poorer measurement. Performing the operator on two parents (rectangles) gives two new strings with possibly dissimilar lengths, corners, and angles of tilt. The type of mutation used is the Gaussian mutation, which is suggested in evolution strategies by Rechenberg. 21 Finally, some fit chromosomes are copied into the following generation without any changes by the reproduction operator.
The objective function is maximizing the areas of the empty rectangles in
Let
Randomly generated rectangles compete for the number of points outside the empty region and the large size of the rectangle, some do not contain all the points, and some are less than that. According to the GA strategy, chromosomes with a larger size and fewer points are to be passed on to the next generation. The maximum area rectangle that does not hold all points will be evolved if the algorithm is run long enough.
In
GA for MEHR
GA for MEHRs
Implementing Algorithm 1 is enough to find only one largest hole. To find more holes, it is needed to run the algorithm many times. But an obstruction arises here, the algorithm will reach the same maximum hole in each exploration. So that the procedure must be updated in such a way that a set of points is generated inside the first rectangle. When the GA searches for a new hole, it avoids the previous one. The searching process continues until the algorithm admits as many as rectangles. The search is interesting only in large holes.
In
Experimental results
First, Algorithm 2 is tested on randomly generated datasets up to

(a,b) The initial rectangles and their optimal results from Algorithm 1; (c, d) many optimal solutions from Algorithm 2 in
Figure 1 depicts the output of Algorithms 1 and 2 in
The accuracy of our method is illustrated by using two small sets of points in

(a) From Algorithm 1; (b) from Gutiérrez and Paramá. 8

(a, b) From Algorithm 2 in comparison with the original ones (c) and (d) from Dumitrescu and Jiang. 3
Also, 50 experiments were conducted with several real datasets used by some previous papers, with 10 experiments on each dataset. In each performance, Algorithm 2 tries to explore 20 different holes. The average number of hyper-rectangles to be detected is taken. Table 3 shows the results from these data, where the first column is the datasets, the second one is the dimension, and the third column gives the average. It illustrates that Algorithm 2 can find 16 to 20 hyper-rectangles for every 20 performances of Algorithm 5.
Test problem statistics.
Three datasets are selected from the well-known UCI machine learning repository. 22 In all cases, we reduced the size of the datasets out of necessity to compare the quality of our algorithm with an existing one that is provided by Lemley et al. 7
The Iris dataset includes the sizes of the petal and septal lengths and width of 156 attributes from three types of iris plants and four dimensions.
Two datasets of the combined cycle power plant include 9568 items in five dimensions, 7 which provide information on physical measurements such as humidity, exhaust vacuum, temperature, pressure, and electric output as measured by sensors around the plant. Only the first 150 items will be tested from the two datasets.
The Wilt dataset includes details about tree wilt, it is a high-resolution remote sensing. Here, both the training and testing sets will be used, the first 150 items from (4339, 498) in five dimensions.
The advantages of our procedure are numerous. First of all, it does not restrict the rectangles to having sides parallel to the axes of the dataset. Indeed, it can find the largest rectangles in any orientation, including of course those with parallel sides to axes. Furthermore, it finds not just the largest but all large rectangles. In fact, it can output all possible rectangles that can be inserted in the holes big or small. Working with rectangles has its reasons. But, our procedure can work with ellipsoids for instance without much change to the implementation. In that sense, it is far superior to anything found in the literature at the moment.
Conclusion
A new method has been provided to calculate the largest empty holes of a dataset in
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
