Sage Journals: Discover world-class research

Abstract

Given a rectangle $M$ containing points, we consider the problem of detecting the largest rectangle that is totally contained in $M$ and does not include any of the points. In other words, we want to find the biggest hole in the dataset that can contain the biggest possible rectangle. A new algorithm for dealing with this problem is, therefore, suggested. Existing algorithms are exact but cannot deal efficiently with problems in high dimensions and large instances. In fact, only computing maximum empty rectangles in a set of points in $2 d, 3 d$ has been well addressed. In high dimensions, the problem is shown to be NP-complete. Our suggested approach is evolutionary in nature and is an innovative implementation of the genetic algorithm. The approach solves a more general form of the stated problem in that it ignores the axis-parallel condition imposed on the hyper-rectangles to be found. This paper includes computational results and their discussions.

Keywords

Evolutionary algorithm approximating maximum holes hyper-rectangle

Introduction

Given a rectangle $M$ in the Cartesian space $R^{d}$ (of size $[l b_{1}, u b_{1}] \times \dots \times [l b_{d}, u b_{d}]$ ) with $l b_{r} \leq u b_{r}$ for $1 \leq r \leq d$ . Let $S = {x_{i}}, i = 1, 2, \dots, n$ be a set of points in the interior of $M$ .¹ Each point $x_{i}$ is given by its coordinates and $M$ is determined by its left, right, top, and bottom boundaries. The problem is to find the maximum empty hyper-rectangle (MEHR) to characterize the largest rectangular region with sides parallel to the axes.¹ This is an old computational geometric problem. It occurs in many situations where a rectangle is to be located inside an area that includes many forbidden regions as in cutting out a rectangular piece from a large metal sheet with defective spots. In low dimensions, it has applications in a range of problems such as facility locations.² For $d \geq 3$ , it has many applications in data mining, in computing large empty areas in a multi-dimensional dataset.³ Other applications include geographical information systems (GIS) and integration design. An instance of GIS application suggests that a park is wanted in a region that contains many facilities such as houses, schools, and other important buildings. The question is to locate the maximum empty rectangular shape (the park) in the area at minimum expense.

The problem has been deeply investigated in low dimensions and has been shown insolvable in higher dimensions.^4,5 Finding holes in one-dimensional datasets is an easy task, but when the dimension increases, the problem becomes challenging. It immediately turns out to be NP-hard when $d \geq 4$ the dimension is included in the length of the input. There are no effective exact or approximate approaches that efficiently solve the problem.⁶ In many fields, analysing and taking out information from datasets in higher dimensions are the main aspects, such as business planning, scientific study, and statistics. It is an essential activity in data mining research to find maximal empty regions in datasets.

If the cases in a dataset are regarded as points in $d$ -dimensional space, then an empty region is an area in the space that has no points. A continuous space has a great number of empty holes because it is impossible to fill the space with points. Yet, if each attribute can take discrete or nominal values, it may still be a challenge to fill all the space. The appearance of large areas is essentially a consequence of the insufficiency of collected data, which results in some empty areas, and also because combinations of particular data are impossible.^7,8 Not all these areas are important. Though, some of them have significant information in some cases. For example, some unknown or unexpected value combinations that are missing in the data could lead to important findings, if known.⁹

Motivation of the search for holes in datasets

Holes in datasets point are missing data and the biggest holes the serious issue. We know that missing data is a problem in itself.

It is, therefore, important to check before embarking on any operation on the dataset such as data mining, regression, clustering, and so on to check whether any holes are presented. Ideally, the holes are filled before the operation is applied. Unfortunately, as will be stated clearly in what follows, detecting holes in data is NP-Hard for dimensions $> 2$ .⁶ Exact approaches only work in one- and two-dimensional spaces. For this reason, we suggest a heuristic of the evolutionary type which works rather well. Here, we describe how it can be implemented and tested on problems of dimensions 2, 3, 4, and 5.

Related work

Over the decades, several algorithms have been suggested for this problem. The most efficient one was provided by Aggarwal.¹⁰ Lui et al.⁹ proposed an algorithm for detecting empty areas in datasets. They ranked them relying on their sizes and characterized their kinds as interesting or uninteresting. Ku et al.¹¹ provided the problem of computing the set of maximal hyper-rectangle (MHR) where the sides are parallel to the axes. The MHR has no points inside and at least one point bounding each at its surface. They suggested an incremental algorithm with running $O (n^{2 d - 1} d^{3} (\log n)^{2})$ and $O (n^{2 d - 1})$ space. It depends on a heuristic approach that prevents small rectangles to be considered. This algorithm is suitable for a small- to a mid-size set of points.

Dumitrescu and Jiang⁶ gave the first important lower and upper bounds to the problem of finding the largest volume of an empty hyper-rectangle which is axis-aligned in a unit hypercube in space including $n$ points. It is an approximating algorithm. They found that the minimum size of the largest hyper-rectangle volume is $Θ (1 / n)$ for a fixed $d$ .

Once again they showed that the number of largest empty boxes inside $n$ points in $R^{d}$ is $O (n \log n 2^{α (n)})$ and for dimensions $\geq 3$ is always $O (n^{d})$ , and it is sometimes $Ω (n^{[d / 2]})$ . They proposed an algorithm that computes $1 - ϵ$ approximation of the largest empty boxes inside $n$ points with time $O ϵ^{- 2} n^{5 / 3} \log^{2} n$ .³ Edmonds et al.¹² reported a fast algorithm that can find the largest empty areas in large 2-dimensional datasets in $O (d n^{2 d - 2})$ run time and the space complexity $O (d n^{d - 1})$ . For the same dimension, Gutierrez and Parama⁸ gave a query point algorithm to compute the maximum empty rectangle that is axis-aligned in the 2-dimensional space. It is a valuable approach for optimization demands in datasets.

Lemley et al.⁷ reported a new polynomial-time algorithm for computing the largest empty holes in the multi-dimensional dataset. They provided a Monte Carlo-based approach that identifies the MEHR in items where dimensionality and input size would make data analysis difficult. Their algorithm may not reach an optimal solution in contrast to all previous approaches that provide optimality with exponential time.

Backer and Keil⁴ proposed the bichromatic rectangle problem which is related to this one. They found the largest hyper-rectangle parallel to the axes that included only blue but no red points. Their algorithm ranks all the relevant hyper-rectangles. This approach costs $O (k \log^{d - 2} n)$ time for $d \geq 3$ , where $k$ is the number of relevant hyper-rectangles.

The reverse problem of this one is finding the minimum volume enclosing rectangle (or ellipsoid) that contains a set of points in the space.^13–17 The paper is organized as follows: related work is provided in Section ‘Related work’; Section ‘The proposed algorithm’ contains the main proposed algorithm; computational results are in Section ‘Experimental results’. Finally, the conclusion is given.

The proposed algorithm

The exact methods for MEHR are not viable in multi-dimensions and large cardinality datasets. So approximating procedures can be used. Here, an evolutionary approach is considered, precisely the genetic algorithm (GA). The intention is to apply this heuristic to the MEHR problem which varies from implementing the existing version of GA that is found in the Matlab Optimisation Toolbox. GA can be applied in different ways.^18,19 They are essentially distinguished by the way chromosomes are represented. Concerning simplicity and computation time, some representations can be suitable for a problem more than others.²⁰ It is important to clarify that hyper-rectangles do not qualify if they are not totally laying inside the convex hull of the points.

The technique used here depends on the novel procedure that is implemented in Abo-Alsabeh and Salhi.¹⁷ They solved the problem of minimum volume enclosing ellipsoid (MVEE) using the GA algorithm. It is stated as given a finite set of points $S = {x_{1}, \dots, x_{n}}$ in $R^{d}$ , finding the MVEE that circumscribing all the points of $S$ . It is known as the Löwner–John ellipsoid problem. Another problem is finding the minimum volume enclosing hyper-rectangle (MVEH) which is the opposite of the one in hand.

The concept of GAs is based on the natural selection work of biological organisms. The populations evolved according to this concept. The fittest among the individuals will have a chance to survive and reproduce into the next generation, while those who are less fit will be removed. The GA takes an initial population of individuals (solutions) and applies some suitable genetic operators in each reproduction. It demands an appropriate representation of the individuals and a suitable fitness function that recognizes the solutions as well as stopping criteria.

To solve the problem with GA, we suggest using opposite the performance idea of the algorithms that are used to solve the problems MVEE and MVEH.¹⁷ They proposed a fitness function that can find the minimum hyper-rectangle (ellipsoid) for a set of points by competing for a set of chromosomes within a population of solutions. Each one represents the parameters (genes) of the geometric shape in hand. For the ellipsoid; they are the major and the minor axes, the center, and the angle of the tilt.

Representation of a rectangle

In $2 d$ , a solution is a rectangle that is represented as a string described in Table 1. It includes the following genes:

The coordinates $(c_{1}, c_{2})$ of a point signifying the corner position of the rectangle.

The width $w$ of the rectangle.

The height $h$ of the rectangle.

The angle $θ$ of tilt of the width with respect to the $x$ -axis which is in counter anticlockwise order.

For

d > 2

, the length of the individual can be constructed to

2 d + 1

. The lengths of the hyper-rectangle are placed in the first

d

genes and the elements of the corner position in the second

d

genes. The final gene is for the tilt angle. The string is shown in Table 2.

Table 1.

String representation of a rectangle in $2 d$ .

Width	Height	Corner position		Angle of tilt
$w$	$h$	$c_{1}$	$c_{2}$	$θ$

Table 2.

String for $d > 2$ .

Lengths			Corner position			Angle of tilt
$l_{1}$	$\dots$	$l_{d}$	$c_{1}$	$\dots$	$c_{d}$	$θ$

Genetic operators

Starting by constructing a random initial population of rectangles (solutions), genetic operators of crossover, mutation, and reproduction are used to propagate successive populations. There are many types of crossover operators, the one used here is the heuristic. It generates offspring that randomly sets up on the line having the two parents (strings), with a minor distance from the string with finer fitness measurement, in the direction away from the string with poorer measurement. Performing the operator on two parents (rectangles) gives two new strings with possibly dissimilar lengths, corners, and angles of tilt. The type of mutation used is the Gaussian mutation, which is suggested in evolution strategies by Rechenberg.²¹ Finally, some fit chromosomes are copied into the following generation without any changes by the reproduction operator.

The fitness function

The objective function is maximizing the areas of the empty rectangles in $2 d$ (volumes for $d > 2$ ). Let $w_{j} * h_{j}$ be the area hole $j$ in $2 d$ , that is wanted to be found, where $j = 1, 2, \dots, k$ , $w_{j}$ represents the width and $h_{j}$ is the height. Then the objective function will be as:

m a x H = h_{j} * w_{j}

(1)

Finding an area

j

demands suggesting the difference between the points number that is set outside the rectangle and its area as the fitness function, equation (1). Maximizing the difference gives the maximal rectangle among the points. The fitness is described as follows.

Let $ψ_{H}$ represent the fitness of rectangle $H$ , $| S |$ the size of the given dataset, $| S_{H} |$ the size of the points outside rectangle $H$ , $A_{H}$ the area of rectangle $H$ , $γ$ a constant, which is $γ \geq | S |^{2}$ , $L B$ the vector of lower bounds on all coordinates of points of $S$ and $U B$ the vector of upper bounds on all coordinates of points of $S$ , and $c$ the center of rectangle $H$ . In this fitness, we take the difference between the area and the number of points, to make this possible, we multiply the number of points $| S_{H} |$ and the value $(| S | - | S_{H} | + γ)$ by a constant $β = 1$ to convert them into area value. The strings (rectangles) are rated to evaluate the fitness function:

\begin{matrix} if (| S_{H} | < | S |) then \\ ψ_{H} = (β (| S | - | S_{H} | + γ) + A_{H})^{2} \\ else if L B_{i} < c_{i} < U B_{i}, \forall i \\ ψ_{H} = β | S_{H} | - A_{H} \\ else \\ ψ_{H} = 1 / (β | S_{H} | - A_{H}) \\ end if \end{matrix}

(2)

Two stopping criteria are used here; the number of still generations and the maximum number of generations. For the selection procedure, we chose the Reminder selection which selects chromosomes for the following generation relating to their fitness.

Randomly generated rectangles compete for the number of points outside the empty region and the large size of the rectangle, some do not contain all the points, and some are less than that. According to the GA strategy, chromosomes with a larger size and fewer points are to be passed on to the next generation. The maximum area rectangle that does not hold all points will be evolved if the algorithm is run long enough.

In $2 d$ , we check a point with the edges of the rectangle, while in dimensions $d \geq 3$ testing a point inside the vertices of the hyper-rectangle turns into a relatively harder task. Checking a set of points inside a rectangle can be time consumed even for a small set in low dimensions.

Algorithm 1:

GA for MEHR

1 Input

S

the set of random points with some empty regions;

2 Input

ρ

the rate of crossover and

μ

the rate of mutation;

3 Generate a random population of rectangles

H

;

4 Evaluate

γ

the fitness of all rectangles in

H

using equation (2);

5 Rank chromosomes relating to their fitness;

6 Select chromosomes from the population relating to Reminder selection;

7 Generate a new population by applying the following operators: crossover (Heuristic), mutation (Gaussian), and reproduction with their particular rates;

8 Compute the fitness of the chromosomes of the new population;

9 Until (The stopping criteria are satisfied) Repeat from 5;

10 Return Best solution.

Algorithm 2:

GA for MEHRs

1 Input

S

the set of random points with some empty regions;

2 Perform Algorithm 1;

3 Return best empty rectangle

H_{j}

;

4 Generate random points inside the resulting best rectangle

H_{j}

;

5 Until (The stopping criteria are satisfied) Repeat from 2;

6 Return all the solutions.

Implementing Algorithm 1 is enough to find only one largest hole. To find more holes, it is needed to run the algorithm many times. But an obstruction arises here, the algorithm will reach the same maximum hole in each exploration. So that the procedure must be updated in such a way that a set of points is generated inside the first rectangle. When the GA searches for a new hole, it avoids the previous one. The searching process continues until the algorithm admits as many as rectangles. The search is interesting only in large holes.

In $2 d, 3 d$ dimensions, the location of the largest rectangles is clear by looking at the figures, they are separated. But in higher dimensions, the centers of the hyper-rectangles can show their positions in the dataset. The number of rectangles to be detected relies on the initial population (rectangles) used in the algorithm, if the lengths of the rectangles are large enough, then the algorithm may detect only large holes. To find smaller holes, the lengths of the population should be updated to be smaller every several runs. The algorithm cannot guarantee finding the best results for all the runs.

Experimental results

First, Algorithm 2 is tested on randomly generated datasets up to $5 d$ . Each set contains five holes which are made by removing some of the points and they vary in size (see Figure 1). The aim is to show the capability of the algorithm to find the largest rectangle.

Figure 1.

(a,b) The initial rectangles and their optimal results from Algorithm 1; (c, d) many optimal solutions from Algorithm 2 in $2 d$ and $3 d$ .

Figure 1 depicts the output of Algorithms 1 and 2 in $2 d$ and $3 d$ ; the results are promising. The top left Figure 1(a) displays the initial population of rectangles (to the left) and the optimal solution (to the right) in $2 d$ , while Figure 1(b) shows the initial population of hyper-rectangles and the optimum in $3 d$ ; both the figures are an output of Algorithm 1. In Figure 1(c), Algorithm 2 finds the five largest rectangles that can be embedded in the dataset. Figure 1(d) shows only three large hyper-rectangles; the other two are missed on purpose for the clarity of the picture.

The accuracy of our method is illustrated by using two small sets of points in $2 d$ from Gutiérrez and Param⁸ and Dumitrescu and Jiang³ that are redrawn here in similar positions in the plane. They found only axis-aligned rectangles, but here, the nonparallel rectangles have been found too. This is done by controlling the angle of tilt of the hyper-rectangles. Figure 2(b) shows the original picture by Gutiérrez and Paramá⁸ and the result from Algorithm 1 in Figure 2(a). It is within many other rectangles we found them, but for an accurate comparison, we preferred not to display them. Figure 3(c) and (d) shows two results from Dumitrescu and Jiang,³ and Figure 3(a) and (b) from our algorithm. Figure 3(a) is similar to Figure 3(c) where the rectangle is bounded by four points. In Figure 3(b), it was enough to bind each of the rectangles with only two points to find them.

Figure 2.

(a) From Algorithm 1; (b) from Gutiérrez and Paramá.⁸

Figure 3.

(a, b) From Algorithm 2 in comparison with the original ones (c) and (d) from Dumitrescu and Jiang.³

Also, 50 experiments were conducted with several real datasets used by some previous papers, with 10 experiments on each dataset. In each performance, Algorithm 2 tries to explore 20 different holes. The average number of hyper-rectangles to be detected is taken. Table 3 shows the results from these data, where the first column is the datasets, the second one is the dimension, and the third column gives the average. It illustrates that Algorithm 2 can find 16 to 20 hyper-rectangles for every 20 performances of Algorithm 5.

Table 3.

Test problem statistics.

Dataset	Dim	No. rectangles
Iris	4	16
Combined cycle power plant 1	5	20
Combined cycle power plant 2	5	20
Training wilt data	5	18
Testing wilt data	5	16

Three datasets are selected from the well-known UCI machine learning repository.²² In all cases, we reduced the size of the datasets out of necessity to compare the quality of our algorithm with an existing one that is provided by Lemley et al.⁷

The Iris dataset includes the sizes of the petal and septal lengths and width of 156 attributes from three types of iris plants and four dimensions.

Two datasets of the combined cycle power plant include 9568 items in five dimensions,⁷ which provide information on physical measurements such as humidity, exhaust vacuum, temperature, pressure, and electric output as measured by sensors around the plant. Only the first 150 items will be tested from the two datasets.

The Wilt dataset includes details about tree wilt, it is a high-resolution remote sensing. Here, both the training and testing sets will be used, the first 150 items from (4339, 498) in five dimensions.

The advantages of our procedure are numerous. First of all, it does not restrict the rectangles to having sides parallel to the axes of the dataset. Indeed, it can find the largest rectangles in any orientation, including of course those with parallel sides to axes. Furthermore, it finds not just the largest but all large rectangles. In fact, it can output all possible rectangles that can be inserted in the holes big or small. Working with rectangles has its reasons. But, our procedure can work with ellipsoids for instance without much change to the implementation. In that sense, it is far superior to anything found in the literature at the moment.

Conclusion

A new method has been provided to calculate the largest empty holes of a dataset in $R^{d}$ . The problem has been addressed using a different way from previous methods, which is based on the GA. The performance of our algorithm is evaluated on different datasets in high dimensions of up to 5. Its implementation has been explained and experimental results are provided. The results are good particularly since the algorithm does not restrict the rectangles to have axis-parallel sides. The comparison in $2 d$ for some results with other papers showed a high accuracy of Algorithm 2. We are confident that our approach can tackle this problem in higher dimensions. The lack of datasets from the real world prevented us from attempting this exercise.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Hajem Ati Daham

References

Chen

Dumitrescu

. On Wegner’s inequality for axis-parallel rectangles. Discrete Math 2020; 343: 112091.

Naamad

Lee

Hsu

W-L

. On the maximum empty rectangle problem. Discrete Appl Math 1984; 8: 267–277.

Dumitrescu

Jiang

. On the number of maximum empty boxes amidst n points. Discrete Comput Geom 2018; 59: 742–756.

Backer

Keil

. The bichromatic rectangle problem in high dimensions. In: 21st Canadian Conference on Computational Geometry, Vancoouver, BC, August 17–19, 2009, pp. 157–160.

Eckstein

Hammer

Liu

et al. The maximum box problem and its application to data analysis. Comput Optim Appl 2002; 23: 285–298.

Dumitrescu

Jiang

. On the largest empty axis-parallel box amidst n points. Algorithmica 2013; 66: 225–248.

Lemley

Jagodzinski

Andonie

. Big holes in big data: A Monte Carlo algorithm for detecting large hyper-rectangles in high dimensional data. In: 2016 IEEE 40th annual computer software and applications conference (COMPSAC), Atlanta, GA, USA, 2016, vol. 1, pp. 563–571. IEEE.

Gutiérrez

Paramá

. Finding the largest empty rectangle containing only a query point in large multidimensional databases. In: Ailamaki A and Bowers S (eds), Scientific and statistical database management. Lecture Notes in Computer Science 2012, vol. 7338, pp. 316–333. Berlin: Springer.

Liu

L-P

Hsu

. Discovering interesting holes in data. In: IJCAI, Proceedings of the fifteenth international joint conference on artificial intelligence (II), Nagoya, Japan, August 23–29, 1997, vol. 2, pp. 930–935. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

10.

Aggarwal

Suri

. Fast algorithms for computing the largest empty rectangle. In: Proceedings of the third annual symposium on computational geometry, Waterloo, ON, Canada, 8–10 June, 1987, pp. 278–290. New York, NY, USA: Association for Computing Machinery.

11.

L-P

Liu

Hsu

. Discovering large empty maximal hyper-rectangle in multi-dimensional space. Technical Report, Department of Information Systems and Computer Science (DCOMP), National University of Singapore, 1997.

12.

Edmonds

Gryz

Liang

et al. Mining for empty spaces in large data sets. Theor Comput Sci 2003; 296: 435–452.

13.

Kaplan

Roy

Sharir

. Finding axis-parallel rectangles of fixed perimeter or area containing the largest number of points. Comput Geom 2019; 81: 1–11.

14.

Chan

Har-Peled

. Smallest k-enclosing rectangle revisited. Discrete Comput Geom 2021; 66: 769–791.

15.

Lin

Y-T

Liu

J-S

. Revisit of minimum-area enclosing rectangle of a convex polygon. In: 2018 5th international conference on control, decision and information technologies (CoDIT), 10–13 April 2018, Thessaloniki, Greece, pp. 1051–1056. IEEE.

16.

Kudela

. Minimum-volume covering ellipsoids: Improving the efficiency of the Wolfe-Atwood algorithm for large-scale instances by pooling and batching. MENDEL 2019; 25: 19–26.

17.

Abo-Alsabeh

Salhi

. An evolutionary approach to constructing the minimum volume ellipsoid containing a set of points and the maximum volume ellipsoid embedded in a set of points. J Phys Conf Ser 2020; 1530: 012087.

18.

Abo-Alsabeh

Salhi

. A metaheuristic approach to the c1s problem. Iraqi J Sci 2021; 62: 218–227.

19.

Abo-Alsabeh

Salhi

. The genetic algorithm: A study survey. Iraqi J Sci 2022; 63: 1215–1231.

20.

Abo-Alsabeh

Daham

Salhi

. An evolutionary approach for solving the minimum volume ellipsoid estimator problem. In: Abraham A, Sasaki H, Rios R, et al. (eds) Innovations in bio-inspired computing and applications. IBICA 2020. Advances in Intelligent Systems and Computing, 2020, vol 1372, pp. 23–31. Cham: Springer.

21.

Bäck

Fogel

Michalewicz

. Evolutionary computation 1: Basic algorithms and operators, vol. 1. Boca Raton, FL, USA: CRC Press, 2000.

22.

Lichman M and Bache K. UCI machine learning repository. Irvine, CA, USA: University of California, School of Information and Computer Science, 2013.

On the maximum empty hyper-rectangle problem

Abstract

Keywords

Introduction

Motivation of the search for holes in datasets

Related work

The proposed algorithm

Experimental results

Conclusion

Footnotes

Declaration of conflicting interests

Funding

ORCID iD

References