Present position: Temporary Researcher
|Thesis title:||Optimization Models and Algorithms for the Hyperplane Clustering Problem|
|Research area:||Applied Systems Analysis and Operations Research|
Clustering data points is a fundamental problem that is relevant in many fields like computer science, operations research, statistics, automation, and is very challenging. Classical clustering algorithms aim at minimizing some 'distance measure' between points within the same group or between each point and a prototypal point (like mean or median) which is representative of the cluster. A distinctive feature of these methods is that they tend to favour clusters with spherical shapes. Recently, there has been a growing interest in clustering problems in which data points need to be grouped with respect to affine subspaces. Clustering data points based on their collinearity is useful in many applications like piecewise linear model fitting, line detection in digital images, 3-D image reconstruction, medical diagnosis, identification and estimation of linear autoregressive models amongst others.
In this thesis, we address three variants of the general problem of clustering d-dimensional points with respect to linear subspaces of a fixed dimension (d-1), called the Hyperplane Clustering Problem (HCP). Depending on the application, the number of hyperplane clusters needed to group the points can be known a priori or has to be determined based on some constraints.
In the minimum-hyperplanes clustering problem (min-HCP), given a maximum positive tolerance on the distance between each point and the hyperplane it is assigned to, we want to determine a minimum number of hyperplanes such that each point lies within the given maximum overall tolerance distance from the corresponding hyperplane. In the bottleneck hyperplane clustering problem (k-bHCP), for a given integer number k>0 of hyperplane clusters, we wish to determine the hyperplanes that minimize the maximum point-to-hyperplane distance. In the third version, the minimum sum of squares hyperplane clustering problem (k-sqHCP), for a given fixed integer k>0 number of hyperplanes, the problem is to minimize the sum of the squares of the Euclidean distances of points to the assigned hyperplanes. All these variants are NP-hard in complexity.
For the min-HCP, we make a step towards an exact algorithm by developing a column generation approach based on a mixed integer nonlinear formulation in which the master problem is a set covering formulation and the pricing subproblem is mixed integer with a nonlinear constraint. Since the performance of a column generation algorithm can be highly affected by the initial pool of columns, we propose different ways of generating this pool and investigate its impact on the overall algorithm. As the pricing subproblem is extremely challenging due to the nonlinearity arising from the l2-norm in the distance function, we devise and investigate some alternative pricing subproblems based on different norms to approximate the original nonlinear constraint. For small-sized instances, we present a column generation approach based on two different norms which yields very good lower and upper bounds on the value of the objective function. For large-sized instances, we present some speed-up strategies for the overall column generation algorithm. We propose a strategy that calculates exact l2-norm hyperplanes and performs point-reassignment which further reduces the number of hyperplane clusters needed to cluster the points in hyperslabs of given maximum thickness. The performance of our approach is assessed on realistic randomly generated instances, for which a tight lower bound on the minimum number of hyperplanes is known, as well as on real world data coming from line detection in digital imaging and piecewise linear model fitting. We also compare with a deterministic greedy and an iterative greedy heuristic.
For the k-bHCP, we present two mixed-integer nonlinear programming formulations which lead to simpler exact-mixed integer nonlinear formulations after applying reformulation techniques. Since the problem is challenging because of the nonlinearities and a large number of binary assignment variables, we devise several linear approximations to replace the nonlinear l2-normalization. To obtain good quality solutions in a reasonable amount of time, we develop several heuristics that are based on these approximations and that exploit the problem geometry. To all the approaches developed, we apply a final conversion phase which gives exact l2-hyperplanes or l2-feasible hyperplanes and which yields a lower overall hyperslab thickness. We present extensive computational results on suitably generated realistic random instances and on piecewise linear model fitting data. We also demonstrate how the min-HCP and the k-bHCP can benefit from each other.
Finally, for the k-sqHCP, we present a preliminary column generation algorithm with a set covering master problem and a mixed integer pricing subproblem with the l2 or l1-normalization and with a cubic objective function. For this problem it was seen that having heuristically generated columns in the initial pool led to the column generation algorithm to get stuck in a local minima very fast. To avoid this, we use a dual-stabilization like method. Even though, the stabilization method improves the objective function substantially, the results show that the column generation approach which performs very well for the min-HCP, is not suited for this problem. In fact it can only handle very small instances in a reasonable amount of computing time. We also extend the regression method to the case of k-clusters, however being a simplistic approach it fails when applied to mid-sized instances. Computational results of our algorithms are compared with those obtained with a known k-plane clustering method for some instances.
All the column generation algorithms and heuristics developed for the HCP’s can be applied with slight modifications to other norms also.