The learning of Gaussian mixture models (GMMs) is a classical problem in machine learning and applied statistics. This can also be interpreted as a clustering problem. Indeed, given data samples independently generated from a GMM, we would like to find the correct target clustering of the samples according to which Gaussian they were generated from. Despite the large number of algorithms designed to find the correct target clustering, many practitioners prefer to use the k-means algorithm because of its simplicity. k-means tries to find an optimal clustering which minimizes the sum of squared distances between each point and its cluster center. In this paper, we provide sufficient conditions for the closeness of any optimal clustering and the correct target clustering of the samples which are independently generated from a GMM. Moreover, to achieve significantly faster running time and reduced memory usage, we show that under weaker conditions on the GMM, any optimal clustering for the samples with reduced dimensionality is also close to the correct target clustering. These results provide intuition for the informativeness of k-means as an algorithm for learning a GMM, further substantiating the conclusions in Kumar and Kannan [2010]. We verify the correctness of our theorems using numerical experiments and show, using datasets with reduced dimensionality, significant speed ups for the time required to perform clustering.

## The Informativeness of k-Means for Learning Gaussian Mixture Models

Abstract · Mar 30, 2017 15:41 · Share on Twitter