TUTCRIS - Tampereen teknillinen yliopisto


Data Clustering Based on Community Structure in Mutual k-Nearest Neighbor Graph



Otsikko2018 41st International Conference on Telecommunications and Signal Processing, TSP 2018
ISBN (painettu)9781538646953
DOI - pysyväislinkit
TilaJulkaistu - 20 elokuuta 2018
OKM-julkaisutyyppiA4 Artikkeli konferenssijulkaisussa
TapahtumaInternational Conference on Telecommunications and Signal Processing - Athens, Kreikka
Kesto: 4 heinäkuuta 20186 heinäkuuta 2018


ConferenceInternational Conference on Telecommunications and Signal Processing


Data clustering is a fundamental machine learning problem. Community structure is common in social and biological networks. In this article we propose a novel data clustering algorithm that uses this phenomenon in mutual k - nearest neighbor (MKNN) graph constructed from the input dataset. We use the authentic scores-a metric that measures the strength of an edge in a social network graph-to rank all the edges in the MKNN graph. By removing the edges gradually in the order of their authentic scores, we collapse the MKNN graph into components to find the clusters. The proposed method has two major advantages comparing to other popular data clustering algorithms. First, it is robust to the noise in the data. Second, it finds clusters of arbitrary shape. We evaluated our algorithm on synthetic noisy datasets, synthetic 2D datasets and real-world image datasets. Results on the noisy datasets show that the proposed algorithm clearly outperforms the competing algorithms in terms of Normalized Mutual Information (NMI) scores. The proposed algorithm is the only one that does not fail on any data in the the synthetic 2D dataset, which are specifically designed to show the limitations of the clustering algorithms. When testing on the real-world image datasets, the best NMI scores achieved by the proposed algorithm is more than any other competing algorithm. The proposed algorithm has computational complexity of O(k3n+kn\log (kn)) and space complexity of O(kn), which is better than or equivalent to the most popular clustering algorithms.