Tampere University of Technology

TUTCRIS Research Portal

Data Clustering Based on Community Structure in Mutual k-Nearest Neighbor Graph

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Details

Original languageEnglish
Title of host publication2018 41st International Conference on Telecommunications and Signal Processing, TSP 2018
PublisherIEEE
Pages262-268
Number of pages7
ISBN (Print)9781538646953
DOIs
Publication statusPublished - 20 Aug 2018
Publication typeA4 Article in a conference publication
EventInternational Conference on Telecommunications and Signal Processing - Athens, Greece
Duration: 4 Jul 20186 Jul 2018

Conference

ConferenceInternational Conference on Telecommunications and Signal Processing
CountryGreece
CityAthens
Period4/07/186/07/18

Abstract

Data clustering is a fundamental machine learning problem. Community structure is common in social and biological networks. In this article we propose a novel data clustering algorithm that uses this phenomenon in mutual k - nearest neighbor (MKNN) graph constructed from the input dataset. We use the authentic scores-a metric that measures the strength of an edge in a social network graph-to rank all the edges in the MKNN graph. By removing the edges gradually in the order of their authentic scores, we collapse the MKNN graph into components to find the clusters. The proposed method has two major advantages comparing to other popular data clustering algorithms. First, it is robust to the noise in the data. Second, it finds clusters of arbitrary shape. We evaluated our algorithm on synthetic noisy datasets, synthetic 2D datasets and real-world image datasets. Results on the noisy datasets show that the proposed algorithm clearly outperforms the competing algorithms in terms of Normalized Mutual Information (NMI) scores. The proposed algorithm is the only one that does not fail on any data in the the synthetic 2D dataset, which are specifically designed to show the limitations of the clustering algorithms. When testing on the real-world image datasets, the best NMI scores achieved by the proposed algorithm is more than any other competing algorithm. The proposed algorithm has computational complexity of O(k3n+kn\log (kn)) and space complexity of O(kn), which is better than or equivalent to the most popular clustering algorithms.

Keywords

  • authentic score, data clustering, graph

Publication forum classification

Field of science, Statistics Finland