Multi-label classification without the multi-label cost
Abstract
Multi-label classification, or the same example can belong to more than one class label, happens in many applications. To name a few, image and video annotation, functional genomics, social network annotation and text categorization are some typical applications. Existing methods have limited performance in both efficiency and accuracy. In this paper, we propose an extension over decision tree ensembles that can handle both challenges. We formally analyze the learning risk of Random Decision Tree (RDT) and derive that the upper bound of risk is stable and lower bound decreases as the number of trees increases. Importantly, we demonstrate that the training complexity is independent from the number of class labels, a significant overhead for many state-of-the-art multi-label methods. This is particularly important for problems with large number of multi-class labels. Based on these characteristics, we adopt and improve RDT for multi-label classification. Experiment results have demonstrated that the computation time of the proposed approaches is 1-3 orders of magnitude less than other methods when handling datasets with large number of instances and labels, as well as improvement up to more than 10% in accuracy as compared to a number of state-of-the-art methods in some datasets for multi-label learning. Considering efficiency and effectiveness together, Multi-label RDT is the top rank algorithm in this domain. Even compared with the HOMER algorithm proposed to solve the problem of large number of labels, Multi-label RDT runs 2-3 orders of magnitude faster in training process and achieves some improvement on accuracy. Software and datasets are available from the authors. Copyright © by SIAM.