Active two-phase learning for classification of large datasets with extreme class-skew
Active learning is a commonly used technique to reduce the amount of labeled data necessary for supervised learning. In this paper, we focus on collection of labeled examples in a domain with large unlabeled dataset and extreme class imbalance. This scenario presents several challenges to Active learning. Traditional active learning strategies can face acute difficulty in locating minority class examples and can fail completely due to the well-known cold start problem. The problem is further complicated by scale as for large datasets it can be expensive to execute Active learning computations on the whole domain set. Additionally, the active learning strategies can turn out to be impractical or inefficient for interactive use due to high computation time for the iterative selection cycles. In this paper, we proposed a two-phase approach that takes into account both high-class imbalance and the scale of the input data space. Specifically, our approach employs two active learners in a tiered fashion - first phase active learner efficiently learns a domain classifier (a filter function) defined on the entire input space and second learner tries to efficiently learn final ML classifier defined on the output of filter. The second-phase selects informative instances from a smaller pool of unlabeled examples which doesn’t require operating on the full dataset. The proposed method allows active learning to be applied to large datasets with class skew. The two-phases are interleaved (rather than isolated) and allow for a bi-directional flow of information. The combined two-phase learner progressively expands knowledge of input data space and uses successive first phase and second phase strategies to switch between learning the decision boundary and expanding domain boundary. Given labeled data at certain iteration, the second-phase focuses on exploiting the decision boundary (up to a performance threshold) and then, first-phase focuses on exploiting given information to intelligently search and expand the domain. We demonstrate the effectiveness of our strategy for product classification on sample of Amazon catalog dataset. Our results show that the proposed method achieves a fast solution with competitive performance in extreme imbalanced setting.