A scalable algorithm for higher-order features generation using MinHash
2018
Linear models have been widely used in the industry for their low computation time, small memory footprint and interpretability. However, linear models are not capable of leveraging non-linear feature interactions in predicting the target. This limits their performance. A classical approach to overcome this limitation is to use combinations of the original features, referred to as higher-order features, to capture non-linearity. The number of higher-order features can be very large. Selecting the informative ones among them that are predictive of the target is essential for scalability. This is computationally expensive, requiring a large memory footprint. In this paper, we propose a novel scalable MinHash-based scheme to select informative higher-order features. Unlike typical use of MinHash for near-duplicate entity detection and association-rule mining, we use the MinHash signature of features to approximate mutual information between higher-order features and the target to enable their selection. By analyzing the running time and memory requirements, we show that our proposal is highly efficient in terms of running time and storage compared to existing alternatives. We demonstrate through experiments on multiple benchmark datasets that our proposed approach is not only scalable but also able to identify the most important feature interactions, resulting in improved model performance.
Research areas