Problem: Manual data analysis for extracting useful features in web log anomaly detection can be costly and time-consuming. Automated techniques on the other hand (e.g. Auto-Encoders and CNNs based) usually require supplemental network trainings for feature extractions. Often the systems trained on these features suffer from high False Positive Rates (FPRs) and rectifying them can negatively impact accuracies and add training/tuning delays. Thus manual analysis delays, mandatory supplementary trainings and inferior detection outcomes are the limitations in contemporary web log anomaly detection systems.
Proposal: Byte Pair Encoding (BPE) is an automated data representation scheme which requires no training, and only needs a single parsing run for tokenizing available data. Models trained using BPE-based vectors have shown to outperform models trained on similar representations, in tasks such as machine translation (NMT) and language generation (NLG). We therefore propose to use BPE tokens obtained from web log data and consequently vectorized by a pre-trained sequence embedding model for performing web log anomaly detection. Our experiments using two public data sets show that ML models trained on BPE sequence vectors, achieve better results compared to training on both manually and automatically extracted features. Moreover our technique of obtaining log representations is fully automated (requiring only a single hyperparameter), needs no additional network training and provides representations that give consistent performance across different ML algorithms (a facet absent from feature-based techniques). The only trade-off with our method is an increased upper limit in system memory consumption, as compared to using manual features. This is due to the higher dimensions of the utilized pre-trained embeddings, and reducing it, is our motivation for future work.
No features needed: Using BPE sequence embeddings for web log anomaly detection
2022
Research areas