Javascript must be enabled to continue!

Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

Anomaly detection (AD) task corresponds to identifying the true anomalies among a given set of data instances. AD algorithms score the data instances and produce a ranked list of candidate anomalies. The ranked list of anomalies is then analyzed by a human to discover the true anomalies. Ensemble of tree-based anomaly detectors trained in an unsupervised manner and scoring based on uniform weights for ensembles are shown to work well in practice. However, the manual process of analysis can be laborious for the human analyst when the number of false-positives is very high. Therefore, in many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensembles based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also show empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

AI Access Foundation

Shubhomoy Das Md Rakibul Islam Nitthilan Kannappan Jayakodi Janardhan Rao Doppa

Journal of Artificial Intelligence Research

2024

Title: Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

Description:

Anomaly detection (AD) task corresponds to identifying the true anomalies among a given set of data instances.

AD algorithms score the data instances and produce a ranked list of candidate anomalies.

The ranked list of anomalies is then analyzed by a human to discover the true anomalies.

Ensemble of tree-based anomaly detectors trained in an unsupervised manner and scoring based on uniform weights for ensembles are shown to work well in practice.

However, the manual process of analysis can be laborious for the human analyst when the number of false-positives is very high.

Therefore, in many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives.

One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances.

Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensembles based on label feedback allows us to quickly discover true anomalies.

This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles.

First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy.

We also show empirical results on real-world data to support our insights and theoretical analysis to support active learning.

Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies.

Third, we develop a novel active learning algorithm to handle streaming data setting.

We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner.

Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings.

Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

Back

Abstract Streaming-potential experiments were conducted within the Muddy- and Dakota-sandstone interval of a Denver basin well. Analysis of the data shows that, f...

The importance of batch sensitization in missing value imputation

AbstractData analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for m...

The Use of Artificial-Intelligence-Based Ensembles for Intrusion Detection: A Review

In supervised learning-based classification, ensembles have been successfully employed to different application domains. In the literature, many researchers have proposed different...

BUILDING A LOYAL FOLLOWING: A LEGAL PERSPECTIVE ON CONVERTING VIEWERS TO FANS THROUGH INTERACTIVE BOOK STREAMING ON DOUYIN

In this research, the objective is to collect a large amount of live streaming data over a long period of time, to argue the micro problems of live streaming e-commerce environment...

Systematics of Literature Reviews: Learning Model of Discovery Learning in Science Learning

The development of the 21st century has affected the world of education. Current education students must be led to learn more creatively and actively. This study aims Furthermore, ...

Inter-specific variations in tree stem methane and nitrous oxide exchanges in a tropical rainforest

<p>Tropical forests are the most productive terrestrial ecosystems, global centres of biodiversity and important participants in the global carbon and water cycles. T...

P-222 Can embryo morphokinetics act as early warning key performance indicators in relation to consumable batching

Abstract Study question Can morphokinetics be used as an early warning indicator of a batch-related effect of oil currently in u...

Streaming Potential and the SP Log

Published in Petroleum Transactions, AIME, Volume 213, 1958, pages 170–179. Abstract Published laboratory data have established ...

Email:
Password:

Email:

Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

Related Results