Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning

View through CrossRef
Anomaly detection (AD) task corresponds to identifying the true anomalies among a given set of data instances. AD algorithms score the data instances and produce a ranked list of candidate anomalies. The ranked list of anomalies is then analyzed by a human to discover the true anomalies. Ensemble of tree-based anomaly detectors trained in an unsupervised manner and scoring based on uniform weights for ensembles are shown to work well in practice. However, the manual process of analysis can be laborious for the human analyst when the number of false-positives is very high. Therefore, in many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances. Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensembles based on label feedback allows us to quickly discover true anomalies. This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles. First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy. We also show empirical results on real-world data to support our insights and theoretical analysis to support active learning. Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies. Third, we develop a novel active learning algorithm to handle streaming data setting. We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings. Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.
Title: Effectiveness of Tree-based Ensembles for Anomaly Discovery: Insights, Batch and Streaming Active Learning
Description:
Anomaly detection (AD) task corresponds to identifying the true anomalies among a given set of data instances.
AD algorithms score the data instances and produce a ranked list of candidate anomalies.
The ranked list of anomalies is then analyzed by a human to discover the true anomalies.
Ensemble of tree-based anomaly detectors trained in an unsupervised manner and scoring based on uniform weights for ensembles are shown to work well in practice.
However, the manual process of analysis can be laborious for the human analyst when the number of false-positives is very high.
Therefore, in many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives.
One important way to configure the detector is by providing true labels (nominal or anomaly) for a few instances.
Recent work on active anomaly discovery has shown that greedily querying the top-scoring instance and tuning the weights of ensembles based on label feedback allows us to quickly discover true anomalies.
This paper makes four main contributions to improve the state-of-the-art in anomaly discovery using tree-based ensembles.
First, we provide an important insight that explains the practical successes of unsupervised tree-based ensembles and active learning based on greedy query selection strategy.
We also show empirical results on real-world data to support our insights and theoretical analysis to support active learning.
Second, we develop a novel batch active learning algorithm to improve the diversity of discovered anomalies based on a formalism called compact description to describe the discovered anomalies.
Third, we develop a novel active learning algorithm to handle streaming data setting.
We present a data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the anomaly detector in a principled manner.
Fourth, we present extensive experiments to evaluate our insights and our tree-based active anomaly discovery algorithms in both batch and streaming data settings.
Our results show that active learning allows us to discover significantly more anomalies than state-of-the-art unsupervised baselines, our batch active learning algorithm discovers diverse anomalies, and our algorithms under the streaming-data setup are competitive with the batch setup.

Related Results

Ensembles of ensembles of ensembles: On using low-dimensional nonlinear systems to design climate prediction experiments
Ensembles of ensembles of ensembles: On using low-dimensional nonlinear systems to design climate prediction experiments
<p>The challenges of climate prediction are varied and complex. On the one hand they include conceptual and mathematical questions relating to the consequences of mod...
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
CREATING LEARNING MEDIA IN TEACHING ENGLISH AT SMP MUHAMMADIYAH 2 PAGELARAN ACADEMIC YEAR 2020/2021
The pandemic Covid-19 currently demands teachers to be able to use technology in teaching and learning process. But in reality there are still many teachers who have not been able ...
Bifix codes, Combinatorics on Words and Symbolic Dynamical Systems
Bifix codes, Combinatorics on Words and Symbolic Dynamical Systems
Codes bifixes, combinatoire des mots et systèmes dynamiques symboliques L'étude des ensembles de mots complexité linéaire joue un rôle très important dans la théori...
A Field Streaming - Potential Experiment
A Field Streaming - Potential Experiment
Abstract Streaming-potential experiments were conducted within the Muddy- and Dakota-sandstone interval of a Denver basin well. Analysis of the data shows that, f...
TELESPECTADORES E TELENOVELA NO STREAMING: UM ESTUDO EXPLORATÓRIO SOBRE O CONSUMO DE BELEZA FATAL
TELESPECTADORES E TELENOVELA NO STREAMING: UM ESTUDO EXPLORATÓRIO SOBRE O CONSUMO DE BELEZA FATAL
O trabalho se alicerça na reflexão sobre o comportamento dos telespectadores de telenovelas no streaming, com Beleza Fatal, lançada pela plataforma Max, em 2025, como ponto de anál...
Empirical Analysis of Data Streaming and Batch Learning Models for Network Intrusion Detection
Empirical Analysis of Data Streaming and Batch Learning Models for Network Intrusion Detection
Network intrusion, such as denial of service, probing attacks, and phishing, comprises some of the complex threats that have put the online community at risk. The increase in the n...
The importance of batch sensitization in missing value imputation
The importance of batch sensitization in missing value imputation
AbstractData analysis is complex due to a myriad of technical problems. Amongst these, missing values and batch effects are endemic. Although many methods have been developed for m...

Back to Top