Javascript must be enabled to continue!

Trie-based Output Space Itemset Sampling

Abstract Itemset mining methods are techniques to discover relevant patterns in transactional databases. The first approach, called constrained-based pattern mining, is based on exhaustive pattern mining techniques which consist in returning all itemsets that satisfy a given constraint. The main issues that hinder their efficiency are the pattern explosion and the difficulty for a user to set the threshold value. To solve this problem, methods that return the most interesting patterns, called top-k, are also proposed, but they tend to lack diversity, a challenging issue for interactive pattern mining. However, interactive pattern mining is based on fast methods that respond effectively to user demand. To overcome all these problems, output pattern sampling is proposed to draw quickly a set of interesting patterns while guaranteeing a good diversity. Pattern sampling techniques are probabilistic methods that aim to draw a set of interesting patterns where each pattern is drawn with a probability proportional to a given interestingness measure. Nowadays, there are several measures that a user can test when interacting with the same database. In that case, the system should last a few times to consider the new utility measures while guaranteeing an exact draw. So, the cost in time of utility change can be a real problem for output pattern sampling techniques in large databases. In addition, with the current sampling methods, it is necessary to store all data in memory and this storage is prohibitive for large data. To solve these problems, this paper deals with how to structure the data for output pattern sampling under length-based utility measures in large transactional databases. So, we revisit the trie structure initially proposed by D. Knuth to enrich it and then no longer have the need (i) to access the data to sample because the patterns will be directly taken from the enriched trie, (ii) to reprocess the entire dataset when utility changes. The computation of the value of a length-based utility measure is based on the lengths of all the patterns that are present in the database. So, we define a new structure of trie called trie of occurrences, built by our first algorithm TPSpace (Trie-based Pattern Space), which materializes all the occurrences of the patterns in the database. The data compression comes from a factorization of the information via the prefixes of the patterns. The particularly remarkable result is that, by definition, the trie of occurrences is the same for any length-based utility measure provided that the same values are kept for the minimum and maximum length constraints. We then describe TPSampling (Trie-based Pattern Sampling) which performs the sampling by drawing patterns according to a length-based utility measure from the trie of occurrences. This paper is completed by the complexity analysis in memory and in time of the method and experiments on benchmark datasets. TPSampling is competitive with the two-step approach to sample following a given interestingness measure but, as expected, it is more particularly advantageous if several utility measures are used thanks to the generic preprocessing. TPSampling is $10^5$ times faster than Two-Step for reprocessing in utility change.

Springer Science and Business Media LLC

Lamine Diop Cheikh Talibouya Diop Arnaud Giacometti Arnaud Soulet

2022

Title: Trie-based Output Space Itemset Sampling

Description:

Abstract Itemset mining methods are techniques to discover relevant patterns in transactional databases.

The first approach, called constrained-based pattern mining, is based on exhaustive pattern mining techniques which consist in returning all itemsets that satisfy a given constraint.

The main issues that hinder their efficiency are the pattern explosion and the difficulty for a user to set the threshold value.

To solve this problem, methods that return the most interesting patterns, called top-k, are also proposed, but they tend to lack diversity, a challenging issue for interactive pattern mining.

However, interactive pattern mining is based on fast methods that respond effectively to user demand.

To overcome all these problems, output pattern sampling is proposed to draw quickly a set of interesting patterns while guaranteeing a good diversity.

Pattern sampling techniques are probabilistic methods that aim to draw a set of interesting patterns where each pattern is drawn with a probability proportional to a given interestingness measure.

Nowadays, there are several measures that a user can test when interacting with the same database.

In that case, the system should last a few times to consider the new utility measures while guaranteeing an exact draw.

So, the cost in time of utility change can be a real problem for output pattern sampling techniques in large databases.

In addition, with the current sampling methods, it is necessary to store all data in memory and this storage is prohibitive for large data.

To solve these problems, this paper deals with how to structure the data for output pattern sampling under length-based utility measures in large transactional databases.

So, we revisit the trie structure initially proposed by D.

Knuth to enrich it and then no longer have the need (i) to access the data to sample because the patterns will be directly taken from the enriched trie, (ii) to reprocess the entire dataset when utility changes.

The computation of the value of a length-based utility measure is based on the lengths of all the patterns that are present in the database.

So, we define a new structure of trie called trie of occurrences, built by our first algorithm TPSpace (Trie-based Pattern Space), which materializes all the occurrences of the patterns in the database.

The data compression comes from a factorization of the information via the prefixes of the patterns.

The particularly remarkable result is that, by definition, the trie of occurrences is the same for any length-based utility measure provided that the same values are kept for the minimum and maximum length constraints.

We then describe TPSampling (Trie-based Pattern Sampling) which performs the sampling by drawing patterns according to a length-based utility measure from the trie of occurrences.

This paper is completed by the complexity analysis in memory and in time of the method and experiments on benchmark datasets.

TPSampling is competitive with the two-step approach to sample following a given interestingness measure but, as expected, it is more particularly advantageous if several utility measures are used thanks to the generic preprocessing.

TPSampling is $10^5$ times faster than Two-Step for reprocessing in utility change.

Back

Abstract Itemset mining methods are techniques to discover relevant patterns in transactional databases. The first methods, called constrained-based pattern mining, are bas...

Hiding Sensitive Itemsets Using Sibling Itemset Constraints

Data collection and processing progress made data mining a popular tool among organizations in the last decades. Sharing information between companies could make this tool more ben...

Seditious Spaces

The title ‘Seditious Spaces’ is derived from one aspect of Britain’s colonial legacy in Malaysia (formerly Malaya): the Sedition Act 1948. While colonial rule may seem like it was ...

Trie Based Subsumption and Improving the pi-Trie Algorithm

An algorithm that stores the prime implicates of a logical formula in a trie was developed in [Matusiewicz et.al. 2009]. In this paper, an improved version of that pi-trie algorith...

Trie-join

A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has a...

PENDETEKSIAN KESALAHAN KETIK DENGAN DAMERAU-LEVENSHTEIN DISTANCE DAN TRIE

Typographical errors are commonly found in text. Many applications implement a spell checking feature to detect and correct typographical errors. Spell checking requires an algorit...

Space Safety through situational awareness

Space Situational Awareness (SSA) entails the detection, tracking, and comprehension of spaceborne objects and phenomena that could potentially affect Earth or space operations. It...

Design and Implementation of Automatic Selection of the Most Efficient Itemset Algorithm Based on Spark

The combination of Spark distributed platform and High-Utility Itemset Mining can solve the problem of long running time issue of High-Utility Itemset Mining. In the experiment, we...

Email:
Password:

Email:

Trie-based Output Space Itemset Sampling

Related Results