Javascript must be enabled to continue!
Trie-join
View through CrossRef
A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a
filter-and-refine
framework and have the following disadvantages: (1) They are inefficient for the data sets with short strings (the average string length is no larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel framework called
trie-join
, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on three real data sets with short strings.
Association for Computing Machinery (ACM)
Title: Trie-join
Description:
A string similarity join finds similar pairs between two collections of strings.
It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently.
In this paper, we study string similarity joins with edit-distance constraints.
Existing methods usually employ a
filter-and-refine
framework and have the following disadvantages: (1) They are inefficient for the data sets with short strings (the average string length is no larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets.
To address these problems, we propose a novel framework called
trie-join
, which can generate results efficiently with small indexes.
We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning.
We devise efficient trie-join algorithms and pruning techniques to achieve high performance.
Our method can be easily extended to support dynamic update of data sets efficiently.
Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on three real data sets with short strings.
Related Results
Trie-based Output Space Itemset Sampling
Trie-based Output Space Itemset Sampling
Abstract
Itemset mining methods are techniques to discover relevant patterns in transactional databases. The first methods, called constrained-based pattern mining, are bas...
Trie-based Output Space Itemset Sampling
Trie-based Output Space Itemset Sampling
Abstract
Itemset mining methods are techniques to discover relevant patterns in transactional databases. The first approach, called constrained-based pattern mining, is bas...
Using join.me to help library patrons
Using join.me to help library patrons
PurposeAs the Informatics Librarian at Olivet Nazarene University, my staff and I are often responsible for troubleshooting our patrons' technology issues. My experience with join....
TriJoin: A Time-Efficient and Scalable Three-Way Distributed Stream Join System
TriJoin: A Time-Efficient and Scalable Three-Way Distributed Stream Join System
<p>Stream join is one of the most fundamental operations in data stream processing applications. Existing distributed stream join systems can support efficient two-way join, ...
Trie Based Subsumption and Improving the pi-Trie Algorithm
Trie Based Subsumption and Improving the pi-Trie Algorithm
An algorithm that stores the prime implicates of a logical formula in a trie was developed in [Matusiewicz et.al. 2009]. In this paper, an improved version of that pi-trie algorith...
PENDETEKSIAN KESALAHAN KETIK DENGAN DAMERAU-LEVENSHTEIN DISTANCE DAN TRIE
PENDETEKSIAN KESALAHAN KETIK DENGAN DAMERAU-LEVENSHTEIN DISTANCE DAN TRIE
Typographical errors are commonly found in text. Many applications implement a spell checking feature to detect and correct typographical errors. Spell checking requires an algorit...
PIPELINING A SKEW-INSENSITIVE PARALLEL JOIN ALGORITHM
PIPELINING A SKEW-INSENSITIVE PARALLEL JOIN ALGORITHM
Most standard parallel join algorithms try to overcome data skews with a relatively static approach. The way they distribute data (and then computation) over nodes depends on a dat...
MEKANISME KOPING DAN PRESTASI BELAJAR MAHASISWA YANG IKUT DAN YANG TIDAK IKUT ORGANISASI
MEKANISME KOPING DAN PRESTASI BELAJAR MAHASISWA YANG IKUT DAN YANG TIDAK IKUT ORGANISASI
A Proper coping mechanisms are needed for students to avoid stress that should be affected by the student’s learning achievement. There’s a different type of stress between student...

