Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Similarity Search with Data Missing

View through CrossRef
Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning. The core idea of similarity search is to find the most similar data sample of given query items, based on a specific similarity metric with the highest similarity score with all the search candidates in a large-scale database. It may suffer from a prohibitive computation cost and storage cost, which motivates us to design effective and fast similarity search algorithms in various scenarios. However, data missing is unavoidable in real-world scenarios, which results in a less accurate similarity score and further leads to an inaccurate similarity matrix. Therefore, obtaining an accurate similarity matrix is non-trivial when there are incomplete observations. To solve this problem, we propose a similarity matrix calibration method to estimate a high-quality similarity matrix and further provide a better similarity search performance. Firstly, we propose an objective function to minimize the difference between the initial inaccurate similarity matrix and the optimal estimated similarity matrix, where the inherent symmetric and positive semi-definiteness (PSD) properties are utilized as the constraint to guide the calibration process. Then, we design an effective algorithm with high efficiency to provide a high-quality similarity matrix that approximates the ground truth similarity matrix. Theoretical analysis demonstrates the efficiency guarantee of our proposed method, and extensive experimental results on real-world datasets verify the effectiveness and efficiency of the proposed method on the similarity matrix calibration task and the downstream similarity search task.
Title: Similarity Search with Data Missing
Description:
Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning.
The core idea of similarity search is to find the most similar data sample of given query items, based on a specific similarity metric with the highest similarity score with all the search candidates in a large-scale database.
It may suffer from a prohibitive computation cost and storage cost, which motivates us to design effective and fast similarity search algorithms in various scenarios.
However, data missing is unavoidable in real-world scenarios, which results in a less accurate similarity score and further leads to an inaccurate similarity matrix.
Therefore, obtaining an accurate similarity matrix is non-trivial when there are incomplete observations.
To solve this problem, we propose a similarity matrix calibration method to estimate a high-quality similarity matrix and further provide a better similarity search performance.
Firstly, we propose an objective function to minimize the difference between the initial inaccurate similarity matrix and the optimal estimated similarity matrix, where the inherent symmetric and positive semi-definiteness (PSD) properties are utilized as the constraint to guide the calibration process.
Then, we design an effective algorithm with high efficiency to provide a high-quality similarity matrix that approximates the ground truth similarity matrix.
Theoretical analysis demonstrates the efficiency guarantee of our proposed method, and extensive experimental results on real-world datasets verify the effectiveness and efficiency of the proposed method on the similarity matrix calibration task and the downstream similarity search task.

Related Results

Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Abstract The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...
Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis
Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis
Abstract Background The COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, ...
Long-range superharmonic Josephson current and spin-triplet pairing correlations in a junction with ferromagnetic bilayers
Long-range superharmonic Josephson current and spin-triplet pairing correlations in a junction with ferromagnetic bilayers
AbstractThe long-range spin-triplet supercurrent transport is an interesting phenomenon in the superconductor/ferromagnet ("Equation missing") heterostructure containing noncolline...
Uncovering the consequences of batch effect associated missing values in omics data analysis
Uncovering the consequences of batch effect associated missing values in omics data analysis
ABSTRACTStatistical analyses in high-dimensional omics data are often hampered by the presence of batch effects (BEs) and missing values (MVs), but the interaction between these tw...
Search engines and their search strategies: the effective use by Indian academics
Search engines and their search strategies: the effective use by Indian academics
Purpose – The purpose of this paper is to examine the use of various search engines and meta search engines by Indian academics for retrieving information on the we...
An Extension of Gregus Fixed Point Theorem
An Extension of Gregus Fixed Point Theorem
AbstractLet "Equation missing" be a closed convex subset of a complete metrizable topological vector space "Equation missing" and "Equation missing" a mapping that satisfies "Equat...
Analysis of a Similarity Measure for Non-Overlapped Data
Analysis of a Similarity Measure for Non-Overlapped Data
A similarity measure is a measure evaluating the degree of similarity between two fuzzy data sets and has become an essential tool in many applications including data mining, patte...

Back to Top