Javascript must be enabled to continue!

Similarity Search with Data Missing

Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning. The core idea of similarity search is to find the most similar data sample of given query items, based on a specific similarity metric with the highest similarity score with all the search candidates in a large-scale database. It may suffer from a prohibitive computation cost and storage cost, which motivates us to design effective and fast similarity search algorithms in various scenarios. However, data missing is unavoidable in real-world scenarios, which results in a less accurate similarity score and further leads to an inaccurate similarity matrix. Therefore, obtaining an accurate similarity matrix is non-trivial when there are incomplete observations. To solve this problem, we propose a similarity matrix calibration method to estimate a high-quality similarity matrix and further provide a better similarity search performance. Firstly, we propose an objective function to minimize the difference between the initial inaccurate similarity matrix and the optimal estimated similarity matrix, where the inherent symmetric and positive semi-definiteness (PSD) properties are utilized as the constraint to guide the calibration process. Then, we design an effective algorithm with high efficiency to provide a high-quality similarity matrix that approximates the ground truth similarity matrix. Theoretical analysis demonstrates the efficiency guarantee of our proposed method, and extensive experimental results on real-world datasets verify the effectiveness and efficiency of the proposed method on the similarity matrix calibration task and the downstream similarity search task.

Association for Computing Machinery (ACM)

Changyi Ma Xuan Song

ACM Transactions on Intelligent Systems and Technology

2025

Title: Similarity Search with Data Missing

Description:

Similarity search is a fundamental research problem with broad applications in various research fields, including data mining, information retrieval, and machine learning.

The core idea of similarity search is to find the most similar data sample of given query items, based on a specific similarity metric with the highest similarity score with all the search candidates in a large-scale database.

It may suffer from a prohibitive computation cost and storage cost, which motivates us to design effective and fast similarity search algorithms in various scenarios.

However, data missing is unavoidable in real-world scenarios, which results in a less accurate similarity score and further leads to an inaccurate similarity matrix.

Therefore, obtaining an accurate similarity matrix is non-trivial when there are incomplete observations.

To solve this problem, we propose a similarity matrix calibration method to estimate a high-quality similarity matrix and further provide a better similarity search performance.

Firstly, we propose an objective function to minimize the difference between the initial inaccurate similarity matrix and the optimal estimated similarity matrix, where the inherent symmetric and positive semi-definiteness (PSD) properties are utilized as the constraint to guide the calibration process.

Then, we design an effective algorithm with high efficiency to provide a high-quality similarity matrix that approximates the ground truth similarity matrix.

Theoretical analysis demonstrates the efficiency guarantee of our proposed method, and extensive experimental results on real-world datasets verify the effectiveness and efficiency of the proposed method on the similarity matrix calibration task and the downstream similarity search task.

Back

Abstract The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...

Handling Missing Data in COVID-19 Incidence Estimation: Secondary Data Analysis

Abstract Background The COVID-19 pandemic has revealed significant challenges in disease forecasting and in developing a public health response, ...

Long-range superharmonic Josephson current and spin-triplet pairing correlations in a junction with ferromagnetic bilayers

AbstractThe long-range spin-triplet supercurrent transport is an interesting phenomenon in the superconductor/ferromagnet ("Equation missing") heterostructure containing noncolline...

ERROR ESTIMATION FOR A PIEZOELECTRIC CONTACT PROBLEM WITH WEAR AND LONG MEMORY

We study a mathematical model for a quasistatic behavior of electro-viscoelastic materials. The problem is related to highly nonlinear and non-smooth phenomena like contact, fricti...

How is missing data handled in cluster randomized controlled trials? A review of trials published in the NIHR Journals Library 1997–2024

Background: Cluster randomized controlled trials are increasingly used to evaluate the effectiveness of interventions in clinical and public health research. However, m...

Uncovering the consequences of batch effect associated missing values in omics data analysis

ABSTRACTStatistical analyses in high-dimensional omics data are often hampered by the presence of batch effects (BEs) and missing values (MVs), but the interaction between these tw...

Search engines and their search strategies: the effective use by Indian academics

Purpose – The purpose of this paper is to examine the use of various search engines and meta search engines by Indian academics for retrieving information on the we...

An Extension of Gregus Fixed Point Theorem

AbstractLet "Equation missing" be a closed convex subset of a complete metrizable topological vector space "Equation missing" and "Equation missing" a mapping that satisfies "Equat...

Email:
Password:

Email:

Similarity Search with Data Missing

Related Results