Javascript must be enabled to continue!

Pseudo labeling and classification of high-dimensional data using visual analytics

Machine learning (ML) works with data consisting of tens up to tens of thousands of measurements (dimensions) per sample. As the number of dimensions and/or samples grow, so does the difficulty of understanding such data and, related to that, understanding how to design ML pipelines that effectively process such data for tasks such as classification. Visualization, and in particular Visual Analytics (VA) has emerged as one of the key approaches that helps practitioners with the understanding of high-dimensional data and with ML engineering tasks. This thesis studies several novel approaches by which VA can help ML (and conversely), as follows. Our work focuses on a visualization technique called dimensionality reduction, or projection, which handles efficiently and effectively large amounts of high-dimensional data. One the ML side, we consider the task of training a typical classifier for the challenging context when only a small amount of ground-truth labels is available. We first propose a pseudo-labeling approach that explores the ability of projections to generate a reduced feature space with enough information to improve feature learning and classifier performance over iterations. We show that the 2D space generated by projections can capture very well the data structure present in high dimensions so as to support the design of high-performance feature and classifier learning models. Secondly, we link data separation (DS), visual separation (VS), and classifier performance (CP) by pseudo-labeling and projections. We use feature spaces with high DS as input to compute high-VS projections. We use these projections to perform pseudo labeling with high propagation accuracies. Finally, we use such labels to train classifiers with a high CP. We show that the high-DS, high-VS, high-CP implication holds for several types of projection techniques. Hence, such projection techniques are suitable for the task of classifier engineering. Thirdly, we exploit the aforementioned observation that high-VS and high-CP are correlated to propose a metric to assess the VS of labeled 2D scatterplots produced by projection techniques. Our metric computes the accuracy of label propagation in the projection space, which is simple and fast to execute. We show that high propagation accuracies match a high VS as assessed by human subjects. Finally, we join all our contributions to incorporate the user in the ML engineering process. We propose an interactive VA tool that assists users in manual labeling samples by providing additional information in terms of classifier decision boundary maps, projection errors, and inverse projection errors. Our results show that this approach enables users to quickly generate labeled samples that lead to higher classification performance after a few labeling iterations. This contribution shows that both algorithms and humans can exploit projections to build better classifiers.

Utrecht University Library

Bárbara Caroline Benato

2024

Title: Pseudo labeling and classification of high-dimensional data using visual analytics

Description:

Machine learning (ML) works with data consisting of tens up to tens of thousands of measurements (dimensions) per sample.

As the number of dimensions and/or samples grow, so does the difficulty of understanding such data and, related to that, understanding how to design ML pipelines that effectively process such data for tasks such as classification.

Visualization, and in particular Visual Analytics (VA) has emerged as one of the key approaches that helps practitioners with the understanding of high-dimensional data and with ML engineering tasks.

This thesis studies several novel approaches by which VA can help ML (and conversely), as follows.

Our work focuses on a visualization technique called dimensionality reduction, or projection, which handles efficiently and effectively large amounts of high-dimensional data.

One the ML side, we consider the task of training a typical classifier for the challenging context when only a small amount of ground-truth labels is available.

We first propose a pseudo-labeling approach that explores the ability of projections to generate a reduced feature space with enough information to improve feature learning and classifier performance over iterations.

We show that the 2D space generated by projections can capture very well the data structure present in high dimensions so as to support the design of high-performance feature and classifier learning models.

Secondly, we link data separation (DS), visual separation (VS), and classifier performance (CP) by pseudo-labeling and projections.

We use feature spaces with high DS as input to compute high-VS projections.

We use these projections to perform pseudo labeling with high propagation accuracies.

Finally, we use such labels to train classifiers with a high CP.

We show that the high-DS, high-VS, high-CP implication holds for several types of projection techniques.

Hence, such projection techniques are suitable for the task of classifier engineering.

Thirdly, we exploit the aforementioned observation that high-VS and high-CP are correlated to propose a metric to assess the VS of labeled 2D scatterplots produced by projection techniques.

Our metric computes the accuracy of label propagation in the projection space, which is simple and fast to execute.

We show that high propagation accuracies match a high VS as assessed by human subjects.

Finally, we join all our contributions to incorporate the user in the ML engineering process.

We propose an interactive VA tool that assists users in manual labeling samples by providing additional information in terms of classifier decision boundary maps, projection errors, and inverse projection errors.

Our results show that this approach enables users to quickly generate labeled samples that lead to higher classification performance after a few labeling iterations.

This contribution shows that both algorithms and humans can exploit projections to build better classifiers.

Back

The scope of sensor networks and the Internet of Things spanning rapidly to diversified domains but not limited to sports, health, and business trading. In recent past, the sensors...

Service Quality Improvement in the Banking Sector: A Data Analytics Perspective

Service quality in the banking sector is a critical determinant of customer satisfaction, loyalty, and competitive advantage. As banks strive to meet the evolving expectations of c...

People Analytics

People analytics refers to the systematic and scientific process of applying quantitative or qualitative data analysis methods to derive insights that shape and inform employee-rel...

Programmatic Labeling of Dark Data for Artificial Intelligence in Spatial Informatics

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]&#160...

Enhancing business performance: The role of data-driven analytics in strategic decision-making

In today’s highly competitive business landscape, organizations are increasingly turning to data-driven analytics to enhance performance and inform strategic decision-making. This ...

Leveraging Salesforce Analytics for Enhanced Business Intelligence

Salesforce Analytics is a strong business intelligence (BI) solution that turns raw data into actionable insights. Today's data-driven world requires fast, reliable data analysis f...

Legal Analytics in Public Administration

The author understands legal analytics as analytical activity in the field of law, which includes rule-making, administrative law enforcement (including control and supervision), l...

Weak pseudo-BCK algebras

Abstract In this paper we define and study the weak pseudo-BCK algebras as generalizations of weak BCK-algebras, extending some results given by Cı⃖rulis for weak BC...

Email:
Password:

Email:

Pseudo labeling and classification of high-dimensional data using visual analytics

Related Results