Javascript must be enabled to continue!

Resource efficient distributed computing

There is a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. My research explores topics on machine learning and big data technologies related to learning under decentralized resources. One topic of distributed learning is to distribute large scale centralized computation to clustered or multi-core computers. We propose a method for fast computation of kNN search, random projection forests (rpForests). RpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of random projections. RpForests has a very low computational complexity as a tree-based methodology and achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances, as demonstrated on many datasets. The ensemble nature of rpForests makes it easily parallelized to run on clustered or multi-core computers; the running time is shown to be nearly inversely proportional to the number of cores or machines. Another two topics treats the data in machine learning as a computing resource. Existing learning algorithms typically assume all the data to be in one centralized place while it is increasingly often that the data are located at a number of distributed sites, and we wish to learn over data from all the sites with low communication overhead. Also, it is often that the data of interest has features shared by some other datasets from multiple sources. It is desirable to take advantage of such auxiliary datasets. We proposed two approaches under this topic—fast communication-efficient spectral clustering overdistributed data and fuzzy join of data with shared features. A novel framework is proposed that enables computation over data from all the physical nodes, with minimal communications overhead while a major speedup in computation for spectral clustering. The loss in accuracy is negligible compared to the non-distributed setting. The proposed approach allows local parallel computing at where the data are located and the speedup is most substantial when the data are evenly distributed across sites. Experiments show almost no loss in accuracy with our approach while a 2x speedup under various settings with two distributed sites. As the transmitted data does not need to be in their original form, the framework readily addresses the privacy concern for data sharing in distributed computing. We propose another efficient algorithm fuzzy join that enhances the learning from the provided data by leveraging the auxiliary data through shared features. Fuzzy join enables the extraction of additional information along the dimension implied by features in the auxiliary data that are not in the given data. Our implementation based on random projection forests is efficient with log linear computational complexity, and is resistant to noises in the data. Experiments demonstrate the practicality of our approach. Fuzzy join extends the scope of the join operation in relational databases by performing join on non-index key columns and allowing non-exact matches between rows from different datasets.

University of Massachusetts Dartmouth

Yingjie Wang

2025

Title: Resource efficient distributed computing

Description:

There is a surge of interests in distributed computing thanks to advances in clustered computing and big data technology.

My research explores topics on machine learning and big data technologies related to learning under decentralized resources.

One topic of distributed learning is to distribute large scale centralized computation to clustered or multi-core computers.

We propose a method for fast computation of kNN search, random projection forests (rpForests).

RpForests finds nearest neighbors by combining multiple kNN-sensitive trees with each constructed recursively through a series of random projections.

RpForests has a very low computational complexity as a tree-based methodology and achieves a remarkable accuracy in terms of fast decaying missing rate of kNNs and that of discrepancy in the k-th nearest neighbor distances, as demonstrated on many datasets.

The ensemble nature of rpForests makes it easily parallelized to run on clustered or multi-core computers; the running time is shown to be nearly inversely proportional to the number of cores or machines.

Another two topics treats the data in machine learning as a computing resource.

Existing learning algorithms typically assume all the data to be in one centralized place while it is increasingly often that the data are located at a number of distributed sites, and we wish to learn over data from all the sites with low communication overhead.

Also, it is often that the data of interest has features shared by some other datasets from multiple sources.

It is desirable to take advantage of such auxiliary datasets.

We proposed two approaches under this topic—fast communication-efficient spectral clustering overdistributed data and fuzzy join of data with shared features.

A novel framework is proposed that enables computation over data from all the physical nodes, with minimal communications overhead while a major speedup in computation for spectral clustering.

The loss in accuracy is negligible compared to the non-distributed setting.

The proposed approach allows local parallel computing at where the data are located and the speedup is most substantial when the data are evenly distributed across sites.

Experiments show almost no loss in accuracy with our approach while a 2x speedup under various settings with two distributed sites.

As the transmitted data does not need to be in their original form, the framework readily addresses the privacy concern for data sharing in distributed computing.

We propose another efficient algorithm fuzzy join that enhances the learning from the provided data by leveraging the auxiliary data through shared features.

Fuzzy join enables the extraction of additional information along the dimension implied by features in the auxiliary data that are not in the given data.

Our implementation based on random projection forests is efficient with log linear computational complexity, and is resistant to noises in the data.

Experiments demonstrate the practicality of our approach.

Fuzzy join extends the scope of the join operation in relational databases by performing join on non-index key columns and allowing non-exact matches between rows from different datasets.

Back

“Cloud Computing – Navigating the Digital Sky” is an extensive guide designed to provide a thorough understanding of cloud computing, an essential technology in today’s digital age...

Current state and prospects of edge computing within the Internet of Things (IoT) ecosystem

The burgeoning growth of the Internet of Things (IoT) has prompted a paradigm shift in computing architectures, leading to the emergence and rapid evolution of edge computing. This...

Dynamic Pricing in Edge computing Resource Allocation Based on Stackelberg Dynamic Game

Abstract The dynamic changes of mobile terminals have led to the more complex environment for edge computing resource allocation. Edge nodes are generally mobile wireless d...

New approaches for resource management and job scheduling for HEP grid computing

(English) The Large Hadron Collider (LHC) ALICE (A Large Ion Collider Experiment) experiment uses grid computing for its extensive data processing and analysis. The ALICE Grid is c...

Advancements in Quantum Computing and Information Science

Abstract: The chapter "Advancements in Quantum Computing and Information Science" explores the fundamental principles, historical development, and modern applications of quantum co...

Influence of Strategic Human Resource Management Practices on Performance of Public Universities in Kenya

Purpose: The objective of the study was to determine the effect of Strategic Human Resource Management Practices (SHRMPs) on performance of public universities. Methodology: ...

DE-RALBA: dynamic enhanced resource aware load balancing algorithm for cloud computing

Cloud computing provides an opportunity to gain access to the large-scale and high-speed resources without establishing your own computing infrastructure for executing the high-per...

The Dual-Helical Evolution of Network Computing: Toward Autonomous Intelligence Over Computing Power Networks

Driven by explosively growing application demands and rapid technological advances, network computing paradigms have continuously reshaped how computational resources are organized...

Email:
Password:

Email:

Resource efficient distributed computing

Related Results