Javascript must be enabled to continue!
An LLM-guided Platform for Multi-Granular Collection and Management of Data Provenance
View through CrossRef
Abstract
As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.
Springer Science and Business Media LLC
Title: An LLM-guided Platform for Multi-Granular Collection and Management of Data Provenance
Description:
Abstract
As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust.
To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes.
This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science.
An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation.
Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them.
We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.
Related Results
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Exploring Large Language Models Integration in the Histopathologic Diagnosis of Skin Diseases: A Comparative Study
Abstract
Introduction
The exact manner in which large language models (LLMs) will be integrated into pathology is not yet fully comprehended. This study examines the accuracy, bene...
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Human-AI Collaboration in Clinical Reasoning: A UK Replication and Interaction Analysis
Abstract
Objective
A paper from Goh et al found that a large language model (LLM) working alone outperformed American clinicians assisted...
Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India
Financial Advisory LLM Model for Modernizing Financial Services and Innovative Solutions for Financial Literacy in India
Abstract
Dynamically evolving financial conditions in India place sophisticated models of financial advisory services relative to its own peculiar conditions more in demand...
Finite-Size Effects in Geophysical Granular Flow from a Nonlocal Rheology Perspective
Finite-Size Effects in Geophysical Granular Flow from a Nonlocal Rheology Perspective
Geophysical granular flow is ubiquitous in nature and plays a crucial role in shaping the landscape (hillslope creep, riverbed evolution) and causing geohazards (landslide, debris ...
CAT-LLM: Style-enhanced Large Language Models with Text Style Definition for Chinese Article-style Transfer
CAT-LLM: Style-enhanced Large Language Models with Text Style Definition for Chinese Article-style Transfer
Text style transfer plays a vital role in online entertainment and social media. However, existing models struggle to handle the complexity of Chinese long texts, such as rhetoric,...
Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology
Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology
ImportanceAlthough augmenting large language models (LLMs) with knowledge bases may improve medical domain–specific performance, practical methods are needed for local implementati...
Granular Matter in Space
Granular Matter in Space
The investigation of granular matter benefits from experiments in microgravity in several ways and may be categorized into three regimes of increasing density: (1) Granular gases ...
Dense fluidized granular media in microgravity
Dense fluidized granular media in microgravity
AbstractHandling and transport of granular media are inevitably governed by the settling of particles. Settling into a dense state is one of the defining characteristics of granula...

