Javascript must be enabled to continue!

Layout inference and table detection in spreadsheet document

Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities. Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive. The same data can be presented in different ways, depending on the preferences and the intentions of the user. These functionalities make spreadsheets user-friendly, but not as much machine-friendly. When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous. It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata. Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks. Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets. In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature. Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information. It is easier to process such information, in order to make it available to other applications. For instance, spreadsheet (tabular) data can be loaded into databases. Thus, these data would become instantly available to existing or new business processes. Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems. To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline. The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings. Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents. Our approach is bottom-up, as it starts from the smallest unit (i.e., the cell) to ultimately arrive at the individual tables of the sheet. Additionally, this thesis makes use of sophisticated machine learning and optimization techniques. In particular, we apply these techniques for layout analysis and table detection in spreadsheets. We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents. Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet. Furthermore, we work even with problematic tables (e.g., containing empty rows/columns and missing values). Finally, we bring flexibility to our approach. This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets. Els fulls de càlcul s’empren massivament en molts dominis i contexts diferents, ja que proporcionen una àmplia gamma de funcionalitats, bàsiques i avançades, de gestió de dades. D’aquesta manera, donen suport a la recollida, transformació, anàlisi i visualització de dades. A la mateixa vegada, els fulls de càlcul tenen una interfície amigable i intuïtiva i tenen un cost molt baix d’implantació. Aplicacions de full de càlcul molt conegudes, com OpenOffice, LibreOffice, Google Sheets i Gnumeric, poden utilitzar-se de forma gratuïta i d’altres, com Microsoft Excel, són a l’abast d’una gran majoria d’usuaris. Per tant, han esdevingut molt populars tant per a novells com per professionals. Com a resultat, un gran volum de dades valuoses resideixen en aquests documents. Són de particular interès les dades que es presenten en format tabular dins dels fulls de càlcul, ja que proporcionen informació concreta, factual i parcialment estructurada. Com a conseqüència, hi ha interès en transferir dades tabulars des de fulls de càlcul a bases de dades. Això permetria que els fulls de càlcul es converteixin en una font directa de dades per a processos empresarials, i introduir aquestes dades als magatzems de dades i integrar-les amb altres fonts. Un pas més enllà, els fulls de càlcul juntament amb altres documents en brut es poden emmagatzemar en repositoris de dades centralitzats avançats, com per exemple, els data lake. Un cop al data lake, es podran fer servir (sota demanda) per a diverses tasques i aplicacions. Tot plegat, l’objectiu és fer accessibles les dades emmagatzemades als fulls de càlcul. Malgrat tot, hi ha reptes considerables en el processament i comprensió automàtica d’aquests documents. Els fulls de càlcul estan dissenyats principalment per al consum humà i, per tant, afavoreixen la personalització i la comprensió visual. Les dades sovint s’entrellacen amb formatació, fórmules, artefactes de disseny i metadades textuals, que porten informació específica del domini o fins i tot informació específica de l’usuari. Al mateix full es poden trobar diverses taules, amb una estructura i disseny diferents. A més, el format de cada taula no es declara a priori, és a dir, no hi ha cap mecanisme per definir l’estructura d’una taula, com passa a les bases de dades. Per aquest motiu, els fulls de càlcul es coneixen com a fonts de dades parcialment estructurades, amb un grau rellevant d'informació implícita. A la literatura, la comprensió automàtica de les dades emmagatzemades en fulls de càlcul s'ha investigat superficialment, sovint assumint el mateix format uniforme de taula a tots els fulls de càlcul. Tanmateix, a causa de les múltiples possibilitats d'estructurar les dades tabulars en fulls de càlcul, la suposició d'un disseny uniforme o bé exclou un nombre substancial de taules del procés d'extracció o condueix a resultats inexactes. En aquesta tesi, abordem tasques fonamentals que contribueixen a l’extracció d’informació dels fulls de càlcul d’una manera més precisa. Proposem mètodes intuïtius i eficaços per a l’anàlisi de la distribució i detecció de taules en fulls de càlcul. Un dels nostres objectius principals és eliminar la majoria dels supòsits de l’estat de l’art actual. Per fer-ho, considerem estructures tabulars altament heterogènies, contingudes en fulls de càlcul amb una o més taules. Addicionalment, preveiem la presencia de metadades i altres tipus de dades no tabulars al mateix full. Per últim, utilitzem tècniques d’optimització i d’aprenentatge automàtic per identificar l’estructura de les taules. Això aporta flexibilitat al nostre enfocament, permetent-lo treballar, fins i tot, amb taules complexes o malformades. Aquesta flexibilitat fa que els nostres mètodes siguin transferibles a nous conjunts de fulls de càlcul amb dades d’altres dominis. Per tant, no estem limitats a dominis o configuracions

Universitat Politècnica de Catalunya

Elvis Koci

2024

Title: Layout inference and table detection in spreadsheet document

Description:

Spreadsheet applications have evolved to be a tool of great importance for businesses, open data, and scientific communities.

Using these applications, users can perform various transformations, generate new content, analyze and format data such that they are visually comprehensive.

The same data can be presented in different ways, depending on the preferences and the intentions of the user.

These functionalities make spreadsheets user-friendly, but not as much machine-friendly.

When it comes to integrating with other sources, the free-for-all nature of spreadsheets is disadvantageous.

It is rather difficult to algorithmically infer the structure of the data when they are intermingled with formatting, formulas, layout artifacts, and textual metadata.

Therefore, user involvement is often required, which results in cumbersome and time-consuming tasks.

Overall, the lack of automatic processing methods limits our ability to explore and reuse a great amount of rich data stored into partially-structured documents such as spreadsheets.

In this thesis, we tackle this open challenge, which so far has been scarcely investigated in literature.

Specifically, we are interested in extracting tabular data from spreadsheets, since they hold concise, factual, and to a large extend structured information.

It is easier to process such information, in order to make it available to other applications.

For instance, spreadsheet (tabular) data can be loaded into databases.

Thus, these data would become instantly available to existing or new business processes.

Furthermore, we can eliminate the risk of losing valuable company knowledge, by moving data or integrating spreadsheets with other more sophisticated information management systems.

To achieve the aforementioned objectives and advancements, in this thesis, we develop a spreadsheet processing pipeline.

The requirements for this pipeline were derived from a large scale empirical analysis of real-world spreadsheets, from business and Web settings.

Specifically, we propose a series of specialized steps that build on top of each other with the goal of discovering the structure of data in spreadsheet documents.

Our approach is bottom-up, as it starts from the smallest unit (i.

, the cell) to ultimately arrive at the individual tables of the sheet.

Additionally, this thesis makes use of sophisticated machine learning and optimization techniques.

In particular, we apply these techniques for layout analysis and table detection in spreadsheets.

We target highly diverse sheet layouts, with one or multiple tables and arbitrary arrangement of contents.

Moreover, we foresee the presence of textual metadata and other non-tabular data in the sheet.

Furthermore, we work even with problematic tables (e.

, containing empty rows/columns and missing values).

Finally, we bring flexibility to our approach.

This not only allows us to tackle the above-mentioned challenges but also to reuse our solution for different (spreadsheet) datasets.

Els fulls de càlcul s’empren massivament en molts dominis i contexts diferents, ja que proporcionen una àmplia gamma de funcionalitats, bàsiques i avançades, de gestió de dades.

D’aquesta manera, donen suport a la recollida, transformació, anàlisi i visualització de dades.

A la mateixa vegada, els fulls de càlcul tenen una interfície amigable i intuïtiva i tenen un cost molt baix d’implantació.

Aplicacions de full de càlcul molt conegudes, com OpenOffice, LibreOffice, Google Sheets i Gnumeric, poden utilitzar-se de forma gratuïta i d’altres, com Microsoft Excel, són a l’abast d’una gran majoria d’usuaris.

Per tant, han esdevingut molt populars tant per a novells com per professionals.

Com a resultat, un gran volum de dades valuoses resideixen en aquests documents.

Són de particular interès les dades que es presenten en format tabular dins dels fulls de càlcul, ja que proporcionen informació concreta, factual i parcialment estructurada.

Com a conseqüència, hi ha interès en transferir dades tabulars des de fulls de càlcul a bases de dades.

Això permetria que els fulls de càlcul es converteixin en una font directa de dades per a processos empresarials, i introduir aquestes dades als magatzems de dades i integrar-les amb altres fonts.

Un pas més enllà, els fulls de càlcul juntament amb altres documents en brut es poden emmagatzemar en repositoris de dades centralitzats avançats, com per exemple, els data lake.

Un cop al data lake, es podran fer servir (sota demanda) per a diverses tasques i aplicacions.

Tot plegat, l’objectiu és fer accessibles les dades emmagatzemades als fulls de càlcul.

Malgrat tot, hi ha reptes considerables en el processament i comprensió automàtica d’aquests documents.

Els fulls de càlcul estan dissenyats principalment per al consum humà i, per tant, afavoreixen la personalització i la comprensió visual.

Les dades sovint s’entrellacen amb formatació, fórmules, artefactes de disseny i metadades textuals, que porten informació específica del domini o fins i tot informació específica de l’usuari.

Al mateix full es poden trobar diverses taules, amb una estructura i disseny diferents.

A més, el format de cada taula no es declara a priori, és a dir, no hi ha cap mecanisme per definir l’estructura d’una taula, com passa a les bases de dades.

Per aquest motiu, els fulls de càlcul es coneixen com a fonts de dades parcialment estructurades, amb un grau rellevant d'informació implícita.

A la literatura, la comprensió automàtica de les dades emmagatzemades en fulls de càlcul s'ha investigat superficialment, sovint assumint el mateix format uniforme de taula a tots els fulls de càlcul.

Tanmateix, a causa de les múltiples possibilitats d'estructurar les dades tabulars en fulls de càlcul, la suposició d'un disseny uniforme o bé exclou un nombre substancial de taules del procés d'extracció o condueix a resultats inexactes.

En aquesta tesi, abordem tasques fonamentals que contribueixen a l’extracció d’informació dels fulls de càlcul d’una manera més precisa.

Proposem mètodes intuïtius i eficaços per a l’anàlisi de la distribució i detecció de taules en fulls de càlcul.

Un dels nostres objectius principals és eliminar la majoria dels supòsits de l’estat de l’art actual.

Per fer-ho, considerem estructures tabulars altament heterogènies, contingudes en fulls de càlcul amb una o més taules.

Addicionalment, preveiem la presencia de metadades i altres tipus de dades no tabulars al mateix full.

Per últim, utilitzem tècniques d’optimització i d’aprenentatge automàtic per identificar l’estructura de les taules.

Això aporta flexibilitat al nostre enfocament, permetent-lo treballar, fins i tot, amb taules complexes o malformades.

Aquesta flexibilitat fa que els nostres mètodes siguin transferibles a nous conjunts de fulls de càlcul amb dades d’altres dominis.

Per tant, no estem limitats a dominis o configuracions.

Back

This article scrutinizes the history of Islamic development in Nusantara between 15th to 18th centuries, which has been colored from theological mysticism thought. Uniquel...

Theoretical study of laser-cooled SH– anion

The potential energy curves, dipole moments, and transition dipole moments for the <inline-formula><tex-math id="M13">\begin{document}${{\rm{X}}^1}{\Sigma ^ + }$\end{do...

KNOWLEDGE AND PREVENTION OF DEMENTIA AMONG THE ELDERLY

<span class="NormalTextRun SCXW51044073 BCX8" data-ccp...

Revisiting near-threshold photoelectron interference in argon with a non-adiabatic semiclassical model

<sec> Purpose: The interaction of intense, ultrashort laser pulses with atoms gives rise to rich non-perturbative phenomena, which are encoded within th...

Perancangan Tata Letak Fasilitas Metode CRAFT (Computerized Relative Allocation Facility Technique)

Abstract. The layout of production facilities is a crucial factor in supporting the smooth operation of manufacturing processes. CV. XYZ faces issues related to inefficient facilit...

Spreadsheet Implementation Technology

A guide to innovative spreadsheet implementation technology, accompanied by a free software platform for experimentation. Spreadsheets are used daily by millions of ...

Section-level genome sequencing and comparative genomics of Aspergillus sections Cavernicolus and Usti

Fig. S1. A cladogram representation of the phylogenetic relations between the species in this paper. The red labels show bootstrap values of 100 % and the black labels show bootstr...

Emerging Evidence of IgG4-Related Disease in Pericarditis: A Systematic Review

Abstract Introduction Immunoglobulin G4-related disease (IgG4-RD) is a recently identified immune-mediated condition that is debilitating and often overlooked. While IgG4-RD has be...

Email:
Password:

Email:

Layout inference and table detection in spreadsheet document

Related Results