Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

A Novel Approach to Data Extraction on Hyperlinked Webpages

View through CrossRef
The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.
Title: A Novel Approach to Data Extraction on Hyperlinked Webpages
Description:
The World Wide Web has an enormous amount of useful data presented as HTML tables.
These tables are often linked to other web pages, providing further detailed information to certain attribute values.
Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms.
We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites.
Tables from the HTML code were extracted and table rows were labeled with appropriate class labels.
Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables.
A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed.
Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs).
Resultantly, these tables could assist with performing better and stronger queries using the join operation.
A manual checking of the linked web table results revealed a 99% precision and 68% recall values.
Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

Related Results

Utilizing Large Language Models for Geoscience Literature Information Extraction
Utilizing Large Language Models for Geoscience Literature Information Extraction
Extracting information from unstructured and semi-structured geoscience literature is a crucial step in conducting geological research. The traditional machine learning extraction ...
Trends in web data extraction using machine learning
Trends in web data extraction using machine learning
Web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from w...
Extraction of Mogroside and Limonin with Different Extraction Methods and its Modeling
Extraction of Mogroside and Limonin with Different Extraction Methods and its Modeling
The extraction yields of mogroside from Siraitia grosvenorii fruits and limonin from orange (Citrus reticulata Blanco) seeds were compared with different extraction methods, respec...
EWOD Based Liquid-Liquid Extraction and Separation
EWOD Based Liquid-Liquid Extraction and Separation
Liquid-liquid extraction techniques are one of the major tools in chemical engineering, analytical chemistry, and biology, especially in a system where two immiscible liquids have ...
Automated evaluation of accessibility issues of webpage content: tool and evaluation
Automated evaluation of accessibility issues of webpage content: tool and evaluation
Abstract In recent years, there has been a growing field of research focused on comprehending complexity in relation to web platform accessibility. It has shown that it i...
Evaluation of a Hyperlinked Consumer Health Dictionary for Reading EHR Notes
Evaluation of a Hyperlinked Consumer Health Dictionary for Reading EHR Notes
In this paper, we report on a pilot study conducted to test the usefulness and understandability of definitions in a Consumer Health Dictionary (IVS-CHD). Our two main goals for th...
Response Surface Analysis on Multiple Parameter Effects on Borehole Gas Extraction Efficiency
Response Surface Analysis on Multiple Parameter Effects on Borehole Gas Extraction Efficiency
To explore the impact of different factors on the effectiveness of borehole gas extraction, in situ stress tests were conducted in a test mining area. A theoretical model of gas mi...

Back to Top