Javascript must be enabled to continue!

A Novel Approach to Data Extraction on Hyperlinked Webpages

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

MDPI AG

Kamran Shaukat Nayyer Masood Matloob Khushi

Applied Sciences

2019

Title: A Novel Approach to Data Extraction on Hyperlinked Webpages

Description:

The World Wide Web has an enormous amount of useful data presented as HTML tables.

These tables are often linked to other web pages, providing further detailed information to certain attribute values.

Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms.

We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites.

Tables from the HTML code were extracted and table rows were labeled with appropriate class labels.

Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables.

A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed.

Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs).

Resultantly, these tables could assist with performing better and stronger queries using the join operation.

A manual checking of the linked web table results revealed a 99% precision and 68% recall values.

Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

Back

Extracting information from unstructured and semi-structured geoscience literature is a crucial step in conducting geological research. The traditional machine learning extraction ...

Optimization of ultrasonic extraction of Lycium barbarum polysaccharides using response surface methodology

Abstract Ultrasonic extraction was a new development method to achieve high-efficiency extraction of Lycium barbarum ...

Trends in web data extraction using machine learning

Web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from w...

Extraction of Mogroside and Limonin with Different Extraction Methods and its Modeling

The extraction yields of mogroside from Siraitia grosvenorii fruits and limonin from orange (Citrus reticulata Blanco) seeds were compared with different extraction methods, respec...

EWOD Based Liquid-Liquid Extraction and Separation

Liquid-liquid extraction techniques are one of the major tools in chemical engineering, analytical chemistry, and biology, especially in a system where two immiscible liquids have ...

Automated evaluation of accessibility issues of webpage content: tool and evaluation

Abstract In recent years, there has been a growing field of research focused on comprehending complexity in relation to web platform accessibility. It has shown that it i...

Evaluation of a Hyperlinked Consumer Health Dictionary for Reading EHR Notes

In this paper, we report on a pilot study conducted to test the usefulness and understandability of definitions in a Consumer Health Dictionary (IVS-CHD). Our two main goals for th...

Response Surface Analysis on Multiple Parameter Effects on Borehole Gas Extraction Efficiency

To explore the impact of different factors on the effectiveness of borehole gas extraction, in situ stress tests were conducted in a test mining area. A theoretical model of gas mi...

Email:
Password:

Email:

A Novel Approach to Data Extraction on Hyperlinked Webpages

Related Results