Javascript must be enabled to continue!
A Novel Approach to Data Extraction on Hyperlinked Webpages
View through CrossRef
The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.
Title: A Novel Approach to Data Extraction on Hyperlinked Webpages
Description:
The World Wide Web has an enormous amount of useful data presented as HTML tables.
These tables are often linked to other web pages, providing further detailed information to certain attribute values.
Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms.
We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites.
Tables from the HTML code were extracted and table rows were labeled with appropriate class labels.
Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables.
A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed.
Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs).
Resultantly, these tables could assist with performing better and stronger queries using the join operation.
A manual checking of the linked web table results revealed a 99% precision and 68% recall values.
Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.
Related Results
Utilizing Large Language Models for Geoscience Literature Information Extraction
Utilizing Large Language Models for Geoscience Literature Information Extraction
Extracting information from unstructured and semi-structured geoscience literature is a crucial step in conducting geological research. The traditional machine learning extraction ...
Optimization of ultrasonic extraction of Lycium barbarum polysaccharides using response surface methodology
Optimization of ultrasonic extraction of Lycium barbarum polysaccharides using response surface methodology
Abstract
Ultrasonic extraction was a new development method to achieve high-efficiency extraction of Lycium barbarum polysaccharides instead of hot water extraction....
Automated evaluation of accessibility issues of webpage content: tool and evaluation
Automated evaluation of accessibility issues of webpage content: tool and evaluation
Abstract
In recent years, there has been a growing field of research focused on comprehending complexity in relation to web platform accessibility. It has shown that it i...
Extraction of Mogroside and Limonin with Different Extraction Methods and its Modeling
Extraction of Mogroside and Limonin with Different Extraction Methods and its Modeling
The extraction yields of mogroside from Siraitia grosvenorii fruits and limonin from orange (Citrus reticulata Blanco) seeds were compared with different extraction methods, respec...
Evaluation of a Hyperlinked Consumer Health Dictionary for Reading EHR Notes
Evaluation of a Hyperlinked Consumer Health Dictionary for Reading EHR Notes
In this paper, we report on a pilot study conducted to test the usefulness and understandability of definitions in a Consumer Health Dictionary (IVS-CHD). Our two main goals for th...
Response Surface Analysis on Multiple Parameter Effects on Borehole Gas Extraction Efficiency
Response Surface Analysis on Multiple Parameter Effects on Borehole Gas Extraction Efficiency
To explore the impact of different factors on the effectiveness of borehole gas extraction, in situ stress tests were conducted in a test mining area. A theoretical model of gas mi...
Extraction of Rice Bran Oil from Rice Bran by Supercritical Carbon Dioxide
Extraction of Rice Bran Oil from Rice Bran by Supercritical Carbon Dioxide
Rice bran is an important source of nutrients that have many good bioactive compounds. This study examined the extraction of bran rice oil using supercritical carbon dioxide. Fr...
Optimization of drilling parameters for coal seam gas extraction considering fluid–solid coupling and field application
Optimization of drilling parameters for coal seam gas extraction considering fluid–solid coupling and field application
AbstractAffect gas extraction efficiency to find out the optimum well location parameters, based on the gray correlation analysis of borehole diameter, borehole spacing, and extrac...

