Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Utilizing Large Language Models for Geoscience Literature Information Extraction

View through CrossRef
Extracting information from unstructured and semi-structured geoscience literature is a crucial step in conducting geological research. The traditional machine learning extraction paradigm requires a substantial amount of high-quality manually annotated data for model training, which is time-consuming, labor-intensive, and not easily transferable to new fields. Recently, large language models (LLMs) (e.g., ChatGPT, GPT-4, and LLaMA), have shown great performance in various natural language processing (NLP) tasks, such as question answering, machine translation, and text generation. A substantial body of work has demonstrated that LLMs possess strong in-context learning (ICL) and even zero-shot learning capabilities to solve downstream tasks without specifically designed supervised fine-tuning. In this paper, we propose utilizing LLMs for geoscience literature information extraction. Specifically, we design a hierarchical PDF parsing pipeline and an automated knowledge extraction process, which can significantly reduce the need for manual data annotation, assisting geoscientists in literature data mining. For the hierarchical PDF parsing pipeline, firstly, a document layout detection model fine-tuned on geoscience literature is employed for layout detection, obtaining layout detection information for the document. Secondly, based on the document layout information, an optical character content parsing model is used for content parsing, obtaining the text structure and plain text corresponding to the content. Finally, the text structure and plain text are combined and reconstructed to ultimately obtain the parsed structured data. For the automated knowledge extraction process, firstly, the parsed long text is segmented into paragraphs to adapt to the input length limit of LLMs. Subsequently, a few-shot prompting method is employed for structured knowledge extraction, encompassing two tasks: attribute value extraction and triplet extraction. In attribute value extraction, prompts are generated automatically by the LLMs based on the subdomain and attribute names, facilitating the location and extraction of values related to subdomain attribute names in the text. For triplet extraction, the LLMs employ a procedural approach to entity extraction, entity type extraction, and relation extraction, following the knowledge graph structure pattern. Finally, the extracted structured knowledge is stored in the form of knowledge graphs, facilitating further analysis and integration of various types of knowledge from the literature. Our proposed approach turns out to be simple, flexible, and highly effective in geoscience literature information extraction. Demonstrations of information extraction in subdomains such as radiolarian fossils and fluvial facies have yielded satisfactory results. The extraction efficiency has significantly improved, and feedback from domain experts indicates a relatively high level of accuracy in the extraction process. The extracted results can be used to construct a foundational knowledge graph for geoscience literature, supporting the comprehensive construction and efficient application of a geoscience knowledge graph.
Title: Utilizing Large Language Models for Geoscience Literature Information Extraction
Description:
Extracting information from unstructured and semi-structured geoscience literature is a crucial step in conducting geological research.
The traditional machine learning extraction paradigm requires a substantial amount of high-quality manually annotated data for model training, which is time-consuming, labor-intensive, and not easily transferable to new fields.
Recently, large language models (LLMs) (e.
g.
, ChatGPT, GPT-4, and LLaMA), have shown great performance in various natural language processing (NLP) tasks, such as question answering, machine translation, and text generation.
A substantial body of work has demonstrated that LLMs possess strong in-context learning (ICL) and even zero-shot learning capabilities to solve downstream tasks without specifically designed supervised fine-tuning.
In this paper, we propose utilizing LLMs for geoscience literature information extraction.
Specifically, we design a hierarchical PDF parsing pipeline and an automated knowledge extraction process, which can significantly reduce the need for manual data annotation, assisting geoscientists in literature data mining.
For the hierarchical PDF parsing pipeline, firstly, a document layout detection model fine-tuned on geoscience literature is employed for layout detection, obtaining layout detection information for the document.
Secondly, based on the document layout information, an optical character content parsing model is used for content parsing, obtaining the text structure and plain text corresponding to the content.
Finally, the text structure and plain text are combined and reconstructed to ultimately obtain the parsed structured data.
For the automated knowledge extraction process, firstly, the parsed long text is segmented into paragraphs to adapt to the input length limit of LLMs.
Subsequently, a few-shot prompting method is employed for structured knowledge extraction, encompassing two tasks: attribute value extraction and triplet extraction.
In attribute value extraction, prompts are generated automatically by the LLMs based on the subdomain and attribute names, facilitating the location and extraction of values related to subdomain attribute names in the text.
For triplet extraction, the LLMs employ a procedural approach to entity extraction, entity type extraction, and relation extraction, following the knowledge graph structure pattern.
Finally, the extracted structured knowledge is stored in the form of knowledge graphs, facilitating further analysis and integration of various types of knowledge from the literature.
Our proposed approach turns out to be simple, flexible, and highly effective in geoscience literature information extraction.
Demonstrations of information extraction in subdomains such as radiolarian fossils and fluvial facies have yielded satisfactory results.
The extraction efficiency has significantly improved, and feedback from domain experts indicates a relatively high level of accuracy in the extraction process.
The extracted results can be used to construct a foundational knowledge graph for geoscience literature, supporting the comprehensive construction and efficient application of a geoscience knowledge graph.

Related Results

Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
Hubungan Perilaku Pola Makan dengan Kejadian Anak Obesitas
<p><em><span style="font-size: 11.0pt; font-family: 'Times New Roman',serif; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-langua...
Primerjalna književnost na prelomu tisočletja
Primerjalna književnost na prelomu tisočletja
In a comprehensive and at times critical manner, this volume seeks to shed light on the development of events in Western (i.e., European and North American) comparative literature ...
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Evaluating the Science to Inform the Physical Activity Guidelines for Americans Midcourse Report
Abstract The Physical Activity Guidelines for Americans (Guidelines) advises older adults to be as active as possible. Yet, despite the well documented benefits of physical a...
International Geoscience Information Standards, Management and Governance
International Geoscience Information Standards, Management and Governance
International standards are important for communication of geoscience information across borders and between countries, and in particular for addressing multinational and global is...
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
A Wideband mm-Wave Printed Dipole Antenna for 5G Applications
<span lang="EN-MY">In this paper, a wideband millimeter-wave (mm-Wave) printed dipole antenna is proposed to be used for fifth generation (5G) communications. The single elem...
Reimagining Geoscience Education for Sustainability
Reimagining Geoscience Education for Sustainability
Geoscience is crucial for addressing sustainability challenges related to climate change, the energy transition, water resources management, and natural hazards. However, the capac...
A Roadmap to Strengthen Geoscience Education for Sustainable Development in Kenya
A Roadmap to Strengthen Geoscience Education for Sustainable Development in Kenya
Meeting the targets of the 17 United Nations (UN) Sustainable Development Goals (SDGs) requires contributions from geoscientists. Like most countries, Kenya is faced with the tripl...
Generación de modelos de procesos y decisiones a partir de documentos de texto
Generación de modelos de procesos y decisiones a partir de documentos de texto
(English) This thesis addresses the importance of formal models for the efficient management of business processes (BPM) and business decision management (BDM) in a constantly evol...

Back to Top