Javascript must be enabled to continue!

Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)

BACKGROUND Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has significantly improved for both the general and medical domains; however, it is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size. OBJECTIVE We hypothesize that this problem can be addressed by over-sampling a domain-specific corpus and using it for pre-training with a larger corpus in a balanced manner. In this study, we verify our hypothesis by developing pre-training models using our method and evaluating their performance. METHODS Our proposed method was based on simultaneous pre-training after over-sampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pre-trained from complete PubMed abstracts in a balanced manner and compared their performance with the conventional models. RESULTS We first confirmed that our English BERT pre-trained using both general and small medical-domain corpora performed sufficiently well for practical use in the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than conventional methods for each different biomedical corpus size with the same corpus size for the general domain. Next, our Japanese medical BERT outperformed the other BERT models built using a conventional method concerning the medical document classification task. It demonstrated the same trend as in the first experiment in English. Lastly, our enhanced biomedical BERT model, in which clinical notes were not used during pre-training, achieved both clinical and biomedical scores on the BLUE benchmark that were 0.3 points above those of the model trained without our proposed method. CONCLUSIONS Well-balanced pre-training by over-sampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

JMIR Publications Inc.

Shoya Wada Toshihiro Takeda Katsuki Okada Shirou Manabe Shozo Konishi Jun Kamohara Yasushi Matsumura

2022

Title: Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)

Description:

BACKGROUND Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing.

With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has significantly improved for both the general and medical domains; however, it is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size.

OBJECTIVE We hypothesize that this problem can be addressed by over-sampling a domain-specific corpus and using it for pre-training with a larger corpus in a balanced manner.

In this study, we verify our hypothesis by developing pre-training models using our method and evaluating their performance.

METHODS Our proposed method was based on simultaneous pre-training after over-sampling.

We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pre-trained from complete PubMed abstracts in a balanced manner and compared their performance with the conventional models.

RESULTS We first confirmed that our English BERT pre-trained using both general and small medical-domain corpora performed sufficiently well for practical use in the biomedical language understanding evaluation (BLUE) benchmark.

Moreover, our proposed method was more effective than conventional methods for each different biomedical corpus size with the same corpus size for the general domain.

Next, our Japanese medical BERT outperformed the other BERT models built using a conventional method concerning the medical document classification task.

It demonstrated the same trend as in the first experiment in English.

Lastly, our enhanced biomedical BERT model, in which clinical notes were not used during pre-training, achieved both clinical and biomedical scores on the BLUE benchmark that were 0.

3 points above those of the model trained without our proposed method.

CONCLUSIONS Well-balanced pre-training by over-sampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.

Back

Abstract Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...

CREATION OF A STRUCTURAL MODEL OF AN POWER TRANSFORMERS IN THE FORM OF AC TRANSFORMING COMPLEXES

Due to the multiple transformation of electrical energy, the rated capacity of power transformers can be 8 or more times the rated generation capacity. Therefore, the state of reli...

Enhancing traditional Chinese medical named entity recognition with Dyn-Att Net: a dynamic attention approach

Our study focuses on Traditional Chinese Medical (TCM) named entity recognition (NER), which involves identifying and extracting specific entity names from TCM record. This task ha...

On the Remote Calibration of Instrumentation Transformers: Influence of Temperature

The remote calibration of instrumentation transformers is theoretically possible using synchronous measurements across a transmission line with a known impedance and a local set of...

Memristor-Based Priority Encoder and Decoder Circuit

Introduction: Memristors, recognized as the fourth fundamental circuit element, exhibit unique features such as non-volatility, scalability, and energy efficien...

MD2PR: A Multi-level Distillation based Dense Passage Retrieval Model

Abstract Reranker and retriever are two important components in information retrieval. The retriever typically adopts a dual-encoder model, where queries and docume...

A Hybrid BERT-ALBERT Model for Text Classification: Improving Accuracy in Document Analysis

In document analysis, text classification is an essential activity that facilitates automatic contentcategorisation, sentiment analysis, and effective information retrieval. This p...

A Comparative Evaluation of Transformers and Deep Learning Models for Arabic Meter Classification

Arabic poetry follows intricate rhythmic patterns called ‘arūḍ’ (prosody), so its automated categorization is difficult. Although earlier studies mostly depend on conventional mach...

Email:
Password:

Email:

Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)

Related Results