Javascript must be enabled to continue!
Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)
View through CrossRef
BACKGROUND
Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing. With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has significantly improved for both the general and medical domains; however, it is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size.
OBJECTIVE
We hypothesize that this problem can be addressed by over-sampling a domain-specific corpus and using it for pre-training with a larger corpus in a balanced manner. In this study, we verify our hypothesis by developing pre-training models using our method and evaluating their performance.
METHODS
Our proposed method was based on simultaneous pre-training after over-sampling. We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pre-trained from complete PubMed abstracts in a balanced manner and compared their performance with the conventional models.
RESULTS
We first confirmed that our English BERT pre-trained using both general and small medical-domain corpora performed sufficiently well for practical use in the biomedical language understanding evaluation (BLUE) benchmark. Moreover, our proposed method was more effective than conventional methods for each different biomedical corpus size with the same corpus size for the general domain. Next, our Japanese medical BERT outperformed the other BERT models built using a conventional method concerning the medical document classification task. It demonstrated the same trend as in the first experiment in English. Lastly, our enhanced biomedical BERT model, in which clinical notes were not used during pre-training, achieved both clinical and biomedical scores on the BLUE benchmark that were 0.3 points above those of the model trained without our proposed method.
CONCLUSIONS
Well-balanced pre-training by over-sampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.
JMIR Publications Inc.
Title: Over-Sampling Effect in Pre-Training for Bidirectional Encoder Representations from Transformers (BERT) to Localize Medical BERT and Enhance Biomedical BERT (Preprint)
Description:
BACKGROUND
Pre-training large-scale neural language models on raw texts has made a significant contribution to improving transfer learning in natural language processing.
With the introduction of transformer-based language models, such as bidirectional encoder representations from transformers (BERT), the performance of information extraction from free text has significantly improved for both the general and medical domains; however, it is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size.
OBJECTIVE
We hypothesize that this problem can be addressed by over-sampling a domain-specific corpus and using it for pre-training with a larger corpus in a balanced manner.
In this study, we verify our hypothesis by developing pre-training models using our method and evaluating their performance.
METHODS
Our proposed method was based on simultaneous pre-training after over-sampling.
We conducted three experiments in which we generated (1) English biomedical BERT from a small biomedical corpus, (2) Japanese medical BERT from a small medical corpus, and (3) enhanced biomedical BERT pre-trained from complete PubMed abstracts in a balanced manner and compared their performance with the conventional models.
RESULTS
We first confirmed that our English BERT pre-trained using both general and small medical-domain corpora performed sufficiently well for practical use in the biomedical language understanding evaluation (BLUE) benchmark.
Moreover, our proposed method was more effective than conventional methods for each different biomedical corpus size with the same corpus size for the general domain.
Next, our Japanese medical BERT outperformed the other BERT models built using a conventional method concerning the medical document classification task.
It demonstrated the same trend as in the first experiment in English.
Lastly, our enhanced biomedical BERT model, in which clinical notes were not used during pre-training, achieved both clinical and biomedical scores on the BLUE benchmark that were 0.
3 points above those of the model trained without our proposed method.
CONCLUSIONS
Well-balanced pre-training by over-sampling instances derived from a corpus appropriate for the target task allowed us to construct a high-performance BERT model.
Related Results
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
A Pre-Training Technique to Localize Medical BERT and to Enhance Biomedical BERT
Abstract
Background: Pre-training large-scale neural language models on raw texts has been shown to make a significant contribution to a strategy for transfer learning in n...
CREATION OF A STRUCTURAL MODEL OF AN POWER TRANSFORMERS IN THE FORM OF AC TRANSFORMING COMPLEXES
CREATION OF A STRUCTURAL MODEL OF AN POWER TRANSFORMERS IN THE FORM OF AC TRANSFORMING COMPLEXES
Due to the multiple transformation of electrical energy, the rated capacity of power transformers can be 8 or more times the rated generation capacity. Therefore, the state of reli...
Enhancing traditional Chinese medical named entity recognition with Dyn-Att Net: a dynamic attention approach
Enhancing traditional Chinese medical named entity recognition with Dyn-Att Net: a dynamic attention approach
Our study focuses on Traditional Chinese Medical (TCM) named entity recognition (NER), which involves identifying and extracting specific entity names from TCM record. This task ha...
On the Remote Calibration of Instrumentation Transformers: Influence of Temperature
On the Remote Calibration of Instrumentation Transformers: Influence of Temperature
The remote calibration of instrumentation transformers is theoretically possible using synchronous measurements across a transmission line with a known impedance and a local set of...
Memristor-Based Priority Encoder and Decoder Circuit
Memristor-Based Priority Encoder and Decoder Circuit
Introduction:
Memristors, recognized as the fourth fundamental circuit element, exhibit unique features
such as non-volatility, scalability, and energy efficien...
MD2PR: A Multi-level Distillation based Dense Passage Retrieval Model
MD2PR: A Multi-level Distillation based Dense Passage Retrieval Model
Abstract
Reranker and retriever are two important components in information retrieval. The retriever typically adopts a dual-encoder model, where queries and docume...
A Hybrid BERT-ALBERT Model for Text Classification: Improving Accuracy in Document Analysis
A Hybrid BERT-ALBERT Model for Text Classification: Improving Accuracy in Document Analysis
In document analysis, text classification is an essential activity that facilitates automatic contentcategorisation, sentiment analysis, and effective information retrieval. This p...
A Comparative Evaluation of Transformers and Deep Learning Models for Arabic Meter Classification
A Comparative Evaluation of Transformers and Deep Learning Models for Arabic Meter Classification
Arabic poetry follows intricate rhythmic patterns called ‘arūḍ’ (prosody), so its automated categorization is difficult. Although earlier studies mostly depend on conventional mach...

