Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

Hybrid Chinese text classification approach using general knowledge from Baidu Baike

View through CrossRef
Most of the previous studies focused on enriching text representation to address text classification (TC) task. However, conventional classification approaches with VSM (vector space model) on Chinese text study intensively only the words and their relationship in some specific corpus/dataset but ignore the basic concept of categories and the general knowledge behind the words learned and used to recognize entities by people. This paper focuses on enriching text representation and proposes a novel approach, which complements information from the online Chinese encyclopedia Baidu Baike for Chinese TC. The similarities between every text and each concept of categories and the most related words from Baidu Baike are added to the feature space. The performance of the proposed approach is measured on the Fudan University TC corpus, which is an imbalanced Chinese dataset. In the experiments, the proposed Baidu Baike‐based concept similarity approach obtains promising results when compared with a previous research and the conventional method, with macro‐precision of 90.31%, recall of 75.45%, and F1 score 80.32%, which are about 0.02%, 0.15%, 0.12%, respectively, higher than the conventional method, which obviously improves the recall for some small categories while keeping precision at high level and improving the macro F1 score. Moreover, the proposed approach has good expandability, so that many other knowledge bases could be integrated and many other concepts could be referred to improve the effectiveness. © 2016 Institute of Electrical Engineers of Japan. Published by John Wiley & Sons, Inc.
Title: Hybrid Chinese text classification approach using general knowledge from Baidu Baike
Description:
Most of the previous studies focused on enriching text representation to address text classification (TC) task.
However, conventional classification approaches with VSM (vector space model) on Chinese text study intensively only the words and their relationship in some specific corpus/dataset but ignore the basic concept of categories and the general knowledge behind the words learned and used to recognize entities by people.
This paper focuses on enriching text representation and proposes a novel approach, which complements information from the online Chinese encyclopedia Baidu Baike for Chinese TC.
The similarities between every text and each concept of categories and the most related words from Baidu Baike are added to the feature space.
The performance of the proposed approach is measured on the Fudan University TC corpus, which is an imbalanced Chinese dataset.
In the experiments, the proposed Baidu Baike‐based concept similarity approach obtains promising results when compared with a previous research and the conventional method, with macro‐precision of 90.
31%, recall of 75.
45%, and F1 score 80.
32%, which are about 0.
02%, 0.
15%, 0.
12%, respectively, higher than the conventional method, which obviously improves the recall for some small categories while keeping precision at high level and improving the macro F1 score.
Moreover, the proposed approach has good expandability, so that many other knowledge bases could be integrated and many other concepts could be referred to improve the effectiveness.
© 2016 Institute of Electrical Engineers of Japan.
Published by John Wiley & Sons, Inc.

Related Results

Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
Sleep Habits and Occurrence of Lowback Pain among Craftsmen
<span style="color: #000000; font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 10px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; ...
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Bounds on the sum of broadcast domination number and strong metric dimension of graphs
Let [Formula: see text] be a connected graph of order at least two with vertex set [Formula: see text]. For [Formula: see text], let [Formula: see text] denote the length of an [Fo...
How are encyclopedias cited in academic research? Wikipedia, Britannica, Baidu Baike, and Scholarpedia
How are encyclopedias cited in academic research? Wikipedia, Britannica, Baidu Baike, and Scholarpedia
Encyclopedias are sometimes cited by scholarly publications, despite concerns about their credibility as sources for academic information. This study investigates trends from 2002 ...
ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL
ANALYSIS OF READING MATERIALS IN TEXTBOOK FOR GRADE XI SENIOR HIGH SCHOOL
This study aims to find out the GI and LD level, the text which has the highest GI and LD and what make the text has the highest GI and LD of Advanced Learning English 2 textbook. ...
E-Press and Oppress
E-Press and Oppress
From elephants to ABBA fans, silicon to hormone, the following discussion uses a new research method to look at printed text, motion pictures and a te...
Early detection of COVID-19 pandemic: evidence from Baidu Index
Early detection of COVID-19 pandemic: evidence from Baidu Index
Abstract Background: New coronavirus disease 2019 (COVID-19) poses a severe threat to human life, and causes a global pandemic. The purpose of current research is to explor...

Back to Top