Search engine for discovering works of Art, research articles, and books related to Art and Culture
ShareThis
Javascript must be enabled to continue!

The Specification of POS Tagging of the Hong Kong University Cantonese Corpus

View through CrossRef
The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was word-segmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). This scheme, which was designed for tagging written Mandarin texts, encountered some problems in tagging spoken Cantonese. However, it is flexible for further expansion of the 26 basic word classes by customizing some subclasses for annotating other Chinese dialects (e.g., Cantonese). Its robustness was proved by the annotation of approximately 230,000 words in the HKUCC. This article will describe the format of the corpus and provide the specification that helps annotators in POS tagging and will solve problems encountered in manual annotation. Guidelines of tagging some word classes will be introduced, followed by the discussion of easily confused tags, illustrated with examples from the corpus. Further work will aim at automatic annotation by computers in order to facilitate the work of POS tagging of Cantonese and other Chinese dialects. The corpora of Hong Kong Cantonese are quite lacking. Past work focused either on a POS-tagged corpus for child language or the phonetic transcription of an adult Cantonese corpus. HKUCC fills the gap by providing a POS-tagged corpus for adult Cantonese and is believed to be of great value to the data-driven linguistic analysis and natural language processing for Cantonese.
Title: The Specification of POS Tagging of the Hong Kong University Cantonese Corpus
Description:
The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people.
It was word-segmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al.
(2002).
This scheme, which was designed for tagging written Mandarin texts, encountered some problems in tagging spoken Cantonese.
However, it is flexible for further expansion of the 26 basic word classes by customizing some subclasses for annotating other Chinese dialects (e.
g.
, Cantonese).
Its robustness was proved by the annotation of approximately 230,000 words in the HKUCC.
This article will describe the format of the corpus and provide the specification that helps annotators in POS tagging and will solve problems encountered in manual annotation.
Guidelines of tagging some word classes will be introduced, followed by the discussion of easily confused tags, illustrated with examples from the corpus.
Further work will aim at automatic annotation by computers in order to facilitate the work of POS tagging of Cantonese and other Chinese dialects.
The corpora of Hong Kong Cantonese are quite lacking.
Past work focused either on a POS-tagged corpus for child language or the phonetic transcription of an adult Cantonese corpus.
HKUCC fills the gap by providing a POS-tagged corpus for adult Cantonese and is believed to be of great value to the data-driven linguistic analysis and natural language processing for Cantonese.

Related Results

Overview of research work of Prof. Leung on Cantonese opera in Hong Kong and Guangzhou
Overview of research work of Prof. Leung on Cantonese opera in Hong Kong and Guangzhou
It is important to recognise and transmit the importance of traditional music. Professor Bo-Wah Leung, Research Centre for Transmission of Cantonese Opera, The Education University...
Breast Carcinoma within Fibroadenoma: A Systematic Review
Breast Carcinoma within Fibroadenoma: A Systematic Review
Abstract Introduction Fibroadenoma is the most common benign breast lesion; however, it carries a potential risk of malignant transformation. This systematic review provides an ove...
Challenges and opportunities of Chinese ports: the multi-faced perspectives
Challenges and opportunities of Chinese ports: the multi-faced perspectives
(English) In this thesis, challenges and opportunities of Chinese ports and shipping is investigated from the multi-faced perspectives, i.e., the challenges between ports in the ba...
Special Administrative Region of Hong Kong
Special Administrative Region of Hong Kong
The Special Administrative Region of Hong Kong (HKSAR) was established in 1997 when China recovered sovereignty over Hong Kong following the terms set out in the 1984 Sino-British ...
Assessing Cantonese
Assessing Cantonese
This chapter covers four aspects of the topic: first, a sociocultural account of Cantonese and how this regional dialect in southern China gained its L1 position for formal educati...
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Žanrovska analiza pomorskopravnih tekstova i ostvarenje prijevodnih univerzalija u njihovim prijevodima s engleskoga jezika
Genre implies formal and stylistic conventions of a particular text type, which inevitably affects the translation process. This „force of genre bias“ (Prieto Ramos, 2014) has been...
How Does Cantonese Media Affect Chinese Cultural Identity Among Malaysian Chinese?
How Does Cantonese Media Affect Chinese Cultural Identity Among Malaysian Chinese?
Cantonese is the carrier of Chinese cultural memory and nostalgia. Malaysian Chinese peo-ple have constructed their ethnic cultural identity using Cantonese media. Nevertheless, it...
Hong Kong as Method of The Grandmaster: Wing Chun, Hong Kong Film to Hong Kong Culture
Hong Kong as Method of The Grandmaster: Wing Chun, Hong Kong Film to Hong Kong Culture
I undertake a close reading of Wong Kar-wai’s The Grandmaster (2013) to outline how a somatechnics of the body in wing chun, a form of martial art, provides a way to understand ‘Ho...

Back to Top