ChineseNLP

Chinese Word Embeddings

Background

Word embedding ingests a large corpus of text and outputs, for each word type, an n-dimensional vector of real numbers. This vector captures syntactic and semantic information about the word that can be employed to solve various NLP tasks. In Chinese, the unit of encoding may be a character or a sub-character unit, rather than a word.

Example

Input:

Large corpus of text

Output:

“查询”, vec(W) = [-0.059569, 0.126913, 0.273161, 0.225467, -0.185914, 0.018743, -0.18434, 0.083859, -0.115781, -0.216993, 0.063437, -0.005511, 0.276968,…, 0.254486]

Standard Metrics

Word vectors can be evaluated intrinsically (e.g., whether similar words get similar vectors) or extrinsically (e.g., to what extent word vectors can improve a sentiment analyzer).

Intrinsic evaluation looks at

Extrinsic evaluation:

See e.g. Torregrossa et al., 2020 for a more detailed comparison of metrics

Chinese word similarity lists.

Test set # word pairs with human similarity judgments
wordsim-240 240
wordsim-296 297

Metrics

Results

System wordsim-240 (⍴) wordsim-296 (⍴)
Sun et. al. (2019) (VCWE) 57.81 61.29
Yu et. al. (2017) (JWE) 51.92 59.84

Chinese word analogy lists.

Given “France : Paris :: China : ?”, a system should come up with the answer “Beijing”.

Test set # analogies
Capitals of countries 687
States/provinces of cities 175
Family relationships 240

Metrics

Results

System Accuracy (capital) Accuracy (state) Accuracy (family) Accuracy (total)
Yu et. al. (2017) (JWE) 0.91 0.93 0.62 0.85
Yin et. al. (2016) (MGE) 0.89 0.88 0.39 0.76
CBOW (baseline) 0.84 0.88 0.60 0.79

Chinese sentiment analysis.

Test set # positive reviews # negative reviews
Notebook 417 206
Car 886 286
Camera 1,558 673
Phone 1,713 843

Results

System Accuracy (notebook) Accuracy (car) Accuracy (camera) Accuracy (phone) Accuracy (overall)
Sun et. al. (2019) (VCWE) 80.95 85.59 83.93 84.38 88.92
Yu et. al. (2017) (JWE) 77.78 78.81 81.70 81.64 85.13
Baseline (skip-gram) 69.84 77.12 80.80 81.25 86.65

Chinese name tagging.

Test set Size (words) Genre
SIGHAN 2006 NER MSRA 100,000 Newswire, Broadcast News, Weblog

Results

System F1 score
Sun et. al. (2019) (VCWE) 85.77
Yu et. al. (2017) (JWE) 85.30

Resources

Train set Size (words) Genre
SIGHAN 2006 NER MSRA 1.3M Newswire, Broadcast News, Weblog

Other Resources

Various Word embeddings

Name Additional features Training Corpus Size Source
FastText - 374M characters Grave et al., 2018
Mimick Interpolate between similar characters to improve rare words, multilingual   Pinter et al., 2017
Glyph2vec Uses character bitmaps, canjie to address OOV problem 10M chars Chen et al., 2020

Text corpora

Corpus Size (words) Size (vocabulary) Genre
Wikipedia dump 153,278,000 66,856 General
People’s Daily 31,000,000 105,000 News

Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com