Chinese Word Embeddings
Background
Word embedding ingests a large corpus of text and outputs, for each word type, an n-dimensional vector of real numbers. This vector captures syntactic and semantic information about the word that can be employed to solve various NLP tasks. In Chinese, the unit of encoding may be a character or a sub-character unit, rather than a word.
Example
Input:
Output:
“查询”, vec(W) = [-0.059569, 0.126913, 0.273161, 0.225467, -0.185914, 0.018743, -0.18434, 0.083859, -0.115781, -0.216993, 0.063437, -0.005511, 0.276968,…, 0.254486]
Standard Metrics
Word vectors can be evaluated intrinsically (e.g., whether similar words get similar vectors) or extrinsically (e.g., to what extent word vectors can improve a sentiment analyzer).
Intrinsic evaluation looks at
- Word relatedness : Spearman correlation (⍴) between human-labeled scores and scores generated by the embeddings on Chinese word similarity datasets wordsim-240 and wordsim-296 (translations of English resources).
- Word Analogy: Accuracy on the word analogy task (e.g: “ 男人 (man) : 女人 (woman) :: 父亲 (father) : X ”, where X chosen by cosine similarity). Different types of word analogy tasks (1) Capitals of countries (2) States/provinces of cities (3) Family words
Extrinsic evaluation:
- Accuracy on Chinese sentiment analysis task
- F1 score on Chinese named entity recognition task
- Accuracy on part-of-speech tagging task
See e.g. Torregrossa et al., 2020 for a more detailed comparison of metrics
Chinese word similarity lists.
Metrics
Results
- The SoTA system (VCWE) published in NAACL 2019, combines intra-character compositionality (computed via convolutional neural network ) and inter-character compositionality (computed via a recurrent neural network with self-attention) to compute the word embeddings
Chinese word analogy lists.
Given “France : Paris :: China : ?”, a system should come up with the answer “Beijing”.
Test set |
# analogies |
Capitals of countries |
687 |
States/provinces of cities |
175 |
Family relationships |
240 |
Metrics
Results
System |
Accuracy (capital) |
Accuracy (state) |
Accuracy (family) |
Accuracy (total) |
Yu et. al. (2017) (JWE) |
0.91 |
0.93 |
0.62 |
0.85 |
Yin et. al. (2016) (MGE) |
0.89 |
0.88 |
0.39 |
0.76 |
CBOW (baseline) |
0.84 |
0.88 |
0.60 |
0.79 |
Chinese sentiment analysis.
- This test measures how much the sentiment analysis task benefits from different word vectors.
- There is no agreed-upon baseline (e.g., sentiment classifier code), so it is difficult to compare across papers.
- Sentiment dataset available at http://sentic.net/chinese-review-datasets.zip (Peng et. al. (2018))
- Consists of Chinese reviews in 4 domains: notebook, car, camera and phone
- Binary classification task: reviews are either positive or negative
- Does not have train/dev/test split.
Test set |
# positive reviews |
# negative reviews |
Notebook |
417 |
206 |
Car |
886 |
286 |
Camera |
1,558 |
673 |
Phone |
1,713 |
843 |
Results
System |
Accuracy (notebook) |
Accuracy (car) |
Accuracy (camera) |
Accuracy (phone) |
Accuracy (overall) |
Sun et. al. (2019) (VCWE) |
80.95 |
85.59 |
83.93 |
84.38 |
88.92 |
Yu et. al. (2017) (JWE) |
77.78 |
78.81 |
81.70 |
81.64 |
85.13 |
Baseline (skip-gram) |
69.84 |
77.12 |
80.80 |
81.25 |
86.65 |
Chinese name tagging.
- This test measures how much the name tagging task benefits from different word vectors.
- There is no agreed-upon baseline (e.g., name tagging code), so it is difficult to compare across papers.
- This evaluation evaluates entity taggers on three types of entities: Person (PER), Location (LOC), and Organization (ORG): Levow (2006)
Test set |
Size (words) |
Genre |
SIGHAN 2006 NER MSRA |
100,000 |
Newswire, Broadcast News, Weblog |
Results
Resources
Train set |
Size (words) |
Genre |
SIGHAN 2006 NER MSRA |
1.3M |
Newswire, Broadcast News, Weblog |
Other Resources
Various Word embeddings
Name |
Additional features |
Training Corpus Size |
Source |
FastText |
- |
374M characters |
Grave et al., 2018 |
Mimick |
Interpolate between similar characters to improve rare words, multilingual |
|
Pinter et al., 2017 |
Glyph2vec |
Uses character bitmaps, canjie to address OOV problem |
10M chars |
Chen et al., 2020 |
Text corpora
Corpus |
Size (words) |
Size (vocabulary) |
Genre |
Wikipedia dump |
153,278,000 |
66,856 |
General |
People’s Daily |
31,000,000 |
105,000 |
News |
Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com