ChineseNLP

Chinese Word Embeddings

Background

Word embedding ingests a large corpus of text and outputs, for each word type, an n-dimensional vector of real numbers. This vector captures syntactic and semantic information about the word that can be employed to solve various NLP tasks. In Chinese, the unit of encoding may be a character or a sub-character unit, rather than a word.

Example

Input:

Large corpus of text

Output:

“查询”, vec(W) = [-0.059569, 0.126913, 0.273161, 0.225467, -0.185914, 0.018743, -0.18434, 0.083859, -0.115781, -0.216993, 0.063437, -0.005511, 0.276968,…, 0.254486]

Standard Metrics

Word vectors can be evaluated intrinsically (e.g., whether similar words get similar vectors) or extrinsically (e.g., to what extent word vectors can improve a sentiment analyzer).

Intrinsic evaluation looks at

Word relatedness : Spearman correlation (⍴) between human-labeled scores and scores generated by the embeddings on Chinese word similarity datasets wordsim-240 and wordsim-296 (translations of English resources).
Word Analogy: Accuracy on the word analogy task (e.g: “ 男人 (man) : 女人 (woman) :: 父亲 (father) : X ”, where X chosen by cosine similarity). Different types of word analogy tasks (1) Capitals of countries (2) States/provinces of cities (3) Family words

Extrinsic evaluation:

Accuracy on Chinese sentiment analysis task
F1 score on Chinese named entity recognition task
Accuracy on part-of-speech tagging task

See e.g. Torregrossa et al., 2020 for a more detailed comparison of metrics

Chinese word similarity lists.

wordsim-240 and wordsim-296 list pairs of Chinese words with human similarity judgments proposed by Chen et. al. (2015) and SemEval Task 4: Evaluating Chinese Word Similarity.
These are Chinese translations of the English lists prepared in 2002.
Datasets can be found at https://github.com/Leonard-Xu/CWE/tree/master/data

Test set	# word pairs with human similarity judgments
wordsim-240	240
wordsim-296	297

Metrics

Spearman correlation (⍴) between human-labeled scores and scores generated by the embeddings on Chinese word similarity datasets.
Implementation: https://github.com/HKUST-KnowComp/JWE/blob/master/src/word_sim.py

Results

The SoTA system (VCWE) published in NAACL 2019, combines intra-character compositionality (computed via convolutional neural network ) and inter-character compositionality (computed via a recurrent neural network with self-attention) to compute the word embeddings

System	wordsim-240 (⍴)	wordsim-296 (⍴)
Sun et. al. (2019) (VCWE)	57.81	61.29
Yu et. al. (2017) (JWE)	51.92	59.84

Chinese word analogy lists.

Given “France : Paris :: China : ?”, a system should come up with the answer “Beijing”.

Chen et. al. (2015) manually constructed 1,225 analogies in 3 domains
- capitals of countries, state/provinces of cities, family relationships
Datasets can be found at: https://github.com/Leonard-Xu/CWE/blob/master/data/analogy.txt

Test set	# analogies
Capitals of countries	687
States/provinces of cities	175
Family relationships	240

Metrics

Accuracy.
Implementation: https://github.com/HKUST-KnowComp/JWE/blob/master/src/word_analogy.py

Results

System	Accuracy (capital)	Accuracy (state)	Accuracy (family)	Accuracy (total)
Yu et. al. (2017) (JWE)	0.91	0.93	0.62	0.85
Yin et. al. (2016) (MGE)	0.89	0.88	0.39	0.76
CBOW (baseline)	0.84	0.88	0.60	0.79

Chinese sentiment analysis.

This test measures how much the sentiment analysis task benefits from different word vectors.
There is no agreed-upon baseline (e.g., sentiment classifier code), so it is difficult to compare across papers.
Sentiment dataset available at http://sentic.net/chinese-review-datasets.zip (Peng et. al. (2018))
- Consists of Chinese reviews in 4 domains: notebook, car, camera and phone
- Binary classification task: reviews are either positive or negative
- Does not have train/dev/test split.

Test set	# positive reviews	# negative reviews
Notebook	417	206
Car	886	286
Camera	1,558	673
Phone	1,713	843

Results

System	Accuracy (notebook)	Accuracy (car)	Accuracy (camera)	Accuracy (phone)	Accuracy (overall)
Sun et. al. (2019) (VCWE)	80.95	85.59	83.93	84.38	88.92
Yu et. al. (2017) (JWE)	77.78	78.81	81.70	81.64	85.13
Baseline (skip-gram)	69.84	77.12	80.80	81.25	86.65

Chinese name tagging.

This test measures how much the name tagging task benefits from different word vectors.
There is no agreed-upon baseline (e.g., name tagging code), so it is difficult to compare across papers.
This evaluation evaluates entity taggers on three types of entities: Person (PER), Location (LOC), and Organization (ORG): Levow (2006)

Test set	Size (words)	Genre
SIGHAN 2006 NER MSRA	100,000	Newswire, Broadcast News, Weblog

Results

System	F1 score
Sun et. al. (2019) (VCWE)	85.77
Yu et. al. (2017) (JWE)	85.30

Resources

Train set	Size (words)	Genre
SIGHAN 2006 NER MSRA	1.3M	Newswire, Broadcast News, Weblog

Other Resources

Various Word embeddings

Name	Additional features	Training Corpus Size	Source
FastText	-	374M characters	Grave et al., 2018
Mimick	Interpolate between similar characters to improve rare words, multilingual		Pinter et al., 2017
Glyph2vec	Uses character bitmaps, canjie to address OOV problem	10M chars	Chen et al., 2020

Text corpora

Corpus	Size (words)	Size (vocabulary)	Genre
Wikipedia dump	153,278,000	66,856	General
People’s Daily	31,000,000	105,000	News

Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com