ChineseNLP

Chinese Machine Translation

Background

Machine translation (MT) converts a text from one language to another. Here, we focus on translation into and out of Chinese.

Example

Input:

美中两国可能很快达成一个贸易协议。

Output:

The United States and China may soon reach a trade agreement.

Standard Metrics

Direct assessment (human judgment). Amazon Mechanical Turk workers are supplied with a system translation and a human reference. They are asked “How accurately does the candidate text convey the original semantics of the reference text?”
Bleu score (Papineni et al 02).
- Bleu-n4r4: word {1,2,3,4}-gram matches, against four human reference translations
  - A brevity penalty additionally punishes translations that are shorter than the reference(s).
  - Standard Bleu scripts will tokenize translations and references before computing n-gram matches.
  - If Chinese is target language, character {1,2,3,4}-gram matches are used
  - Bleu-n4r1 is used when only one human reference is available.
- There are important variations:
  - Case-sensitive vs. case-insensitive
  - Brevity penalty may kick in when the system translation is shorter than the shortest reference, or the “closest” reference.
NIST. A variation of BLEU that gives higher weight to rare n-grams.
TER (Translation Edit Rate). Automatically calculates the number of edits required to make a translation identical to a human reference.
BLEU-SBP ((Chiang et al 08)[http://aclweb.org/anthology/D08-1064]). Addresses decomposability problems with Bleu, proposing a cross between Bleu and word error rate.
HTER. Returns the number of edits performed by a human posteditor to get an automatic translation into good shape.

ZH-EN

WMT.

The Second Conference on Machine Translation (WMT17) has a Chinese/English MT component, done in cooperation with CWMT 2017.

Website
Overview paper on WMT17 task
Chinese-English test set:

Test set	Size (sentences)	Genre
WMT17 Parallel English/Chinese test set	2001	News

Note that the Conference on Machine Translation (WMT19) is announced here.

Metrics

Direct assessment (human judgments).
WMT Bleu score script

Results

Chinese to English (WMT17)

System	Direct Assessment (Ave z)	Bleu
[Hany et al 18]		27.4
[Wang et al 17]	0.209	26.4
[Sennrich et al 17]	0.208	25.7
[Tan et al 17]	0.184	26

English - Chinese (WMT17)

System	Direct Assessment (Ave z)	Bleu
[Wang et al 17]	0.208
[Sennrich et al 17]	0.178	36.3
[Tan et al 17]	0.165	35.8

Resources

There are many parallel English/Chinese text resources to train MT systems on. These are publicly available:

Dataset	Size (words on English side)	Genre
UN	327m	Political
New Commentary v12	5m	News opinions
CWMT	154m	Web, movies, thesaurus, government, news conversation, novels, technical documents
AI_Challenger	120m	Movie subtitles, English learning, etc.
WMT 2017 Dev	54k	News

The Linguistic Data Consortium has additional resources, such as FBIS and NIST test sets.

NIST.

NIST has a long history of supporting Chinese-English translation by creating annual test sets and running annual NIST OpenMT evaluations during the 2000s. Many sites have reported results on NIST test sets.

Test sets contain Chinese sentences with four distinct (human reference) English translations each. Four references makes NIST an unusually strong evaluation set.

Variations in training and evaluation conditions can make it difficult to compare systems.

Evaluation script. Some papers evaluate with mteval-v13a, while others evaluate with the multibleu script. Case sensitivity can also be an issue, and with multiple references, there are variations in how to compute Bleu’s brevity penalty.
Training data. Papers vary in the number of training sentence pairs they restrict to.
Development data. Some papers use NIST 02 for development/tuning, while others use it as test data.

Note that this paper proposes a standard corpus and methodology around NIST sets, while reporting high Bleu scores. Github

Test set	Size (sentence pairs)	Genre
NIST 02	878	News
NIST 03	919	News
NIST 04	1788	News
NIST 05	1082	News
NIST 06	1664	Newswire, broadcast news, broadcast conversations, web newgroups
NIST 08	1357	Newswire, broadcast news, broadcast conversations, web newgroups

Metrics

Bleu

Results

System	Training sentence pairs	Eval script	NIST 02	NIST 03	NIST 04	NIST 05	NIST 06	NIST 08	Average
[Zhang et al 2019]	1.25m	mteval-v11b		48.31	49.40	48.72	48.45		48.72
[Hadiwinoto & Ng, 2018]	7.65m	mteval-v13a	46.94	47.58	49.13	47.78	49.37	41.48	47.05
[Yang te al, 2020]	1.2m	unspecified		46.56		46.04		37.53
[Meng et al 2019]	1.25m	unspecified	40.56 (dev)	39.93	41.54	38.01	37.45	29.07	37.76
[Ma et al 2018c]	1.25m	unspecified	39.77 (dev)	38.91	40.02	36.82	35.93	27.61	36.51
[Chen et al 2017]	1.6m	multibleu	36.57	35.64	36.63	34.35	30.57

Resources

The Linguistic Data Consortium provides training materials typically used for NIST OpenMT tasks.

IWSLT 2015.

Translation of TED talks
Chinese-to-English track
Shared task overview

Test sets	Size (sentences)	# of talks	Genre
tst2014	1068	12	TED talks
tst2015	1,080	12	TED talks

Metrics

Automatic metrics: Bleu, NIST, TER.
Manual metrics: HTER.

Results

Chinese to English (tst2015)

System	Bleu	NIST	TER
MITLL-AFRL	16.86	5.2565	67.31

English to Chinese (tst2015)

System	Bleu	NIST	TER
Univ. Edinburgh	25.39	6.3985	60.83
MITLL-AFRL	24.31	6.4136	59.00

Resources

Dataset	Size (sentences)	# of talks	Genre
Train	210k	1718	TED talks

TED corpus.

This site contains an up-to-date multi-way corpus of TED talks using for machine translation research. It also contains a leaderboard maintained by Kevin Duh.

Test set	Size (sentences)	Genre
Chinese/English test	1,982	TED talks
Chinese/English dev	1,958	TED talks

This site contains more languages but a different train/test split.

Results

Chinese to English

System	Bleu
Kevin Duh, 6-layer transformer (Sockeye)	16.63

English to Chinese

System	Bleu
Kevin Duh, coming	Not yet available

Resources

The Multitarget TED Talks Task (MTTT)

ZH-JA

Workshop on Asian Translation.

The Workshop on Asian Translation has run since 2014. Here, we include the 2018 Chinese/Japanese evaluations.

Metrics

BLEU.
RIBES. Method developed at NTT based on rank correlation coefficients (link).
Scoring code.
- for evaluating Chinese outputs: link
- for evaluating Japanese outputs: link

ASPEC Chinese-Japanese

Participants must get data from here

Test set	Size (sentences)	Genre
ASPEC Chinese-Japanese	2107	Scientific abstracts
ASPEC Japanese-Chinese	2107	Scientific abstracts

JPO Patent Corpus 2

Participants must get data from here

Test set	Size (sentences)	Genre
JPCN Chinese-Japanese	5,204	Patents
JPCN Japanese-Chinese	5204	Patents
JPCN1 Chinese-Japanese	2000	Patents
JPCN1 Japanese-Chinese	2000	Patents
JPCN2 Chinese-Japanese	3,000	Patents
JPCN2 Japanese-Chinese	3,000	Patents
JPCN3 Chinese-Japanese	204	Patents
JPCN3 Japanese-Chinese	204	Patents
JPSEP Chinese-Japanese	1151	Patent Expression Patterns

Results

See here
Note that citing others’ results should only be done with anonymization.

Resources

Dataset	Size (sentences)	Genre
Japanese-Chinese train	250,000	Patents
Japanese-Chinese dev	2000	Patents
Japanese-Chinese devtest	2000	Patents

IWSLT2020 ZH-JA Open Domain Translation.

The shared task is to promote research on translation between Asian languages, exploitation of noisy parallel web corpora for MT and smart processing of data and provenance.

Metrics

4-gram character BlEU.

A (secret) mixed-genre test set was intended to cover a variety of topics. The test data was selected from high-quality (human translated) parallel web content, authored between January and March 2020.

Test set	Size (sentences)	Genre
Chinese-Japanese	875	mixed-genre
Japanese-Chinese	875	mixed-genre

Results

Chinese to Japanese

System	Bleu
CASIA*	43.0
Xiaomi	34.3
TSUKUBA	33.0

Japanese to Chinese

System	Bleu
CASIA*	55.8
Samsung Research China	34.0
OPPO	32.9

* means system collected external parallel training data that inadvertently overlapped with the blind test set.

Resources

Dataset	Size (sentences)	Genre
Web crawled	18,966,595	mixed-genre
Existing parallel sources	1,963,238	mixed-genre

Others

CWMT.

CWMT 2017 and 2018 (China Workshop on Machine Translation) features six tasks:

Test set	Size (sentences)	Genre
CWMT Chinese-English news	1000	News
CWMT English-Chinese news	1000	News
Mongolian-Chinese	1001	Daily expressions
Tibetan-Chinese	729	Government documents
Uyghur-Chinese	1000	News
Japanese-Chinese	1000	Patents

In 2019, CWMT became CCMT (China Conference on Machine Translation).

Metrics

BLEU-SBP is the primary metric. Other metrics include BLEU-NIST, TER, METEOR, NIST, GTM, mWER, mPER, and ICT.

Results

Still being compiled.

Resources

Detailed at here

Other Resources

Opus is an excellent site for open-source parallel corpora, with a nice language-pair search function.

Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com