Machine translation (MT) converts a text from one language to another. Here, we focus on translation into and out of Chinese.
Input:
美中两国可能很快达成一个贸易协议。
Output:
The United States and China may soon reach a trade agreement.
The Second Conference on Machine Translation (WMT17) has a Chinese/English MT component, done in cooperation with CWMT 2017.
Test set | Size (sentences) | Genre |
---|---|---|
WMT17 Parallel English/Chinese test set | 2001 | News |
Note that the Conference on Machine Translation (WMT19) is announced here.
Chinese to English (WMT17)
System | Direct Assessment (Ave z) | Bleu |
---|---|---|
[Hany et al 18] | 27.4 | |
[Wang et al 17] | 0.209 | 26.4 |
[Sennrich et al 17] | 0.208 | 25.7 |
[Tan et al 17] | 0.184 | 26 |
English - Chinese (WMT17)
System | Direct Assessment (Ave z) | Bleu |
---|---|---|
[Wang et al 17] | 0.208 | |
[Sennrich et al 17] | 0.178 | 36.3 |
[Tan et al 17] | 0.165 | 35.8 |
There are many parallel English/Chinese text resources to train MT systems on. These are publicly available:
Dataset | Size (words on English side) | Genre |
---|---|---|
UN | 327m | Political |
New Commentary v12 | 5m | News opinions |
CWMT | 154m | Web, movies, thesaurus, government, news conversation, novels, technical documents |
AI_Challenger | 120m | Movie subtitles, English learning, etc. |
WMT 2017 Dev | 54k | News |
The Linguistic Data Consortium has additional resources, such as FBIS and NIST test sets.
NIST has a long history of supporting Chinese-English translation by creating annual test sets and running annual NIST OpenMT evaluations during the 2000s. Many sites have reported results on NIST test sets.
Test sets contain Chinese sentences with four distinct (human reference) English translations each. Four references makes NIST an unusually strong evaluation set.
Variations in training and evaluation conditions can make it difficult to compare systems.
Note that this paper proposes a standard corpus and methodology around NIST sets, while reporting high Bleu scores. Github
Test set | Size (sentence pairs) | Genre |
---|---|---|
NIST 02 | 878 | News |
NIST 03 | 919 | News |
NIST 04 | 1788 | News |
NIST 05 | 1082 | News |
NIST 06 | 1664 | Newswire, broadcast news, broadcast conversations, web newgroups |
NIST 08 | 1357 | Newswire, broadcast news, broadcast conversations, web newgroups |
System | Training sentence pairs | Eval script | NIST 02 | NIST 03 | NIST 04 | NIST 05 | NIST 06 | NIST 08 | Average |
---|---|---|---|---|---|---|---|---|---|
[Zhang et al 2019] | 1.25m | mteval-v11b | 48.31 | 49.40 | 48.72 | 48.45 | 48.72 | ||
[Hadiwinoto & Ng, 2018] | 7.65m | mteval-v13a | 46.94 | 47.58 | 49.13 | 47.78 | 49.37 | 41.48 | 47.05 |
[Yang te al, 2020] | 1.2m | unspecified | 46.56 | 46.04 | 37.53 | ||||
[Meng et al 2019] | 1.25m | unspecified | 40.56 (dev) | 39.93 | 41.54 | 38.01 | 37.45 | 29.07 | 37.76 |
[Ma et al 2018c] | 1.25m | unspecified | 39.77 (dev) | 38.91 | 40.02 | 36.82 | 35.93 | 27.61 | 36.51 |
[Chen et al 2017] | 1.6m | multibleu | 36.57 | 35.64 | 36.63 | 34.35 | 30.57 |
The Linguistic Data Consortium provides training materials typically used for NIST OpenMT tasks.
Test sets | Size (sentences) | # of talks | Genre |
---|---|---|---|
tst2014 | 1068 | 12 | TED talks |
tst2015 | 1,080 | 12 | TED talks |
Chinese to English (tst2015)
System | Bleu | NIST | TER |
---|---|---|---|
MITLL-AFRL | 16.86 | 5.2565 | 67.31 |
English to Chinese (tst2015)
System | Bleu | NIST | TER |
---|---|---|---|
Univ. Edinburgh | 25.39 | 6.3985 | 60.83 |
MITLL-AFRL | 24.31 | 6.4136 | 59.00 |
Dataset | Size (sentences) | # of talks | Genre |
---|---|---|---|
Train | 210k | 1718 | TED talks |
This site contains an up-to-date multi-way corpus of TED talks using for machine translation research. It also contains a leaderboard maintained by Kevin Duh.
Test set | Size (sentences) | Genre |
---|---|---|
Chinese/English test | 1,982 | TED talks |
Chinese/English dev | 1,958 | TED talks |
This site contains more languages but a different train/test split.
Chinese to English
System | Bleu |
---|---|
Kevin Duh, 6-layer transformer (Sockeye) | 16.63 |
English to Chinese
System | Bleu |
---|---|
Kevin Duh, coming | Not yet available |
The Multitarget TED Talks Task (MTTT)
The Workshop on Asian Translation has run since 2014. Here, we include the 2018 Chinese/Japanese evaluations.
ASPEC Chinese-Japanese
Participants must get data from here
Test set | Size (sentences) | Genre |
---|---|---|
ASPEC Chinese-Japanese | 2107 | Scientific abstracts |
ASPEC Japanese-Chinese | 2107 | Scientific abstracts |
JPO Patent Corpus 2
Participants must get data from here
Test set | Size (sentences) | Genre |
---|---|---|
JPCN Chinese-Japanese | 5,204 | Patents |
JPCN Japanese-Chinese | 5204 | Patents |
JPCN1 Chinese-Japanese | 2000 | Patents |
JPCN1 Japanese-Chinese | 2000 | Patents |
JPCN2 Chinese-Japanese | 3,000 | Patents |
JPCN2 Japanese-Chinese | 3,000 | Patents |
JPCN3 Chinese-Japanese | 204 | Patents |
JPCN3 Japanese-Chinese | 204 | Patents |
JPSEP Chinese-Japanese | 1151 | Patent Expression Patterns |
Dataset | Size (sentences) | Genre |
---|---|---|
Japanese-Chinese train | 250,000 | Patents |
Japanese-Chinese dev | 2000 | Patents |
Japanese-Chinese devtest | 2000 | Patents |
The shared task is to promote research on translation between Asian languages, exploitation of noisy parallel web corpora for MT and smart processing of data and provenance.
A (secret) mixed-genre test set was intended to cover a variety of topics. The test data was selected from high-quality (human translated) parallel web content, authored between January and March 2020.
Test set | Size (sentences) | Genre |
---|---|---|
Chinese-Japanese | 875 | mixed-genre |
Japanese-Chinese | 875 | mixed-genre |
Chinese to Japanese
System | Bleu |
---|---|
CASIA* | 43.0 |
Xiaomi | 34.3 |
TSUKUBA | 33.0 |
Japanese to Chinese
System | Bleu |
---|---|
CASIA* | 55.8 |
Samsung Research China | 34.0 |
OPPO | 32.9 |
* means system collected external parallel training data that inadvertently overlapped with the blind test set.
Dataset | Size (sentences) | Genre |
---|---|---|
Web crawled | 18,966,595 | mixed-genre |
Existing parallel sources | 1,963,238 | mixed-genre |
CWMT 2017 and 2018 (China Workshop on Machine Translation) features six tasks:
Test set | Size (sentences) | Genre |
---|---|---|
CWMT Chinese-English news | 1000 | News |
CWMT English-Chinese news | 1000 | News |
Mongolian-Chinese | 1001 | Daily expressions |
Tibetan-Chinese | 729 | Government documents |
Uyghur-Chinese | 1000 | News |
Japanese-Chinese | 1000 | Patents |
In 2019, CWMT became CCMT (China Conference on Machine Translation).
BLEU-SBP is the primary metric. Other metrics include BLEU-NIST, TER, METEOR, NIST, GTM, mWER, mPER, and ICT.
Still being compiled.
Detailed at here
Opus is an excellent site for open-source parallel corpora, with a nice language-pair search function.
Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com