拼写纠错 (Spelling Correction)任务的目标是在文本中查找并更正拼写错误 (typographical errors),这类错误通常发生在外观、发音相似的字符之间.
输入:
1986年毕业于国防科技大学计算机应用专业,获学时学位。
输出:
1986年毕业于国防科技大学计算机应用专业,获学士学位。
(时 -> 士)
拼写纠错 (Spelling Correction)任务通常使用accuracy, precision, recall, 以及F1 score指标进行评测.这些指标可以是字符层面(character level)的或者是句子层面(sentence level)的.通常会分别对识别(Detection)和纠正(Correction)进行评估.
test set | # sentence pairs | # characters | #拼写错误 (chars) | 字符集 | 主题 (Genre) |
---|---|---|---|---|---|
SIGHAN 2015 (Tseng et. al. 2015) | 1,100 | 33,711 | 715 | 繁体 | 第二语言学习 (second-language learning) |
SIGHAN 2014 (Yu et. al. 2014) | 1,062 | 53,114 | 792 | 繁体 | 第二语言学习 (second-language learning) |
SIGHAN 2013 (Wu et. al. 2013) | 2,000 | 143,039 | 1,641 | 繁体 | 第二语言学习 (second-language learning) |
| System | False Positive Rate | Detection Accuracy | Detection Precision | Detection Recall| Detection F1| Correction Accuracy | Correction Precision | Correction Recall | Correction F1| | — | — | — | — | — | — | — | — | — | — | | Soft-Masked BERT (Zhang et. al. 2020)| - | 80.9 | 73.7 | 73.2 | 73.5 | 77.4 | 66.7 | 66.2 | 66.4 | | Confusion-set (Wang et. al. 2019)| - | - | 66.8 | 73.1 | 69.8 | - | 71.5 | 59.5 | 64.9 | | FASPell (Hong et. al. 2019)| - | 74.2 | 67.6 | 60.0 | 63.5 | 73.7 | 66.6 | 59.1 | 62.6 | | CAS (Zhang et. al. 2015)| 11.6 | 70.1 | 80.3 | 53.3 | 64.0 | 69.2 | 79.7 | 51.5 | 62.5 | 上表是不同模型在结果SIGHAN 2015测试集上的结果.
Source | # sentence pairs | # chars | #拼写错误 | 字符集 | 主题 (Genre) |
---|---|---|---|---|---|
SIGHAN 2015 Training data (Tseng et. al. 2015) | 2,334 | 146,076 | 2,594 | 繁体 | 第二语言学习 (second-language learning) |
SIGHAN 2014 Training data (Yu et. al. 2014) | 6,526 | 324,342 | 10,087 | 繁体 | 第二语言学习 (second-language learning) |
SIGHAN 2013 Training data (Wu et. al. 2013) | 350 | 17,220 | 350 | 繁体 | 第二语言学习 (second-language learning) |
Source | # sentence pairs | # chars | #拼写错误 | 字符集 | 主题 (Genre) |
---|---|---|---|---|---|
Synthetic training dataset (Wang et. al. 2018) | 271,329 | 12M | 382,702 | 简体 | 新闻 |
建议? 修改? 请发邮件到 chinesenlp.xyz@gmail.com