ChineseNLP

Chinese Text Summarization

Background

Text summarization takes a long text document and creates a shorter text document that is a fluent and accurate summary of the longer text document.

Example

Input:

 较早进入中国市场的星巴克, 是不少小资钟 情的品牌。相比在美国的平民形象,星巴克在中国就 显得“高端”得多。用料并无差别的一杯中杯美式咖 啡,在美国仅约合人民币12元,国内要卖21元,相当 于贵了75%。第一财经日报 

Output:

媒体称星巴克美式咖啡售价中国比美国 贵75%。

Standard Metrics

ROUGE compares an automatically produced summary with human-produced, reference summaries. ROUGE-1 records unigram overlap, ROUGE-2 bigram overlap, and ROUGE-L the longest common subsequence. ROUGE can be computed over characters or words.

Implementations

LCSTS: A Large Scale Chinese Short Text Summarization Dataset.

Test set # (text, summary) pairs # (text, summary) pairs >= 3 score Genre
Part II (validation) 10,666 8,685 News, politics, economic, military, movies, games, etc.
Part III (test) 1,106 725 News, politics, economic, military, movies, games, etc.

Results

System ROUGE-1 ROUGE-2 ROUGE-L
Duan et al. (2019) 44.35 30.65 40.58
Wang et. al. (2018) 39.9 21.5 37.9
Lin et. al. (2018) 39.4 26.9 36.5
Ma et. al. (2018) 39.2 26.0 36.2
Wei et. al. (2018) 36.2 24.3 33.8
Seq2Seq (baseline) 32.1 19.9 29.2

Resources

Train set # (text, summary) pairs Genre
Part I 2,400,591 News

Other Resources


Suggestions? Changes? Please send email to chinesenlp.xyz@gmail.com