• 热门标签

当前位置: 主页 > 航空资料 > 国外资料 >

时间:2010-08-13 09:05来源:蓝天飞行翻译 作者:admin
曝光台 注意防骗 网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

of updates from docset B, assuming that the
user has read all the documents in A. The maximum
length for both types of summaries is 100
words.
There were 57 participating systems in TAC
2008. We use the summaries and evaluations of
these systems for the experiments reported in the
paper.
3.3 Evaluation metrics
Both manual and automatic evaluations were conducted
at NIST to assess the quality of summaries
2http://www.nist.gov/tac
307
manual score R-1 recall R-2 recall
Query Focused summaries
pyramid score 0.859 0.905
responsiveness 0.806 0.873
Update summaries
pyramid score 0.912 0.941
responsiveness 0.865 0.884
Table 1: Spearman correlation between manual
scores and ROUGE-1 and ROUGE-2 recall. All
correlations are highly significant with p-value <
0.00001.
produced by the systems.
Pyramid evaluation: The pyramid evaluation
method (Nenkova and Passonneau, 2004) has been
developed for reliable and diagnostic assessment
of content selection quality in summarization and
has been used in several large scale evaluations
(Nenkova et al., 2007). It uses multiple human
models from which annotators identify semantically
defined Summary Content Units (SCU).
Each SCU is assigned a weight equal to the
number of human model summaries that express
that SCU. An ideal maximally informative summary
would express a subset of the most highly
weighted SCUs, with multiple maximally informative
summaries being possible. The pyramid
score for a system summary is equal to the ratio
between the sum of weights of SCUs expressed
in a summary (again identified manually) and the
sum of weights of an ideal summary with the same
number of SCUs.
Four human summaries provided by NIST for
each input and task were used for the pyramid
evaluation at TAC.
Responsiveness evaluation: Responsiveness of a
summary is a measure of overall quality combining
both content selection and linguistic quality:
summaries must present useful content in a structured
fashion in order to better satisfy the user’s
need. Assessors directly assigned scores on a
scale of 1 (poor summary) to 5 (very good summary)
to each summary. These assessments are
done without reference to any model summaries.
The (Spearman) correlation between the pyramid
and responsiveness metrics is high but not perfect:
0.88 and 0.92 respectively for query focused and
update summarization.
ROUGE evaluation: NIST also evaluated the
summaries automatically using ROUGE (Lin,
2004; Lin and Hovy, 2003). Comparison between
a summary and the set of four model summaries
is computed using unigram (R1) and bigram overlaps
(R2)3. The correlations between ROUGE and
manual evaluations is shown in Table 1 and varies
between 0.80 and 0.94.
Linguistic quality evaluation: Assessors scored
summaries on a scale from 1 (very poor) to 5 (very
good) for five factors of linguistic quality: grammaticality,
non-redundancy, referential clarity, focus,
structure and coherence.
We do not make use of any of the linguistic
quality evaluations. Our work focuses on fully automatic
evaluation of content selection, so manual
pyramid and responsiveness scores are used
for comparison with our automatic method. The
pyramid metric measures content selection exclusively,
while responsiveness incorporates at least
some aspects of linguistic quality.
4 Features for content evaluation
We describe three classes of features to compare
input and summary content: distributional similarity,
summary likelihood and use of topic signatures.
Both input and summary words were stopword
filtered and stemmed before computing the
features.
4.1 Distributional Similarity
Measures of similarity between two probability
distributions are a natural choice for the task at
hand. One would expect good summaries to be
characterized by low divergence between probability
distributions of words in the input and summary,
and by high similarity with the input.
We experimented with three common measures:
KL and Jensen Shannon divergence and cosine
similarity. These three metrics have already been
applied for summary evaluation, albeit in different
contexts. In Lin et al. (2006), KL and JS divergences
between human and machine summary
distributions were used to evaluate content selection.
The study found that JS divergence always
outperformed KL divergence. Moreover, the performance
of JS divergence was better than standard
ROUGE scores for multi-document summarization
 
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:航空资料8(90)