航空资料36(41)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

Four human summaries were used for the annotation.
The pyramid score for a system summary is equal to the
ratio between the sum of weights of SCUs expressed in
a summary (again identified manually) and the sum of
weights of an ideally informative summary with the same
number of SCUs.3
3In addition, pyramids using all combinations of three models were
constructed from the four-model pyramid. A human summary was
scored against a pyramid comprising of SCUs from the other three
model summaries. Jackknifing was implemented for system summaries
by comparing them to each of the four 3-model pyramids obtaining
a pyramid score from each comparison. The average of these scores
is also reported together with the score from the four-model pyramid.
The correlation between the two pyramid scores (using three and four
models) is very high— 0.9997 for query focused and 0.9993 for update
Responsiveness eval The responsiveness of a summary
is defined as a measure of overall quality combining
content selection and linguistic quality: summaries must
present useful content in a structured fashion in order to
better satisfy the user’s need. Assessors directly assigned
responsiveness scores on a scale of 1 (poor summary) to 5
(very good summary) to each summary. The assessments
are done without reference to any model summaries.
ROUGE NIST also evaluated the summaries automatically
using ROUGE (Lin, 2004; Lin and Hovy, 2003).
Comparison between a summary and a set of model summaries
is computed using unigram (R1) and bigram overlaps
(R2). The scores were computed after stemming
but stop words were retained in the summaries. Table 1
shows that ROUGE obtains very good correlations with
the manual scores for content selection. The correlation
with pyramid scores is 0.90 and 0.94 for query focused
and update summary respectively, and 0.87 and 0.88 with
responsiveness. Given these observations, ROUGE is a
high performance automatic evaluation metric when human
summaries are available and sets a high standard for
comparison of other automatic evaluation methods.
Linguistic quality questions were used to assess readability
and well-formedness of the produced summaries.
Assessors scored the well-formedness of a summary on a
scale from 1 (very poor) to 5 (very good). Grammaticality,
non-redundancy, referential clarity, focus, structure
and coherence were the factors to be considered during
evaluation.
But since our features are designed to capture content
selection quality, manual pyramid and responsiveness
scores will be used for comparison with our automatic
method. The correlation between these two evaluations
is overall high, 0.885 and 0.923 respectively for
query focused and update summarization. Despite of
this, we use both measures as a reference for comparison
with our fully automatic evaluation method because albeit
high, the correlation between them is not perfect (as
was the correlation between the two alternative pyramid
scores).
3 Features
We describe three classes of features used to compare input
and summary content: distributional similarity, summary
likelihood and use of topic signatures. Words in
both input and summary were stemmed before feature
computations.
3.1 Distributional Similarity
Measures of similarity between two probability distributions
are a natural choice for the task at hand. We choose
tasks. So we will not discuss the correlations of our features with the
three-model pyramid scores.
to experimentwith three common suchmeasures: KL and
Jensen Shannon divergence and cosine similarity. We expect
that good summaries are characterized by low divergence
between the probability distributions of words in
the input and the summary, and by high similarity with
the input.
Moreover, these three metrics have already been used
for summary evaluation, albeit in different contexts. (Lin
et al., 2006) compared the performance of ROUGE with
KL and JS divergence for the evaluation of summaries using
human models. The divergence between human and
machine summary distributions was used as an estimate
of summary score. The study found that JS divergence
always outperformed KL divergence and using multiple
human references, the performance of JS divergence was
better than standard ROUGE scores for multi-document
summarization. JS divergence has also been found useful
in other NLP tasks as a good predictor of unseen events
(Dagan et al., 1994; Lapata, 2000).
The use of cosine similarity in (Donaway et al., 2000)
is more directly related to our work. In this study, it was
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料36(41)