航空资料36(44)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

of words—performs worse than the best divergence and
topic signature features. The modified overlap score of
input topic signatures and summary words also fails to
obtain very high correlations. The rankings based on unigram
and multinomial summary probabilities do not correlate
with manual scores. Almost all systems use frequency
in some form to inform content selection and this
could be a reason why likelihood fails to distinguish between
the system summaries.
4.2 Performance on micro level
As a more stringent assessment of the automatic evaluation,
let us consider the rankings obtained on a per-input
basis. These results are summarized in Table 3. The
number of inputs for which correlations were significant
are reported along with their minimum, maximum values
and the number of inputs for which correlations above
0.5 were observed. The results are less spectacular at the
level of individual inputs: JS divergence rankings obtain
significant correlations with pyramid scores for 73% of
the inputs. Further, only 40% of inputs obtain correlations
above 0.5. The results are worse for other features
and for comparison with responsiveness scores.
Overall, the micro level results suggest that the fully
automatic measures we examined will not be useful for
providing information about summary quality for a single
input. For average over many test sets, the fully automatic
evaluations measures give more reliable and useful
results, highly correlated with rankings produced by
manual evaluations.
4.3 Effects of stemming
So far, the analysis was based on feature values computed
after stemming the input and summary words. We also
computed the values of the same features without stemming
and found that divergence metrics benefit greatly
when stemmed vocabularies are used. The biggest improvements
in correlations are for JS and KL divergences
with respect to responsiveness. For JS divergence, the
correlation increases from 0.571 to 0.736 and for KL divergence
(summary-input), from 0.528 to 0.694. Before
stemming, topic signature and bag of words overlap features
are best predictive of responsiveness (correlations
are 0.630 and 0.642 respectively) but do not changemuch
after stemming (topic overlap— 0.629, bag of words—
0.647). Divergences emerge as better metrics only after
stemming. Stemming also proves beneficial for likelihood
features. Before stemming, their correlations are
directed in the wrong direction, they improve after stemming
to being either positive or closer to zero. However,
these probabilities remain unable to produce human-like
rankings.
4.4 Difference in correlations: pyramid and
responsiveness scores
Overall, we find that correlations with pyramid scores are
higher than correlations with responsiveness. Clearly our
features are designed to compare input-summary contents
only. On the other hand, higher order ROUGE n-gram
scores can be expected to capture some aspects of fluency
in addition to an estimate of content quality. Since
responsiveness judgements were based on both content
and linguistic qualities of summaries, it is not surprising
that these rankings are harder to replicate using our content
based features.
5 Comparison with ROUGE
Although, JS divergence outperforms ROUGE-1 recall
for correlations with pyramid scores at the average level,
ROUGE-2 recall is still better. Also, ROUGE obtains
the best correlations with responsiveness judgements. At
the per-input micro level, ROUGE clearly gives the best
human-like rankings— ROUGE-1 recall obtains significant
correlations for over 95% of inputs and correlations
above 0.5 for at least 50% of inputs. The ROUGE results
are shown in the last two rows of Table 3.
However, when making these comparisons we need to
keep in mind the fact that ROUGE evaluates system summaries
using four manual models for each input. The
evaluations using our features are fully automatic, with
no human summaries at all.
For manual pyramid scores, the best obtained correlation
with fully automatic evaluation is 0.88 (JS divergence)
while the best correlation with ROUGE is 0.90
(R2). The difference is negligably small for contentbased
evaluations.
For manual responsiveness scores which combine aspects
of linguistic quality along with content selection
evaluation, the best correlations are 0.73 (JS divergence)
and 0.87 (R2). For this measure, the difference between
ROUGE and the fully automatic comparisons is significant,
indicating that our intuition that the proposed metrics
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料36(44)