航空资料8(93)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

of inputs for which correlations were significant
are also reported.
Now, JS divergence obtains significant correlations
with pyramid scores for 73% of the inputs
but for particular inputs, the correlation can be
as low as 0.27. The results are worse for other
features and for comparison with responsiveness
scores.
At the micro level, combining features with regression
gives the best result overall, in contrast to
the findings for the macro level setting. This result
has implications for system development; no
single feature can reliably predict good content for
a particular input. Even a regression combination
of all features is a significant predictor of content
selection quality in only 77% of the cases.
We should note however, that our features are
based only on the distribution of terms in the input
and therefore less likely to inform good content
for all input types. For example, a set of
documents each describing different opinion on a
given issue will likely have less repetition on both
lexical and content unit level. The predictiveness
of features like ours will be limited for such inputs4.
However, model summaries written for the
specific input would give better indication of what
information in the input was important and interesting.
This indeed is the case as we shall see in
Section 6.
Overall, the micro level results suggest that the
fully automatic measures we examined will not be
useful for providing information about summary
quality for an individual input. For averages over
many test sets, the fully automatic evaluations give
more reliable and useful results, highly correlated
with rankings produced by manual evaluations.
4In fact, it would be surprising to find an automatically
computable feature or feature combination which would be
able to consistently predict good content for all individual inputs.
If such features existed, an ideal summarization system
would already exist.
5.3 Effects of stemming
The analysis presented so far is on features computed
after stemming the input and summary
words. We also computed the values of the same
features without stemming and found that divergence
metrics benefit greatly when stemming is
done. The biggest improvements in correlations
are for JS and KL divergences with respect to responsiveness.
For JS divergence, the correlation
increases from 0.57 to 0.73 and for KL divergence
(summary-input), from 0.52 to 0.69.
Before stemming, the topic signature and bag
of words overlap features are the best predictors
of responsiveness (correlations are 0.63 and 0.64
respectively) but do not change much after stemming
(topic overlap—0.62, bag of words—0.64).
Divergences emerge as better metrics only after
stemming.
Stemming also proves beneficial for the likelihood
features. Before stemming, their correlations
are directed in the wrong direction, but they improve
after stemming to being either positive or
closer to zero. However, even after stemming,
summary probabilities are not good predictors of
content quality.
5.4 Difference in correlations: pyramid and
responsiveness scores
Overall, we find that correlations with pyramid
scores are higher than correlations with responsiveness.
Clearly our features are designed to
compare input-summary content only. Since responsiveness
judgements were based on both content
and linguistic quality of summaries, it is not
surprising that these rankings are harder to replicate
using our content based features. Nevertheless,
responsiveness scores are dominated by content
quality and the correlation between responsiveness
and JS divergence is high, 0.73.
Clearly, metrics of linguistic quality should be
integrated with content evaluations to allow for
better predictions of responsiveness. To date, few
attempts have been made to automatically evaluate
linguistic quality in summarization. Lapata
and Barzilay (2005) proposed a method for coherence
evaluation which holds promise but has
not been validated so far on large datasets such
as those used in TAC and DUC. In a simpler approach,
Conroy and Dang (2008) use higher order
ROUGE scores to approximate both content and
linguistic quality.
311
pyramid responsiveness
features max min no. significant (%) max min no. significant (%)
JS div -0.714 -0.271 35 (72.9) -0.654 -0.262 35 (72.9)
JS div smoothed -0.712 -0.269 35 (72.9) -0.649 -0.279 33 (68.8)
KL div summ-inp -0.736 -0.276 35 (72.9) -0.628 -0.261 35 (72.9)
% of input topic words 0.701 0.286 31 (64.6) 0.693 0.279 29 (60.4)
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料8(93)