航空资料8(92)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

combination, the training set consisted
only of examples which included neither the same
input nor the same system. Hence during training,
no examples of either the test input or system were
seen.
5 Correlations with manual evaluations
In this section, we report the correlations between
system ranking using our automatic features and
the manual evaluations. We studied the predictive
power of features in two scenarios.
MACRO LEVEL; PER SYSTEM: The values of features
were computed for each summary submitted
for evaluation. For each system, the feature values
were averaged across all inputs. All participating
systems were ranked based on the average value.
Similarly, the average manual score, pyramid or
responsiveness, was also computed for each system.
The correlations between the two rankings
are shown in Tables 2 and 4.
MICRO LEVEL; PER INPUT: The systems were
ranked for each input separately, and correlations
between the summary rankings for each input
were computed (Table 3).
The two levels of analysis address different
questions: Can we automatically identify system
performance across all test inputs (macro
level) and can we identify which summaries for a
given input were good and which were bad (micro
level)? For the first task, the answer is a definite
“yes” while for the second task the results are
mixed.
In addition, we compare our results to modelbased
evaluations using ROUGE and analyze the
effects of stemming the input and summary vocabularies.
In order to allow for in-depth discussion,
we will analyze our findings only for query
focused summaries. Similar results were obtained
for the evaluation of update summaries and are described
in Section 7.
5.1 Performance at macro level
Table 2 shows the Spearman correlation between
manual and automatic scores averaged across the
Features pyramid respons.
JS div -0.880 -0.736
JS div smoothed -0.874 -0.737
% of input topic words 0.795 0.627
KL div summ-inp -0.763 -0.694
cosine overlap 0.712 0.647
% of summ = topic wd 0.712 0.602
topic overlap 0.699 0.629
KL div inp-summ -0.688 -0.585
mult. summary prob. 0.222 0.235
unigram summary prob. -0.188 -0.101
regression 0.867 0.705
ROUGE-1 recall 0.859 0.806
ROUGE-2 recall 0.905 0.873
Table 2: Spearman correlation on macro level for
the query focused task. All results are highly significant
with p-values < 0.000001 except unigram
and multinomial summary probability, which are
not significant even at the 0.05 level.
48 inputs. We find that both distributional similarity
and the topic signature features produce system
rankings very similar to those produced by humans.
Summary probabilities, on the other hand,
turn out to be unpredictive of content selection
performance. The linear regression combination
of features obtains high correlations with manual
scores but does not lead to better results than the
single best feature: JS divergence.
JS divergence outperforms other features including
the regression metric and obtains the best
correlations with both types of manual scores, 0.88
with pyramid score and 0.74 with responsiveness.
The regression metric performs comparably with
correlations of 0.86 and 0.70. The correlations obtained
by both JS divergence and the regression
metric with pyramid evaluations are in fact better
than that obtained by ROUGE-1 recall (0.85).
The best topic signature based feature—
percentage of input’s topic signatures that are
present in the summary—ranks next only to JS divergence
and regression. The correlation between
this feature and pyramid and responsiveness evaluations
is 0.79 and 0.62 respectively. The proportion
of summary content composed of topic words
performs worse as an evaluation metric with correlations
0.71 and 0.60. This result indicates that
summaries that cover more topics from the input
are judged to have better content than those in
which fewer topics are mentioned.
Cosine overlaps and KL divergences obtain
good correlations but still lower than JS divergence
or percentage of input topic words. Further,
rankings based on unigram and multinomial sum-
310
mary probabilities do not correlate significantly
with manual scores.
5.2 Performance on micro level
On a per input basis, the proposed metrics are not
that effective in distinguishing which summaries
have better content. The minimum and maximum
correlations with manual evaluations across the 48
inputs are given in Table 3. The number and percentage
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料8(92)