曝光台 注意防骗
网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者
guided simply by the presence of any topic word while
the second measures the diversity of topic words used in
the summary.
3.4 Feature combination using linear regression
We also evaluated the performance of a feature combining
all of the above features into a single measure using
linear regression. The value of the feature for each summary
was obtained using leave-one-out approach: for a
particular input and system-summary combination, a linear
regression model using the automatic features to predict
the manual evaluation scores was trained. The training
set consisted only of exampleswhich included neither
the same input nor the same system. Hence during training,
no examples of either the test input or system were
seen.
4 Comparison to manual evaluations
In this section, we report the correlations between system
ranking using our automatic features and manual evaluations.
More precisely, the value of features was computed
for each summary submitted for evaluations. We studied
the predictive power of features in two scenarios macro
level; per system: the average feature value across all inputs
was used to rank the systems. The average manual
score (pyramid or responsiveness) was also computed for
each system, and the correlations between the two rankings
were analyzed; micro level; per input the systems
were ranked for each input separately, and correlations
between the summary ranking for each input were computed.
The two levels of analysis address different questions:
Can we automatically identify system performance
Features pyramid score responsiveness
JSD div -0.880 -0.736
JSD div smoothed -0.874 -0.737
% of ip topic in summ 0.795 0.627
KL div summ-inp -0.763 -0.694
cosine inp-summ 0.712 0.647
% of summ = topic wd 0.712 0.602
topic overlap 0.699 0.629
KL div inp-summ -0.688 -0.585
mult. summ prob 0.222 0.235
unigram summ prob -0.188 -0.101
regression 0.867 0.705
Table 2: Spearman correlation between fully automatically
computed features and manually assigned system scores (avg.
over all test inputs) for the query focused summarization subtask
in TAC 2008. All results are highly significant with pvalues
< 0.000001 except unigram and multinomial summary
probability, which are not significant.
across all test inputs (macro level) and can we identify
which summaries for a given input were good and which
were bad (micro level) ?
In addition, we compare our results to model-based
evaluations using ROUGE and analyze the effects of
stemming the input and summary vocabularies.
4.1 Performance at macro level
Table 2 shows the Spearman correlation between the
manual and automatic scores averaged across the 48 inputs.
We find that both distributional similarity as well as
the topic signature features obtain rankings very similar
to those produced by humans while summary probabilities
turn out unsuitable for the evaluation task.
Notably, the linear regression combination of features
does not lead to better results than the single best feature:
the JS divergence. It outperforms other features including
the regression metric and obtains the best correlations
with both types of manual scores, 0.88 with pyramid
score and 0.74 with responsiveness. The correlation
with pyramid score is in fact better than that obtained by
ROUGE-1 recall (0.86). Similar results establishing that
JS divergence is the most suitable measure for automatic
evaluation were reported in a study of model-based evaluation
metrics (Lin et al., 2006). In their study of generic
multi-document summarization, JS divergence between
system and model summaries obtained better correlations
with manual rankings than ROUGE overlap scores. Our
results provide further evidence that this divergence metric
is indeed best suited for content comparison of two
texts.
The best topic signature based feature—percentage of
input’s topic signatures that are present in the summary—
ranks next only to JS divergence and regression. Given
this result, systems optimizing for topic signatures would
score well with respect to content as was observed in previous
large scale evaluations conducted by NIST.We also
find that the feature simply reflecting the proportion of
topic signatures in the summary performs worse as an
evaluation metric. This observation leads us to the conclusion
that a summary that contains many different topic
signatures from the input seems to carry better content
than one that contains topic signatures of fewer types.
The most simple comparison metric—cosine overlap
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:
航空资料36(43)