航空资料36(45)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

will not be suitable for linguistic quality evaluation
was correct. Other metrics for linguistic quality need to
be explored for this task (Lapata and Barzilay, 2005).
pyramid responsiveness
features max min sig sig% a0.5 a0.5% max min sig sig% a0.5 a0.5%
JSD -0.714 -0.271 35 72.9 19 39.6 -0.654 -0.262 35 72.9 17 35.4
JSD smoothed -0.712 -0.269 35 72.9 18 37.5 -0.649 -0.279 33 68.8 17 35.4
KL summ-inp -0.736 -0.276 35 72.9 17 35.4 -0.628 -0.261 35 72.9 13 27.1
% of input sign 0.701 0.286 31 64.6 16 33.3 0.693 0.279 29 60.4 9 18.8
cosine overlap 0.622 0.276 31 64.6 6 12.5 0.618 0.265 28 58.3 4 8.3
KL inp-summ -0.628 -0.262 28 58.3 8 16.7 -0.577 -0.267 22 45.8 6 12.5
topic overlap 0.597 0.265 30 62.5 5 10.4 0.689 0.277 26 54.2 3 6.3
% summ sign 0.607 0.269 23 47.9 7 14.6 0.534 0.272 23 47.9 1 2.1
mult. summ prob 0.434 0.268 8 16.7 0 0 0.459 0.272 10 20.8 0 0
uni. summ prob 0.292 0.261 2 4.2 0 0 0.466 0.287 2 4.2 0 0
regression 0.736 0.281 37 77.1 14 29.2 0.642 0.262 32 66.7 6 12.5
Rouge-1 recall 0.833 0.264 47 97.9 32 66.7 0.754 0.266 46 95.8 25 52.1
Rouge-2 recall 0.875 0.316 48 100 33 68.8 0.742 0.299 44 91.7 22 45.8
Table 3: Spearman correlations between feature values and manual system scores on a per input basis (TAC 2008 Query Focused
summarization). Only the minimum, maximum values of the significant correlations are reported together with the number of
significant correlations and the number of inputs for which correlations above 0.5 were obtained.
6 Evaluation of systems for TAC 2008
Update Summarization task
In the paper, we discussed only the results from evaluations
of the query focused summaries produced at TAC
2008. The results for the update task are very similar and
all conclusions hold for these data as well. For completeness
we give the correlations between fully automatic and
manual evaluations in Table 4.
7 Summarization as optimization—is JSD
enough?
We have demonstrated that comparison of input and summary
contents is predictive of summary quality. Further,
our experiments show that a single best feature could approximate
the comparison. A natural question arises in
this situation—when a summarizer is built that globally
optimizes for JS divergence during summary creation,
wouldn’t this input-based evaluation method be voided?
It must be remembered that the goal of summarization
is not the selection of good content alone. Summarizers
must also reduce redundancy, improve coherence and
adhere to a length restriction. Often these goals might enter
into conflicts during summary creation and satisfying
them simultaneously becomes a difficult problem.
Studies of global inference algorithms for multidocument
summarization (McDonald, 2007; tau Yih et
al., 2007; Hassel and Sj¨obergh, 2006) found that optimizing
for content is NP-hard and equivalent to the Knapsack
problem. (McDonald, 2007) further showed that
intractability of a relevance maximization framework increases
with the addition of redundancy constraints. Although,
exact solutions may be found using ILP formulations,
they can be used only on small document sets.
Their huge runtimes makes them prohibitively expensive
for summarizing large collections of documents. Hence
only approximate solutions to the problem are feasible in
real world situations.
Although some approximate solutions seem to obtain
very good results in (McDonald, 2007), we must note
that coherence is not included in that framework and that
coherence is in fact a multi-faceted constraint requiring
considerations of anaphors, discourse relations, cohesion
and ordering. Together with coherence constraints, the
inference could only become harder. Hence our evaluation
method might still be suitable for content evaluation
of summaries provided the overall summarizer scores include
judgements of linguistic quality and redundancy as
well.
8 Improving input-summary comparisons
Our experiments are clearly a starting point in understanding
the role of inputs in summary evaluations. We
demonstrated that simple comparison of summary and input
contents using suitable features can capture perceptions
of summary quality. These features can nevertheless
be extended with other capabilities.
In fact, our test data comes from a query focused summarization
task where a topic statement is also available
for relevance assessment. We can expect better results
by incorporating the query statement during evaluations.
For example, we can select portions from the input that
are relevant to the query and only use these for comparison
with summary content.
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料36(45)