曝光台 注意防骗
网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者
We find that content selection performance on
standard test collections can be approximated
well by the proposed fully automatic method.
This result greatly underlines the need to require
linguistic quality evaluations alongside
content selection ones in future evaluations
and research.
2 Model-free methods for evaluation
Proposals for developing fully automatic methods
for summary evaluation have been put forward
in the past. Their attractiveness is obvious for
large scale evaluations, or for evaluation on nonstandard
test sets for which human models are not
available.
306
For example in Radev et al. (2003), a large
scale fully automatic evaluation of eight summarization
systems on 18,000 documents was performed
without any human effort. A search engine
was used to rank documents according to their relevance
to a given query. The summaries for each
document were also ranked for relevance with respect
to the same query. For good summarization
systems, the relevance ranking of summaries
is expected to be similar to that of the full documents.
Based on this intuition, the correlation between
relevance rankings of summaries and original
documents was used to compare the different
systems. The approach was motivated by the assumption
that the distribution of terms in a good
summary is similar to the distribution of terms in
the original document.
Even earlier, Donaway et al. (2000) suggested
that there are considerable benefits to be had in
adopting model-free methods of evaluation involving
direct comparisons between the original document
and its summary. The motivation for their
work was the considerable variation in content selection
choices in model summaries (Rath et al.,
1961). The identity of the model writer significantly
affects summary evaluations (also noted by
McKeown et al. (2001), Jing et al. (1998)) and
evaluations of the same systems can be rather different
when different models are used. In their
experiments, Donaway et al. (2000) demonstrated
that the correlations between manual evaluation
using a model summary and
a) manual evaluation using a different model
summary
b) automatic evaluation by directly comparing
input and summary1,
are the same. Their conclusion was that such automatic
methods should be seriously considered as
an alternative to model based evaluation.
In this paper, we present a comprehensive study
of fully automatic summary evaluation without
any human models. A summary’s content is
judged for quality by directly estimating its closeness
to the input. We compare several probabilistic
and information-theoretic approaches for characterizing
the similarity and differences between input
and summary content. A simple informationtheoretic
measure, Jensen Shannon divergence between
input and summary, emerges as the best fea-
1They used cosine similarity to perform the inputsummary
comparison.
ture. System rankings produced using this measure
lead to correlations as high as 0.88 with human
judgements.
3 TAC summarization track
3.1 Query-focused and Update Summaries
Two types of summaries, query-focused and update
summaries, were evaluated in the summarization
track of the 2008 Text Analysis Conference
(TAC)2. Query-focused summaries were produced
from input documents in response to a stated user
information need. The update summaries require
more sophistication: two sets of articles on the
same topic are provided. The first set of articles
represents the background of a story and users are
assumed to be already familiar with the information
contained in them. The update task is to produce
a multi-document summary from the second
set of articles that can serve as an update to the
user. This task is reminiscent of the novelty detection
task explored at TREC (Soboroff and Harman,
2005).
3.2 Data
The test set for the TAC 2008 summarization task
contains 48 inputs. Each input consists of two sets
of 10 documents each, called docsets A and B.
Both A and B are on the same general topic but
B contains documents published later than those
in A. In addition, the user’s information need associated
with each input is given by a query statement
consisting of a title and narrative. An example
query statement is shown below.
Title: Airbus A380
Narrative: Describe developments in the production
and launch of the Airbus A380.
A system must produce two summaries: (1) a
query-focused summary of docset A, (2) a compilation
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:
航空资料8(89)