航空资料36(39)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

in information retrieval (Mani et al., 2002; Jing et
al., 1998; Brandow et al., 1995) and other tasks (McKeown
et al., 2005; Sakai and Sparck-Jones, 2001; Mani
1This work was presented at the poster session in TAC on Nov. 18,
2008. We would like to thank Hoa Trang Dang and the TAC advisory
board for giving us the opportunity to work on this project.
et al., 2002; Tombros and Sanderson, 1998; Roussinov
and Chen, 2001). However, organizing and carrying out
such evaluations is difficult and in practice intrinsic evaluations
are the standard in summarization. In intrinsic
evaluations, system summaries are either compared with
reference summaries produced by humans (model summaries),
or directly assessed by human judges on a scale
(most commonly 1 to 5), without reference to a model
summary. The refinement and usability analysis of these
evaluation techniques have been the focus of large scale
evaluation efforts such as the Document Understanding
Conferences (DUC) (Baldwin et al., 2000; Harman and
Over, 2004; Over et al., 2007) and TIPSTER SUMMAC
reports (Mani et al., 2002).
Still, in recent years by far the most popular evaluation
method used during system development and for reporting
results in publications has been the automatic evaluation
tool ROUGE (Lin, 2004; Lin and Hovy, 2003).
ROUGE compares system summaries against one or
more model summaries automatically, by computing ngram
word overlaps between the two. The wide adoption
of such automaticmeasures is understandable, as they are
convenient and have greatly reduced the complexity of
evaluations. They have also been shown to correlate well
with manual evaluations of content, based on comparison
with a single model summary, as used in the early
editions of DUC.
However, the creation of gold standard summaries for
comparison is still time-consuming and expensive. In our
work, we explore the feasibility of developing a fully automatic
evaluation method, that does not make use of human
model summaries at all. Proposals for developing
such fully automatic methods have been put forward in
the past, but no substantial progress has been made so far
in this research direction.
For example in (Radev et al., 2003), a large scale
fully automatic evaluation of eight summarizer systems
on 18,000 documents was performed without any human
effort, using an information retrieval scenario. A search
engine was used to rank documents according to their relevance
to a posed query. The summaries for each document
were also ranked for relevance with respect to the
same query. For good summarization systems, the ranking
of relevance of summaries is expected to be similar
to that of the full documents. Based on this intuition, the
correlation between query relevance rankings of a system’s
summaries and the ranking of original documents
was used to compare the different systems. Effectively
this approach is motivated by the assumption that the distribution
of terms in a good summary is similar to the
distribution of terms in the original input text.
Even earlier, (Donaway et al., 2000) suggested that
there are considerable benefits to be had in adopting
model-free methods of evaluation involving direct comparisons
between input and summary. Their work was
motivated by the well documented fact that there are multiple
good summaries of the same text and that there is incredible
variation in content selection choices in human
summarization (Rath et al., 1961; Radev and Tam, 2003;
van Halteren and Teufel, 2003; Nenkova and Passonneau,
2004). As a result, the identity of the model writer significantly
affects summary evaluations (McKeown et al.,
2001), and evaluations of the same systems can be rather
different when two different models are used. In their
experiments, Donaway et al. demonstrated that the correlation
between a manual evaluation using comparison
with a model summary and a) manual evaluation using
comparison with a different model summary and b) evaluation
by directly comparing input and summary, are the
same. They used cosine similarity with singular value decomposition
to perform the input-summary comparison.
Their conclusionwas that automaticmethods for comparison
between input and a summary should be seriously
considered as an alternative for evaluation.
In this paper, we present a comprehensive study of
fully automatic summary evaluation, without human
models. A summary’s content is judged for quality by directly
estimating its closeness to the input. We compare
several probabilistic (information-theoretic) approaches
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料36(39)