航空资料36(40)_航空信息_民用航空_通用航空

曝光台注意防骗网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者

for characterizing the similarity and differences between
input and summary content. The utility of the various approaches
for comparison varies widely, but a number of
them lead to rankings of systems that correlate well with
manual evaluations performed in the recent Text Analysis
Conference (NIST). A simple information theoretic
measure, Jensen Shannon divergence between input and
summary emerges as the best feature. System rankings
produced using this measure lead to correlations with human
rankings as high as 0.9.
2 Data: TAC 2008
2.1 Topic focused and Update Summarization
Two types of summaries, query focused summaries and
update summaries, were evaluated in the main task of
the summarization track of the 2008 Text Analysis Conference
(TAC) run by NIST2. Query focused summaries
are produced from the input documents in response to a
stated user information need (query). The update summaries
requiremore sophistication: two sets of articles on
the same topic are provided. The first set of articles represent
the background of a story and the users are assumed
to be already familiar with the information contained in
them. The task is produce a multi-document summary
from the second set of articles on the same topic, that can
serve as an update to the user. This task is reminiscent
of the novelty detection task explored at TREC (Soboroff
and Harman, 2005).
2.2 Test set
The test set for the TAC 2008 update task contains 48
inputs. Each input consists of two sets of documents, A
and B, of 10 documents each. A and B are on the same
general topic, and B contains documents published later
than those in A. In addition, for each input, the user’s
need is represented by a topic statement which consists
of a title and narrative. An example topic statement is
given below.
Title: Airbus A380
Narrative: Describe developments in the production and
launch of the Airbus A380.
The task for participating systems is to produce two
summaries, a query focused summary of document set A
and an update summary of document set B. The first summary
is expected to summarize the documents in A using
the topic statement to focus content selection. The second
summary is expected to be a compilation of updates
from document set B, assuming that the user has read all
the documents in A. The maximum word limit for both
types of summaries is 100 words.
In order to allow for in-depth discussion, we will analyze
our findings only for query focused summaries. Similar
results were obtained for the evaluation of update
summaries and are reported in separate tables in Section
6.
2.3 Summarizers
There were 57 participating systems in TAC 2008. The
baseline summary in both cases—query focused and update
tasks was created by choosing the first sentences of
2http://www.nist.gov/tac
manual score R-1 recall R-2 recall
Query Focused summaries
pyramid score 0.859 0.905
responsiveness 0.806 0.873
Update summaries
pyramid score 0.912 0.941
responsiveness 0.865 0.884
Table 1: Spearman correlation between system scores assigned
by manual methods and those assigned by ROUGE, ROUGE-1
and ROUGE-2 recall (TAC2008, 57 systems). All correlations
are highly significant with p-value < 0.00001.
the most recent document in document sets A and B respectively.
In addition, four human summaries were produced
for each type to serve as model summaries for evaluation.
Only the 57 systems were used for our evaluation
experiments.
2.4 Evaluations
Both manual and automatic evaluations were conducted
at NIST to assess quality of summaries produced by the
systems. Summarizer performance is defined by two key
aspects of summary quality— content selection (identification
of important content in the input) and linguistic
quality (structure and presentation of selected content).
Three methods of manual evaluation were used to assign
scores to summaries.
Pyramid eval The pyramid evaluation method
(Nenkova and Passonneau, 2004) has been developed
for reliable and diagnostic assessment of content selection
quality in summarization and has been used in several
large scale evaluations (Nenkova et al., 2007). It
uses multiple human models from which annotators identify
semantically defined Summary Content Units (SCU).
Each SCU is assigned a weight equal to the number of
human model summaries that express that SCU. An ideal
maximally informative summary would express a subset
of the most highly weighted SCUs, with multiple maximally
informative summaries being possible.
　
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址：航空资料36(40)