曝光台 注意防骗
网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者
Length (number of words)
all queries
unique queries
Figure 3: Percentage of input queries of various
lengths, computed over all queries (including dupli-
cates) and over unique queries
contain 2, 3, 4 and 5 words respectively. This corresponds
to 82.5% unique queries with 5 words or less; 1.6% contain
more than 10 words. The query length peaks at 3 words for
unique queries, as compared to 2 words when considering all
queries.
Despite the dierences in the distributions of unique vs. all
queries, Figure 3 conrms that most search queries, which
constitute the input data to the experiments, are relatively
short. Therefore, the amount of input data that is actually
usable by the extraction method is only a fraction of the
available 50 million queries, since an attribute cannot be
extracted for a given class unless it occurs together with a
class instance in an input query, which is a condition that is
less likely to be satised in the case of short queries.
Target Classes: The target classes selected for experi-
ments are each specied as an (incomplete) set of repre-
sentative instances, details on which are given in Table 1.
The number of instances varies from 25 (for SearchEngine)
to 1500 (for Actor), with a median of 172 instances per class.
The classes also dier with respect to the domain of inter-
est (e.g., Health for Drug vs. Entertainment for Movie),
instance capitalization (e.g., instances in BasicFood usually
occur in text in lower rather than upper case), and concep-
tual type (e.g., abstraction for Religion vs. group for Soc-
cerClub vs. activity for VideoGame). Therefore, we choose
what we feel to be a large enough number of classes (40) to
properly ensure varied experimentation on several dimen-
sions, while taking into account the time intensive nature of
manual accuracy judgments often required in the evaluation
of information extraction systems [2].
Seed Attributes: Besides the set of its representative in-
stances, each target class is accompanied by 5 seed attributes.
For an objective evaluation, the seed attributes are chosen
independently of whether it is possible to extract them from
the collection of search queries with the proposed method,
and without checking whether they actually occur within the
search queries. Examples of complete seed attribute sets are
fquality, speed, number of users, market share, reliabilityg
for SearchEngine; fsymbol, atomic number, discovery date,
mass, classicationg for ChemicalElem; and fdean, number
of students, research areas, alumni, mascotg for University.
Similarity Functions: As described earlier, the relevance
of a candidate attribute for a class is computed in Step
18 of Figure 1 as a similarity score between the search-
signature vector associated to the candidate phrase (i.e.,
attribute), on one hand, and the reference search-signature
vector for the class, on the other hand. Rather than arbitrar-
ily choosing the similarity function driving the computation,
we prefer to compare several similarity functions used the
WWW 2007 / Track: Data Mining Session: Mining Textual Data
104
Class (Size) Examples of Instances
Actor (1500) Mel Gibson, Sharon Stone, Tom Cruise, Sophia Loren, Will Smith, Johnny Depp, Kate Hudson
AircraftModel (217) Boeing 747, Boeing 737, Airbus A380, Embraer 170, ATR 42, Boeing 777, Douglas DC 9, Dornier 228
Award (200) Nobel Prize, Pulitzer Prize, Webby Award, National Book Award, Prix Ars Electronica, Fields Medal
BasicFood (155) sh, turkey, rice, milk, chicken, cheese, eggs, corn, beans, ice cream
CarModel (368) Honda Accord, Audi A4, Subaru Impreza, Mini Cooper, Ford Mustang, Porsche 911, Chrysler Crossre
CartoonChar (50) Mickey Mouse, Road Runner, Winnie the Pooh, Scooby-Doo, Homer Simpson, Bugs Bunny, Popeye
CellPhoneModel (204) Motorola Q, Nokia 6600, LG Chocolate, Motorola RAZR V3, Siemens SX1, Sony Ericsson P900
ChemicalElem (118) lead, silver, iron, carbon, mercury, copper, oxygen, aluminum, sodium, calcium
City (589) San Francisco, London, Boston, Ottawa, Dubai, Chicago, Amsterdam, Buenos Aires, Paris, Atlanta
Company (738) Adobe Systems, Macromedia, Apple Computer, Gateway, Target, Netscape, Intel, New York Times
Country (197) Canada, France, China, Germany, Australia, Lichtenstein, Spain, South Korea, Austria, Taiwan
Currency (55) Euro, Won, Lire, Pounds, Rand, US Dollars, Yen, Pesos, Kroner, Kuna
DigitalCamera (534) Nikon D70, Canon EOS 20D, Fujilm FinePix S3 Pro, Sony Cybershot, Nikon D200, Pentax Optio 430
Disease (209) asthma, arthritis, hypertension, in
uenza, acne, malaria, leukemia, plague, tuberculosis, autism
Drug (345) Viagra, Phentermine, Vicodin, Lithium, Hydrocodone, Xanax, Vioxx, Tramadol, Allegra, Levitra
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:
航空资料36(18)