曝光台 注意防骗
网曝天猫店富美金盛家居专营店坑蒙拐骗欺诈消费者
From a quantitative standpoint, the amount of text within
query logs is at a clear disadvantage against the much larger
textual content available within document collections such
as the Web. When compared to a query, which contains
only two words on average, a textual document is orders of
magnitude larger. The size dierence is exacerbated in very
large document collections, with the leading Web search en-
gines currently providing access to billions of documents. At
least in theory, this has implications on the potential quality
of the extracted information, since more data usually means
better results. Indeed, given enough documents, inexpensive
algorithms produce results sometimes rivaling those output
by more complex methods [5].
A large percentage of Web queries suer from ambiguity
as a result of underspecied information needs, limited or
no grammatical structure due to the use of keywords rather
than natural language, and frequent typos and misspellings
due to the hurried pace at which Web users typically enter
queries. Even though ambiguity is a notorious issue in sen-
tence processing, when compared to search queries the con-
tent found within a document is relatively clear. As opposed
to the casual typing of search queries, authors of documents
tend to pay more attention to both form and content, to
pass their message across to readers through coherent sen-
tences in natural language. While this is particularly true
for genres such as news or scientic articles, it also applies
to other less formal texts such as Web documents.
Since major Web search engines do not currently sup-
port truly interactive search sessions, each query is a self-
contained request for information. While documents are less
ambiguous than queries due to the size restrictions on the
respective mediums, most search queries are in fact typed
with this issue in mind. Many Web users have learned to
use the most meaningful and least ambiguous terms avail-
able when searching for information.
Perhaps the most intriguing aspect of queries is, however,
their ability to indirectly capture human knowledge, pre-
cisely as they inquire about what is already known. In-
deed, users formulate their queries based on the common-
sense knowledge that they already possess at the time of
the search. Search queries play two roles simultaneously: in
addition to requesting new information, they also indirectly
convey knowledge in the process. If knowledge is gener-
ally prominent or relevant, people will eventually ask about
it, especially as the number of users and the quantity and
breadth of the available knowledge increase, as it is the case
with the Web as a whole. Query logs convey knowledge
through requests that may be answered by the knowledge
asserted in expository text of document collections.
2.2 Weakly Supervised Extraction Framework
Figure 1 describes the proposed algorithm for informa-
tion extraction from anonymized search queries. The tar-
get class C (e.g., VideoGame), for which a certain type of
phrases (e.g., class attributes) need to be extracted from
query logs, is available in the form of a set of representative
instances I (e.g., Grand Theft Auto, Street Fighter II and
Age of Empires). As opposed to highly supervised meth-
ods that rely on handcrafted patterns to extract information
from text [13], the knowledge guiding our extraction frame-
work is limited to a small set of seed phrases K (e.g., price,
creator and genre) that are known to be part of the desired
output (e.g., attributes) for the class C (e.g., VideoGame).
Since a class (concept) is traditionally a placeholder for a
set of instances (objects) that share similar properties [13],
the desired phrases (of interest) P for a given class C can be
derived by extracting and merging candidate phrases for in-
dividual instances I of the class. Step 1 in Figure 1 exploits
the instances I to collect a large pool of noisy (high recall,
low precision) candidate phrases P that are associated to
various instances, and therefore are associated to the class.
Any method may be used to collect the pool of candidate
phrases P, as long as most of the seed phrases K are likely to
be in that pool. As a brute-force example, the set of n-grams
that occur together with one of the instances in any of the
search queries is a catch-all (although extremely noisy) pool
of candidate phrases.
Steps 2 through 9 in Figure 1 nds queries that contain
WWW 2007 / Track: Data Mining Session: Mining Textual Data
102
Input: target class C, available as a set of instances fIg
中国航空网 www.aero.cn
航空翻译 www.aviation.cn
本文链接地址:
航空资料36(15)