Download Data mining - NYU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data center wikipedia , lookup

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Semantic Web wikipedia , lookup

Business intelligence wikipedia , lookup

Data mining wikipedia , lookup

Transcript
Data mining
Modern organizations are overwhelmed with
data, producing mountains of textual
documents, spreadsheets,and web pages.
Every activity of the organization yields more
documents with each passing day.
But a surprisingly large volume of data remains
relatively inaccessible. Any organization is
likely to have data that remain tantalizingly
out of reach, either through a loss of
institutional memory (“Fred ran that project,
but he took the early retirement package”) or
through
needle-in-a-haystack
search
complexity.
The idea that information resources should be
counted as an asset, and managed to yield
maximum advantage emerges in cycles every few
years. The concept has most recently manifested
itself as part of the “knowledge management”
approach. However the idea is packaged, there is a
growing acknowledgement that information
resources are important, and that they should be
consciously cultivated.
The topic of information extraction involves more
than databases and indexing documents. Often,
the greatest institutional memory of any
organization is that of its own people. However,
people are not so amenable to the usual automated
computer-based search and retrieval techniques.
The challenge is how to effectively submit an
inquiry to the target group of people and get
constructive responses in a timely manner.
The “people” solution involves a multi-pronged
approach. Provide network connectivity to
the field offices. Enable access to the
internal web system, with policies, protocols,
and old documents available electronically.
Provide access to existing database systems.
Index existing document archives to enable
search-engine-like query and response.
Then, through a combination of official and
self-interested
participation,
individual
Knowledge Networks benefit from
the
participation of the most knowledgeable
experts within the system, and thus much of
the
undocumented
history
of
an
organization and their prior activities is
made accessible to the participants.
The recognition of the importance of people in
building computer systems is also an
inherent characteristic in the construction of
“expert systems”, wherein the knowledge and
experience of an expert in a particular field is
somehow captured and stored as a set of
rules, such that queries submitted yield
answers comparable to those of an expert.
Indexing is a form of information extraction;
so, too, is the process of crawling to support
more comprehensive indexing. It is, in a
way, a form of “mining” for content that
might otherwise remain relatively unknown,
but the term “data mining” usually implies
something more subtle.
The concept of “data mining” is based upon the
premise that information about information is
itself useful information. Given vast amounts of
data, it is possible to detect patterns that might
provide insights that could prove useful in making
strategic decisions on issues ranging from the
content and structure of a web site to the selection
of promotional items and the generation of
marketing strategies.
“A class of database applications that look for
hidden patterns in a group of data that can
be used to predict future behavior. For
example, data mining software can help
retail companies find customers with
common interests. The term is commonly
misused to describe software that presents
data in new ways.
True data mining software doesn't just change the
presentation, but actually discovers previously
unknown relationships among the data. Data
mining is popular in the science and mathematical
fields but also is utilized increasingly by marketers
trying to distill useful consumer data from Web
sites.” [1]
[1]
http://www.webopedia.com/TERM/d/data_minin
g.html
Note that this definition addresses database
applications, but the term is now used much
more expansively to cover almost any
application that seeks to extract information
or patterns from volumes of raw data.
One of the earliest applications of data mining
was the problem of reliably extracting data
from semi-structured text sources, and
transforming it into the more structured
format of fields in a relational database
system.
Imagine, for example, that a system was designed to
scan through the Wall Street Journal on a daily
basis, and extract from the newspaper every
reference made to the topic of resignations and
appointments of CEOs in US corporations[1].
[1] This idea was, in fact, used as an example in Roman
Yangarber’s PhD dissertation at New York University
“Scenario Customization for Information Extraction”,
2000.
The process involves a form of indexing,
recognition of the key words and context,
and the transformation of the raw text into a
structured form. It is consistent with the
idea of “mining” because it creates a
substantial information resource that did not
otherwise exist, effectively extracting content
from the data source.
Data mining is generally broken into two
main applications: mining for content, and
mining for usage patterns and structure.
Data mining represents another concept that predates
the emergence of the web, but the fact that
seemingly all digital data is available through the
web, and the web is readily accessible, means that
most research and development of mining
techniques are applied to the web setting. The web
provides the scale and range of data necessary to
test and challenge the capacity of any mining
application.
There once was an Internet startup called EZWays.com, with a dream of gathering all of the
transport sector schedule data into one integrated
travel routing engine. They collected routes,
schedules, and pricing data for all public
transportation such buses, ferrys, trains, and
subways in a specific region of the country. The
user could then evaluate all the alternatives to air
travel for a given journey, and compare them based
upon trip cost and trip duration.
The relevant routing information for public
transport agencies is readily available,
published on the web sites of the respective
transit authorities. It is therefore an
attractive candidate for a targeted data
mining operation. A web crawler was
designed to visit the transit sites, and
retrieve pages likely to contain schedule
data.
The pages were then parsed and analyzed to
determine what information they contained.
The presentation of schedule data generally
falls into a relatively small number of
formats. The system was able to recognize
such data and apply extraction techniques
unique to each format.
For example, some of the transport data was
stored in a PDF format. The system
converted tables in a PDF file into HTML
counterparts, and used the same strategy for
information extraction in PDF settings as it
did for web pages.
The system was even able to parse schedules given
partly in text phrases (such as “and then every
twenty minutes until 4:00 am”), and in that limited
context determine the likely information content.
The system included consistency checks to
determine that the schedules as presented were in
fact plausible; it was soon apparent that some
published schedules were internally inconsistent!
The EZ-Ways system is an example of “mining
for content”. It established a narrow range
of target sites, and extracted fairly well
defined data sets that met a relatively small
range of patterns. While some structural
evaluation took place in the process, the
emphasis was on content extraction.
We have seen several examples of web
mining for content, but as previously noted,
there is much to be learned from the
structure of the pages (as in the page rank
example) and the data associated with the
way that users browse a web site.
Information gathered from web server logs,
client-side agents, and proxy server caching
data can be used to detect usage patterns
that might otherwise go unnoticed. The
concept of web usage mining is becoming
more popular as a means to determine just
how visitors use the web, and which parts of
a web site seem to be most effective.
The usage mining results are then applied to
reformulate web sites so as to increase
effectiveness, and formulate strategies for
target marketing and personalization.
Web usage mining is typically organized into
three main stages: preprocessing, pattern
discovery, and pattern analysis.
The main benefits associated with web usage
analysis are related to the assessment of the
structure of the site, and how this might be
improved, and the potential to provide web
personalization at an individual or group
level to enhance the experience of the users.
For many eCommerce merchants, web usage analysis
is an extremely informative tool that helps identify
the demographics of their customer base, improve
the presentation and delivery of information on
the site, enhance the management of the customer
relationship, and potentially increase “yield” of
sales per customer through target marketing and
cross-promotion.
A potential problem associated with web
usage analysis is that while the data is
extremely useful for a merchant involved in
eCommerce, the amount of data collected
and the information it yields may constitute
a potential breach of user privacy.
If a user visits the site with the belief that the
browsing is relatively anonymous, but in
fact the site is collecting session data on
every page view and duration, the potential
for an adverse reaction may outweigh the
benefits from collecting the data.
A comprehensive introduction to data
mining is available in the “Tutorial on High
Performance Data Mining”, at http://wwwusers.cs.umn.edu/~mjoshi/hpdmtut/.
•
A good introduction to the general field of
data mining as it applies to the web can be
found in the paper “Web Mining Research:
A Survey”, by Kosala and Blockeel,
published in the ACM Special Interest
Group on Knowledge Discovery and Data
Mining (SIGKDD), July 2000. It is available
through
the
Citeseer
service
at
http://citeseer.nj.nec.com/kosala00web.html.
An introduction to the application of data
mining techniques to ascertain patterns in
web use can be found in “Web Usage
Mining: Discovery and Applications of
Usage Patterns from Web Data”, by
Srivastava, Cooley, Deshpande, and Tan,
SIGKDD
January
2000.
http://citeseer.nj.nec.com/srivastava00web.html.