Download Web Mining - CS 331 Research Project

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Search engine optimization wikipedia , lookup

Web 2.0 wikipedia , lookup

Web analytics wikipedia , lookup

Transcript
Web Mining
Report By,
Faten Al Zahrani & Abeer Al Nasser

1-Introduction
With the explosive growth of information sources
available on the World Wide Web, it has become
increasingly necessary for users to utilize
automated tools in find the desired information
resources, and to track and analyze their usage
patterns. These factors give rise to the necessity of
creating server side and client side intelligent
systems that can effectively mine for knowledge.
Web mining can be broadly defined as the
discovery and analysis of useful information from
the World Wide Web. This describes the automatic
search of information resources available online, i.e.
Web content mining, and the discovery of user
access patterns from Web servers, i.e., Web usage
mining.
Web Mining is the extraction of interesting and
potentially useful patterns and implicit information
from artifacts or activity related to the Worldwide
Web. There are roughly three knowledge discovery
domains that pertain to web mining: Web Content
Mining, Web Structure Mining, and Web Usage
Mining. Web content mining is the process of
extracting knowledge from the content of
documents or their descriptions. Web document
text mining, resource discovery based on concepts
indexing or agent based technology may also fall in
this category. Web structure mining is the process
of inferring knowledge from the Worldwide Web
organization and links between references and
referents in the Web. Finally, web usage mining,
also known as Web Log Mining, is the process of
extracting interesting patterns in web access logs.
In November 1997, the top search engines
claim to index from 2 million to 100 million
Web documents. The big volume of data on
the Web makes it difficult to deal with Web
data by traditional database techniques.

The web data is distributed and
heterogeneous: Due to the essential property
of Web being an interconnection of various
nodes over the Internet, Web data is usually
distributed across a wide range of computers
or servers, which are located at different
places around the world. At the same time,
Web not includes only the textual content but
also multimedia content such as images, audio
files and video. It requires the developed
techniques for Web data processing with the
ability of dealing with heterogeneity of
multimedia data.

The data on the Web is unstructured. There
are, so far, no rigid and uniform data
structures or schemas that Web pages should
strictly follow.Instead, Web designers are able
to organize related information on the Web
together in their own ways such as HTML
format. Although Web pages in well-defined
HTML format could contain some
preliminary Web data structures e.g. tags or
anchors.
As a result, there is an increasing requirement
to better deal with the unstructured nature of
Web documents and extract the mutual
relationships hidden in Web data for
facilitating users to locate needed Web
information or service.

1.1 Characteristics of web data.
There are many characteristics of web data:
The data on the Web is huge in amount .Now;
it is hard to estimate the exact data volume
available on the Internet due to the
exponential growth of Web data daily. For
instance ,one of the first Web search engines
is called the World Wide Web Worm
(WWWW) had an index of 110,000 Web
pages and Web accessible documents in
1994.
The data on the Web is dynamic. The implicit
and explicit structure of Web data is updated
frequently. Especially, due to different
applications of Web-based data management
systems, a variety of presentations of Web
documents will be generated while contents
resided in databases update.
1.2 Web community.
A web community or Virtual community is a social
network of individuals who interact through specific
media, potentially crossing geographical and political
boundaries in order to pursue mutual interests or goals.
One of the most pervasive types of virtual community
includes social networking services, which consist of
various online communities.
The term web community or Virtual Community is
attributed to the book of the same title by Howard
Rheingold, published in 1993. The book, which could
be considered a social enquiry, putting the research in
the social sciences, discussed his adventures on The
WELL and onward into a range of computer-mediated
communication and social groups, broadening it to
information science. The technologies included
Usenet, MUDs (Multi-User Dungeon) and their
derivatives MUSHes and MOOs, Internet Relay Chat
(IRC), chat rooms and electronic mailing lists; the
World Wide Web as we know it today was not yet
used by many people. Rheingold pointed out the
potential benefits for personal psychological wellbeing, as well as for society at large, of belonging to
such a group.
These virtual communities Virtual
all encourage
interaction, sometimes focusing around a particular
interest, or sometimes just to communicate. Quality
virtual communities do both. They allow users to
interact over a shared passion, whether it is through
message boards, chat rooms, social networking sites,
or virtual worlds.
A web community is a web site (or group of web sites)
that is a virtual community. A web community may
take the form of a social network service, an Internet
forum, a group of blogs, or another kind of social
software web application.
2-What is a web-mining?
The term Web Data Mining is a technique used to
crawl through various web resources to collect
required information, which enables an individual or a
company to promote business, understanding
marketing dynamics, new promotions floating on the
Internet, etc. There is a growing trend among
companies, organizations and individuals alike to
gather information through web data mining to utilize
that information in their best interest.
Data Mining is done through various types of data
mining software. These can be simple data mining
software or highly specific for detailed and extensive
tasks that will be sifting through more information to
pick out finer bits of information. For example, if a
company is looking for information on doctors
including their emails, fax, telephone, location, etc.,
this information can be mined through one of these
data mining software programs. This information
collection through data mining has allowed companies
to make thousands and thousands of dollars in
revenues by being able to better use the internet to gain
business intelligence that helps companies make vital
business decisions.
Before this data mining software came into being,
different businesses used to collect information from
recorded data sources. But the bulk of this information
is too much too daunting and time consuming to gather
by going through all the records, therefore the
approach of computer based data mining came into
being and has gained huge popularity to now become a
necessity for the survival of most businesses.
This collected information is used to gain more
knowledge and based on the findings and analysis of
the information make predictions as to what would be
the best choice and the right approach to move toward
on a particular issue. Web data mining is not only
focused to gain business information but is also used
by various organizational departments to make the
right predictions and decisions for things like business
development, work flow, production processes and
more by going through the business models derived
from the data mining.
A strategic analysis department can undermine their
client archives with data mining software to determine
what offers they need to send to what clients for
maximum conversions rates. For example, a company
is thinking about launching cotton shirts as their new
product. Through their client database, they can clearly
determine as to how many clients have placed orders
for cotton shirts over the last year and how much
revenue such orders have brought to the company.
After having a hold on such analysis, the company can
make their decisions about which offers to send both to
those clients who had placed orders on the cotton shirts
and those who had not. This makes sure that the
organization heads in the right direction in their
marketing and not goes through a trial and error phase
to learn the hard facts by spending money needlessly.
These analytical facts also shed light as to what the
percentage of customers is who can move from your
company to your competitor.
The data mining also empowers companies to keep a
record of fraudulent payments which can all be
researched and studied through data mining. This
information can help develop more advanced and
protective methods that can be undertaken to prevent
such events from happening. Buying trends shown
through web data mining can help you to make
forecast on your inventories as well. This is a direct
analysis, which will empower the organization to fill in
their stocks appropriately for each month depending on
the predictions they have laid out through this analysis
of buying trends.
website that is used by the visitors frequently, then you
must look forward to enhance and pronounce so as to
increase the usage that can appeal more to users of the
website. This kind of mining makes use of accesses
and logs of the web. Simply by understanding the
movement of the guests and the behavior of surfing the
net, you can look forward to meet the preferences and
the needs in a better manner and popularize your
website among the masses in the internet arena.
The data mining technology is going through a huge
evolution and new and better techniques are made
available all the time to gather whatever information is
required. Web data mining technology is opening
avenues on not just gathering data but it is also raising
a lot of concerns related to data security. There is loads
of personal information available on the internet and
web data mining had helped to keep the idea of the
need to secure that information at the forefront.
2. Web Content Mining
3. Data Mining vs. Web mining.
Data mining refers to extracting informative
knowledge from a large amount of data, which could
be expressed in different data types, such as transaction
data in e-commerce applications or genetic
expressions. No matter which type of data it is, the
main purpose of data mining is discovering hidden
knowledge, normally in the forms of patterns, from
available data repository.
What is the difference between data mining and web
mining? Well, one of the significant factors is the
structure of the mining data. Common data mining
applications discover patterns in a structured data such
as database (i.e. DBMS). Web mining, likewise
discover patterns in a less structured data such as
Internet (WWW). In other words, we can say that Web
Mining is Data Mining techniques applied to the
WWW.
4-Types of web mining
Basically the web mining is of three types:
1. Web Usage mining process
In the web usage mining process, the techniques
of data mining are applied so as to discover the trends
and the patterns in the browsing nature of the visitors
of the website. There is extraction of the navigation
patterns as the browsing patterns could be traced and
the structure of the website can be designed
accordingly. For example, a particular feature of
Such kind of mining process attempts to discover all
links of the hyperlinks in a document so as to generate
the structural report on a web page. The information
regarding the different facets, for instance, if the users
are in a position to find the information, if the structure
of the website is too shallow or deep, whether the
elements of the web page are correctly placed, the least
visited and the most visited website areas and whether
they have something to do with page design, etc. Such
kinds of things are analyzed and evaluated for deep
research.
3. Web Linkage/Structure mining
This involves the usage of graph theory for analyzing
the connections and node structure of the website.
According to the type and nature of the data of the web
structure, it is again divided into two kinds:


Extraction of patterns from the hyperlinks on
the net: The hyperlink is structural form of
web address connecting a web page to some
other location.
Mining of the structure of the document: The
tree like structure gets used for analyzing and
describing the XHTML or the HTML tags in
the web page.
study. This specific content database enables to pull
only the information within those subjects, providing
the most specific results of search queries in search
engines. This allowance of only the most relevant
information being provided gives a higher quality of
results. This increase of productivity is due directly to
use of content mining of text and visuals.
The main uses for this type of data mining are to
gather, categorize, organize and provide the best
possible information available on the WWW to the
user requesting the information. This tool is imperative
to scanning the many HTML documents, images, and
text provided on Web pages. The resulting information
is provided to the search engines in order of relevance
giving more productive results of each search.
Web content categorization with a content database is
the most important tool to the efficient use of search
engines. A customer requesting information on a
particular subject or item would otherwise have to
search through thousands of results to find the most
relevant information to his query. Thousands of results
through use of mining text are reduced by this step.
This eliminates the frustration and improves the
navigation of information on the Web.
4.1-Web content mining
Web content mining, also known as text mining, is
generally the second step in Web data mining. Content
mining is the scanning and mining of text, pictures and
graphs of a Web page to determine the relevance of the
content to the search query. This scanning is completed
after the clustering of web pages through structure
mining and provides the results based upon the level of
relevance to the suggested query. With the massive
amount of information that is available on the World
Wide Web, content mining provides the results lists to
search engines in order of highest relevance to the
keywords in the query.
Text mining is directed toward specific information
provided by the customer search information in search
engines. This allows for the scanning of the entire Web
to retrieve the cluster content triggering the scanning
of specific Web pages within those clusters. The
results are pages relayed to the search engines through
the highest level of relevance to the lowest. Though,
the search engines have the ability to provide links to
Web pages by the thousands in relation to the search
content, this type of web mining enables the reduction
of irrelevant information.
Web text mining is very effective when used in
relation to a content database dealing with specific
topics. For example online universities use a library
system to recall articles related to their general areas of
Business uses of content mining allow for the
information provided on their sites to be structured in a
relevance-order site map. This allows for a customer of
the Web site to access specific information without
having to search the entire site. With the use of this
type of mining, data remains available through order of
relativity to the query, thus providing productive
marketing.
Used as a marketing tool this provides additional
traffic to the Web pages of a company’s site based on
the amount of keyword relevance the pages offer to
general
searches.
As the second section of data mining, text mining is
useful to improve the productive uses of mining for
businesses, Web designers, and search engines
operations. Organization, categorization, and gathering
of the information provided by the WWW become
easier and produce results that are more productive
through the use of this type of mining.
In short, the ability to conduct Web content mining
allows results of search engines to maximize the flow
of customer clicks to a Web site, or particular Web
pages of the site, to be accessed numerous times in
relevance to search queries. The clustering and
organization of Web content in a content database
enables effective navigation of the pages by the
customer and search engines. Images, content, formats
and Web structure are examined to produce a higher
quality of information to the user based upon the
requests made. Businesses can maximize the use of
this text mining to improve marketing of their sites as
well as the products they offer.
4.2- web linkage mining
Web Linkage or Web Structure Mining is the
organization of the content via HTML and XML tags.
Web structure mining, one of three categories of web
mining for data, is a tool used to identify the
relationship between Web pages linked by information
or direct link connection. This structure data is
discoverable by the provision of web structure schema
through database techniques for Web pages. This
connection allows a search engine to pull data relating
to a search query directly to the linking Web page from
the Web site the content rests upon. This completion
takes place through use of spiders scanning the Web
sites, retrieving the home page, then, and linking the
information through reference links to bring forth the
specific page containing the desired information.
Structure mining uses minimize two main problems of
the World Wide Web due to its vast amount of
information. The first of these problems is irrelevant
search results. Relevance of search information
become misconstrued due to the problem that search
engines often only allow for low precision criteria. The
second of these problems is the inability to index the
vast amount if information provided on the Web. This
causes a low amount of recall with content mining.
This minimization comes in part with the function of
discovering the model underlying the Web hyperlink
structure provided by Web structure mining.
The main purpose for structure mining is to extract
previously unknown relationships between Web pages.
This structure data mining provides use for a business
to link the information of its own Web site to enable
navigation and cluster information into site maps. This
allows its users the ability to access the desired
information through keyword association and content
mining. Hyperlink hierarchy is also determined to path
the related information within the sites to the
relationship of competitor links and connection
through search engines and third party co-links.
This enables clustering of connected Web pages to
establish the relationship of these pages.
On the WWW, the use of structure mining enables the
determination of similar structure of Web pages by
clustering through the identification of underlying
structure. This information can be used to project the
similarities of web content. The known similarities
then provide ability to maintain or improve the
information of a site to enable access of web spiders in
a higher ratio. The larger the amount of Web crawlers,
the more beneficial to the site because of related
content to searches.
In the business world, structure mining can be quite
useful in determining the connection between two or
more business Web sites. The determined connection
brings forth a useful tool for mapping competing
companies through third party links such as resellers
and customers. This cluster map allows for the content
of the business pages placing upon the search engine
results through connection of keywords and co-links
throughout the relationship of the Web pages. This
determined information will provide the proper path
through structure mining to improve navigation of
these pages through their relationships and link
hierarchy of the Web sites.
With improved navigation of Web pages on business
Web sites, connecting the requested information to a
search engine becomes more effective. This stronger
connection allows generating traffic to a business site
to provide results that are more productive. The more
links provided within the relationship of the web pages
enable the navigation to yield the link hierarchy
allowing navigation ease. This improved navigation
attracts the spiders to the correct locations providing
the requested information, proving more beneficial in
clicks to a particular site.
Therefore, Web mining and the use of structure mining
can provide strategic results for marketing of a Web
site for production of sale. The more traffic directed to
the Web pages of a particular site increases the level of
return visitation to the site and recall by search engines
relating to the information or product provided by the
company. This also enables marketing strategies to
provide results that are more productive through
navigation of the pages linking to the homepage of the
site
itself.
To truly utilize your website as a business tool web
structure mining is a must.
4.3- web usage mining.
Web usage mining is the type of Web mining activity
that involves the automatic discovery of user access
patterns from one or more Web servers. As more
organizations rely on the Internet and the World Wide
Web to conduct business, the traditional strategies and
techniques for market analysis need to be revisited in
this context. Organizations often generate and collect
large volumes of data in their daily operations. Most of
this information is usually generated automatically by
Web servers and collected in server access logs. Other
sources of user information include referrer logs which
contains information about the referring pages for each
page reference, and user registration or survey data
gathered via tools such as CGI scripts.
Analyzing such data can help these organizations to
determine the life time value of customers, cross
marketing strategies across products, and effectiveness
of promotional campaigns, among other things.
Analysis of server access logs and user registration
data can also provide valuable information on how to
better structure a Web site in order to create a more
effective presence for the organization. In
organizations using intranet technologies, such
analysis can shed light on more effective management
of workgroup communication and organizational
infrastructure. Finally, for organizations that sell
advertising on the World Wide Web, analyzing user
access patterns helps in targeting ads to specific groups
of users.
Most of the existing Web analysis tools
[Inc96,eSI95,net96] provide mechanisms for reporting
user activity in the servers and various forms of data
filtering. Using such tools, for example, it is possible
to determine the number of accesses to the server and
the individual files within the organization's Web
space, the times or time intervals of visits, and domain
names and the URLs of users of the Web server.
However, in general, these tools are designed to deal
handle low to moderate traffic servers, and
furthermore, they usually provide little or no analysis
of data relationships among the accessed files and
directories within the Web space.
5- The future of web mining.
This trend is likely to continue
As Web services continue to flourish [K2002]. As the
Web and its usage grows, it will
Continue to generate evermore content, structure, and
usage data, and the value of Web
Mining will keep increasing. Outlined here are some
research directions that must be
Pursued to ensure that continue to develop Web
mining technologies that will enable this value to be
realized.
5.1 Web metrics and measurements
From an experimental human behaviorist’s viewpoint,
the Web is the perfect experimental apparatus. Not
only does it provides the ability of measuring human
behavior at a micro level, it (i) eliminates the bias of
the subjects knowing that they are participating
in an experiment, and (ii) allows the number of
participants to be many orders of magnitude larger.
However, we have not even begun to appreciate the
true impact of a revolutionary experimental apparatus.
The WebLab of Amazon [AMZNa] is one of the early
efforts in this direction. It is regularly used to measure
the user impact
of various proposed changes - on operational metrics
such as site visits and visit/buy
ratios, as well as on financial metrics such as revenue
and profit - before a deployment decision is made. For
example, during Spring 2000 a 48 hour long
experiment on the live site was carried out, involving
over one million user sessions, before the decision to
change Amazon’s logo was made. Research needs to
be done in developing the right set of Web metrics, and
their measurement procedures, so that various Web
phenomena can be studied.
5.2 Process mining
Mining of ’market basket’ data, collected at the pointof-sale in any store, has been one of the visible
successes of data mining. However, this data provides
only the end result of the process, and that too
decisions that ended up in product purchase.
Click-stream data provides the opportunity for a
detailed look at the decision making process itself, and
knowledge extracted from it can be used for
optimizing the process, influencing the process, etc.
[ONL2002]. Underhill [U2000] has conclusively
proven the value of process information in
understanding users’ behavior in traditional shops.
Research needs to be carried out in (i) extracting
process models from usage data, (ii) understanding
how different parts of the process model impact
various Web metrics of interest, and (iii) how the
process models change in response to various changes
that are made - changing stimuli to the user.
5.3 Temporal evolution of the Web
Society’s interaction with the Web is changing the
Web as well as the way the society interacts. While
storing the history of all of this interaction in one place
is clearly too
SRIVASTAVA, DESIKAN AND KUMAR 67
Figure 3.10: Shopping Pipeline modeled as State
Transition Diagram staggering a task; at least the
changes to the Web are being recorded by the
pioneering
Internet Archive project [IA]. Research needs to be
carried out in extracting temporal models of how Web
content, Web structures, Web communities,
authorities, hubs, etc. are evolving. Large
organizations generally archive (at least portions of)
usage data from their Web sites. With these sources of
data available, there is a large scope of research to
develop techniques for analyzing of how the Web
evolves over time.
The temporal behavior of the three kinds ofWeb data:
Web Content, Web Structure and Web Usage. The
methodology suggested for Hyperlink Analysis in
[DSKT2002] can be extended here and the research
can be classified based on Knowledge Models,
Metrics, Analysis Scope and Algorithms. For example,
the analysis scope of the temporal behavior could be
restricted to the behavior of a single document,
multiple documents or the whole Web graph. The other
factor that has to be studied is the effect
of Web Content, Web Structure and Web Usage on
each other over time.
5.4 Web services optimization
As services over the Web continue to grow [K2002],
there will be a need to make them robust, scalable,
efficient, etc. Web mining can be applied to better
understand the behavior of these services, and the
knowledge extracted can be useful for various kinds of
optimizations. The successful application ofWeb
mining for predictive pre-fetching
of pages by a browser has been demonstrated in
[PSS2001]. Research is needed in developing Web
mining techniques to improve various other aspects of
Web services.
5.5 Fraud and threat analysis
The anonymity provided by the Web has led to a
significant increase in attempted fraud,
from unauthorized use of individual credit cards to
hacking into credit card databases
for blackmail purposes [S2000].
5.6 Web mining and privacy
While there are many benefits to be gained from Web
mining, a clear drawback is the potential for severe
violations of privacy. Public attitude towards privacy
seems to be almost schizophrenic - i.e. people say one
thing and do quite the opposite. For example, famous
case like [DG2000] and [DCLKa] seem to indicate that
people value
their privacy, while experience at major e-commerce
portals shows that over 97can be provided based on it.
Spiekerman et al [SGB2001] have demonstrated that
people were willing to provide fairly personal
information about them, this was completely irrelevant
to the task at hand, if provided the right stimulus to do
so. Furthermore, explicitly bringing attention
information privacy policies had practically no effect.
One explanation of this seemingly contradictory
attitude towards privacy may be that we
have a bi-modal view of privacy, namely that ”I’d be
willing to share information about
myself as long as I get some (tangible or intangible)
benefits from it, as long as there
is an implicit guarantee that the information will not be
abused”. The research issue generated by this attitude
is the need to develop approaches, methodologies and
tools that can be used to verify and validate that a Web
service is indeed using an end-user’s information in a
manner consistent with its stated policies.
6- The techniques and the application of web mining.
An outcome of the excitement about the Web in the
past few years has been that Web applications have
been developed at a much faster rate in the industry
than research in
Web related technologies. Many of these were based
on the use of Web mining concepts
- even though the organizations that developed these
applications, and invented
the corresponding technologies, did not consider it as
such.
6.1 Personalized Customer Experience in B2C Ecommerce - Amazon.com
Early on in the life of Amazon.com, its visionary CEO
Jeff Bezos observed,
’In a traditional (brick-and-mortar) store, the main
effort is in getting a customer to the store. Once a
customer is in the store they are likely to make a
purchase - since the cost of going to another store is
high – and thus the marketing budget (focused on
getting the customer to the store)
is in general much higher than the in-store customer
experience budget
(which keeps the customer in the store). In the case of
an on-line store, getting in or out requires exactly one
click, and thus the main focus must be on customer
experience in the store.’3
This fundamental observation has been the driving
force behind Amazon’s comprehensive approach to
personalized customer experience, based on the mantra
’a personalized store for every customer’ [M2001]. A
host of Web mining techniques, e.g. associations
between pages visited, click-path analysis, etc., are
used to improve the customer’s experience during a
’store visit’. Knowledge gained from Web mining is
the key intelligence behind Amazon’s features such as
’instant recommendations’, ’purchase circles’, ’wishlists’, etc. [AMZNa].
6.2 Web Search - Google
Google [GOOGa] is one of the most popular and
widely used search engines. It provides
users access to information from almost 2.5 billion
web pages that it has indexed
on its server. The simplicity and the quickness of the
search facility, makes it the most successful search
engine. Earlier search engines concentrated on the
Web content to return the relevant pages to a query.
Google was the first to introduce the importance of
the link structure in mining the information from the
web. Page Rank, that measures an importance of a
page, is the underlying technology in all Google search
products. The Page Rank technology, that makes use
of the structural information of the Web graph,
is the key to returning quality results relevant to a
query.
Google has successfully used the data available from
the Web content (the actual text and the hyper-text)
and the Web graph to enhance its search capabilities
and provide best results to the users. Google has
expanded its search technology to provide site-specific
search to enable users to search for information within
a specific website.
The ’Google Toolbar’ is another service provided by
Google that seeks to make search easier and
informative by providing additional features such as
highlighting the query words on the returned web
pages. The full version of the toolbar, if installed, also
sends the click-stream information of the user to
Google. The usage statistics thus obtained would be
used by Google to enhance the quality of its results.
Google also provides advanced search capabilities to
search images and look for pages that have been
updated within a specific date range. Built on top of
Netscape’s Open Directory project, Google’s web
directory provides a fast and easy way to search within
a certain topic or related topics. The Advertising
Programs introduced by Google targets users by
providing advertisements that are relevant to search
query. This does not bother users with irrelevant ads
and has increased the clicks for the advertising
companies by four or five times. According to BtoB, a
leading national marketing publication, Google was
named a top 10 advertising property in the Media
Power 50 that recognizes the most powerful and
targeted business-to-business advertising outlets
[GOOGb].
One of the latest services offered by Google is,’
Google News’ [GOOGc]. It integrates news from the
online versions of all newspapers and organizes them
categorically to make it easier for users to read “the
most relevant news”. It seeks to provide information
that is the latest by constantly retrieving pages that are
being updated on a regular basis. The key feature of
this news page, like any other Google service, is that
it integrates information from various Web news
sources through purely algorithmic means, and thus
does not introduce any human bias or effort. However,
the publishing industry is not very convinced about a
fully automated approach to news distillations
6.3 Web-wide tracking - DoubleClick
’Web-wide tracking’, i.e. tracking an individual across
all sites (s) he visits is one of the most intriguing and
controversial technologies. It can provide an
understanding of an individual’s lifestyle and habits to
a level that is unprecedented - clearly of tremendous
interest to marketers. A successful example of this is
DoubleClick Inc.’s DART ad management technology
[DCLKa]. DoubleClick serves advertisements, which
can be targeted on demographic or behavioral
attributes, to the end-user on behalf of the client, i.e.
the Web site using DoubliClick’s service. Sites that use
Double Click’s service are part of ’The DoubleClick
Network’ and the browsing behavior of a user can be
tracked across all sites in the network, using a cookie.
This provides Double Click’s ad targeting to be based
on very sophisticated criteria. Alexa Research [?] has
recruited a panel of more than 500,000 users, who’ve
voluntarily agreed to have their every click tracked, in
return for some freebies. This is achieved through
having a browser bar that can be downloaded by the
panelist from Alexa’s website, which gets attached to
the browser and sends Alexa a complete click-stream
of the panelist’s Web usage. Alexa was purchased by
Amazon for its tracking technology.
Clearly Web-wide tracking is a very powerful idea.
However, the invasion of privacy it causes has not
gone unnoticed, and both Alexa/Amazon and
DoubleClick have faced very visible lawsuits
[DG2000, DCLKb]. The value of this technology in
applications uch a cyber-threat analysis and homeland
defense is quite clear, and it might be only a matter of
time before these organizations are asked to provide
this information.
6.4 UnderstandingWeb communities - AOL
One of the biggest successes of America Online (AOL)
has been its sizeable and loyal customer base [AOLa].
A large portion of this customer base participates in
various’ AOL communities’, which are collections of
users with similar interests. In addition to providing a
forum for each such community to interact amongst
themselves, AOL provides useful information, etc. as
well. Over time, these communities have grown to be
well-visited ’waterholes’ for AOL users with shared
interests. Applying Web mining to the data collected
from community interactions provides AOL with a
very good understanding of its communities, which it
has used for targeted marketing through ads and e-mail
solicitations. Recently, it has started the concept of
’community sponsorship’, whereby an organization
like Nike may sponsor a community called ’Young
Athletic Twenty Something’s’. In return, consumer
survey and new product development experts of the
sponsoring organization get to participate in the
community - usually without the knowledge of the
other participants. The idea is to treat the community
as a highly specialized focus group, understand its
needs and opinions on new and existing products; and
also test strategies for influencing opinions.
6.5 Understanding auction behavior - eBay
As individuals in a society where we have many more
things than we need, the allure of exchanging our
’useless stuff’ for some cash - no matter how small - is
quite powerful.
This is evident from the success of flea markets,
garage sales and estate sales. The genius of eBay’s
founders was to create an infrastructure that gave this
urge a global
64 CHAPTER THREE
Figure 3.7: Groups at AOL: Understanding user
community reach, with the convenience of doing it
from one’s home PC [EBAYa]. In addition, it
popularized auctions as a product selling/buying
mechanism, which provides the thrill of gambling
without the trouble of having to go to Las Vegas. All
of this has made eBay as one of the most successful
businesses of the Internet era. Unfortunately, the
anonymity of the Web has also created a significant
problem for eBay auctions, as it is impossible to
distinguish real bids from fake ones. EBay is now
using Web mining techniques to analyze bidding
behavior to determine if a bid is fraudulent [C2002].
Recent efforts are towards understanding participants’
bidding behaviors/patterns to create a more efficient
auction market.
6.6 Personalized Portal for the Web - MyYahoo
Yahoo [YHOOa] was the first to introduce the concept
of a ’personalized portal’, i.e. a
Web site designed to have the look-and-feel as well as
content personalized to the needs of an individual enduser. This has been an extremely popular concept and
has led to the creation of other personalized portals,
e.g. Yodlee [YODLa] for private information.
Mining MyYahoo usage logs provides Yahoo valuable
insight into an individual’s Web usage habits, enabling
Yahoo to provide compelling personalized content,
which in turn has led to the tremendous popularity of
the Yahoo Web site .
7-Conclusion
As the Web and its usage continues to grow, so grows
the opportunity to analyze Web data and extract all
manner of useful knowledge from it. The past five
years have seen the emergence of Web mining as a
rapidly growing area, due to the efforts of the research
community as well as various organizations that are
practicing it. In this paper we have briefly described
the key computer science contributions made by the
field, the prominent successful applications, and
outlined some promising areas of future research. Our
hope is that this overview provides a starting point for
fruitful discussion.
References:
Book:
Web Mining and Social Networking, Yinchuan Zhang,
Victoria University, Australia.
Web information system and mining, wenyin
liu,xiangfeng luo,fu lee wang, jingsheng lei.
Web Mining Applications in E-Commerce and EServices, Ting, I-Hsien; Wu, Hui-Ju (Eds.)
Web:
http://www.galeas.de/webmining.html
http://www.ieee.org.ar/downloads/Srivastava-tutpaper.pdf.
http://people.ischool.berkeley.edu/~hearst/talks/datamining-panel/sld008.htm
http://en.wikipedia.org/wiki/Web_mining
http://www.slideshare.net/dataminingtools/webminingoverview-2649166
http://www.expertstown.com/web-mining
http://searchcrm.techtarget.com/definition/Web-mining
http://www.cs.uic.edu/~liub/WebContentMining.html
http://www.cyberartsweb.org/cpace/ht/lanman/wum1.h
tm
http://www.facebook.com/pages/Web-usagemining/110446175642994