Download Web Mining - CS 331 Research Project

Web Mining Report By, Faten Al Zahrani & Abeer Al Nasser  1-Introduction With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns. These factors give rise to the necessity of creating server side and client side intelligent systems that can effectively mine for knowledge. Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available online, i.e. Web content mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining. Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the Worldwide Web. There are roughly three knowledge discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource discovery based on concepts indexing or agent based technology may also fall in this category. Web structure mining is the process of inferring knowledge from the Worldwide Web organization and links between references and referents in the Web. Finally, web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in web access logs. In November 1997, the top search engines claim to index from 2 million to 100 million Web documents. The big volume of data on the Web makes it difficult to deal with Web data by traditional database techniques.  The web data is distributed and heterogeneous: Due to the essential property of Web being an interconnection of various nodes over the Internet, Web data is usually distributed across a wide range of computers or servers, which are located at different places around the world. At the same time, Web not includes only the textual content but also multimedia content such as images, audio files and video. It requires the developed techniques for Web data processing with the ability of dealing with heterogeneity of multimedia data.  The data on the Web is unstructured. There are, so far, no rigid and uniform data structures or schemas that Web pages should strictly follow.Instead, Web designers are able to organize related information on the Web together in their own ways such as HTML format. Although Web pages in well-defined HTML format could contain some preliminary Web data structures e.g. tags or anchors. As a result, there is an increasing requirement to better deal with the unstructured nature of Web documents and extract the mutual relationships hidden in Web data for facilitating users to locate needed Web information or service.  1.1 Characteristics of web data. There are many characteristics of web data: The data on the Web is huge in amount .Now; it is hard to estimate the exact data volume available on the Internet due to the exponential growth of Web data daily. For instance ,one of the first Web search engines is called the World Wide Web Worm (WWWW) had an index of 110,000 Web pages and Web accessible documents in 1994. The data on the Web is dynamic. The implicit and explicit structure of Web data is updated frequently. Especially, due to different applications of Web-based data management systems, a variety of presentations of Web documents will be generated while contents resided in databases update. 1.2 Web community. A web community or Virtual community is a social network of individuals who interact through specific media, potentially crossing geographical and political boundaries in order to pursue mutual interests or goals. One of the most pervasive types of virtual community includes social networking services, which consist of various online communities. The term web community or Virtual Community is attributed to the book of the same title by Howard Rheingold, published in 1993. The book, which could be considered a social enquiry, putting the research in the social sciences, discussed his adventures on The WELL and onward into a range of computer-mediated communication and social groups, broadening it to information science. The technologies included Usenet, MUDs (Multi-User Dungeon) and their derivatives MUSHes and MOOs, Internet Relay Chat (IRC), chat rooms and electronic mailing lists; the World Wide Web as we know it today was not yet used by many people. Rheingold pointed out the potential benefits for personal psychological wellbeing, as well as for society at large, of belonging to such a group. These virtual communities Virtual all encourage interaction, sometimes focusing around a particular interest, or sometimes just to communicate. Quality virtual communities do both. They allow users to interact over a shared passion, whether it is through message boards, chat rooms, social networking sites, or virtual worlds. A web community is a web site (or group of web sites) that is a virtual community. A web community may take the form of a social network service, an Internet forum, a group of blogs, or another kind of social software web application. 2-What is a web-mining? The term Web Data Mining is a technique used to crawl through various web resources to collect required information, which enables an individual or a company to promote business, understanding marketing dynamics, new promotions floating on the Internet, etc. There is a growing trend among companies, organizations and individuals alike to gather information through web data mining to utilize that information in their best interest. Data Mining is done through various types of data mining software. These can be simple data mining software or highly specific for detailed and extensive tasks that will be sifting through more information to pick out finer bits of information. For example, if a company is looking for information on doctors including their emails, fax, telephone, location, etc., this information can be mined through one of these data mining software programs. This information collection through data mining has allowed companies to make thousands and thousands of dollars in revenues by being able to better use the internet to gain business intelligence that helps companies make vital business decisions. Before this data mining software came into being, different businesses used to collect information from recorded data sources. But the bulk of this information is too much too daunting and time consuming to gather by going through all the records, therefore the approach of computer based data mining came into being and has gained huge popularity to now become a necessity for the survival of most businesses. This collected information is used to gain more knowledge and based on the findings and analysis of the information make predictions as to what would be the best choice and the right approach to move toward on a particular issue. Web data mining is not only focused to gain business information but is also used by various organizational departments to make the right predictions and decisions for things like business development, work flow, production processes and more by going through the business models derived from the data mining. A strategic analysis department can undermine their client archives with data mining software to determine what offers they need to send to what clients for maximum conversions rates. For example, a company is thinking about launching cotton shirts as their new product. Through their client database, they can clearly determine as to how many clients have placed orders for cotton shirts over the last year and how much revenue such orders have brought to the company. After having a hold on such analysis, the company can make their decisions about which offers to send both to those clients who had placed orders on the cotton shirts and those who had not. This makes sure that the organization heads in the right direction in their marketing and not goes through a trial and error phase to learn the hard facts by spending money needlessly. These analytical facts also shed light as to what the percentage of customers is who can move from your company to your competitor. The data mining also empowers companies to keep a record of fraudulent payments which can all be researched and studied through data mining. This information can help develop more advanced and protective methods that can be undertaken to prevent such events from happening. Buying trends shown through web data mining can help you to make forecast on your inventories as well. This is a direct analysis, which will empower the organization to fill in their stocks appropriately for each month depending on the predictions they have laid out through this analysis of buying trends. website that is used by the visitors frequently, then you must look forward to enhance and pronounce so as to increase the usage that can appeal more to users of the website. This kind of mining makes use of accesses and logs of the web. Simply by understanding the movement of the guests and the behavior of surfing the net, you can look forward to meet the preferences and the needs in a better manner and popularize your website among the masses in the internet arena. The data mining technology is going through a huge evolution and new and better techniques are made available all the time to gather whatever information is required. Web data mining technology is opening avenues on not just gathering data but it is also raising a lot of concerns related to data security. There is loads of personal information available on the internet and web data mining had helped to keep the idea of the need to secure that information at the forefront. 2. Web Content Mining 3. Data Mining vs. Web mining. Data mining refers to extracting informative knowledge from a large amount of data, which could be expressed in different data types, such as transaction data in e-commerce applications or genetic expressions. No matter which type of data it is, the main purpose of data mining is discovering hidden knowledge, normally in the forms of patterns, from available data repository. What is the difference between data mining and web mining? Well, one of the significant factors is the structure of the mining data. Common data mining applications discover patterns in a structured data such as database (i.e. DBMS). Web mining, likewise discover patterns in a less structured data such as Internet (WWW). In other words, we can say that Web Mining is Data Mining techniques applied to the WWW. 4-Types of web mining Basically the web mining is of three types: 1. Web Usage mining process In the web usage mining process, the techniques of data mining are applied so as to discover the trends and the patterns in the browsing nature of the visitors of the website. There is extraction of the navigation patterns as the browsing patterns could be traced and the structure of the website can be designed accordingly. For example, a particular feature of Such kind of mining process attempts to discover all links of the hyperlinks in a document so as to generate the structural report on a web page. The information regarding the different facets, for instance, if the users are in a position to find the information, if the structure of the website is too shallow or deep, whether the elements of the web page are correctly placed, the least visited and the most visited website areas and whether they have something to do with page design, etc. Such kinds of things are analyzed and evaluated for deep research. 3. Web Linkage/Structure mining This involves the usage of graph theory for analyzing the connections and node structure of the website. According to the type and nature of the data of the web structure, it is again divided into two kinds:   Extraction of patterns from the hyperlinks on the net: The hyperlink is structural form of web address connecting a web page to some other location. Mining of the structure of the document: The tree like structure gets used for analyzing and describing the XHTML or the HTML tags in the web page. study. This specific content database enables to pull only the information within those subjects, providing the most specific results of search queries in search engines. This allowance of only the most relevant information being provided gives a higher quality of results. This increase of productivity is due directly to use of content mining of text and visuals. The main uses for this type of data mining are to gather, categorize, organize and provide the best possible information available on the WWW to the user requesting the information. This tool is imperative to scanning the many HTML documents, images, and text provided on Web pages. The resulting information is provided to the search engines in order of relevance giving more productive results of each search. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through thousands of results to find the most relevant information to his query. Thousands of results through use of mining text are reduced by this step. This eliminates the frustration and improves the navigation of information on the Web. 4.1-Web content mining Web content mining, also known as text mining, is generally the second step in Web data mining. Content mining is the scanning and mining of text, pictures and graphs of a Web page to determine the relevance of the content to the search query. This scanning is completed after the clustering of web pages through structure mining and provides the results based upon the level of relevance to the suggested query. With the massive amount of information that is available on the World Wide Web, content mining provides the results lists to search engines in order of highest relevance to the keywords in the query. Text mining is directed toward specific information provided by the customer search information in search engines. This allows for the scanning of the entire Web to retrieve the cluster content triggering the scanning of specific Web pages within those clusters. The results are pages relayed to the search engines through the highest level of relevance to the lowest. Though, the search engines have the ability to provide links to Web pages by the thousands in relation to the search content, this type of web mining enables the reduction of irrelevant information. Web text mining is very effective when used in relation to a content database dealing with specific topics. For example online universities use a library system to recall articles related to their general areas of Business uses of content mining allow for the information provided on their sites to be structured in a relevance-order site map. This allows for a customer of the Web site to access specific information without having to search the entire site. With the use of this type of mining, data remains available through order of relativity to the query, thus providing productive marketing. Used as a marketing tool this provides additional traffic to the Web pages of a company’s site based on the amount of keyword relevance the pages offer to general searches. As the second section of data mining, text mining is useful to improve the productive uses of mining for businesses, Web designers, and search engines operations. Organization, categorization, and gathering of the information provided by the WWW become easier and produce results that are more productive through the use of this type of mining. In short, the ability to conduct Web content mining allows results of search engines to maximize the flow of customer clicks to a Web site, or particular Web pages of the site, to be accessed numerous times in relevance to search queries. The clustering and organization of Web content in a content database enables effective navigation of the pages by the customer and search engines. Images, content, formats and Web structure are examined to produce a higher quality of information to the user based upon the requests made. Businesses can maximize the use of this text mining to improve marketing of their sites as well as the products they offer. 4.2- web linkage mining Web Linkage or Web Structure Mining is the organization of the content via HTML and XML tags. Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. This structure data is discoverable by the provision of web structure schema through database techniques for Web pages. This connection allows a search engine to pull data relating to a search query directly to the linking Web page from the Web site the content rests upon. This completion takes place through use of spiders scanning the Web sites, retrieving the home page, then, and linking the information through reference links to bring forth the specific page containing the desired information. Structure mining uses minimize two main problems of the World Wide Web due to its vast amount of information. The first of these problems is irrelevant search results. Relevance of search information become misconstrued due to the problem that search engines often only allow for low precision criteria. The second of these problems is the inability to index the vast amount if information provided on the Web. This causes a low amount of recall with content mining. This minimization comes in part with the function of discovering the model underlying the Web hyperlink structure provided by Web structure mining. The main purpose for structure mining is to extract previously unknown relationships between Web pages. This structure data mining provides use for a business to link the information of its own Web site to enable navigation and cluster information into site maps. This allows its users the ability to access the desired information through keyword association and content mining. Hyperlink hierarchy is also determined to path the related information within the sites to the relationship of competitor links and connection through search engines and third party co-links. This enables clustering of connected Web pages to establish the relationship of these pages. On the WWW, the use of structure mining enables the determination of similar structure of Web pages by clustering through the identification of underlying structure. This information can be used to project the similarities of web content. The known similarities then provide ability to maintain or improve the information of a site to enable access of web spiders in a higher ratio. The larger the amount of Web crawlers, the more beneficial to the site because of related content to searches. In the business world, structure mining can be quite useful in determining the connection between two or more business Web sites. The determined connection brings forth a useful tool for mapping competing companies through third party links such as resellers and customers. This cluster map allows for the content of the business pages placing upon the search engine results through connection of keywords and co-links throughout the relationship of the Web pages. This determined information will provide the proper path through structure mining to improve navigation of these pages through their relationships and link hierarchy of the Web sites. With improved navigation of Web pages on business Web sites, connecting the requested information to a search engine becomes more effective. This stronger connection allows generating traffic to a business site to provide results that are more productive. The more links provided within the relationship of the web pages enable the navigation to yield the link hierarchy allowing navigation ease. This improved navigation attracts the spiders to the correct locations providing the requested information, proving more beneficial in clicks to a particular site. Therefore, Web mining and the use of structure mining can provide strategic results for marketing of a Web site for production of sale. The more traffic directed to the Web pages of a particular site increases the level of return visitation to the site and recall by search engines relating to the information or product provided by the company. This also enables marketing strategies to provide results that are more productive through navigation of the pages linking to the homepage of the site itself. To truly utilize your website as a business tool web structure mining is a must. 4.3- web usage mining. Web usage mining is the type of Web mining activity that involves the automatic discovery of user access patterns from one or more Web servers. As more organizations rely on the Internet and the World Wide Web to conduct business, the traditional strategies and techniques for market analysis need to be revisited in this context. Organizations often generate and collect large volumes of data in their daily operations. Most of this information is usually generated automatically by Web servers and collected in server access logs. Other sources of user information include referrer logs which contains information about the referring pages for each page reference, and user registration or survey data gathered via tools such as CGI scripts. Analyzing such data can help these organizations to determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things. Analysis of server access logs and user registration data can also provide valuable information on how to better structure a Web site in order to create a more effective presence for the organization. In organizations using intranet technologies, such analysis can shed light on more effective management of workgroup communication and organizational infrastructure. Finally, for organizations that sell advertising on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users. Most of the existing Web analysis tools [Inc96,eSI95,net96] provide mechanisms for reporting user activity in the servers and various forms of data filtering. Using such tools, for example, it is possible to determine the number of accesses to the server and the individual files within the organization's Web space, the times or time intervals of visits, and domain names and the URLs of users of the Web server. However, in general, these tools are designed to deal handle low to moderate traffic servers, and furthermore, they usually provide little or no analysis of data relationships among the accessed files and directories within the Web space. 5- The future of web mining. This trend is likely to continue As Web services continue to flourish [K2002]. As the Web and its usage grows, it will Continue to generate evermore content, structure, and usage data, and the value of Web Mining will keep increasing. Outlined here are some research directions that must be Pursued to ensure that continue to develop Web mining technologies that will enable this value to be realized. 5.1 Web metrics and measurements From an experimental human behaviorist’s viewpoint, the Web is the perfect experimental apparatus. Not only does it provides the ability of measuring human behavior at a micro level, it (i) eliminates the bias of the subjects knowing that they are participating in an experiment, and (ii) allows the number of participants to be many orders of magnitude larger. However, we have not even begun to appreciate the true impact of a revolutionary experimental apparatus. The WebLab of Amazon [AMZNa] is one of the early efforts in this direction. It is regularly used to measure the user impact of various proposed changes - on operational metrics such as site visits and visit/buy ratios, as well as on financial metrics such as revenue and profit - before a deployment decision is made. For example, during Spring 2000 a 48 hour long experiment on the live site was carried out, involving over one million user sessions, before the decision to change Amazon’s logo was made. Research needs to be done in developing the right set of Web metrics, and their measurement procedures, so that various Web phenomena can be studied. 5.2 Process mining Mining of ’market basket’ data, collected at the pointof-sale in any store, has been one of the visible successes of data mining. However, this data provides only the end result of the process, and that too decisions that ended up in product purchase. Click-stream data provides the opportunity for a detailed look at the decision making process itself, and knowledge extracted from it can be used for optimizing the process, influencing the process, etc. [ONL2002]. Underhill [U2000] has conclusively proven the value of process information in understanding users’ behavior in traditional shops. Research needs to be carried out in (i) extracting process models from usage data, (ii) understanding how different parts of the process model impact various Web metrics of interest, and (iii) how the process models change in response to various changes that are made - changing stimuli to the user. 5.3 Temporal evolution of the Web Society’s interaction with the Web is changing the Web as well as the way the society interacts. While storing the history of all of this interaction in one place is clearly too SRIVASTAVA, DESIKAN AND KUMAR 67 Figure 3.10: Shopping Pipeline modeled as State Transition Diagram staggering a task; at least the changes to the Web are being recorded by the pioneering Internet Archive project [IA]. Research needs to be carried out in extracting temporal models of how Web content, Web structures, Web communities, authorities, hubs, etc. are evolving. Large organizations generally archive (at least portions of) usage data from their Web sites. With these sources of data available, there is a large scope of research to develop techniques for analyzing of how the Web evolves over time. The temporal behavior of the three kinds ofWeb data: Web Content, Web Structure and Web Usage. The methodology suggested for Hyperlink Analysis in [DSKT2002] can be extended here and the research can be classified based on Knowledge Models, Metrics, Analysis Scope and Algorithms. For example, the analysis scope of the temporal behavior could be restricted to the behavior of a single document, multiple documents or the whole Web graph. The other factor that has to be studied is the effect of Web Content, Web Structure and Web Usage on each other over time. 5.4 Web services optimization As services over the Web continue to grow [K2002], there will be a need to make them robust, scalable, efficient, etc. Web mining can be applied to better understand the behavior of these services, and the knowledge extracted can be useful for various kinds of optimizations. The successful application ofWeb mining for predictive pre-fetching of pages by a browser has been demonstrated in [PSS2001]. Research is needed in developing Web mining techniques to improve various other aspects of Web services. 5.5 Fraud and threat analysis The anonymity provided by the Web has led to a significant increase in attempted fraud, from unauthorized use of individual credit cards to hacking into credit card databases for blackmail purposes [S2000]. 5.6 Web mining and privacy While there are many benefits to be gained from Web mining, a clear drawback is the potential for severe violations of privacy. Public attitude towards privacy seems to be almost schizophrenic - i.e. people say one thing and do quite the opposite. For example, famous case like [DG2000] and [DCLKa] seem to indicate that people value their privacy, while experience at major e-commerce portals shows that over 97can be provided based on it. Spiekerman et al [SGB2001] have demonstrated that people were willing to provide fairly personal information about them, this was completely irrelevant to the task at hand, if provided the right stimulus to do so. Furthermore, explicitly bringing attention information privacy policies had practically no effect. One explanation of this seemingly contradictory attitude towards privacy may be that we have a bi-modal view of privacy, namely that ”I’d be willing to share information about myself as long as I get some (tangible or intangible) benefits from it, as long as there is an implicit guarantee that the information will not be abused”. The research issue generated by this attitude is the need to develop approaches, methodologies and tools that can be used to verify and validate that a Web service is indeed using an end-user’s information in a manner consistent with its stated policies. 6- The techniques and the application of web mining. An outcome of the excitement about the Web in the past few years has been that Web applications have been developed at a much faster rate in the industry than research in Web related technologies. Many of these were based on the use of Web mining concepts - even though the organizations that developed these applications, and invented the corresponding technologies, did not consider it as such. 6.1 Personalized Customer Experience in B2C Ecommerce - Amazon.com Early on in the life of Amazon.com, its visionary CEO Jeff Bezos observed, ’In a traditional (brick-and-mortar) store, the main effort is in getting a customer to the store. Once a customer is in the store they are likely to make a purchase - since the cost of going to another store is high – and thus the marketing budget (focused on getting the customer to the store) is in general much higher than the in-store customer experience budget (which keeps the customer in the store). In the case of an on-line store, getting in or out requires exactly one click, and thus the main focus must be on customer experience in the store.’3 This fundamental observation has been the driving force behind Amazon’s comprehensive approach to personalized customer experience, based on the mantra ’a personalized store for every customer’ [M2001]. A host of Web mining techniques, e.g. associations between pages visited, click-path analysis, etc., are used to improve the customer’s experience during a ’store visit’. Knowledge gained from Web mining is the key intelligence behind Amazon’s features such as ’instant recommendations’, ’purchase circles’, ’wishlists’, etc. [AMZNa]. 6.2 Web Search - Google Google [GOOGa] is one of the most popular and widely used search engines. It provides users access to information from almost 2.5 billion web pages that it has indexed on its server. The simplicity and the quickness of the search facility, makes it the most successful search engine. Earlier search engines concentrated on the Web content to return the relevant pages to a query. Google was the first to introduce the importance of the link structure in mining the information from the web. Page Rank, that measures an importance of a page, is the underlying technology in all Google search products. The Page Rank technology, that makes use of the structural information of the Web graph, is the key to returning quality results relevant to a query. Google has successfully used the data available from the Web content (the actual text and the hyper-text) and the Web graph to enhance its search capabilities and provide best results to the users. Google has expanded its search technology to provide site-specific search to enable users to search for information within a specific website. The ’Google Toolbar’ is another service provided by Google that seeks to make search easier and informative by providing additional features such as highlighting the query words on the returned web pages. The full version of the toolbar, if installed, also sends the click-stream information of the user to Google. The usage statistics thus obtained would be used by Google to enhance the quality of its results. Google also provides advanced search capabilities to search images and look for pages that have been updated within a specific date range. Built on top of Netscape’s Open Directory project, Google’s web directory provides a fast and easy way to search within a certain topic or related topics. The Advertising Programs introduced by Google targets users by providing advertisements that are relevant to search query. This does not bother users with irrelevant ads and has increased the clicks for the advertising companies by four or five times. According to BtoB, a leading national marketing publication, Google was named a top 10 advertising property in the Media Power 50 that recognizes the most powerful and targeted business-to-business advertising outlets [GOOGb]. One of the latest services offered by Google is,’ Google News’ [GOOGc]. It integrates news from the online versions of all newspapers and organizes them categorically to make it easier for users to read “the most relevant news”. It seeks to provide information that is the latest by constantly retrieving pages that are being updated on a regular basis. The key feature of this news page, like any other Google service, is that it integrates information from various Web news sources through purely algorithmic means, and thus does not introduce any human bias or effort. However, the publishing industry is not very convinced about a fully automated approach to news distillations 6.3 Web-wide tracking - DoubleClick ’Web-wide tracking’, i.e. tracking an individual across all sites (s) he visits is one of the most intriguing and controversial technologies. It can provide an understanding of an individual’s lifestyle and habits to a level that is unprecedented - clearly of tremendous interest to marketers. A successful example of this is DoubleClick Inc.’s DART ad management technology [DCLKa]. DoubleClick serves advertisements, which can be targeted on demographic or behavioral attributes, to the end-user on behalf of the client, i.e. the Web site using DoubliClick’s service. Sites that use Double Click’s service are part of ’The DoubleClick Network’ and the browsing behavior of a user can be tracked across all sites in the network, using a cookie. This provides Double Click’s ad targeting to be based on very sophisticated criteria. Alexa Research [?] has recruited a panel of more than 500,000 users, who’ve voluntarily agreed to have their every click tracked, in return for some freebies. This is achieved through having a browser bar that can be downloaded by the panelist from Alexa’s website, which gets attached to the browser and sends Alexa a complete click-stream of the panelist’s Web usage. Alexa was purchased by Amazon for its tracking technology. Clearly Web-wide tracking is a very powerful idea. However, the invasion of privacy it causes has not gone unnoticed, and both Alexa/Amazon and DoubleClick have faced very visible lawsuits [DG2000, DCLKb]. The value of this technology in applications uch a cyber-threat analysis and homeland defense is quite clear, and it might be only a matter of time before these organizations are asked to provide this information. 6.4 UnderstandingWeb communities - AOL One of the biggest successes of America Online (AOL) has been its sizeable and loyal customer base [AOLa]. A large portion of this customer base participates in various’ AOL communities’, which are collections of users with similar interests. In addition to providing a forum for each such community to interact amongst themselves, AOL provides useful information, etc. as well. Over time, these communities have grown to be well-visited ’waterholes’ for AOL users with shared interests. Applying Web mining to the data collected from community interactions provides AOL with a very good understanding of its communities, which it has used for targeted marketing through ads and e-mail solicitations. Recently, it has started the concept of ’community sponsorship’, whereby an organization like Nike may sponsor a community called ’Young Athletic Twenty Something’s’. In return, consumer survey and new product development experts of the sponsoring organization get to participate in the community - usually without the knowledge of the other participants. The idea is to treat the community as a highly specialized focus group, understand its needs and opinions on new and existing products; and also test strategies for influencing opinions. 6.5 Understanding auction behavior - eBay As individuals in a society where we have many more things than we need, the allure of exchanging our ’useless stuff’ for some cash - no matter how small - is quite powerful. This is evident from the success of flea markets, garage sales and estate sales. The genius of eBay’s founders was to create an infrastructure that gave this urge a global 64 CHAPTER THREE Figure 3.7: Groups at AOL: Understanding user community reach, with the convenience of doing it from one’s home PC [EBAYa]. In addition, it popularized auctions as a product selling/buying mechanism, which provides the thrill of gambling without the trouble of having to go to Las Vegas. All of this has made eBay as one of the most successful businesses of the Internet era. Unfortunately, the anonymity of the Web has also created a significant problem for eBay auctions, as it is impossible to distinguish real bids from fake ones. EBay is now using Web mining techniques to analyze bidding behavior to determine if a bid is fraudulent [C2002]. Recent efforts are towards understanding participants’ bidding behaviors/patterns to create a more efficient auction market. 6.6 Personalized Portal for the Web - MyYahoo Yahoo [YHOOa] was the first to introduce the concept of a ’personalized portal’, i.e. a Web site designed to have the look-and-feel as well as content personalized to the needs of an individual enduser. This has been an extremely popular concept and has led to the creation of other personalized portals, e.g. Yodlee [YODLa] for private information. Mining MyYahoo usage logs provides Yahoo valuable insight into an individual’s Web usage habits, enabling Yahoo to provide compelling personalized content, which in turn has led to the tremendous popularity of the Yahoo Web site . 7-Conclusion As the Web and its usage continues to grow, so grows the opportunity to analyze Web data and extract all manner of useful knowledge from it. The past five years have seen the emergence of Web mining as a rapidly growing area, due to the efforts of the research community as well as various organizations that are practicing it. In this paper we have briefly described the key computer science contributions made by the field, the prominent successful applications, and outlined some promising areas of future research. Our hope is that this overview provides a starting point for fruitful discussion. References: Book: Web Mining and Social Networking, Yinchuan Zhang, Victoria University, Australia. Web information system and mining, wenyin liu,xiangfeng luo,fu lee wang, jingsheng lei. Web Mining Applications in E-Commerce and EServices, Ting, I-Hsien; Wu, Hui-Ju (Eds.) Web: http://www.galeas.de/webmining.html http://www.ieee.org.ar/downloads/Srivastava-tutpaper.pdf. http://people.ischool.berkeley.edu/~hearst/talks/datamining-panel/sld008.htm http://en.wikipedia.org/wiki/Web_mining http://www.slideshare.net/dataminingtools/webminingoverview-2649166 http://www.expertstown.com/web-mining http://searchcrm.techtarget.com/definition/Web-mining http://www.cs.uic.edu/~liub/WebContentMining.html http://www.cyberartsweb.org/cpace/ht/lanman/wum1.h tm http://www.facebook.com/pages/Web-usagemining/110446175642994

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Web Mining - CS 331 Research Project