Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining on Symbolic Knowledge Extracted from the Web Changho Choi Source: http://www.cs.cmu.edu/~dunja/WshKDD2000.html Carnegie Mellon University, J.Stefan Institute Abstract This paper gives a case study of combining information Unstructured Information Structured Information less up-to-date, but reliable as facts Using information from two kinds of sources 3/6/2001 an errorful source of large amounts of potentially useful information Improves the reliability of data-mined rules Changho Choi, University at Buffalo 1 Introduction (#1/2) Challenge 3/6/2001 not only gather and represent knowledge existing on the Web, but also use that knowledge for planning, acting, and creating new knowledge Changho Choi, University at Buffalo 2 Introduction (#2/2) First stage integrating three types of information gathering Aim 3/6/2001 Extracting propositional knowledge from highly-structured automatically-generated web pages Extracting propositional knowledge from free-form, unstructured data sources Extracting relational knowledge existing on the Web through a combination of web pages and their hyperlink structure identify patterns of knowledge that were not explicitly represented as facts on the Web Changho Choi, University at Buffalo 3 Data sources and features Extracted features come directly from crawling the company Web sites Wrapper features from secondary sources rely on a mostly regular format e.g. hoovers-sector, hoovers-industry, hoovers-type, address, ... Abstracted features describe relationships between companies discretize our continuous features 3/6/2001 e.g. performs-activity, links-to, officers, sector, location, ... e.g. same-state, same-city, share-officers, mentions-same, ... Changho Choi, University at Buffalo 4 Process of acquiring potentially interesting information about companies from the Web 4312 web sites 50 pages on each sites www.3com.com The Web Data Mining Extracting from corp. Web sites New knowledge KB Wrapping from corp. info. Company information from www.hoovers.com 3/6/2001 Abstracting features Changho Choi, University at Buffalo 5 Extracted Features Feature Values Description Extracting Method Performsactivity 8 The types of activity this company engages in. Looking for keywords associated with each type of activity. Links-to Companies whose web sites are pointed to by this company. Simple text search on all the web pages. mentions Companies whose name occurs on this company’s Web site. ,, officers Officers of this company. On the pages containing “officer”, “director”. sector 200 Naïve Bayes predicted economic sector of company. Text classification by a Naïve Bayesian model. Coarsesector 12 Naïve Bayes predicted coarse-grained economic sector. ,, Derived from a naïve Bayes classifier on small regions of text surrounding country names, and autoslog-based rules. Advanced Information Extraction technique. Inferred from the URL domain name where applicable. Country domain of the URL locations urlcountry 3/6/2001 39 Changho Choi, University at Buffalo 6 Wrapped Features Feature Values Description hoovers-sector 28 Sector listed on the company’s Hoovers page. hoovers-industry 298 Industry listed on the company’s Hoovers page. hoovers-type 18 Public, private, school etc. address Address as listed on hoovers. City, state Extracted form address. competitor Companies that compete with this company. subsidiary Companies listed as subsidiaries of this company. products 4648 officers auditors Product categories extracted from the products page. Officers listed on the Hoovers page. 266 Company auditors. revenue Revenue data for up to the last 10 years. Net-income Net Income data for up to the last 10 years. Net-profit Net Profit data for up to the last 10 years. employees Number of employees each year for up to the last 10 years. 3/6/2001 Changho Choi, University at Buffalo 7 Abstracted Features Feature Values Description Same-state Companies in the same state as this company. Same-city Companies in the same city as this company. Share-officers Companies that have officers in common with this company. Mentions-same Companies that mention some company also mentioned by this company. Links-to-same Companies that link to some company also linked to by this company. Reciprocally-mentions Companies mentioned by this company, who link to this company. Reciprocally-links Companies linked to by this company, who link to this company. Reciprocally-competes Companies listed as a competitor of this company, who list this company as a competitor. Revenue-binned 10 Revenues for each of up to 10 years binned into 10 equal sized bins. Net-profit-binned 10 Net profits similarly binned. Net-income-binned 10 Net income similarly binned. employees 10 Employees similarly binned. 3/6/2001 Changho Choi, University at Buffalo 8 Data mining algorithms Discovering associations Learning propositional rules by using the C5.0 algorithm , which generates a decision tree for the given dataset Learning relational rules 3/6/2001 by applying the Apriori algorithm by using Quinlan’s FOIL system , which can use patterns in the relationship between companies Changho Choi, University at Buffalo 9 Experimental results Apriori Experiments Decision Trees generate propositional rules using Decision trees FOIL Experiments 3/6/2001 discover associations in the data using association rules generate first order rules using the first order rule learning system Changho Choi, University at Buffalo 10 Result:Apriori Experiments (#1/2) Threshold minimal support:10%, minimal confidence: 80% Some Examples Highest confidence rule =>intuitively be understood 3/6/2001 performs-activity = sell :- locations = united-states, links-to = adobe-systems-incorporated (10.8%, 93.0%) performs-activity = sell :- performs-activity = technicalassistance, links-to = adobe-systems-incorporated (11.8%, 91.1%) Changho Choi, University at Buffalo 11 Result:Apriori Experiments (#2/2) Some Examples Normal rule Lower support or conficence rule Meaningful? 3/6/2001 performs-activity = sell :- locations = japan (14.5%, 90.8%) performs-activity = research :- locations = japan (14.5%, 90.8%) performs-activity = research :- locations = united-states (26.9%, 72.5%) hoovers-sector = food-beverage-&-tobacco :- competitor = conagra-inc (1.0%, 89.8%) hoovers-sector = retail :- competitor = kmart-corporation (1.0%, 75.0%) hoovers-sector = energy :- competitor = bp-amoco-p.l.c. (1.1%, 73.0%) Changho Choi, University at Buffalo 12 Result: Decision Trees Example : Predict the economic sector For cities, different features 3/6/2001 city atlanta Based on Naïve revenue1996 <= 0.1 => Diversified Services (28, 0.179) Bayes Classification revenue1996 > 0.1 => Computer Software & Services (20, 0.2) city Houston coarse-sector [basic-materials, capital-goods, transportation] => Manufacturing (10, 0.3) coarse-sector [financial, healthcare, technlogy] => Computer Software & Services (21, 0.238) coarse-sector [conglomerates, consumer-cyclical, consumer-non-cyclical, energy, services, utilities] => Energy (49, 0.49) city Dallas net_income1999 <= 19 => Health Products & Services (25, 0.2) net_income1999 > 19 => Leisure (25, 0.2) ... Changho Choi, University at Buffalo 13 Result: FOIL Experiments (Fist Order Inductive Logic) Example computer-software-&-services(A) :- hq-city(A,B), B<>fremont, competitor(A,C), hq-city(C, Islandia), not(employees_binned(A,?,?)). 3/6/2001 It means that companies headquartered somewhere other than Fremont competing with “Computer Associates International” are in the computer software & services sector. (“Computer Associates International” is the only company in our knowledge base headquartered in Islandia.) Changho Choi, University at Buffalo 14 Discussion Difficulties data cleaning feature selection Pleased result 3/6/2001 errorful nature of our facts the interaction between the symbolic features and the statistically-derived(naïve Bayes) features Changho Choi, University at Buffalo 15 Further Work This paper suggests Further work 3/6/2001 a number of research directions , impacting each of information extraction, machine learning, and data-mining from text Extracting information from wrapped web-sites as a source of training data Automatic data-cleaning of tracted features Extending the information extraction Changho Choi, University at Buffalo 16 Reference(#1/2) FOIL Three companions for first order data mining 3/6/2001 http://www.cs.kuleuven.ac.be/~ml/Doc/Tutorial_Summer/tutor ial_summer.html Changho Choi, University at Buffalo 17 Reference(#2/2) Feature Sample URL hoovers-sector http://www.hoovers.com/sector/ hoovers-industry http://www.hoovers.com/industry/list/ hoovers-type http://www.hoovers.com/company/dir/0,2116,15694,00.html address http://www.hoovers.com/co/capsule/5/0,2163,12475,00.html City, state same competitor same subsidiary http://www.hoovers.com/premium/profile/5/0,2147,12475,00.html products same officers same auditors same revenue http://www.hoovers.com/hoov/join/sample_historical.html Net-income same Net-profit same employees same 3/6/2001 Changho Choi, University at Buffalo 18