Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Construction of Web-Based, Service-Oriented Information Networks: A Data Mining Perspective (Abstract) Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801, U.S.A. [email protected] https://www.cs.uiuc.edu/homes/hanj Abstract. Mining directly on the existing networks formed by explicit webpage links on the World-Wide Web may not be so fruitful due to the diversity and semantic heterogeneity of such web-links. However, construction of service-oriented, semi-structured information networks from the Web and mining on such networks may lead to many exciting discoveries of useful information on the Web. This talk will discuss this direction and its associated research opportunities. The World-Wide Web can be viewed as a gigantic information network, where webpages are the nodes of the network, and links connecting those pages form an intertwined, gigantic network. However, due to the unstructured nature of such a network and semantic heterogeneity of web-links, it is difficult to mine interesting knowledge from such a network except for finding authoritative pages and hubs. Alternatively, one can also view that Web is a gigantic repository of multiple information sources, such as universities, governments, companies, news, services, sales of commodities, and so on. An interesting problem is whether this view may provide any new functions for web-based information services, and if it does, whether one can construct such kind of semi-structured information networks automatically or semi-automatically from the Web, and whether one can use such new kind of networks to derive interesting new information and expand web services. In this talk, we take this alternative view and examine the following issues: (1) what are the potential benefits if one can construct service-oriented, semistructured information networks from the World-Wide Web and perform data mining on them, (2) whether it is possible to construct such kind of serviceoriented, semi-structured information networks from the World-Wide Web automatically or semi-automatically, and (3) research problems for constructing and mining Web-Based, service-oriented, semi-structured information networks. This view is motivated from our recent work on (1) mining semi-structured heterogeneous information networks, and (2) discovery of entity Web pages and their corresponding semantic structures from parallel path structures. H. Gao et al. (Eds.): WAIM 2012, LNCS 7418, pp. 17–19, 2012. c Springer-Verlag Berlin Heidelberg 2012 18 J. Han First, real world physical and abstract data objects are interconnected, forming gigantic, interconnected networks. By structuring these data objects into multiple types, such networks become semi-structured heterogeneous information networks. Most real world applications that handle big data, including interconnected social media and social networks, scientific, engineering, or medical information systems, online e-commerce systems, and most database systems, can be structured into heterogeneous information networks. For example, in a medical care network, objects of multiple types, such as patients, doctors, diseases, medication, and links such as visits, diagnosis, and treatments are intertwined together, providing rich information and forming heterogeneous information networks. Effective analysis of large-scale heterogeneous information networks poses an interesting but critical challenge. Our recent studies show that the semi-structured heterogeneous information network model leverages the rich semantics of typed nodes and links in a network and can uncover surprisingly rich knowledge from interconnected data. This heterogeneous network modeling will lead to the discovery of a set of new principles and methodologies for mining interconnected data. The examples to be used in this discussion include (1) meta path-based similarity search, (2) rank-based clustering, (3) rank-based classification, (4) meta pathbased link/relationship prediction, (5) relation strength-aware mining, as well as a few other recent developments. Second, it is not easy to automatically or semi-automatically construct serviceoriented, semi-structured, heterogeneous information networks from the WWW. However, with the enormous size and diversity of WWW, it is impossible to construct such information networks manually. Recently, there are progresses on finding entity-pages and mining web structural information using the structural and relational information on the Web. Specifically, given a Web site and an entity-page (e.g., department and faculty member homepage) it is possible to find all or almost all of the entity-pages of the same type (e.g., all faculty members in the department) by growing parallel paths through the web graph and DOM trees. By further developing such methodologies, it is possible that one can construct service-oriented, semi-structured, heterogeneous information networks from the WWW for many critical services. By integrating methodologies for construction and mining of such web-based information networks, the quality of both construction and mining of such information networks can be progressively and mutually enhanced. Finally, we point out some open research problems and promising research directions and hope that the construction and mining of Web-based, serviceoriented, semi-structured heterogeneous information networks will become an interesting frontier in the research into Web-aged information management systems. References 1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proc. 7th Int. World Wide Web Conf. (WWW 1998), Brisbane, Australia, pp. 107–117 (April 1998) Web-Based Information Network 19 2. Ji, M., Han, J., Danilevsky, M.: Ranking-based classification of heterogeneous information networks. In: Proc. 2011 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2011), San Diego, CA (August 2011) 3. Ji, M., Sun, Y., Danilevsky, M., Han, J., Gao, J.: Graph regularized transductive classification on heterogeneous information networks. In: Proc. 2010 European Conf. Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2010), Barcelona, Spain (September 2010) 4. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999) 5. Sun, Y., Aggarwal, C.C., Han, J.: Relation strength-aware clustering of heterogeneous information networks with incomplete attributes. PVLDB 5, 394–405 (2012) 6. Sun, Y., Barber, R., Gupta, M., Aggarwal, C., Han, J.: Co-author relationship prediction in heterogeneous bibliographic networks. In: Proc. 2011 Int. Conf. Advances in Social Network Analysis and Mining (ASONAM 2011), Kaohsiung, Taiwan (July 2011) 7. Sun, Y., Han, J.: Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool Publishers (2012) 8. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: PathSim: Meta path-based top-k similarity search in heterogeneous information networks. In: Proc. 2011 Int. Conf. Very Large Data Bases (VLDB 2011), Seattle, WA (August 2011) 9. Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: RankClus: Integrating clustering with ranking for heterogeneous information network analysis. In: Proc. 2009 Int. Conf. Extending Data Base Technology (EDBT 2009), Saint-Petersburg, Russia (March 2009) 10. Sun, Y., Yu, Y., Han, J.: Ranking-based clustering of heterogeneous information networks with star network schema. In: Proc. 2009 ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD 2009), Paris, France (June 2009) 11. Wang, C., Han, J., Jia, Y., Tang, J., Zhang, D., Yu, Y., Guo, J.: Mining advisoradvisee relationships from research publication networks. In: Proc. 2010 ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD 2010), Washington D.C. (July 2010) 12. Weninger, T., Danilevsky, M., Fumarola, F., Hailpern, J., Han, J., Ji, M., Johnston, T.J., Kallumadi, S., Kim, H., Li, Z., McCloskey, D., Sun, Y., TeGrotenhuis, N.E., Wang, C., Yu, X.: Winacs: Construction and analysis of web-based computer science information networks. In: Proc. 2011 ACM SIGMOD Int. Conf. Management of Data (SIGMOD 2011) (system demo), Athens, Greece (June 2011) 13. Weninger, T., Fumarola, F., Lin, C.X., Barber, R., Han, J., Malerba, D.: Growing parallel paths for entity-page discovery. In: Proc. 2011 Int. World Wide Web Conf. (WWW 2011), Hyderabad, India (March 2011)