Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data mining: theory and applications Heikki Mannila Data Mining: Theory and Applications • Data analysis becoming more important in other sciences and in industry – New measurement methods – Ability to store data • High-dimensional large data sets • Non-traditional forms (e.g., strings, trees, graphs) • Data analysis lags behind Data mining • Has emerged as a major research area in the interface of computer science and statistics – Machine learning, databases, algorithms • Data analysis questions are increasingly visible in database and algorithms research • Theory and practice interact Goals • • • • Develop novel data analysis techniques for the use of other sciences and industry How? – Look at data analysis problems arising in practice – Abstract new computational concepts from them – Analyse the concepts and develops new computational methods – Take the results into practice Theoretical work in algorithms and foundations of data analysis can have fast impact in the application areas The applications feed interesting novel questions to theoretical research Major themes in methods • Pattern discovery • Methods for sequence decomposition • Interplay of combinatorial and continuous methods in data mining • Techniques sets. for the decomposition of large 0-1 data Application areas • Genome structure • Gene expression data analysis • Palaeontology • Linguistic applications • Ubiquitous computing The people • Heikki Mannila, Hannu Toivonen, Jaakko Hollmen, Aristides Gionis, Floris Geerts, Bart Goethals • 6 Ph.D. students • A visible position in the international community Highlights • • • • Finding recurrent sources in sequences – global structure in genomic sequences – recognizing recurrent contexts in mobile device usage – (k,h)-segmentation Finding orderings of attributes from unordered binary data – Fragements of order – Spectral ordering techniques Pattern discovery and mixture modelling techniques for onomastic data sets Methods for finding topics in 0-1 datasets on the basis of co-occurrence information Finding recurrent sources in sequences • Sequences – DNA – Telecommunications – Etc. • How to find some global structure from a sequence? • Try to find homogenous segments from the sequence Finding homogenous segments T = S1 S2 S3 S4 S5 S6 • Sequence T, integer k • Measure of homogeneity H for segments of T – E.g., H(S) = |S| Var(S) k • Find the division T = S1,S2,…, Sk minimizing H ( Si ) • Dynamic programming i 1 • (k,k)-segmentation: k-segments with no relationship to each other; independent sources (k,h)-segmentation • We want to limit the number of different types of segments • Only h<k different types are allowed • Find the best segmentation of T into k segments by using only h different types of segments (6,3)-segmentation Source 1 Source 2 Source 3 Data k = 3 and h = 3 k = 3 and h = 2 (k,h)-segmentation problem • Given sequence T • Find h sources w1,w2,…, wh • A decomposition of sequence T into k segments T = S1 S2 … Sk • Minimizing the sum of distances from each point t to the source wa(t) of the segment to which t belongs to n (t i 1 i wa (ti ) ) 2 Results • (k,h)-segmentation problem is NP-hard for dimension d>1, for L1 and L2 metrics • Dimension d=1: complexity open • Simple approximation algorithms • d=1: 3-approximation for L1 • d=1: 5 -approximation for L2 • d>1: 3+e –approximation for L1 for any e>0 • d>1: A+2 –approximation for L2, where A is the best approximation factor for k-means clustering • Very good performance in practice • The algorithms work for any generative model (not just reals with Lp metrics) 5 Example: onomastic data • Names of lakes in Finland • About 150,000 lakes • What are the main trends? • High-dimensional marked point process • Collaboration with Research Center for the Languages of Finland (Kotus) • Similar data analysis problems arise also in environmental sciences Similarity with the names of lakes in Kangasala Clustering on the basis of the names of lakes Example: paleontological data • Given a matrix of occurrences of species in fossil sites • Ages of the fossil sites are not available • How to order the sites according to their age? • Background information: species arrive and vanish • Try to find ordering that minimizes Lazarus events time A 0 1 1 0 species B C 0 1 1 0 0 1 1 0 Lazarus events Methods • Spectral ordering: form a Laplacian of the cooccurrence matrix, look at eigenvectors • Fragments of order: find short segments of orders which are not violated by observations • Other applications: text analysis, telecommunications Fortelius, Jernvall, Gionis, Mannila, in preparation Future research directions • • • • • • • Theory and practice The combination of continuous and combinatorial methods Concepts and algorithms for describing structure of sequences Methods for pattern discovery in and modelling of spatiotemporal data Theoretical models for data mining (such as inductive databases) Foundational issues in pattern discovery (e.g., logical form of patterns and the difficulty in discovering them) Publications, collaborations, software releases Applications in the future • Genome structure and its relation to function • Linguistic applications: spatial and temporal variation in language • Ubiquitous computing and telecommunications applications • Paleontological and ecological applications Mobile Computing Research at HIIT Kimmo Raatikainen Research Director Helsinki Institute for Information Technology [email protected] Fuego Mission • To address the research challenges arising in mobile computing systems and applications of tomorrow. • Mobile computing will fulfil the vision of ubiquitous - invisible computing providing access and services anytime, anywhere, and anyhow. • The key research challenges are related to – context-awareness, – reconfigurability, – adaptability, – understanding user needs and experience, and – personalization. ”Any technology distinguishable from magic is insufficiently advanced,” Gregory Benford Present State • Some 20 researchers organised in two closely co-operating research groups – Mobile Computing Group (Prof. Kimmo Raatikainen) – User Experience Research Group (Prof. Martti Mäntylä) • Other senior researchers and post-docs: – Dr. Ken Rimey (software technologies, distributed computing) – Dr. Pekka Nikander, permanent visitor from Ericsson Research (security and privacy in Mobile Internet) – Dr. Timo Saari (user experience research, media science) – Dr. Jan Lindström (distributed data management, mobile data) • Other postdocs likely to be hired 2004 Current Research Topics • Middleware for Mobile Wireless Internet – Fuego Core project – Mobile distributed event system – Mobile (XML-based) file system with intelligent synchronization – SOAP messaging over wireless (W3C: XML Binary Infoset) – Mobile Presence – Host Identity Protocol • Personal Distributed Information Storage – PDIS project – Synchronization-based peer-to-peer infrastructure for storage of structured XML data: PIM data, metadata for digital media • Context Recognition by User Situation Data Analysis – CONTEXT project – Bridge between User Experience Group at ARU and Adaptive Computing Systems Group at BRU • Software Architectures for Configurable Ubiquitous Systems – Sarcous project by SoberIT at HUT – Managing the large variety of software products Targets to 2005-2010 – 1/3 • to enlarge and strengthen international co-operation – current: WWRF, UCB, Fraunhofer FOKUS – new: Japan, KCL/Mobile VCE, an European NoE, CMU, … – but not forgetting co-operation in Finland: • HUT, UHE, Tampere Univ Tech, Univ Oulu, UIAH, … • to contribute to software architecture for Wireless World • to address challenges due to personal networking – minimal differences between solution stacks for ad-hoc communities and networked infrastructure – peer-to-peer, device-to-device solutions Not in primary focus (and other smart places) Not in primary focus but perhaps latter Targets to 2005-2010 – 2/3 • to put more focus on infrastructure for contextawareness and dynamic (end-user) systems – context modelling: presentation, maintenance, sharing, protection, reasoning, and queries – decision rules for reconfiguration • reflective (self-aware) middleware for personal networking • Fault tolerance in Wireless World – traditional exception will be the usual case – compensations, delayed/delegated actions, … • Trust and privacy in Wireless World Targets to 2005-2010 – 3/3 • user needs and novel application concepts – human factors of the Wireless World • basic psychosocial mechanisms • what makes a service use experience engaging and sustaining? – user-centric concept design (UCPCD) • process, methods, tools – novel application concepts based on contextawareness, other novel technologies – experience prototypes