Download Introduction to Large Databases and Data Mining

Introduc)on to Large Databases & Data Mining Tips for Assembling Your Data Analysis Toolbox for the 22nd Century 10/05/12 Jim Heasley, Ins)turte for Astronomy 1 Outline ‐ I •  Rela)onal Databases & BIG DATA –  Big data volumes require a new data handling paradigm –  Advantages of a rela)onal database •  Organiza)on of data •  Data integrity •  SQL ‐‐ Structured (and almost standard) query language for queries –  What a database is not. 10/05/12 Jim Heasley, Ins)turte for Astronomy 2 Outline ‐ II •  Data mining –  What is it? –  Common data mining tasks –  (FREE) Tools available to you to perform many of these tasks. 10/05/12 Jim Heasley, Ins)turte for Astronomy 3 Outline ‐ III •  Examples – Imagined & Real –  If we only had )me travel… –  Things one might start to do with PAN‐STARRS data (right now). 10/05/12 Jim Heasley, Ins)turte for Astronomy 4 RELATIONAL DATABASES 10/05/12 Jim Heasley, Ins)turte for Astronomy 5 Basic Defini8ons •  Database: •  Data: •  Database Management System (DBMS): •  –  A collec)on of related data organized to provide informa)on. –  Known facts that can be recorded and have an implicit meaning. –  Oben integrated from several sources. –  Stored in a standard format for use by mul)ple applica)ons. –  A sobware package/ system to facilitate the crea)on and maintenance of a computerized database. Database System: 10/05/12 –  The DBMS sobware together with the data itself and the hardware upon which it runs. Some)mes, the applica)ons are also included. Jim Heasley, Ins)turte for Astronomy 6 Two approaches –  Generally, there are two approaches to extract informa)on from data: •  file processing approach –  file based sobware programs •  database approach –  DBMS 10/05/12 Jim Heasley, Ins)turte for Astronomy 7 File processing approach Application program 1 Data Instructions Application program n . . . Data Instructions •  Each application program has a specific purpose •  Each program uses its own data –  Issues: •  data redundancy •  redundant processes/interfaces •  data integrity –  ease of maintenance –  consistency •  Security –  preserva)on – valuable company asset –  access control 10/05/12 Jim Heasley, Ins)turte for Astronomy 8 Mo8va8on for databases –  Data is a very important asset of an organiza)on –  Mo)va)ons for databases •  to maintain data independent from applica)on programs •  to avoid: –  redundant data –  redundant processes/interfaces •  to enable: –  ease of maintenance –  sharing of data –  data access control 10/05/12 Jim Heasley, Ins)turte for Astronomy 9 Database approach Application program 1 DBMS Instructions Data . . . Metadata Application program n Instructions –  DBMS ‐ a general purpose sobware •  is self‐describing •  contains –  data –  metadata (i.e. data about data) 10/05/12 Jim Heasley, Ins)turte for Astronomy 10 Main Characteris8cs of the Database Approach •  Self‐describing nature of a database system: –  A DBMS catalog stores the descrip)on of a par)cular database (e.g. data structures, types, and constraints) •  •  Insula8on between programs and data: –  Called program‐data independence. Data Abstrac8on: –  A data model is used to hide storage details and present the users with a conceptual view of the database. •  Support of mul8ple views of the data: –  Each user may see a different view of the database, which describes only the data of interest to that user. •  Concurrent Execu8ons 10/05/12 Jim Heasley, Ins)turte for Astronomy 11 Characteris8cs of DBMS –  Data is: •  integrated, shared, persistent •  self‐describing –  Abstrac)on •  program and data independence –  Mul)ple views of the data •  different users need different kinds of informa)on 10/05/12 Jim Heasley, Ins)turte for Astronomy 12 Advantages of Using the Database Approach •  Controlling redundancy –  Sharing of data among mul)ple users. •  •  •  •  •  •  •  •  Restric)ng unauthorized access to data. Providing persistent storage for program Objects Providing Storage Structures (e.g. indexes) for efficient Query Processing backup and recovery services. mul)ple interfaces to different classes of users. complex rela)onships among data. integrity constraints. Drawing inferences and ac)ons from the stored data using deduc)ve and ac)ve rules 10/05/12 Jim Heasley, Ins)turte for Astronomy 13 Addi8onal advantages of the database approach –  Re‐use of data across mul)ple applica)ons –  Data structure and access can be changed without changing applica)ons –  Enforcement of standards and computa)on of sta)s)cs –  Improved responsiveness, produc)vity 10/05/12 Jim Heasley, Ins)turte for Astronomy 14 Addi8onal Implica8ons of Using the Database Approach Poten)al for enforcing standards Reduced applica)on development )me Flexibility to change data structures Availability of current informa)on –  Extremely important for on‐line transac)on systems such as airline, hotel, car reserva)ons. •  Economies of scale •  •  •  •  10/05/12 Jim Heasley, Ins)turte for Astronomy 15 Disadvantages of the database approach –  –  –  –  –  10/05/12 Complexity Size (of sobware and applica)on) Cost Performance Risk of (spectacular!) failures Jim Heasley, Ins)turte for Astronomy 16 When not to use a DBMS •  Main inhibitors (costs) of using a DBMS: –  High ini)al investment and possible need for addi)onal hardware. –  Overhead for providing generality, security, concurrency control, recovery, and integrity func)ons. •  When a DBMS may be unnecessary: –  If the database and applica)ons are simple, well defined, and not expected to change. –  If access to data by mul)ple users is not required. •  When no DBMS may suffice: –  If the database system is not able to handle the complexity of data because of modeling limita)ons –  If the database users need special opera)ons not supported by the DBMS. 10/05/12 Jim Heasley, Ins)turte for Astronomy 17 Database Logic •  Opera)ons within the database are governed by standard set theory and logic. New types of databases that are built upon fuzzy sets, fuzzy logic, and fuzzy measure are currently the subject of ac)ve research, but are not (as yet) widely available. •  The two key set opera)ons of interest in databases are INTERSECTION (the JOIN) and UNION (called the same in the DB world). 10/05/12 Jim Heasley, Ins)turte for Astronomy 18 Structured Query Language •  The user usually interacts with the database by expressing what she/he wants to accomplish by expressing the request in SQL. Note SQL tells the database what you want to do, but not how to do it. •  There are many helpful tutorials about SQL available on the web. An excellent introduc)on is available at www2.aao.gov.au/2dfgrs/Public/Release/Database/sql_intro.pdf •  This introduc)on is sufficiently vanilla it will get you started despite the minor varia)ons between different flavors of SQL 10/05/12 Jim Heasley, Ins)turte for Astronomy 19 The Schema •  The logical schema defines how aoributes are assigned to various tables and the defini)on of keys (indexes) that help to )e tables together. A user must have understanding of the logical schema. •  The physical schema defines how the data tables are stored on the physical storage media (e.g., disks). Generally, users do not need to know the physical schema although the system developers must leverage this to maximize the performance of their system. 10/05/12 Jim Heasley, Ins)turte for Astronomy 20 User Queries •  Users develop queries to the database in a procedural language, usually some form of SQL, that builds requests for informa)on stored in the databases tables, oben making use of internal rela)onships inherent in the data (e.g., intersec)ons between different tables). 10/05/12 Jim Heasley, Ins)turte for Astronomy 21 The SQL Select Command •  The most frequently used SQL command (by the typical users) is the SELECT command. This is used to get (i.e. select) data from the database tables. •  The basic syntax of the SELECT command is SELECT (list of aoributes you want) FROM (list of tables containing them) WHERE (list of limi)ng/restric)ng condi)ons) 10/05/12 Jim Heasley, Ins)turte for Astronomy 22 What a Database isn’t! While the column arrangement of aoributes in database tables might remind the user of a spreadsheet program like Excel, a database is not a compu)ng engine. Further, because of the nature of SQL, the user’s query simply defines what data is wanted, not how to get it. That also includes how the database may choose to execute numerical opera)ons the user embeds in the query. 10/05/12 Jim Heasley, Ins)turte for Astronomy 23 Database Technology Machine Learning Statistics Data Mining Information Science Visualization Other Disciplines DATA MINING: CONFLUENCE OF MULTIPLE DISCIPLINES 10/05/12 Jim Heasley, Ins)turte for Astronomy 24 The purpose of compu)ng is insight, not numbers. Richard Hamming, in the preface to his 1962 text on numerical methods. 10/05/12 Jim Heasley, Ins)turte for Astronomy 25 What is Data Mining? •  Finding (meaningful) paoerns in data –  –  –  –  –  Classifica)on Associa)on Rules Cluster Analysis Anomaly Detec)on Regression •  Data mining tools have been used extensively in –  –  –  –  –  –  –  10/05/12 Biology, gene)cs, medical research (Bioinforma)cs) Business and Economics Ecology and resource management Engineering Literature Music Voice and facial recogni)on Jim Heasley, Ins)turte for Astronomy 26 Don’t Re‐invent the Wheel! 10/05/12 Jim Heasley, Ins)turte for Astronomy 27 Rela8onship between Databases & Data Mining •  Databases are oben a key component in data mining. One oben finds data warehouses providing the informa)on needed by the mining tools. •  However, one usually finds that the actual data mining opera)ons are executed outside the database itself. Databases are excellent informa)on severs but are not good compute engines! 10/05/12 Jim Heasley, Ins)turte for Astronomy 28 Classifica8on: Defini8on •  Given a collec)on of records (training set ) –  Each record contains a set of a<ributes, one of the aoributes is the class. •  Find a model for class aoribute as a func)on of the values of other aoributes. •  Goal: previously unseen records should be assigned a class as accurately as possible. –  A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. 10/05/12 Jim Heasley, Ins)turte for Astronomy 29 Associa8on Rule Mining •  Given a set of transac)ons, find rules that will predict the occurrence of an item based on the occurrences of other items in the transac)on Market‐Basket transac)ons Example of Associa)on Rules {Diaper} → {Beer}, {Milk, Bread} → {Eggs,Coke}, {Beer, Bread} → {Milk}, Implica)on means co‐occurrence, not causality! 10/05/12 Jim Heasley, Ins)turte for Astronomy 30 What is Cluster Analysis? •  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized 10/05/12 Jim Heasley, Ins)turte for Astronomy 31 Anomaly/Outlier Detec8on •  What are anomalies/outliers? –  The set of data points that are considerably different than the remainder of the data •  Variants of Anomaly/Outlier Detec)on Problems –  Given a database D, find all the data points x ∈ D with anomaly scores greater than some threshold t –  Given a database D, find all the data points x ∈ D having the top‐n largest anomaly scores f(x) –  Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D •  Applica)ons: –  Credit card fraud detec)on, telecommunica)on fraud detec)on, network intrusion detec)on, fault detec)on 10/05/12 Jim Heasley, Ins)turte for Astronomy 32 Regression (Predic8on) Regression is the process of finding a func)on that describes data classes for the purpose of being able to predict discrete numerical data values. Numerous approaches for developing the desired func)on exist, including classifica)on (IF‐THEN) rules, decision trees, mathema)cal formulae, or neural networks. Predic)on also encompasses the iden)fica)on of distribu)on trends based on the available data. Both classifica)on and predic)on may need to be preceded by relevance analysis, which aoempts to iden)fy those aoributes or features that do not contribute to the classifica)on or predic)on process. These aoributes can then be excluded from the analysis. A common relevance analysis technique is principal component analysis. 10/05/12 Jim Heasley, Ins)turte for Astronomy 33 Machine Learning 10/05/12 Jim Heasley, Ins)turte for Astronomy 34 Data Mining Environments There are a large number of data mining sobware packages available, both commercial and open source. A search of the internet can quickly iden)fy these. A comprehensive review of these packages is far beyond the scope of what we can deal with in this talk, so I will restrict my comments here to several well‐known packages used for data analysis and mining: the R sta)s)cal analysis package, Matlab (and the open source work‐alike Octave), and data mining packages Weka and Scikits.Learn. 10/05/12 Jim Heasley, Ins)turte for Astronomy 35 •  The R Project for Sta8s8cal Compu8ng www.r‐project.org/ •  R, also called GNU S, is a strongly func)onal language and environment to sta)s)cally explore data sets, make many graphical displays of data. Very strong sta)sical tools. •  The basic system has been greatly expanded by the addi)on of packages developed by its user community 10/05/12 Jim Heasley, Ins)turte for Astronomy 36 Matlab (Octave) •  MATLAB, a commercial product from MathWorks, is a high‐level technical compu)ng language and interac)ve environment for algorithm development, data visualiza)on, data analysis, and numerical modeling. hop://www.mathworks.com/products/matlab/ •  GNU Octave is a high‐level interpreted language, primarily intended for numerical computa)ons. It is ian open source work‐alike version of MATLAB. hop://www.gnu.org/sobware/octave/ 10/05/12 Jim Heasley, Ins)turte for Astronomy 37 Weka (Waikato Environment for Knowledge Analysis) is a well‐known suite of machine learning sobware that supports several typical data mining tasks, par)cularly data preprocessing, clustering, classifica)on, regression, visualiza)on, and feature selec)on. Its techniques are based on the hypothesis that the data is available as a single flat file or rela)on, where each data point is labeled by a fixed number of aoributes. Weka provides access to SQL databases u)lizing Java Database Connec)vity and can process the result returned by a database query. Its main user interface is the Explorer, but the same func)onality can be accessed from the command line or through the component‐based Knowledge Flow interface. hop://www.cs.waikato.ac.nz/~ml/weka/ 10/05/12 Jim Heasley, Ins)turte for Astronomy 38 scikit‐learn is a Python module integra)ng classic machine learning algorithms in the )ghtly‐knit scien)fic Python world (numpy, scipy, matplotlib). It aims to provide simple and efficient solu)ons to learning problems, accessible to everybody and reusable in various contexts: machine‐learning as a versa)le tool for science and engineering. Tools are available for supervised & unsupervised learning, model selec)on, datasets, feature extrac)on. hop://scikit‐learn.org/stable/ 10/05/12 Jim Heasley, Ins)turte for Astronomy 39 Pluses, Minuses, Observa8ons The R and Weka sobware both have a large community which contributes to extending their func)onality through the development of new add‐on packages. Further R and Weka can be interfaced via the RWeka package. There are many excellent on‐line tutorials for these packages, and Weka itself is well described in the text Data Mining – PracBcal Machine Learning Tools and Techniques by Wioen, Frank, & Hall. This text provides both a good underpinning of the methods and prac)cal tutorial informa)on. (The text is available as an e‐book.) Scikits.learn, while s)ll fairly new (current release is version 0.7), has a very impressive collec)on of tools and an extensive user guide. The sobware is wrioen in Python. My main reserva)on about this sobware is that while the user guide presents many examples, there is an implicit assump)on that the user knows a great deal about the field of data mining. This may leave the new user somewhat in over their head in trying to determine exactly which tool best serves their need. 10/05/12 Jim Heasley, Ins)turte for Astronomy 40 EXAMPLES – IMAGINARY & REAL 10/05/12 Jim Heasley, Ins)turte for Astronomy 41 How could we have helped this lady? 10/05/12 Jim Heasley, Ins)turte for Astronomy 42 10/05/12 Jim Heasley, Ins)turte for Astronomy 43 10/05/12 Jim Heasley, Ins)turte for Astronomy 44 Or these gentlemen? 10/05/12 Jim Heasley, Ins)turte for Astronomy 45 10/05/12 Jim Heasley, Ins)turte for Astronomy 46 Or him? 10/05/12 Jim Heasley, Ins)turte for Astronomy 47 Pan‐STARRS Opportuni8es •  The PS1 Small Area Survey (SAS), covering an area of 81 deg2, overlaps with the SDSS Stripe 82. In addi)on to the deep Stripe 82 database, the images from this region have been examined by the Ci)zen Science team known as the Galaxy Zoo. This interes)ng overlap of resources provides data for some exci)ng data mining experiments. •  Star‐Galaxy classifica)on (or more precisely, Star‐Galaxy‐QSO classifica)on) is an on‐going challenge for the PS1 science teams. While this work has been reasonably successful, the efforts thus far seem to have aoempted to get by with the simplest possible classifica)on approach. What might happen if we performed a classifica)on exercise wherein we use a wide range of IPP measurements (e.g., psf, Kron, Petrosian magnitude, Petrosian radii, various moments measured in individual frames and stack) with SDSS and Galaxy Zoo data providing classifica)on “truth?” •  A similar analysis, using visual inspec)on of the images to iden)fy ar)facts in the PS1 images and/or stacks, might provide a robust garbage rejec)on process. Not necessarily glamorous but definitely important. 10/05/12 Jim Heasley, Ins)turte for Astronomy 48 Empirical Photo‐Z Methods •  •  •  •  •  •  •  •  •  Ar)ficial Neural Networks Support Vector Machines Self‐Organizing Maps Gaussian Process Regression Kernel Regression Linear/Nonlinear polynomial fixng Instance Based Learning & Nearest Neighbors Boosted Decision Trees Regression Trees And these are just the ones I’ve found so far! 10/05/12 Jim Heasley, Ins)turte for Astronomy 49 Galaxy Clusters? •  We all know the best way to iden)fy clusters of galaxies is from their x‐ray emission. Unfortunately, current x‐ray surveys don’t provide sufficient sky & depth coverage to do this. •  Op)cal surveys have sufficient depth but suffer from background issues, overlapping foreground & background clusters, etc. •  It has long been hoped that in large scale op)cal surveys such as Pan‐STARRS and LSST, we will be able to use Photo‐ Z values to sort out real clusters from accidental clustering of galaxies, and overlapping clusters at different distances. (Some of the PS1 partners in Taiwan are working on this problem.) 10/05/12 Jim Heasley, Ins)turte for Astronomy 50 Galaxy Clusters – Can Data Mining Help? •  While there is a plethora of data mining techniques for finding clusters within data, most are probably not well suited for finding galaxy clusters. Many methods start off by assuming that in a given region that one knows how many clusters are present. Clearly this is not the case with our problem. Further, we need to deal with the fact that in the 3‐D representa)on, we have much larger uncertainty along the line of sight due to the accuracy of the Photo‐Z measures. •  Some interes)ng work in this area has made use of a friend‐of‐friends approach. I think this could be generalized to include beoer background discrimina)on including the Photo‐Z distribu)on. 10/05/12 Jim Heasley, Ins)turte for Astronomy 51 PAU 10/05/12 Jim Heasley, Ins)turte for Astronomy 52 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Introduction to Large Databases and Data Mining