Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fayé A. Briggs, PhD Adjunct Professor and Intel Fellow(Retired) Rice University Material mostly derived from “Mining Massive Datasets” from: Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org One Petabyte = ~50x ~13y to view ~11s to generate in 2012 the content in the Library of Congress an HD video continuously A transatlantic flight in a Boeing 777 produces so much telemetry, about 30 terabytes of data. A new generation of technologies and architectures designed to economically extract value from very large VOLUMES of a wide VARIETY of data by enabling highVELOCITY capture, discovery, and/or analysis and ensure VERACITY Source: IDC Source: IDC's Digital Universe Study, sponsored by EMC, December 2012 http://blogs.loc.gov/digitalpreservation/2011/07/transferring-libraries-of-congress-of-data/ 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4 • • • • By year 2020, more than 200 billion devices will be connected to the cloud and each other; commonly called as IoT IDC predicts that 1/3rd of billions of devices will be intelligent devices Large amount of legacy equipment is not connected, managed or secured Need to address the interoperability of legacy systems to avoid incredibility avoid large cost of replacing I/F that can be securely connected to cloud (public/private) “Economically” - Productivity, Efficiency & better decisions “Extract Value & Analysis” - Through Parallel Computations & algorithms for better decisions “Very Large Volumes Capture” – Memory & Storage “Wide Variety”- Sensors, Software & Security “High-Velocity”- Networking & I/O “Veracity” – Correlation to reality(quality and correctness) 6 Telecom Calling patterns, signal processing, forecasting Analyze switches/routers data for quality of call, frequency of calls, region loads, etc. Act before problems happen. Act before customer calls arrive. Financial Trading behavior Analyze real-time data to understand market behavior, role of individual institution/investor Detect fraud, detect impact of an event, detect the players Search Engines Process the data collected by Web bot in multiple dimensions Enhance relevance of search Big Data impacts e-connected businesses through capture, processing and storage of huge amount of data efficiently Click Stream Analysis Analysis of online users behavior Develop comprehensive insight (Business Intelligence) to run effective strategies in real time Graph analysis Term for discovering the online influencers in various walks of life Enables a business to understand key players and devise effective strategies Lifecycle Marketing Strategies to move away from spam/mass mail Enables a business to spend money on high probable customers only Revenue Attribution Term for analyzing the data to accurately attribute revenue back to various marketing investments Business can identify effectiveness of campaign to control expenses Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!! % of population over age 60 30+ % 25-29% 20-24% 10-19% 2050 0-9% WW Average Age 60+: 21% Source: United Nations “Population Aging 2002” Healthcare costs are RISING Significant % of GDP Global AGING Average age 60+: growing from 10% to 21% by 2050 U.S. Healthcare BIG DATA Value $300 Billion in value/year ~ 0.7% annual productivity growth Healthcare effectiveness analysis: medical histories, clinical information, imaging results, laboratory test results, physician interactions, preferred prescriptions and patient accountability for taking those medications. providers using graph analytics for assessing many similar medical histories managed within a graph model that not only links patients to physicians, medications and presumed diagnoses, and providers. Providers can rapidly scan through the graph to discover therapies used with other patients with similar characteristics (such as age, diagnostics, clinical history, associated risk factors, etc.) that have the most positive outcomes Source: McKinsey Global Institute Analysis ESG Research Report 2011 – North American Health Care Provider Market Size and Forecast Where is the data coming from? 1. Pharma/Life Sciences 2. Clinical Decision Support & Trends (includes Diagnostic Imaging) 3. Claims, Utilization and Fraud 4. Patient Behavior/Social Networking How do we create value? (examples) 1. Personalized Medicine 2. Clinical Decision Support 3. Enhanced Fraud Detection 4. Analytics for Lifestyle and Behaviorinduced Diseases Penn State Planned : Broke ground for a $54M new data center dedicated to making use of big data to enhance medical research and patient care potential. Allow the university to better gather, and analyze large volumes of rich heath data for effective prediction, modeling of diseases and disease behaviors. . McKinsey Global Institute Analysis Digital rendering of the 46,000 sq ft new data center to open April 2016 Health Info Services Primary Care Personal Health Management Aging Society New Healthcare Applications Clinical Decision Support Personalized Medicine Cancer Genomics Analytics and Visualization SQL-like Query Machine Learning Medical Imaging Analytics Medical Records Genome Data Medical Images Storage Optimization Security and Privacy Imaging Acceleration Data Processing/ Management Distributed Platform Goals Advance state-of-the-art core technologies required to collect, store, preserve, manage, analyze massive amount of data, and visualize results for business intelligence. Term for discovering the online influencers in various walks of life Enables a business to understand key players and devise effective strategies Challenge The large data sets from, for example, the proliferation of sensors, patient records, experiential medical data are overwhelming data analysts, as they lack the tools to efficiently store, process, analyze, and visualize the resulting meta-data from the vast amounts of data. Lack of IoT interface and data delivery standards. Fractured providers causing data format wars (jerry-mandering) (data loss from consolidations) Systems architecture to store these big data, extracting meta-data on the fly and provide the computing capability to analyze the data real-time pose major challenges. Big Data phenomenon allows businesses to know, predict and influence customer behaviors!!! Data contains value and knowledge J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 But to extract the knowledge data needs to be Stored Managed And ANALYZED to Predict actionable insight from the data Create data products that have business impacts Communicate relevant visuals to influence business Build confidence in data value to drive business decisions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14 Carlos Somohano, Data Science, London: What does a Data Scientist Do? 15 Carlos Somohano, Data Science, London: What does a Data Scientist Do? 16 Carlos Somohano, Data Science, London: What does a Data Scientist Do? 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 Context & Location Compute SW FRAMEWORK Store Analytics Leading to Insight Transport Generate Protect By Requiring A Variety of Well Optimized Technologies Working Together 19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 Given lots of data Discover patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 Descriptive methods Find human-interpretable patterns that describe the data Example: Clustering Predictive methods Use some variables to predict unknown or future values of other variables Example: Recommender systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23 Example: We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day 109 people being tracked 1,000 days Each person stays in a hotel 1% of time (1 day out of 100) Hotels hold 100 people (so 105 hotels) If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of “suspicious” pairs of people: 250,000 … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24 Usage Quality Context Streaming Scalability J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25 Data mining overlaps with: Databases: Large-scale data, simple queries Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that CS examine large amounts of data Result is the query answer Theory To a ML person, data-mining is the inference of models Result is the parameters of the model In this class we will review both! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org Machine Learning Data Mining Database systems 26 This class overlaps with machine learning, statistics, artificial intelligence, databases, systems architecture but stresses more on: Scalability (big data) Algorithms Computing architectures Review of handling large data Visualization Statistics Machine Learning Data Mining Computer Systems Database systems J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 Learn to mine different types of data: Data is high dimensional Data is a graph Data is infinite/never-ending Data is labeled Learn to use different models of computation: MapReduce Streams and online algorithms Single machine in-memory J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28 Review real-world problems: Recommender systems Market Basket Analysis Spam detection Duplicate document detection Review various “tools”: Linear algebra (SVD, Rec. Sys., Communities) Optimization (stochastic gradient descent) Dynamic programming (frequent itemsets) Hashing (LSH, Bloom filters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommender systems Clustering Community Detection Web advertising Decision Trees Association Rules Dimensionality reduction Spam Detection Queries on streams Perceptron, kNN Duplicate document detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30 How do you want that data? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31 Instructor: Wish I had TAs! Office hours: TBD 33 Course website: TBD Lecture slides(To be posted to a Rice U website- TBD) Readings Readings: Book Mining of Massive Datasets with A. Rajaraman and J. Ullman Free online: http://www.mmds.org J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34 (1+)4 longer homeworks: 40% Theoretical and programming questions HW0 (Hadoop tutorial) has just been posted Assignments take lots of time. Start early!! How to submit? Homework write-up: Stanford students: In class or in Gates submission box SCPD students: Submit write-ups via SCPD Attach the HW cover sheet (and SCPD routing form) Upload code: Put the code for 1 question into 1 file and submit at: http://snap.stanford.edu/submit/ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35