Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data and Business Intelligence Ryan J. Baxter, Ph.D Boise State University EMBA Session - November 18, 2016 Session Goals • Explore the history, current use, and trajectory of Big Data including the underlying technologies and their role in enabling Big Data. • Consider critically the challenges and opposition to Big Data • Review and analyze industry specific examples and insights of Big Data and Analytics Agenda 12:30-1:15 (45 min) Overview and Exploration 1:15-1:45 (30 min) Team Breakout: Analyzing industries and Looking for Patterns 1:45-2:00 (15 min) Break 2:00-2:50 (60 min) Reconvene and Share Insights 2:50-3:00 (10 min) Break 3:00-4:00 (60 min) Mark Bastian – Clearwater Analytics: Systematically Converting Unstructured Data into Value Changing data alone won’t solve problems • A concrete example… 4 5 6 7 8 Key Takeaways • Changing the reference system affects: • Use of Tools • Culture, Habits, Customs • More data intensive reference system allows for: • Tighter coordination • Orchestrate complex routines What is Big Data? http://www.ibmbigdatahub.com/infographic/four-vs-big-data 11 How many Vs does it take to define Big Data? • Volume • Variety • Velocity • Veracity • Variability • Visualization • Value 3 Vs of Big Data 4 Vs of Big Data by IBM Where is the data coming from? Decreasing cost of data storage Average Cost in $USD Per Gigabyte 500,000 450,000 437,500 400,000 350,000 300,000 250,000 200,000 150,000 105,000 100,000 11,200 1,120 50,000 11.00 1.24 0.090 2000 2005 2010 0.050 0.030 0.022 0.019 2015 2016 0 1980 1985 1990 1995 Average Cost Per Gigabyte Recreated from source: http://www.statisticbrain.com/average-cost-of-hard-drive-storage/ 2013 2014 Miniaturization and Mobility of Computing Technology and Sensors http://www.computerhistory.org/atchm/the-worlds-smallest-computer/ By Kopiersperre (talk) - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36391402 By Author of Carna Botnet "Internet Census 2012", https://commons.wikimedia.org/w/index.php?curid=26114329 Automotive Appliances Computers Consumer Electronics Healthcare Industrial Military https://www.ncta.com/platform/broadband-internet/behind-the-numbers-growth-in-the-internet-of-things-2/ Customer Interaction Evolution Maturing National Merchant Early Large Merchant •Loose Relationship with customer •Little personal data, •Tight Relationship but lots of general with customer – data •Rich, organic, credible narrative data Small Merchants: •Tightening relationship with customer •Increasing personal data + lots of aggregate data Current National Merchant Future Global and SME Merchants https://www.flickr.com/ph otos/gleonhard/897955548 2/ •Multi-faceted •Intimate relationship relationship with customer with customer •Huge amount of •Lots of personal and personal and aggregate aggregate data data https://www.flickr.com/p hotos/davedugdale/5102 910864/in/photostream/ 19 What other trends or advances are contributing to data growth? Analytics, Big Data, Business Intelligence, Decision Support Systems, Data Mining… How do these fit together? How do we deal with this data? Volume, Variety, Velocity The relational database • Good • Avoid redundant data (save space!) • Transaction friendly • Consistency during update • Bad • Scaling • High volume availability • Sensitive to small changes Relational Databases are Sensitive to Change • “This notion of thinking about data in a structured, relational database is dead.” 1 • Each year, billions of dollars are spent on data modeling and ETL* processes to create and recreate more “perfect” data models that will never change. BUT THEY ALWAYS DO.2 1. 2. *. 2009, Vivek Kundra, Former CIO of the U.S. Federal Government (Cited in #2). 2016, Matt Allen, “Relational Databases Are Not Designed To Handle Change” ETL = Extract, Transform, and Load Necessity is the mother of … Big Data Technologies Leverage • Controlling clusters of commodity hardware • Non-relational databases • Open source • Rapidly evolving NoSQL: “Not only SQL” – Non Relational • Characteristics: • • • • • Non-relational Schema-less (on input) Open source Cluster-friendly Real-time (fast read/write) • Why? • Large dataset – scale horizontally • Ease of programming • Schema-less • Data variety • Faster capture • Redundant Additional resource: https://www.youtube.com/watch?v=qI_g07C_Q5I (Introduction to NoSQL by Martin Fowler) Normalization vs. Aggregation Source: https://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ Apache Hadoop • Open source • Large scale, distributed storage and processing • Clusters of commodity hardware (high failure tolerance) • Immutability of Data • Batch oriented Resource: https://developer.yahoo.com/hadoop/tutorial/module1.html 30 Immutability of Data • All data appended • No rewriting/updating • Learn from “streams of change” Criticisms of Big Data Privacy – Asymmetry of Power “… these capabilities, most of which are not visible or available to the average consumer, also create an asymmetry of power between those who hold the data and those who intentionally or inadvertently supply it.”1 1. Source: BIG DATA: SEIZING OPPORTUNITIES, PRESERVING VALUES (Executive Office of the President May 2014) http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf 2. By Toby Hudson (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 33 Data will Help us Manifesto… (http://datawillhelp.us/) • “…we’re abandoning timeless decision-making tools like wisdom, morality, and personal experience for a new kind of logic which simply says: “show me the data”. “Big data has arrived, but big insights have not.” Big Data Articles of Faith: 1. 2. 3. 4. It’s accurate All data captured - (no need for sampling) Causation is unimportant “…the numbers speak for themselves” Theory free analysis is fragile. “If you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down. ” Source: http://www.ft.com/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz3MByvnOn8 35 Mirai Bot – IOT – What’s Going on? • DDos attack on default IOT devices • 61 default username/passwords • No Industry Minimum or Standard • Future regulation? Partial List: http://www.csoonline.com/article/3126924/security/hereare-the-61-passwords-that-powered-the-mirai-iot-botnet.html Medical Devices are vulnerable “In our recent assessment of medical devices used in clinics and hospital around the country, weak encryption, lack of key management, poor authentication and authorization protocols, and insecure communications were all common findings.” -Chandu Ketkar, Technical Manager at Cigital https://www.bitsighttech.com/press-releases/news/industry-analysis-reveals-healthcare-and-pharmaceuticalindustry-lags-in-security-effectiveness Case Study: St. Jude Medical Devices Vulnerabilities • Watch Video at: http://www.bloomberg.com/news/articles/201608-25/in-an-unorthodox-move-hacking-firm-teams-up-with-shortsellers • “A number of associations in the model were really problematic,” • “It’s scary enough to think that private companies are gathering endless amounts of data on us. It’d be even worse if the conclusions they reach from that data aren’t even right.” (Lazar) 39 Crime Prediction and Prevention • Police leverage real-time analytics to provide actionable intelligence that can be used to understand criminal behavior, identify crime/incident patterns, and uncover location-based threats. • That reminds me of a movie I once watched… https://www.mapr.com/solutions/industry/big-data-and-apache-hadoop-government 40 Prediction? Source: http://paperathensupm59.files.wordpress.com/2010/11/schermafbeelding-2010-11-29-om-19-34-10.png 41 Gaining or Losing from lost Privacy? • “When we lose privacy, we gain so much more. For example, if we open all our medical data for everybody to have, we can have insights.” (Kira Radinsky – CTO and co-founder of SalesPredict) • Crowdsourcing Health Data • 23andme genetic research • Ouraring and WeAreCurious 42 Hold on! Are you leveraging existing data opportunities? Little Data? • Management and work practices alignment • Data quality • Data synchrony • Scorecard – Evidence based management • Coaching • Business rules management (aligning operational decisions with strategy) 44 Best Practices for New Initiatives • Well-defined use cases • Hypotheses • Build Infrastructure • Measure • Adapt • Iterate… • Leverage increasing infrastructure to explore 9. Measure 8. Increase/Refine Infrastructure 4. Measure 5. Adapt 1. Use Case 2. Hypotheses 10. Adapt 6. New Use Case 7. Hypotheses 3. Build Infrastructure 11. Iterate Next wave…Data Driven Automation of Business Decisions • Operational Analytics by Bill Franks • Focus on breadth (good enough vs. perfect) • Design connections from data to decisions • Prototype, Test, Refine See an overview at: http://www.theanalyticsrevolutionbook.com/ Discussion After Group Breakout: What are the keys to evolving to a data-driven/centric organization? Additional Resources and Issues Data Mining • Techniques for learning patterns in data by applying statistical techniques. • Training • Classifying, Clustering, Associations • Predictive • Resource: https://rayli.net/blog/data/top-10-data-mining-algorithmsin-plain-english/ Public Data Sets Listings – e.g. • https://github.com/caesar0301/awesome-public-datasets • https://aws.amazon.com/datasets/ • https://www.google.com/publicdata/directory • https://www.reddit.com/r/datasets/ Facebook Data Set Example: https://docs.google.com/spreadsheets/d/1mLO7SFqHmUaZEpp87cwk M0luJutSwmwKMx7kaM9348U/edit#gid=1042851424 http://www.wsj.com/articles/whats-all-that-data-worth-1413157156 • “A lot of what is going on at the companies is not being reflected in public disclosures or the accounting,” (Glen Kernick, a managing director at investment-banking and valuation advisory firm Duff & Phelps Corp.) • “the accounting profession has completely failed modern business in not being able to catch up to new forms of property” (Alex Poltorak, CEO of General Patent Corp) 51 Designing Data Repositories • Data Warehouse – Structured – Schemas on Data Write • Data Lake – Raw – structuring happens on Read