Download Big Data Analytics - Dr Elan Sasson - B-Pro

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Big Data vs. big Data?
Data?
‫אנליטיקה עסקית בארגונים ממוקדי לקוח‬
‫ד"ר אילן ששון‬
[email protected]
www.datascience.co.il
30/3/2015
‫מטרות‬
‫מה זה ‪?Big Data Analytics‬‬
‫מה זה ‪?Data Science‬‬
‫‪ DDD‬וחשיבותו בארגונים מוטי לקוח ושירות‬
‫מגוון מקורות נתונים‬
‫מושגים בסיסיים ביצירת ‪Data Products‬‬
‫המלצות וגישות להבניית יכולות אנליטיות‬
‫מדעני נתונים ולמה זה עשוי לעניין‪ ...‬אתכם?‬
‫מושגי יסוד בכריית נתונים‬
‫ניהול פרויקטים מוטי אנליטיקה עסקית )‪(CRIPS-DM‬‬
‫‪ Data Privacy‬ולמה זה חשוב‬
‫דוגמאות קוד ‪Data/Text mining R‬‬
Trend of Google Searches of “Big Data”
and “Data science” over time showing
the popularity of the terms
Data Science – the connective tissue
between big data processing
technologies and data-driven
decision making (DDD) (Provost &
Fawcett, 2013)
Terminology
Data-Driven Decision-Making (DDD) – refers to
the practice of basing decisions on the analysis
of data, rather than purely on intuition.
(Provost & Fawcett, 2013)
Data Science – is a set of fundamental
principles that support the extraction of
information and knowledge form data.
It involves principles, processes, and
techniques for understanding phenomena via
the (automated) analysis of data.
Big Data Technologies are used to process
and handle big data, and include preprocessing prior to implementing data mining
techniques.
The new approach to Business Analytics
Why do we really care?
• DDD affects firm performance → the more data-driven a
firm is the more productive is with a 4%-6% increase and
highly correlated with higher ROI, ROE, asset utilization and
market value. (Brynjolfsson et al. Strength in numbers: How does datadriven decision making affect firm performance , 2013 MIT).
• BD Technologies utilization correlates with significant
additional productivity growth affects firm performance →
3% increase in productivity than the average firm. (Tambe P. Big
data know-how and business value , 2012 NYU).
Competitive Advantage
What can I now do that I couldn’t do before,
or do better than I could do before?
3 Principles of the new era of computing
• Data will be the basis of competitive intelligence for any organization – companies,
government entities, cites and individuals
• Data in this new era – not limited resource
• Changing how we make decision - Decisions will be based not on
intuition or past experience, but on predictive analytics.
• Changing how we create value - Organizations - private and public will become social enterprises.
• Changing how we deliver value - Success will depend upon the ability
to create products and services for individuals - not market
segments.
http://asmarterplanet.com/blog/2013/03/ibm-ceo-ginni-rometty-gaining-competitive-advantage-in-the-newera-of-computing.html
Big Data Every Where!
• Lots of data is being collected and warehoused
– Transactional data
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
– Social Network
– Multi media content
– Scientific data
– Networks sensors
– Mobile phones
– User generated content
– Internet of Things
Data is becoming the new currency vital natural resource
Datafication - taking all aspects of life
and turning them into data (The rise of
big data, 2013. Foreign Affairs)
What to do with these data?
Aggregation and Statistics:
• Data warehouse
• OLAP
Indexing, Searching, and Querying:
• Keyword based search
• Pattern matching (RDF/XML)
Knowledge discovery:
•
•
•
•
Data mining
Text mining
Graph mining
Statistical modeling
Big Data … Big Assumptions
• Collecting and using a lot of data rather
than small samples (“N= All”)
• Accepting messiness in your data
• Giving up on knowing the causes
Big Data Use Cases
Big Data can play a significant economic role to
• Private commerce
• Public sector
• National economies
big Data– The enterprise perspective
Enterprise data is big… but it is not Google-big
OLTP
ETL
IT-Oriented
OLAP
Classic BI Boundary
Dash-bored…
Big Data
Warehouse
OLTP / Dark data/ Log /
Social/ web
ETL’
Augmented DWH
+
Extreme-ScaleAnalytics
Business-Oriented
‫הפתרון הקיים ‪DWH -‬‬
‫מה הם סוגי המוצרים הבנקאיים הנמכרים ביותר?‬
‫מה היא התפלגות ההוצאות על פי‬
‫יחידות מטה ?‬
‫מה היא התפלגות הכנסות על פי‬
‫מוצרים בנקאיים?‬
‫באילו סוגי מוצרים קיימת מגמת‬
‫עונתיות?‬
‫רווחיות על פי מוצרים על פני מימד‬
‫הזמן ומימד הסניפים ?‬
‫מרחב הבעיה?‬
‫מי הם הלקוחות הפוטנציאליים ביותר‬
‫להלוואה מעל ‪ 300,000‬ש"ח?‬
‫מה הם המאפיינים של לקוח נוטש?‬
‫איך ניתן לקצר את תהליך הטיפול‬
‫במתן אשראי ללקוח חדש?‬
‫מה הם המאפיינים של לקוח רווחי?‬
‫אילו מוצרים חדשים מומלץ להציע‬
‫ללקוחות קיימים?‬
‫כיצד ניתן לייעל תהליכים בארגון ?‬
‫מרחב הפתרון‬
‫‪Business Analytics‬‬
‫‪to‬‬
‫‪Business Intelligence‬‬
‫‪Large Scale Data/Text Mining‬‬
‫‪Discovery Based Analysis‬‬
‫~‬
‫~‬
‫~‬
‫~‬
‫~‬
‫תהליך אנליטי מבוסס גילוי‬
‫כלים‪/‬אלגורתימים הפועלים על מרחב‬
‫הנתונים חושפים תבניות חבויות‬
‫תהליכי הקבצה‪ ,‬ניבוי ואסוציאציה‬
‫‪Unsupervised Learning Machine‬‬
‫תהליכים המחייבים בסיס נתונים היסטורי‬
‫גדול‬
‫‪to‬‬
‫‪from‬‬
‫‪DHW/OLAP‬‬
‫‪Verification Based Analysis‬‬
‫~‬
‫~‬
‫~‬
‫~‬
‫תהליכים משלימים‬
‫תהליך אנליטי מבוסס אימות‬
‫משתמש מניח היפותיזה כלשהיא‬
‫מופעלות טכניקות אישוש‪/‬סתירה‬
‫תהליכים מבוססי משתמש – היכולת‬
‫להניח הנחות נכונות‪,‬בחירת‬
‫הכלים‪,‬ופרשנות התוצאות‬
‫‪from‬‬
Big Data Architecture & Pipeline
‫מקור נתונים‬
‫חיצוני‬/‫פנימי‬
Real Time Analytics
Streams
Network/Sensor
Internet of Things
Video/Audio
Entity Analytics
Information Ingestion
Unified Information
Access (UIA)
Master Data
Data Integration
Stream Processing
• Exploration,
Analytics
Discovery
Predictive
Operative
Descriptive
Prescriptive
Landing Area Zone &
Archive
Raw Data
Structured Data
Unstructured Data
Text Analytics
Data Mining
Machine Learning
Complex Event Processing
Intelligence
Analysis
Decision
Management
BI &
Predictive
Analytics
Reporting &
Discovery
Business
Processes
Data-Analytic Thinking
One of the most critical aspects of data science is the support of
data-analytic thinking
throughout the organization →
Data-oriented business environment
• Basic understanding of basic principles →
– In order to assess and envision opportunities accurately (data-analytics
projects)
– Professional advantage in being able to interact competently (dataanalytics team)
– Business units must interact with data science team (domain knowledge)
– Data science project require close interaction with business people
responsible for decision making
Conveying the message….
• Data mining is moving from the research arena into the pragmatic
world of business
• There is continuous effort of refining algorithms and coming up with
new ones
• Now with new developments in algorithms and architecture smallscale development teams can build large-scale projects
• Practical data mining weighs the trade-offs between the most
advanced and accurate model with the costs and complexity in realworld business environment
• New analytics tools and platforms make data mining much more easier
and powerful for people at all levels of expertise
• Hadoop-based computing ecosystem is evolving rapidly, making project
with very large-scale datasets much more affordable
The Ladder Approach
• Build a foundation
–
–
–
–
–
–
Learn to think analytically (data mining models, visualization, statistics etc.)
Develop a strategy and road map based on business needs (pick a theme)
C-level management engagement (presentation)
Adopt a step-by-step process (problem definition → results: CRISP-DM)
Pick and learn a tool (R, Python etc.)
Practice on small datasets
• Build a portfolio
– Deliverable POCs and pilot projects (3-5) – Quick-wins
– Practice on small datasets
– Write-up findings (storytelling)
• Deliver solutions
– Adopt technology infrastructure (HDFS, MapReduce, NoSQL Spark SQL…. etc.)
– Ongoing revisions of models (data products)
– Continue to apply advanced analytics
Business Scope & Deliverables
Rethinking the Business & IT Model
Data Management &
Business Analytics are Core
Business Competencies
o The Business Owns the Data
o Recognize Analytics as a Business Driven and Owned Process
o Technology is an Enabler
Shift to Business
Configurable and Controlled
o Acknowledge the Difference between Software Development
and Business Analytics
o Redefine the IT Support Model to Enable The Business to
Acquire, Assess, Analyze, Test, and Deploy Analytical
Outcomes
Change the IT funding &
Financial Model
o Current Infrastructure Model is Geared towards Legacy &
Transactional Platform
o Recognize Analytics as a Business Driven and Owned Process
Technology is an Enabler
2013 ‫ אוקטובר‬IBM ‫ כנס ביג דאטה‬MetLife ‫ מצגת‬:‫מקור השקף‬
Big Data Adoption
‫התווית תוכנית עבודה‬
(‫בחינת תרחיש )אחד או יותר למימוש‬
Data- ‫ מפגשים קורס‬10 ‫ בן‬Big Data, Data Science & Data Mining ‫קורס‬
Business ‫ מפגשים לאנשי‬8 ‫ בן‬Analytic Thinking
«360°» ‫הקמת קבוצת‬
•
•
•
•
R&D Team
Infrastructure & Operations
Business Unit Analysts
Business IT Support Team
‫‪The Data Journey‬‬
‫הערכה‪ 80% :‬מהמידע בארגון אינו מובנה‬
‫ואינו ממודל ולפיכך אינו זמין לניתוח ואנליזה‬
‫בכלים הקיימים והמסורתיים‬
‫בשלב ראשון לא נרתיח את האוקיינוס‪.......‬‬
‫‪Big Data doesn’t have to be big – it can be managed‬‬
‫‪and built incrementally.‬‬
‫‪Big Data may or may not include social media‬‬
‫‪(eventually it will).‬‬
‫‪Big Data may or may not include external data‬‬
‫‪(eventually it will).‬‬
‫‪Sometimes information is good enough.‬‬
‫‪Data Management before Business Analytics‬‬
‫מה עושים כיום בארגון ‪:‬‬
‫‪ OLAP .1‬דוחות מימדי מוצר שיווק תמחור‬
‫‪ .2‬מודלים של כריית נתונים ‪?...‬‬
‫‪Internal Data‬‬
‫מידע תפעולי קיים במחסן הנתונים‬
‫‪New Internal Data‬‬
‫)‪(Dark Data‬‬
‫מידע קיים שלא מוגדר במחסן‬
‫הנתונים‪ ,‬מידע מובנה‪ ,‬מיילים‪ ,‬מידע‬
‫טקסטואלי )סוכנים‪ ,‬שמאים‪(..‬‬
‫‪New External Data‬‬
‫מידע ממקורות חיצוניים‪ :‬אינטרנט‪,‬‬
‫מתחרים‪,‬רשתות חברתיות‪ ,‬מידע‬
‫סלולארי מבוסס מיקום‪ ,‬טלמטיקה‬
‫סנסורים ועוד‬
Data Products
• Motivation: turning data assets → products and services
• A data product is an algorithm, software, application,
presentation or reproducible report based on data analytics
• A data product is the production output from a statistical
analysis, data mining, text miming, AI etc.
• Initially online companies : “A data product is a product that facilitates an
– search algorithms (Google)
end goal through the use of data.”
– similar offerings (Amazon)
– recommendations for “people you may know” (Facebook)
DJ Patil
• Developing and launching data products, particularly if you
are an offline business → it won’t be second nature...
Data-as-a-Service (DaaS) - a cloud strategy used to facilitate the accessibility of businesscritical data in a well-timed, protected and affordable manner → B2B "renting" data service
The Model Assembly Line
Do you
have the
data?
Do you
own the
data?
Data
quality?
Business
model
Type of
analysis
Competi
tive adv.
Do you have the data?
Do you own the data? (legal issues, consider
anonymized personal data)
Is it high-quality and useful data?
Do you have a business model? (bundling,
selling, free)
What types of analysis are you offering?
(descriptive analytics vs. predictive analytics)
Do you have differentiation or competitive
advantage? (proprietary vs. commodity data)
The Model Assembly Line: A case study of DaaS
Cellular companies
Do you
have the
data?
‫( מרכזי ערים‬Location based) – ‫מידע מיקומי‬
‫מפתחי אפליקציות חברתיות‬
‫ מרכזי עסקים‬,‫ איזורי בילוי‬,‫ איזורי קניות‬,‫ מרכז העיר‬- ‫חלוקה גיאוגרפית‬
‫חודשי‬/‫שבועי‬/‫ יומי‬- ‫תדירות עדכון הנתונים‬
Online/Batch - ‫נגישות לנתונים‬
‫ פרטי‬,‫סיווג לקוח – עסקי‬
Voice, SMS - ‫סוג תקשורת‬
Do you
own the
data?
Data
quality?
‫( עורקי תחבורה ראשיים‬Location based) – ‫מידע מיקומי‬
‫עיריות ומוסדות תכנון ממשלתיים‬
‫פרבר‬, ‫ עיר‬- ‫חלוקה גיאוגרפית‬
‫ אוטוסטרדה‬,‫ עירוני‬,‫ מהיר בין עירוני‬- ‫סוג כביש‬
‫חודשי‬/‫שבועי‬/‫ יומי‬- ‫תדירות עדכון הנתונים‬
Online/Batch - ‫נגישות לנתונים‬
Business
model
Type of
analysis
Competi
tive adv.
Pricing
Models
Volume based model
Quantity-based pricing (amount)
Pay-per-call (PPCall)
Data type based model
based on the type or attribute of data
Subscription based model
an unlimited amount of data
Implementations Approaches
• The Full Service Approach: Relying on a 3rd party to develop
and maintain the model
• The Full Control Approach: In house model development and
deployment
• The Consultant Approach: Hybrid methodology
Implementations Approaches
• The Full Service Approach: Relying on a 3rd party to develop
and maintain the model
• The Full Control Approach: In house model development and
deployment
• The Consultant Approach: Hybrid methodology
o Pros:
o
o
o
o
the ideal solution for companies who are resource constrained
the ideal solution for companies lacking technical and analytics staff
the model development can rely on expertise provided by the vendor
the quickest path to implementation
o Cons:
o reliance on the vendor to provide a solution without any independent review
o not being able to make changes to the model directly
o Internal staff is not trained to ensure attainment of desired results
Implementations Approaches
• The Full Service Approach: Relying on a 3rd party to develop
and maintain the model
• The Full Control Approach: In house model development and
deployment
• The Consultant Approach: Hybrid methodology
o Pros:
o the ideal solution for companies with analytics and IT resources
o Helps to protect IP in case of a novel idea or product
o This approach offers the most flexibility in making revisions or customizations to
the model
o Cons:
o The firm can’t take advantage of any data or expertise accumulated by vendors
and consultants
o If a fundamental modeling error has been made, it may never be discovered
o historically the slowest path to deployment, with successful implementations
measured in years(?)
Implementations Approaches
• The Full Service Approach: Relying on a 3rd party to develop
and maintain the model
• The Full Control Approach: In house model development and
deployment
• The Consultant Approach: Hybrid methodology
o Pros:
Build your own core competencies coupled
with high-end data science consultancy
o the ideal solution for companies lacking depth in their analytics department, but
who have available resources in systems and IT
o There is a built-in “independent review” phase in this approach.
o Companies are able to make changes directly to the model as needed
o Cons:
o If companies lack internal technical or analytical resources, they may be at the
mercy of the vendor in the future should a model update or revision be needed.
o Some companies attempt to update vendor models, but lack the in-depth
knowledge of modeling techniques used. As a result, they may inadvertently
make fundamental modeling errors
o Continuous management attention
Roles in Data Science
• Data Scientist
– Applied statistician X computer scientist
Computer science
Data Scientist (noun):
Math
better at statistics than any software engineer and
better at software engineering than any other statistician
Statistics
Josh Wills
Machine learning
Domain expertise
Communication and presentation skills
Data visualization
– No one person can be the perfect data scientists
→ A team ….?
“…shortage of 140,000 to 190,000
people with deep analytical skills
as well as 1.5 million managers and
analysts to analyze big data…”
(McKinsey, 2011)
Data Scientist
Skills required to exploit big data
• Skills to work with business stakeholders to understand the
business issue and context
• Analytical and decision modeling skills for discovering relationships
within data and proposing patterns
• Data management skills are required to build the relevant dataset
used for the analysis.
• Broad combination of soft and technical skills
Sample of Program Offerings
DB - Databases
BI – Business Intelligence, Data Warehousing
ST – Advanced-Level Statistics
BA – Business Analytics, Web Analytics
DM – Data Mining, Machine Learning, Text Mining, Natural-Language Processing
BD – Big Data Technologies, Visualization
KM – Knowledge Management, Social-Web Analysis
‫קוסמולוגים של היקום הדיגיטלי‬
http://online.wsj.com/article/SB10001424127887323478304578332850293360468.html?mod=itp
Building Models – Introduction
A model captures the knowledge exhibited by the data and encodes
it in some language…no model can perfectly represent the real
world
Automatic or semi-automatic extraction of
• Interesting
• Non-trivial
• Implicit
• Previously unknown
• Potentially useful
Forecasting what may happen in the future
Classifying items into groups by recognizing patterns
Clustering items into groups based on their attributes
Associating what events are likely to occur together
Sequencing what events are likely to lead to later events
Building Models – Introduction
Models fall into the categories of data mining: descriptive and
predictive
Predictive Tasks
Use some variables to predict unknown or future values of other
variables
Descriptive Tasks
Find human-interpretable patterns that describe the data
Supervised learning
Unsupervised learning
Meta learning (ensemble learners)
31
Types of Data Mining Tasks
• Affinity grouping (a.k.a. “associations”, “market-basket analysis”)
– What items are commonly purchased together?
• Similarity Matching
– What other companies are like our best small business
customers?
• Description/Profiling
– What does “normal behavior” look like?
Unsupervised
Many business problems have as an important component one of these DM
tasks:
• Clustering
• Predictive Modeling (including causal modeling & link prediction)
– Will customer X churn next month/default on her loan?
– How much would prospect X spend?
– Who might be good “friends” on our social networking site?
32
Supervised
– Do my customers form natural groups?
Data Mining vs. Deployment
Merging Traditional & Big Data
approaches
Merging Traditional & Agile approaches
Time to market – slow
process
Disconcert between the
business people (consumers)
and IT people (producers)
The overall cost is high
Breaking down the walls
Discovery process and not a
traditional SW development
project
Business owns the data……
Codification of The Process
Extracting useful knowledge from data to solve business problems
can be treated systematically by following a process with reasonably
well-defined stages
CRISP-DM - The Cross Industry Process for Data Mining (www.crisp-dm.org) (CRISP-DM; Shearer, 2000)
Structured process with critical points:
Human Intuition
High-powered analytical tools
A well-understood process that places a structure on a
problem which still
involves art…
science + craft + creativity + common sense
36
CRISP-DM
This process diagram makes explicit the fact that
iteration is the rule rather than the exception…
exception… not a
linear process
The point of
actually using
your results
Both
mathematical and
logical
37
Preparatory activity
what data?
where is the data?
accuracy and reliability
of the data
The most
substantial
components
(65%) timeconsuming
and laborintensive
CRISP-DM
Business Understanding
A creative problem formulation - what is the problem ?
Think carefully about the use scenario and the actual business need
• What exactly do we want to do?
• How exactly would we do it?
• What parts of this use scenario constitute possible data mining models?
Data Understanding
It is important to understand the strengths and limitations of the data.
Historical data often are collected for purposes unrelated to the current business
problem.
Estimating the costs and benefits of each data source
Data having varying degrees of reliability
• Cost of acquiring the data
• Data manipulation
• Data quality
38
CRISP-DM
Data Preparation
Pre-processing tasks
• Data conversions
• Data transformations (e.g., normalization, scaling etc.)
• Missing values, Outliers
• Redundant or non-informative features (i.e., feature selection, between-predictors
correlations)
• Dimensionality reduction techniques (e.g., PCA, SVD)
Modeling
The primary place where data mining techniques are applied to the data
It is important to have some understanding of the fundamental ideas of data
mining, including the sorts of techniques algorithms and tuning parameters.
Evaluation
The evaluation stage is to assess the data mining results rigorously and to gain
confidence that they are valid and reliable before moving on.
Measuring models performance and generalization
39
Basic Principles - Privacy
• Collection limitation - Data should be obtained lawfully and fairly,
while some very sensitive data should not be held at all.
• Data quality - Data should be relevant to the stated purposes,
accurate, complete, and up-to-date; proper precautions should be
taken to ensure this accuracy.
• Purpose specification - The purposes for which data will be used
should be identified, and the data should be destroyed if it no longer
serves the given purpose.
• Use limitation - Use of data for purposes other than specified is
forbidden.
Source: the OECD (Organization for Economic Co-operation and Development (OECD), 1980).
Data Science Course
41
Big Data ‫אפליקציות ושימושים של‬
Predictive and Descriptive ‫הצגת מגוון מודלים לכריית נתונים‬
:‫ הכוללים בין היתר‬Exploratory Data Analysis -‫ו‬Analytics
Cluster Analysis –
Association Analysis –
Decision Trees & Random Forest –
Support Vector Machine –
Neural Networks –
Anomaly Detection –
:‫ והצגת מושגי יסוד כדוגמת‬Graph mining ,Social Network Analysis
Degree & Degree Distribution –
Centrality, Betweeness, Closeness –
‫ ועוד‬Centralization –
Text ‫ לצורך‬NLP ‫שיטות לכריית נתונים טקסטואליים מבוססות‬
‫ הצגת מושגי יסוד‬Categorization Information Extraction
‫ ושיטות של ייצוג נתונים טקסטואליים מבוססי‬Information Retrieval
Bag-Of-Words
‫ כרייה והצגה של נתונים‬,‫ לצורך תחקור סטטיסטי‬R ‫שימוש בסביבת‬
‫גישות ויזואליזציה וגרפיקה לאפליקציות מבוססות ניתוח נתונים‬
(‫ ועוד‬co-occurrences network, neighborhood graph ) ‫טקסטואלי‬
‫טכנולוגיות מתקדמות לניהול נתונים וארכיטקטורות אחסון ועיבוד‬
‫ לניהול פרויקטי אנליטיקה עסקית‬CRISP-DM ‫הצגת מודל‬
•
•
•
•
•
•
•
•
Why R ?
• R is a free and open source language and environment for statistical
computing and graphics.
• R is already the most popular amongst the leading software for
statistical analysis.
• Key features:
–
–
–
–
–
It’s a mature & widely used NYT
Excellent graphics capabilities http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html
Highly extensible, with over 4300 user-contributed packages
It’s easy to use and has excellent online help and associated documentation
http://cran.r-project.org/other-docs.html -Manuals, tutorials, etc. provided by
users of R
‫ביג דאטה הוא ייצוג של תהליך בעל מגמות אבולוציוניות‪:‬‬
‫מורכבות גיוון והתמחות‬
‫תודה על ההקשבה‬
‫‪[email protected]‬‬