Download Laboratoire des Sciences de l `Image, de l `

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
MFD – Schedule and grades
MINING OF FINANCIAL DATABASES
Lectures: Thursday, 17:20-19:35, CKU 110, 15 hours
Computer Labs: 15 hours
gr.1 Wednesday, 15:00-17:15, 002Z1
gr.2 Thursday, 15:00-17:15, 003Z1
INTRODUCTION
Final grade = AVG (Ass1+Ass2+Ass3) ± D activity
Jerzy KORCZAK
email: [email protected]
http://www.korczak-leliwa.pl
http://kti.ue.wroc.pl
1
Outline
•
•
•
•
•
•
•
2
Data Mining
• Data mining (knowledge discovery in databases, KDD)
Introduction – concept of data-driven knowledge discovery
History of data mining
Statistics vs. data mining
Overview of data sets, databases
CRISP - Data mining methodology
Business Requirements
Research progress
– Extraction of interesting, non-trivial, implicit, previously unknown
and potentially useful information (knowledge) or patterns from
data in large databases or other information repositories
• Scientific point of view: data abstraction and KDD
• Commercial point of view: competitive pressure
• Necessity is the mother of invention
– Data is everywhere — data mining should be everywhere, too!
– Understand and use data — an imminent task!
3
Origins of Data Mining
4
Statistics vs Data Mining
• Draws ideas from AI/machine learning, pattern recognition, statistics, and
database systems
• Statistics: a discipline dedicated to data analysis
• What are the differences?
Artificial Intelligence
Statistics
• Traditional Techniques
may be unsuitable due to
– Enormity of data
– High dimensionality of data
– Heterogeneous, distributed
nature of data
Machine Learning
Pattern Recognition
Data Mining
Database
systems
–
Huge amount of data—in Giga to Tera bytes
–
–
Fast computer—quick response, interactive analysis
Multi-dimensional, powerful, thorough analysis
–
High-level, “declarative”—user’s ease and control
–
Automated or semi-automated—mining functions hidden or built-in
in many systems
Visualisation
5
6
1
Data Sets, Database, Images
• Relational database — A commodity of every enterprise
• Huge data warehouses are under construction
• POS (Point of Sales): Transactional DBs in terabytes
• Object-relational databases, distributed, heterogeneous,
and legacy databases
• Spatial databases (GIS), remote sensing database (EOS),
and scientific/engineering databases
• Time-series data (e.g., stock trading) and temporal data
• Text (documents, emails) and multimedia databases
• WWW: A huge, hyper-linked, dynamic, global information
system
7
Healy J., Why what happens in an internet minute really matters, M2M
8
9
Types of Decision-Support Systems (DSS)
10
A Multi-Dimensional View of Data Mining
• Databases to be mined
Model-driven DSS:
Relational, transactional, object-relational, active, spatial, time-series, text,
Primarily stand-alone systems
Use a strong theory or model to perform “what-if” analyses
multi-media, heterogeneous, legacy, WWW, etc.
• Knowledge to be mined
Data-driven DSS:
Characterization, discrimination, association, classification, clustering, trend,
• Integrated with large pools of data in major enterprise systems and Web
sites
• Support decision making by enabling user to extract useful information
• Data mining: can obtain types of information such as associations,
sequences, classifications, clusters, and forecasts
deviation and outlier analysis, etc.
• Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, neural network, genetics, etc.
• Applications adapted
Decision Support Systems, telecommunication, banking, fraud analysis,
DNA mining, stock market analysis, image interpretation, Web mining, etc.
11
12
2
What is Data Mining?
Data Mining Tasks...
Many definitions
– Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of large
quantities of data in order to discover meaningful patterns
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
13
1
Yes
Single
125K
No
2
No
Married 100K
No
3
No
Single
No
4
Yes
Married 120K
No
5
No
Divorced 95K
Yes
6
No
Married 60K
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
70K
Married 75K
No
10 No
Single
90K
Yes
11 No
Married 60K
No
12 Yes
Divorced 220K
No
13 No
Single
85K
Yes
14 No
Married 75K
No
15 No
Single
Yes
90K
14
10
A Brief History of Data Mining
•
Most scientific discoveries involve “data mining”
–
Kepler’s Law, Newton’s Laws, periodic table of chemical elements, …, from “big
bang” to DNA
•
1989 IJCAI Workshop on Knowledge Discovery in Databases
•
1991-1994 Workshops on Knowledge Discovery in Databases
–
–
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P.
Smyth, and R. Uthurusamy, 1996)
•
1995- now International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD)
•
1998 ACM SIGKDD, SIGKDD’1999-2005 conferences, and SIGKDD
Explorations
•
More conferences on data mining
–
–
Journal of Data Mining and Knowledge Discovery (1997)
PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.
15
16
Types of Data Structures:
Mining of Big Data
Quasi-Structured Data
Structured Data
• Concepts of Big Data:
– “Big Data” is data whose scale, distribution, diversity,
and/or timeliness require the use of new technical
architectures and analytics to enable insights that
unlock new sources of business value
• Requires new data architectures, analytic sandboxes
• New tools
• New analytical methods
• Integrating multiple skills into new role of data scientist
– Organizations are deriving business benefit from analyzing
ever larger and more complex data sets that increasingly
require real-time or near-real time capabilities
Semi-Structured Data
View  Source
http://www.google.com/#hl=en&sugexp=kjrmc&cp=8&gs_id=2m&xhr=t&q=data+scientist&pq=big
+data&pf =p&sclient=psyb&source=hp&pbx=1&oq=data+sci&aq=0&aqi=g4&aql=f&gs_sm=&gs_u
pl=&bav=on.2,or.r_gc.r_pw.,cf .osb&fp=d566e0fbd09c8604&biw=1382&bih=651
Unstructured Data
The Red Wheelbarrow, by
William Carlos Williams
Source: McKinsey May 2011 article Big Data: The next frontier for innovation, competition, and productivity
17
18
3
Business Requirements
Data Structures
Current Business Problems Provide Opportunities for Organizations to
Become More Analytical & Data Driven
• Data containing a defined data type, format, structure
Structured
• Example: Transaction data and OLAP
More Structured
SemiStructured
Driver
1
• Textual data files with a discernable pattern,
enabling parsing
Examples
Desire to optimize business
operations
Sales, pricing, profitability, efficiency
Desire to identify business risk
Customer churn, fraud, default
3
Predict new business
opportunities
Upsell, cross-sell, best new customer
prospects
4
Comply with laws or regulatory
requirements
Anti-Money Laundering, Fair Lending,
Basel II
• Example: XML data files that are self
describing and defined by an xml schema
“Quasi”
Structured
2
• Textual data with erratic data formats, can
be formatted with effort, tools, and time
• Example: Web clickstream data that
may contain some inconsistencies in data
values and formats
• Data that has no inherent
structure and is usually stored
as different types of files.
Unstructured
• Example: Text documents,
PDFs, images and video
19
20
Data Analytics – Development
Methodology
Cross Industry Standard
Process for Data Mining
known by its acronym
CRISP-DM [ESPRIT, 1996].
• Data Analytics Lifecycle
1
Do I have enough
information to draft an
analytic plan and share for
peer review?
Discovery
2
6
Operationalize
3
5
Model
Planning
Communicate
Results
4
Is the model robust enough?
Have we failed for sure?
Data
Prep
Do I have
enough good
quality data to
start building
the model?
Model
Building
Do I have a good idea about
the type of model to try? Can
I refine the analytic plan?
22
22
21
Data Analytics – Development (cont.)
Data Analytics – Development (cont.)
• Phase 1: Discovery
• Phase 2: Data Preparation
1
Do I have enough
information to draft an
analytic plan and share for
peer review?
Discovery
• Formulate Initial Hypotheses
 IH,
H1 , H2, H3, … Hn
Operation
Data
Do I have enough
information to draft an
analytic plan and share for
peer review?
Do I have
enough good
quality data to
start building
the model?
 Gather
and assess hypotheses from stakeholders
and
alize
Prep
domain experts
 Preliminary data exploration to inform discussions with
stakeholders
Communi during the hypothesis forming stage
Model
• Identifycate
Data Sources – Begin Learning the Data
Plannin
 Aggregate
Resultssources for previewing the data and provide
g
high-level understanding Model
Do I have a good idea
about
the type of model
 Review
therobust
raw data
Is the model
to try? Can I refine the
Building
enough? Have we
 Determine the structures and tools needed
failed for sure?
analytic plan?
 Scope the kind of data needed for this kind of problem
23
• Prepare Analytic Sandbox Discover
y
 Work space for the analytic team
 10x+ vs. EDW
• Perform
ELT
Operation
 Determine
alize needed transformations
 Assess data quality and
structuring
Communi
 Derive
cate statistically useful
measures
Results
 Extract data and determineModel
data
connections
for
raw data, OLTP
the model
robust
• UsefulIsTools
for this
phase:
Building
enough? Have we
2
Data
Prep
Do I have
enough good
quality data to
start building
the model?
Model
Planning
Do I have a good idea
about the type of model
to try? Can I refine the
OLAP cubes
or data SQL,
feeds
• transactions,
For Data Transformation
& Cleansing:
Hadoop, MapReduce,
analyticAlpine
plan? Miner
failed for sure?
 Big ELT and Big ETL
23
24
24
4
Data Analytics – Development (cont.)
Data Analytics – Development (cont.)
• Phase 3: Model Planning
• Phase 3 - Model Planning
Do I have enough
information to draft an
analytic plan and share for
peer review?
Discovery
Discoveryy
Do I have
enough good
quality data to
start building
the model?
• Determine Methods
 Select methods based on hypotheses, data
Data
Prep
structure
and volume
Operational
 Ensureize
techniques and approach will meet
business objectives
3
Model
Planning
Communic
• Techniques
& Workflow
ate tests and sequence
 Candidate
Results
 Identify
and document modeling
assumptions
Model
Is the model robust
Building
• Useful Tools for this phase: R/PostgresSQL,
SQL
Do I have a good idea
about the type of model
to try? Can I refine the
analytic plan?
enough? Have we
Analytics,
Alpine
Miner, SAS/ACCESS, SPSS/OBDC
failed
for sure?
How do
people generally solve this
Operation
problem alize
with the kind of data and
resources I have?
• Does that work well enough?
Communi
Or do I have to come up with
cate
something new?
Results
• What are related or analogous
problems? How are theyModel
solved?
Is the model robust
Building
Can
I do Have
that?
enough?
we
Data
Prep
Model
Planning
Do I have a good idea
about the type of model
to try? Can I refine the
analytic plan?
failed for sure?
25
Data Analytics – Development (cont.)
• Phase 3 - Model Planning
• Data Exploration
Data Analytics – Development
Do I have enough
information to draft an
analytic plan and share for
peer review?
Discover
y
• Variable Selection
 Inputs from stakeholders and domain
Do I have
enough good
quality data to
start building
the model?
Data
Prep
Operation
experts
alize
 Capture essence of the predictors,
leverage a technique for dimensionality
reduction
Communi
 Iterative testing to confirm the most
cate
significant variables

The Problem to Solve
The Category of
Techniques
Algorithms
I want to group items by similarity.
I want to find structure (commonalities)
in the data
Clustering
K-means clustering
I want to discover relationships between
actions or items
Association Rules
Apriori
I want to determine the relationship
between the outcome and the input
variables
Regression
Linear Regression
Logistic Regression
I want to assign (known) labels to
objects
Classification
Naïve Bayes
Decision Trees
I want to find the structure in a temporal
process
I want to forecast the behavior of a
temporal process
Time Series Analysis
ACF, PACF, ARIMA
I want to analyze my text data
Text Analysis
Regular expressions, Document
representation (Bag of Words), TFIDF
3
Model
Planning
Results
Model
Building
• Model Selection
26
Do I have a good idea
about the type of model
to try? Can I refine the
analytic plan?
Is the model robust
Conversion
enough? HavetoweSQL or database
failed for sure?
language
for best performance
27
 Choose technique based on the end
27
28
goal
Data Analytics – Development (cont.)
• Phase 4: Model Building
Data Analytics – Development (cont.)
• Phase 5: Communicate Results
Do I have enough
information to draft an
analytic plan and share for
peer review?
Discovery
• Develop data sets for testing, training, and production purposes
Do I have
 Need to ensure that the model data is sufficiently robust
model and analytical techniques
Discovery
enough good
for
the
quality data to
start building
the model?
Operationalize
Dataset for initial
Operation
Smaller, test sets for validating approach, training
experiments
alize
Prep
• Get the best environment you can for building models and
workflows…fast hardware, parallel processing
Communicate
Results
Is the model robust
enough? Have we
failed for sure?
•
5
Communicate
Results
Model
Planning
4
Model
Building
Do I have enough
information to draft an
analytic plan and share for
peer review?
Do I have
enough good
quality data to
start building
the model?
Did we succeed? Did we fail?
• Interpret the results Data
Prep
• Compare to IH’s from
Phase 1
• Identify key findings
• Quantify business value
Model
• Summarizing findings,
depending on
Plannin
audience
Do I have a good idea
about the type of model
to try? Can I refine the
analytic plan?
Is the model robust
enough? Have we
failed for sure?
Useful Tools for this phase: R, PL/R, SQL, Alpine Miner, SAS Enterprise Miner
29
Model
Building
g
Do I have a good idea
about the type of model
to try? Can I refine the
analytic plan?
30
5
Data Analytics – Development (cont.)
Data Analytics – Core deliverables
• Phase 6: Operationalize
Do I have enough
information to draft an
analytic plan and share for
peer review?
Discover
6
Operationalize
Communicate
Results
Presentation for Project Sponsors
1. “Big picture" takeaways for executive level stakeholders
2. Determine key messages to aid their decision-making
process
3. Focus on clean, easy visuals for the presenter to explain and
for the viewer to grasp
Presentation for Analysts
1. Business process changes
2. Reporting changes
3. Fellow Data Scientists will want the details and are
comfortable with technical graphs (such as ROC curves,
density plots, histograms)
Do I have
y
enough good
• Run
a pilot
quality data to
start building
• Assess the benefits
the model?
Data
• Deliver final deliverables
Prep
• Model execution in production
environment
Model
• Define process Plannin
to update and
retrain the model, as
g needed
Is the model robust
enough? Have we
failed for sure?
Model
Building
Code for technical people
Do I have a good idea
about the type of model
to try? Can I refine the
analytic plan?
Technical specs of implementing the code
31
32
Data Analytics – Key roles
Role
Business Requirements
Description
Business User
Someone who benefits from the end results and can consult and advise project
team on value of end results and how these will be operationalized
Project Sponsor
Person responsible for the genesis of the project, providing the impetus for the
project and core business problem, generally provides the funding and will gauge
the degree of value from the final outputs of the working team
Project Manager
Ensure key milestones and objectives are met on time and at expected quality.
Business
Intelligence Analyst
Business domain expertise with deep understanding of the data, KPIs, key metrics
and business intelligence from a reporting perspective
Data Engineer
Deep technical skills to assist with tuning SQL queries for data management,
extraction and support data ingest to analytic sandbox
Database
Administrator (DBA)
Database Administrator who provisions and configures database environment to
support the analytical needs of the working team
Data Scientist
Provide subject matter expertise for analytical techniques, data modeling, applying
valid analytical techniques to given business problems and ensuring overall
analytical objectives are met
• Objectives of the problem decomposition:
– Focus your time
– Ensure rigor and completeness
– Enable better transition to members of the crossfunctional analytic teams
• Repeatable
• Scale to additional analysts
• Support validity of findings
33
34
Business Requirements
Business Requirements
• How do you currently approach your analytics problems?
• Do you follow a methodology or some kind of framework?
• How do you plan for an analytic project?
• Analytical Approaches for Meeting Business Drivers
Predictive Analytics & Data Mining
(Data Science)
High
Data
Science
Typical
Techniques
& Data
Types
• Optimization, predictive modeling,
forecasting, statistical analysis
• Structured/unstructured data, many
types of sources, very large data sets
Common
Questions
• What if…..?
• What’s the optimal scenario for our
business ?
• What will happen next? What if these
trends continue? Why is this
happening?
Business Intelligence
BUSINESS
VALUE
Business
Intelligence
Typical
Techniques
& Data
Types
• Standard and ad hoc reporting,
dashboards, alerts, queries, details on
demand
• Structured data, traditional sources,
manageable data sets
Common
Questions
• What happened last quarter?
• How many did we sell?
• Where is the problem? In which
situations?
Low
Past
35
TIME
Future
36
6
Business Requirements
Business Requirements
• A typical analytical architecture
• A typical analytical architecture (cont.)
1
Data
Sources
1
Data
Sources
Non-Agile Models
2
Non-Agile Models
Departmental
Warehouse
“Spread
Marts”
Departmental
Warehouse
2
Enterprise
Applications
3
Static schemas
accrete over time
Departmental
Warehouse
“Spread
Marts”
4
Departmental
Warehouse
Prioritized
Operational
Processes
Reporting
Enterprise
Applications
3
Static schemas
accrete over time
Siloed
Analytics
Non-Prioritized Data Provisioning
4
Prioritized
Operational
Processes
Reporting
Siloed
Analytics
Non-Prioritized Data Provisioning
Errant data & marts
Errant data & marts
37
38
Business Requirements
Business Requirements
• Opportunities for a new approach to analytics
Implications of Typical Architecture for Data Science
– High-value data is hard to reach and leverage
– Predictive analytics & data mining activities are last in line for data
Slow
• Queued after prioritized operational processes
“time-to-insight”
– Data is moving in batches from EDW to local analytical tools
&
reduced
• In-memory analytics (such as R, SAS, SPSS, Excel)
business
impact
• Sampling can skew model accuracy
– Isolated, ad hoc analytic projects, rather than centrally-managed
harnessing of analytics
• Non-standardized initiatives
• Frequently, not aligned with corporate business goals
1
D at a
D evices
Individual
Analytic
Services
Medical
Information
Brokers
Advertising
Marketers
Employers
Law
Enforcemen
t
2
Internet
Government
D at a
C ollect or s
Websites
3
D at a
Aggr egat or
s
D at a
U ser s/ B uyer s
4
Catalog
Co-Ops
Phone/TV
Media
Media
Archives
Retail
Credit
Bureaus Financial
Banks
List
Brokers
Delivery
Service
Private
Investigators
/Law yers
Government
39
40
Data Mining
Data Mining
41
42
7
Research Progress
Conclusions and Perspectives
• TREND: AFTER C-C, Availabilty of BD IN-MEMORY
Technology, LOWER cost, REAL TIME, IoT
• According to McKinsey – a retailer using big data to the full
could increase its operating marging by more than 60%
• Bad data or poor data quality costs US businesses $600
billion annually
• According to Gartner Big Data $232 billion in spending
through 2016.
• By 2016, 5 milion IT jobs globally weree created to support
big data, generating 2 million IT jobs in the US.
43
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Multi-dimensional data analysis: Data Warehouse and OLAP
Association, correlation, and causality analysis
Classification: scalability and new approaches
Clustering and outlier analysis
Sequential patterns and time-series analysis
Similarity analysis: curves, trends, images, texts, etc.
Text mining, Web mining and Weblog analysis
Social networks, link analysis
Spatial, multimedia, scientific data analysis
Smart sensors: IoT
Image classification and interpretation
Data preprocessing and database compression
Data visualization and visual data mining
Many others, e.g., collaborative filtering
44
8