Download Data Mining - Zhangxi Lin

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
1
ISQS 6339: Data Management
& Business Intelligence
Spring, 2017
Instructor: Zhangxi Lin
Office: BA E311
Phone: (806) 834-1926
E-mail: [email protected]
Homepage: http://zlin.ba.ttu.edu
Class meetings: TR 2-4:50p, BA287
Office hours: TR 10:30a-12:30p, or by appointment
2
About me
 PhD, IS, UT Austin, 1999 (Joined TTU since
then)
 MS, Economics, UT Austin, 1996
 MEng, CS, Tsinghua University, Beijing,
1982
 EE, Tongji University, Shanghai, 1978-1979
 Hometown: Fuzhou, China
3
What is Business Intelligence
 A Simple Definition: The applications and technologies
transforming Business Data into Action
 Business intelligence (BI) is a business management term
 refers to applications and technologies which are used to gather, provide
access to, and analyze data and information about their company operations.
 Business intelligence systems can help companies gain more
comprehensive knowledge of the factors affecting their
business, and help companies to make better business
decisions.
 YouTube:
 What is BI? – B, 2’
 Global warming 0’31”
 World Economy & Population 2’45”
 Microsoft Business Intelligence Surface Demo 6’34”
ISQS 6339, Data Mgmt & BI
4
Data, information, and
knowledge
 Data – a collection of raw value elements
or facts used for calculating, reasoning,
or measuring.
 Information – the result of collecting and
organizing data in a way that establishes
relationship between data items, which
thereby provides context and meaning
 Knowledge – the concept of
understanding information based on
recognized patterns in a way that
provides insight to information.
ISQS 6339, Data Mgmt & BI
BI Problems
5
 Structured
 Detecting Credit card fraud
 Setting Loan parameters
 Market segmentation/Mass customization
 Deciding Marketing mix
 Customer Churn
 Reducing employee turnover
 Improving Quality/Efficiency
 …
 Unstructured
 Data exploration
 Utilization of resources (stored knowledge) to maximum effectiveness
…
ISQS 6339, Data Mgmt & BI
6
BI Applications
 Customer Analytics
 Customer profiling
 Targeted marketing
 Personalization
 Collaborative filtering
 Customer satisfaction
 Customer lifetime value
 Customer loyalty
 Sales Channel Analytics
 Marketing
 Sales performance and pipeline
ISQS 6339, Data Mgmt & BI
7
BI Applications (2)
 Supply Chain Analytics
 Supplier and vendor management
 Shipping
 Inventory control
 Distribution analysis
 Behavior Analysis
 Purchasing trends
 Web activity
 Fraud and abuse detection
 Customer attrition
 Social network analysis
ISQS 6339, Data Mgmt & BI
Business Intelligence Evolution
Stream Analytics*
Real-time, continuous, sequential analysis
(ranging from basic to advanced analytics)
3rd-Generation BI
Advanced Analytics/Optimization
Rules
Predictive Analytics
Real-time and traditional Data Mining
“New Traditional” Analytics
“2.5-Gen” Analytics (In-Memory OLAP, Search-Based)
Source:
Bill O’Connell
IBM, Aug 2007
8
Traditional Analytics
1st Generation Analytics (Query & Reporting)
2nd Generation Analytics (OLAP, Data Warehousing)
ISQS 6339, Data Mgmt & BI
Legacy BI
9
Driving Force - Big Data
 A collection of data sets so large and complex that it
becomes awkward to work with using on-hand database
management tools.
 Difficulties include capture, storage, search, sharing,
analysis, and visualization.
 Videos
 What is big data 1’33”
 Big Data Analytics 3’05”
 Artificial intelligence & big data 1’54”
Copyright 2012
8/14/20
12
10
ISQS73
39, Fall
2012
11
Data Scale
12
13
14
Big Data Companies

IBM

Oracle

Facebook

LinkedIn

Cloudera (Hortonworks)

Yahoo

Amazon

Google

AirBNB


Uber


an online marketplace and
hospitality service, enabling people
to lease or rent short-term lodging
including vacation rentals,
apartment rentals, homestays, hostel
beds, or hotel rooms.
a transportation network company
headquartered in San Francisco,
California, operating in 528 cities
worldwide.
Palantir

a private American software and
services company headquartered in
Palo Alto, California which
specializes in big data analysis.

In January 2015, the company was
valued at US$15 billion. This valuation
rose to US$20.33 billion in late 2015 as
the company closed an $880 million
round of funding.
15
Cloud Computing
 Cloud computing is the use
of computing resources (hardware
and software) that are delivered as
a service over a network (typically
the Internet). The name comes
from the use of a cloud-shaped
symbol as an abstraction for the
complex infrastructure it contains
in system diagrams. Cloud
computing entrusts remote
services with a user's data,
software and computation.
 Buzzword: SaaS/IaaS/PaaS
ISQS 6339, Data Mgmt & BI
Cloud versus cloud
 Amazon Elastic Compute Cloud
 Google App Engine
 Microsoft Azure
 GoGrid
 AppNexus
17
Case Study: Alibaba
 A privately owned E-Commerce company, started in 1999, covering
B2B, B2C (Tmall), C2C (Taobao), ePayment (Alipay, 49% market
share), financing (AliFinance), and data-centric cloud computing
services.
 Facts:
 One of the 20 most-visited websites globally.
 Account for over 60% of the parcels delivered in China.
 In 2012, handled 1.1 trillion yuan ($170 billion) in sales, more than
competitors eBay and Amazon combined.
 Recent events
 IPO at NASDAQ with Market cap $200 billion
 Became the second largest e-commerce company in the world.
Z. Lin, ISQS Colloquium
201402-28
18
Apache Hadoop
 An open-source software framework for storage and large scale
processing of data-sets on clusters of commodity hardware.
 The Apache Hadoop framework is composed of the following
modules :
 Hadoop Common - contains libraries and utilities needed by other Hadoop
modules
 Hadoop Distributed File System (HDFS).
 Hadoop YARN - a resource-management platform responsible for managing
compute resources in clusters and using them for scheduling of users'
applications.
 Hadoop MapReduce - a programming model for large scale data processing.
 Apache Hadoop's MapReduce and HDFS components originally derived
respectively from Google's MapReduce and Google File System (GFS) papers.
ISQS 6339, Data Mgmt & BI
19
ISQS 6339, Data Mgmt & BI
20
Hadoop 2: Big data's big leap
forward
 The new Hadoop is the Apache Foundation's attempt to
create a whole new general framework for the way big data
can be stored, mined, and processed.
 The biggest constraint on scale has been Hadoop’s job
handling. All jobs in Hadoop are run as batch processes
through a single daemon called JobTracker, which creates a
scalability and processing-speed bottleneck.
 Hadoop 2 uses an entirely new job-processing framework
built using two daemons: ResourceManager, which governs
all jobs in the system, and NodeManager, which runs on
each Hadoop node and keeps the ResourceManager
informed about what's happening on that node.
ISQS 6339, Data Mgmt & BI
MapReduce
2.0 – YARN
21
(Yet Another Resource Negotiator)
ISQS 6339, Data Mgmt & BI
22
Data Center
ISQS 6339, Data Mgmt & BI
23
Big Data Analytics
 To understand the nature of a complex system from huge
amount of data, which are observations of the system.
 What is the core of big data technology?
 Data abstraction for knowledge creation
 Data mining – knowledge discovery by computer
 Visualization – Human-computer Interactive knowledge discovery
 To do the above we need to
 Collect data
 Management data – database and data warehousing
Z. Lin, TTU/SWUFE
01/06/2
015
What is Data Mining?
24
 Many Definitions
 Non-trivial extraction of implicit, previously unknown and
potentially useful information from data
 Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns.
(Berry and Linoff, 1997, 2000)
 Data Mining is the process of discovering meaningful new
correlations, patterns and trends by sifting through large amount
of data stored in repositories, using pattern recognition
technologies as well as statistical and mathematical techniques.
(Gartner Group, 2004)
 Data analytics 2’37”
 Visual data mining 4’32”
ISQS 6347, Data & Text Mining
25
Origins of Data Mining
 Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
 Traditional Techniques
may be unsuitable due to
Statistics/ Machine
AI
Learning/
 High dimensionality
Pattern
of data
Recognition
Data Mining
 Heterogeneous,
 Enormity of data
distributed nature
of data
Database
systems
ISQS 6347, Data & Text Mining
Why Mine Data? Commercial Viewpoint
26
 Lots of data is being collected
and warehoused
 Web data, e-commerce
 purchases at department/
grocery stores
 Bank/Credit Card
transactions
 Computers have become cheaper and more powerful
 Competitive Pressure is Strong
 Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
ISQS 6347, Data & Text Mining
27
ISQS 6339 Course Description
 Data management comprises all the disciplines related to
managing data as a valuable resource.
 Business intelligence (BI) is referred to as applications and
technologies which are used to gather, provide access to,
and analyze data and information about their company
operations.
 Three main topics
 Data warehousing
 Introductory data mining
 Big data and its trends
28
Syllabus
 Textbook and references
 Deliverables: projects, exercises
 Exams
 Grading policy
 Schedule
29
Application Tools
Microsoft SQL Server 2016
SAS Enterprise Guide
SAS Enterprise Miner
Hadoop – Horton Works or CDH
30
Big Data Tools
 Pentaho: A company that offers Pentaho Business
Analytics, a suite of open source Business Intelligence
(BI) products, founded in 2004 by five founders,
headquartered in Orlando, FL, USA, acquired by Hitachi
in 2015 (https://en.wikipedia.org/wiki/Pentaho)
 Pentaho Data Integration (PDI)
 Pentaho for Big Data
 Pentaho Data Mining
 Tableau Software founded in Mountain View, California
in January, 2003 by Chris Stolte, Christian Chabot and
Pat Hanrahan, headquartered in Seattle, Washington. It
produces a family of interactive data visualization
products focused on business intelligence.
31
Your checklist
 Website
 Class home page, Schedule, online Notes
 Shared network drive
 \\TechShare\coba\d\ISQS3358\
 \\TechShare\coba\d\ISQS6347\
 Downloadable materials
 E-Textbooks
 Datasets
 Homework assignments
 Slides
 Exercises
 Demonstrative Videos
32
CAABI
Center for Advanced Analytics and
Business Intelligence initially started in
2004 by Dr. Peter Westfall, ISQS, Rawls
College of Business.
Looking to offer support to companies
in developing BI capabilities.
Lots of technical expertise.
32
ISQS 6339, Data Mgmt & BI
33
Your opportunities to
contact BI industry
SAS Analytics 2017 Conference,
September 18-20, Washington DC. Check
https://www.sas.com/en_us/events/analytic
s-conference.html
SAS Global Forum 2017
April 2 - 5, Orlando, FL
https://www.sas.com/en_us/events/sasglobal-forum/sas-global-forum-2017.html
Posters in SAS M2008
35
36