Download Bringing together the data mining, data science and analytics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Bringing together the data mining, data science and analytics
community.
ACM: Association for Computing Machinery is the world’s largest educational and
scientific computing society with the highest reputation as a professional organization.
SIGKDD: Special Interest Group on Knowledge Discovery and Data Mining.
Previous Courses
•  Introduction to Spark
•  Advanced Hadoop Based Machine Learn
•  Hadoop Based Machine Learning
•  MapReduce Design Patterns
Austin SIGKDD Chapter-officers:
Chance Coble
[email protected]
Robert Chong
[email protected]
Omar Odibat
[email protected]
Machine Learning with Python
Date Topic 22-Apr 1: GETTING STARTED WITH PYTHON MACHINE
LEARNING Instructor Chance Coble
29-Apr 2: LEARNING HOW TO CLASSIFY WITH REAL-WORLD Roger Huang
EXAMPLES 6-May 3: CLUSTERING – FINDING RELATED POSTS 13-May 4: TOPIC MODELING 20-May 5: CLASSIFICATION I 27-May 6: CLASSIFICATION II – SENTIMENT ANALYSIS certificate
will
be awarded
to
3-Jun 7:
REGRESSION
– RECOMMENDATIONS
An official ACM
those who complete the course (attending at
8: REGRESSION – RECOMMENDATIONS IMPROVED least 8 10-Jun
sessions).
17-Jun 9: COMPUTER VISION – PATTERN RECOGNITION 24-Jun 10: DIMENSIONALITY REDUCTION | 1-Jul 11: DIMENSIONALITY REDUCTION || 8-Jul 12: BIG(GER) DATA Misty Nodine
Christine Doig
Mark Landry Jennifer Davis
Jessica
Williamson
Omar Odibat
Ed Solis
Jessica
Williamson
Robert Chong
Chance Coble
Introduction to Machine
Learning with Python
Getting Started with Machine Learning and Python
Outline
  Introduction to an ATX ACM SIGKDD, Course and
Instructors
  Introduction to Python (Interactive)
  Getting Started with Machine Learning (Lecture)
  Q&A
  Chapter Walk-through
Part I: Getting Started with
Python
Python Introduction
  Python is a high level general purpose programming
language
  Compiled to byte-code, then interpreted
  Other flavors are JPython and IronPython
  It’s the Strong, Dynamic Type.
  Object-oriented
  Functional features
  And here it goes: print(“Hello World”) Python Continued
  Comments: # This is commented   ‘’’ More   than one   line ‘’’   Variables
 
 
 
 
a = 0 ß Interpreter determines this is an integer
b = “Hello” ß Interpreter determines this is a string
casting:
int(x) float(x) str(x) type(x) : handily returns type of variable x Python: Operators
print(3 + 4) print(3 – 4) print(3*4) print(3/4) print(3 % 2) print(3 ** 4) # 3 to the 4th print(3 // 4) # Floor division Python: Guards
a = 20 if a>= 22: print(“if”) # Note the spaces before ‘print’ –important! elif a>=21: print(“elif”) else: print(“else”) Python Functions
def someFunction(): print(“boo”) # ß Again with the space someFunction() def someOtherFunction(a,b): print(a+b) someOtherFunction(12,451) Python: Iteration
for a in range(1,3): print(a) a = 1 while a<10: print(a) a+=1 Python: Strings
str = “a string” str.count(‘x’) str.find(‘x’) str.lower() str.upper() str.replace(‘a’,’b’) str.strip() print(str[1:3]) print(str[:-­‐1]) Python: Lists
sampleList = [1,2,3,4,5,6,7,8] for a in sampleList: print(a) sampleList.append(9) sampleList.count(2) sampleList.index(5) sampleList.pop() sampleList.remove(7) sampleList.reverse() sampleList.sort() Python: Tuples and Dictionaries
•  Tuples are immutable (unlike lists)
myTuple = (1,2,3) a,b,c = myTuple •  Dictionaries store key value pairs
dictExample = {‘someItem’:20,’other’:100} dictExample[‘newItem’] = 400 for a in dictExample: print(a) printdictExample[‘someItem’] Classes
class Calculator(object): #define class to simulate simple calculator def __init__(self): #start with zero self.current = 0 def add(self, amount): #add number to current self.current += amount def getCurrent(self): return self.current myCalc = Calculator() myCalc.add(2) print(myCalc.getCurrent()) Part II: Getting Started with
Machine Learning
Problem Setup
  Programs for well processes to map input to output
  Create invoices from an accounting system
  Compute basic statistics on a set of data records
  Compile programs into machine level code
  Programs for scenarios with only the input and output
 
 
 
 
Find the baby in the picture
Drive a car
Rank a set of documents by importance
Transcribe speech
  Some the human brain does well
Problem Setup
  Entity Analytics Records
  For each entity in which we want to perform a mapping,
create a column of values associated with that entity
  We will call these: features
  For each entity you could have many features
  We need to construct a program Φ so that an architecture
m(Φ,x) yields the best answer given that
  x: a set of features which have an ideal target value
Learning is Representation,
Optimization & Evaluation
  m in this case is our representation
 
 
 
 
That is the language in which our model will be built
Line: y = m * x + b
Rules (if x then y)
Decision Tree
  Optimization allows you to improve your parameters
  Evaluation (Ch 2): determines if one model is better than
another
Achieving Generalization
  Easy to get 0 error in training
  novice mistake!
  If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
  This is an extreme example of overfitting
  This is constantly your enemy in machine learning
Achieving Generalization
  Easy to get 0 error in training
  novice mistake!
  If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
  This is an extreme example of overfitting
  This is constantly your enemy in machine learning
Achieving Generalization
  Easy to get 0 error in training
  novice mistake!
  If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
  This is an extreme example of overfitting
  This is constantly your enemy in machine learning
Achieving Generalization
  Easy to get 0 error in training
  novice mistake!
  If your optimization creates a dictionary of inputs and their
outputs, and then for a new point outputs the result for the
closest input
  This is an extreme example of overfitting
  This is constantly your enemy in machine learning
Generalization
  See low error in training? Be skeptical!
  Have records with 100 boolean features?
  Let’s say you have 1 million records to learn from
  Assuming distinct records, you have covered only 2100 – 106 !
  Requires domain knowledge in your representation
  The goal of optimization is not to get zero error, it is to get
the right kind of error
  Nothing does better than everything else on all data
Generalization
  Watch out for these claims
Asymptopia: When carried out to infinity our
optimization approach is guaranteed to find the
minimum
  Any model is representable in our model
  Representable is not learnable!
  More data trumps more complex learners
Training and Testing
  Back to the Entity Analytics Record
  You will have a set of data for your entities that should have an
output from m
  Training set: Optimize Φ based on the data you provide
  Testing set: Evaluate m(Φ,x) for all x’s in a representative set
Dimensionality isn’t always
intuitive
  Back to our example
  a set of 100 boolean features is a space of 2100 and our data
set of 1,000,000 leaving 2100 – 106
  The size of the space grows much faster than our data can
cover
  Even stranger still – most of the data in a high dimensional
orange would be in the skin, not the pulp
Feature Design
  Often features are “engineered”
  Flags of other properties
  Calculations of other features
  Reductions of other features
  As a practitioner, most of your time will be spent engineering
features
  High hopes exist to automate this process one day (i.e. it’s
holy grail stuff)
Ensembles
  Many models are better than one
  Bagging
  Boosting
  Stacking
  Trend is toward larger ensembles
  Netflix prize winner (and runner up) were both ensemble
approaches
References
  Pedro Domingo : A Few Useful Things to Know about Machine
Learning
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
  Building Machine Learning Applications
  Elements of Statistical Learning
  Pattern Classification
  Handbook of Statistical Analysis and Data Mining
  Python Tutorial:
http://www.afterhoursprogramming.com/tutorial/Python/
Classes/