Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bringing together the data mining, data science and analytics community. ACM: Association for Computing Machinery is the world’s largest educational and scientific computing society with the highest reputation as a professional organization. SIGKDD: Special Interest Group on Knowledge Discovery and Data Mining. Previous Courses • Introduction to Spark • Advanced Hadoop Based Machine Learn • Hadoop Based Machine Learning • MapReduce Design Patterns Austin SIGKDD Chapter-officers: Chance Coble [email protected] Robert Chong [email protected] Omar Odibat [email protected] Machine Learning with Python Date Topic 22-Apr 1: GETTING STARTED WITH PYTHON MACHINE LEARNING Instructor Chance Coble 29-Apr 2: LEARNING HOW TO CLASSIFY WITH REAL-WORLD Roger Huang EXAMPLES 6-May 3: CLUSTERING – FINDING RELATED POSTS 13-May 4: TOPIC MODELING 20-May 5: CLASSIFICATION I 27-May 6: CLASSIFICATION II – SENTIMENT ANALYSIS certificate will be awarded to 3-Jun 7: REGRESSION – RECOMMENDATIONS An official ACM those who complete the course (attending at 8: REGRESSION – RECOMMENDATIONS IMPROVED least 8 10-Jun sessions). 17-Jun 9: COMPUTER VISION – PATTERN RECOGNITION 24-Jun 10: DIMENSIONALITY REDUCTION | 1-Jul 11: DIMENSIONALITY REDUCTION || 8-Jul 12: BIG(GER) DATA Misty Nodine Christine Doig Mark Landry Jennifer Davis Jessica Williamson Omar Odibat Ed Solis Jessica Williamson Robert Chong Chance Coble Introduction to Machine Learning with Python Getting Started with Machine Learning and Python Outline Introduction to an ATX ACM SIGKDD, Course and Instructors Introduction to Python (Interactive) Getting Started with Machine Learning (Lecture) Q&A Chapter Walk-through Part I: Getting Started with Python Python Introduction Python is a high level general purpose programming language Compiled to byte-code, then interpreted Other flavors are JPython and IronPython It’s the Strong, Dynamic Type. Object-oriented Functional features And here it goes: print(“Hello World”) Python Continued Comments: # This is commented ‘’’ More than one line ‘’’ Variables a = 0 ß Interpreter determines this is an integer b = “Hello” ß Interpreter determines this is a string casting: int(x) float(x) str(x) type(x) : handily returns type of variable x Python: Operators print(3 + 4) print(3 – 4) print(3*4) print(3/4) print(3 % 2) print(3 ** 4) # 3 to the 4th print(3 // 4) # Floor division Python: Guards a = 20 if a>= 22: print(“if”) # Note the spaces before ‘print’ –important! elif a>=21: print(“elif”) else: print(“else”) Python Functions def someFunction(): print(“boo”) # ß Again with the space someFunction() def someOtherFunction(a,b): print(a+b) someOtherFunction(12,451) Python: Iteration for a in range(1,3): print(a) a = 1 while a<10: print(a) a+=1 Python: Strings str = “a string” str.count(‘x’) str.find(‘x’) str.lower() str.upper() str.replace(‘a’,’b’) str.strip() print(str[1:3]) print(str[:-‐1]) Python: Lists sampleList = [1,2,3,4,5,6,7,8] for a in sampleList: print(a) sampleList.append(9) sampleList.count(2) sampleList.index(5) sampleList.pop() sampleList.remove(7) sampleList.reverse() sampleList.sort() Python: Tuples and Dictionaries • Tuples are immutable (unlike lists) myTuple = (1,2,3) a,b,c = myTuple • Dictionaries store key value pairs dictExample = {‘someItem’:20,’other’:100} dictExample[‘newItem’] = 400 for a in dictExample: print(a) printdictExample[‘someItem’] Classes class Calculator(object): #define class to simulate simple calculator def __init__(self): #start with zero self.current = 0 def add(self, amount): #add number to current self.current += amount def getCurrent(self): return self.current myCalc = Calculator() myCalc.add(2) print(myCalc.getCurrent()) Part II: Getting Started with Machine Learning Problem Setup Programs for well processes to map input to output Create invoices from an accounting system Compute basic statistics on a set of data records Compile programs into machine level code Programs for scenarios with only the input and output Find the baby in the picture Drive a car Rank a set of documents by importance Transcribe speech Some the human brain does well Problem Setup Entity Analytics Records For each entity in which we want to perform a mapping, create a column of values associated with that entity We will call these: features For each entity you could have many features We need to construct a program Φ so that an architecture m(Φ,x) yields the best answer given that x: a set of features which have an ideal target value Learning is Representation, Optimization & Evaluation m in this case is our representation That is the language in which our model will be built Line: y = m * x + b Rules (if x then y) Decision Tree Optimization allows you to improve your parameters Evaluation (Ch 2): determines if one model is better than another Achieving Generalization Easy to get 0 error in training novice mistake! If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input This is an extreme example of overfitting This is constantly your enemy in machine learning Achieving Generalization Easy to get 0 error in training novice mistake! If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input This is an extreme example of overfitting This is constantly your enemy in machine learning Achieving Generalization Easy to get 0 error in training novice mistake! If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input This is an extreme example of overfitting This is constantly your enemy in machine learning Achieving Generalization Easy to get 0 error in training novice mistake! If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input This is an extreme example of overfitting This is constantly your enemy in machine learning Generalization See low error in training? Be skeptical! Have records with 100 boolean features? Let’s say you have 1 million records to learn from Assuming distinct records, you have covered only 2100 – 106 ! Requires domain knowledge in your representation The goal of optimization is not to get zero error, it is to get the right kind of error Nothing does better than everything else on all data Generalization Watch out for these claims Asymptopia: When carried out to infinity our optimization approach is guaranteed to find the minimum Any model is representable in our model Representable is not learnable! More data trumps more complex learners Training and Testing Back to the Entity Analytics Record You will have a set of data for your entities that should have an output from m Training set: Optimize Φ based on the data you provide Testing set: Evaluate m(Φ,x) for all x’s in a representative set Dimensionality isn’t always intuitive Back to our example a set of 100 boolean features is a space of 2100 and our data set of 1,000,000 leaving 2100 – 106 The size of the space grows much faster than our data can cover Even stranger still – most of the data in a high dimensional orange would be in the skin, not the pulp Feature Design Often features are “engineered” Flags of other properties Calculations of other features Reductions of other features As a practitioner, most of your time will be spent engineering features High hopes exist to automate this process one day (i.e. it’s holy grail stuff) Ensembles Many models are better than one Bagging Boosting Stacking Trend is toward larger ensembles Netflix prize winner (and runner up) were both ensemble approaches References Pedro Domingo : A Few Useful Things to Know about Machine Learning http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf Building Machine Learning Applications Elements of Statistical Learning Pattern Classification Handbook of Statistical Analysis and Data Mining Python Tutorial: http://www.afterhoursprogramming.com/tutorial/Python/ Classes/