Download Bringing together the data mining, data science and analytics

Bringing together the data mining, data science and analytics community. ACM: Association for Computing Machinery is the world’s largest educational and scientific computing society with the highest reputation as a professional organization. SIGKDD: Special Interest Group on Knowledge Discovery and Data Mining. Previous Courses •  Introduction to Spark •  Advanced Hadoop Based Machine Learn •  Hadoop Based Machine Learning •  MapReduce Design Patterns Austin SIGKDD Chapter-officers: Chance Coble [email protected] Robert Chong [email protected] Omar Odibat [email protected] Machine Learning with Python Date Topic 22-Apr 1: GETTING STARTED WITH PYTHON MACHINE LEARNING Instructor Chance Coble 29-Apr 2: LEARNING HOW TO CLASSIFY WITH REAL-WORLD Roger Huang EXAMPLES 6-May 3: CLUSTERING – FINDING RELATED POSTS 13-May 4: TOPIC MODELING 20-May 5: CLASSIFICATION I 27-May 6: CLASSIFICATION II – SENTIMENT ANALYSIS certificate will be awarded to 3-Jun 7: REGRESSION – RECOMMENDATIONS An official ACM those who complete the course (attending at 8: REGRESSION – RECOMMENDATIONS IMPROVED least 8 10-Jun sessions). 17-Jun 9: COMPUTER VISION – PATTERN RECOGNITION 24-Jun 10: DIMENSIONALITY REDUCTION | 1-Jul 11: DIMENSIONALITY REDUCTION || 8-Jul 12: BIG(GER) DATA Misty Nodine Christine Doig Mark Landry Jennifer Davis Jessica Williamson Omar Odibat Ed Solis Jessica Williamson Robert Chong Chance Coble Introduction to Machine Learning with Python Getting Started with Machine Learning and Python Outline   Introduction to an ATX ACM SIGKDD, Course and Instructors   Introduction to Python (Interactive)   Getting Started with Machine Learning (Lecture)   Q&A   Chapter Walk-through Part I: Getting Started with Python Python Introduction   Python is a high level general purpose programming language   Compiled to byte-code, then interpreted   Other flavors are JPython and IronPython   It’s the Strong, Dynamic Type.   Object-oriented   Functional features   And here it goes: print(“Hello World”) Python Continued   Comments: # This is commented   ‘’’ More   than one   line ‘’’   Variables         a = 0 ß Interpreter determines this is an integer b = “Hello” ß Interpreter determines this is a string casting: int(x) float(x) str(x) type(x) : handily returns type of variable x Python: Operators print(3 + 4) print(3 – 4) print(3*4) print(3/4) print(3 % 2) print(3 ** 4) # 3 to the 4th print(3 // 4) # Floor division Python: Guards a = 20 if a>= 22: print(“if”) # Note the spaces before ‘print’ –important! elif a>=21: print(“elif”) else: print(“else”) Python Functions def someFunction(): print(“boo”) # ß Again with the space someFunction() def someOtherFunction(a,b): print(a+b) someOtherFunction(12,451) Python: Iteration for a in range(1,3): print(a) a = 1 while a<10: print(a) a+=1 Python: Strings str = “a string” str.count(‘x’) str.find(‘x’) str.lower() str.upper() str.replace(‘a’,’b’) str.strip() print(str[1:3]) print(str[:-‐1]) Python: Lists sampleList = [1,2,3,4,5,6,7,8] for a in sampleList: print(a) sampleList.append(9) sampleList.count(2) sampleList.index(5) sampleList.pop() sampleList.remove(7) sampleList.reverse() sampleList.sort() Python: Tuples and Dictionaries •  Tuples are immutable (unlike lists) myTuple = (1,2,3) a,b,c = myTuple •  Dictionaries store key value pairs dictExample = {‘someItem’:20,’other’:100} dictExample[‘newItem’] = 400 for a in dictExample: print(a) printdictExample[‘someItem’] Classes class Calculator(object): #define class to simulate simple calculator def __init__(self): #start with zero self.current = 0 def add(self, amount): #add number to current self.current += amount def getCurrent(self): return self.current myCalc = Calculator() myCalc.add(2) print(myCalc.getCurrent()) Part II: Getting Started with Machine Learning Problem Setup   Programs for well processes to map input to output   Create invoices from an accounting system   Compute basic statistics on a set of data records   Compile programs into machine level code   Programs for scenarios with only the input and output         Find the baby in the picture Drive a car Rank a set of documents by importance Transcribe speech   Some the human brain does well Problem Setup   Entity Analytics Records   For each entity in which we want to perform a mapping, create a column of values associated with that entity   We will call these: features   For each entity you could have many features   We need to construct a program Φ so that an architecture m(Φ,x) yields the best answer given that   x: a set of features which have an ideal target value Learning is Representation, Optimization & Evaluation   m in this case is our representation         That is the language in which our model will be built Line: y = m * x + b Rules (if x then y) Decision Tree   Optimization allows you to improve your parameters   Evaluation (Ch 2): determines if one model is better than another Achieving Generalization   Easy to get 0 error in training   novice mistake!   If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input   This is an extreme example of overfitting   This is constantly your enemy in machine learning Achieving Generalization   Easy to get 0 error in training   novice mistake!   If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input   This is an extreme example of overfitting   This is constantly your enemy in machine learning Achieving Generalization   Easy to get 0 error in training   novice mistake!   If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input   This is an extreme example of overfitting   This is constantly your enemy in machine learning Achieving Generalization   Easy to get 0 error in training   novice mistake!   If your optimization creates a dictionary of inputs and their outputs, and then for a new point outputs the result for the closest input   This is an extreme example of overfitting   This is constantly your enemy in machine learning Generalization   See low error in training? Be skeptical!   Have records with 100 boolean features?   Let’s say you have 1 million records to learn from   Assuming distinct records, you have covered only 2100 – 106 !   Requires domain knowledge in your representation   The goal of optimization is not to get zero error, it is to get the right kind of error   Nothing does better than everything else on all data Generalization   Watch out for these claims Asymptopia: When carried out to infinity our optimization approach is guaranteed to find the minimum   Any model is representable in our model   Representable is not learnable!   More data trumps more complex learners Training and Testing   Back to the Entity Analytics Record   You will have a set of data for your entities that should have an output from m   Training set: Optimize Φ based on the data you provide   Testing set: Evaluate m(Φ,x) for all x’s in a representative set Dimensionality isn’t always intuitive   Back to our example   a set of 100 boolean features is a space of 2100 and our data set of 1,000,000 leaving 2100 – 106   The size of the space grows much faster than our data can cover   Even stranger still – most of the data in a high dimensional orange would be in the skin, not the pulp Feature Design   Often features are “engineered”   Flags of other properties   Calculations of other features   Reductions of other features   As a practitioner, most of your time will be spent engineering features   High hopes exist to automate this process one day (i.e. it’s holy grail stuff) Ensembles   Many models are better than one   Bagging   Boosting   Stacking   Trend is toward larger ensembles   Netflix prize winner (and runner up) were both ensemble approaches References   Pedro Domingo : A Few Useful Things to Know about Machine Learning http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf   Building Machine Learning Applications   Elements of Statistical Learning   Pattern Classification   Handbook of Statistical Analysis and Data Mining   Python Tutorial: http://www.afterhoursprogramming.com/tutorial/Python/ Classes/

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bringing together the data mining, data science and analytics