Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Machine Learning Márk Horváth Morgan Stanley FID Institutional Securities Content • • • • AI Paradigm Data Mining Weka Application Areas • Introduce many fields and the whole paradigm – No time for details AI Paradigm • “The area of computer science which deals with problems, that we where not able to cope with before.” – Computer science is a branch of mathematics, btw. • “Algorithms solving problems mainly through interaction with the problem. The programmer does not have to understand the solution to the problem itself, but only the details of the learning algorithm.” AI Paradigm • Why AI? – new, fast expanding science, applicable at most of other sciences • it also deals with explaining evidence – interdisciplinar • • • • • math computer science applied math philosophy of science biology (many naturally inspired algorithms, thinking machine) • Why Machine Learning / Data Mining? – it can be applied on any data (financial, medical, demographical, …) AI Paradigm • • • • • 1965 John McCarthy => 42 years Hilbert, theorem proving machine Occam (XIV.) Many distinct fields Many algorithms at each field • => 1 hour is nothing…. • • • • Empirical and theoretical science Intuition needed to use and hybridize Few proves Area too big to grasp everything in detail, but concepts are important – => BIG PICTURE, no formulas! AI Taxonomy AI Logic / Expert Sys Machine Learning / Optimization Control Clustering Model / AGI PCA, ICA Data Mining / … Function Approximation Kernel Based / Decision Tree / Linear Regression / Naiive Bayes 0R, 1R Nearest Neighbor Covering Gradient Methods … (max likelihood) Data Mining vs. Statistics • Statistics – ~ hypothesis testing • DM – search through hypothesizes • Empirical side – Many methods work which are proven to not converge – Some methods do not work while they should (due to computation power problems, slow convergence) Relation, Attribute, Class (Ω, A, P) X = MYCT x MMIN x MMAX x CACH x CHMIN x CHMAX (Attribute, Feature) Y = class (Class, Target) Ω=XxY ρ( Y | X ) = ? @relation 'cpu‘ @attribute MYCT real @attribute MMIN real @attribute MMAX real @attribute CACH real @attribute CHMIN real @attribute CHMAX real @attribute class real % performance @data 125,256,6000,256,16,128,199 29,8000,32000,32,8,32,253 29,8000,16000,32,8,16,132 26,8000,32000,64,8,32,290 23,16000,32000,64,16,32,381 … General View of Data Mining • Language • Build model / search over the Language Simple Cases • 0R • 1R (nominal class) • Max likelihood • Linear Regression Data Mining Taxonomy • Regression vs. Classification (exchangeable) • Deterministic vs. Stochastic (~exchangeable: Chebyshev) • Batch driven vs. Updateable (~exchangeable, but with cost) • Symbolic vs. Subsymbolic Methodology • • • • Clean data Try many methods Optimize good methods Hybridize good methods, make meta algorithms Evaluation Measures • Mean Absolute Error / Root Mean Squared Error • Correlation Coefficient • Information gain • Custom (e.g. weighted) • Significance analysis (Bernoulli process) Overfitting, Learning Noise • Philosophical question – When do we accept or deny a model? – No chance to prove, only to reject • Train / (Validation) / Test • Cross-validation, leave one out • Minimum Description Length principle – Occam – Kolmogorov complexity Nearest Neighbor / Kernel • • • • Instance based Statistical (k neighbors) Distance: Euclidian, Manhattan / Evolved Missing Attribute: maximal distance • KD-tree (log(n)), ball tree, metric tree Decision Trees / Covering • Divide and Conquer • Split by the best feature • User Classifier / REP Tree Naiive Bayes • Independent Attributes • P(X | Y) = P(Y | X) * P(X) / P(Y) = = Π P(Y | Xi) * P(X) / P(Y) • Discrete Class Artificial Neural Networks • Structure (Weka) – Theoretical limitations (Minsky, AI winter) • Recurrent networks for time series Feedforward Learning Rules • Learning rules – Perceptron / Winnow (very simple rules for special cases) – Various gradient descent methods • Slower than perceptron • Faster than doing derivation of the whole expression • Local search – Evolution • Global search • Bit slower, but easy to hybridize with local search • Can evolve: – – – – Weights Structure Transfer functions Recurrent networks Perceptron / Winnow • Perceptron – Add the misclassified instance to the weight – Converges if the space is separable • Winnow – Binary – Increase or decrease non zero attribute weights Feature extraction • • • • • Discretization PCA/ICA Various state space transitions Evolving features Clustering Meta / Hybrid Methods • • • • LEGO ;) Vote (many ways) Use meta algorithm to predict based on base methods Embed – Apply regression in the leaves of decision trees – Embed decision tree, or training samples in ANN • Unify – Choose a general purpose language – Use conventional training methods to build models – Hybridize training methods, evolve • Easy to write articles, countless new ideas Practical Uses • New paradigm • Countless applications • At all natural sciences – finance, psychology, sociology, biology, medicine, chemistry, … – actually discovering and explaining evidence is science itself • Business – predictive enterprise Applications in AI • Optimal Control (model building) • Using in other AI methods – Speech recognition – OCR – Speech synthesis – Vision, recognition – AGI (logic, DM, evolution, clustering, reinforcement learning, …) TDK, Article • Any topic you’ve found interesting…