Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Unit VI. Image interpretation MSc in Computational Sciences Dr. Felipe Orihuela-Espina Outline Interpreting statistics Causality Data mining Pattern recognition, machine learning Representation learning and manifold embedding Deep learning Knowledge representation and discovery Interpretation guidelines © 2015-16 Dr. Felipe Orihuela-Espina 2 Typical fMRI processing Figure source: [Wellcome Trust; Tutorial on SPM] © 2015-16 Dr. Felipe Orihuela-Espina 3 Typical fNIRS processing Raw Detrend Low pass filtering (decimation) Averaging Decimated and detrended © 2015-16 Dr. Felipe Orihuela-Espina 4 The three levels of analysis Data analysis often comprises 3 steps: Processing: Output domain matches input domain Preparation of data; data validation, cleaning, normalization, etc… Analysis: Reexpress data in a more convenient domain Summarization of data: Feature extraction, computation of metrics, statistics, etc… Understanding: Abstraction to achieve knowledge generation Interpretation of data: Concept validation, reexpresion in natual language, etc. © 2015-16 Dr. Felipe Orihuela-Espina 5 The three levels of analysis Processing • f:XX such that X (domain) and X (co-domain) share the same space (even the semantics of the space) • E.g.: Apply a filter to a signal or image and you get another signal or image Analysis • f:XY such that X and Y do not share the same space (the dimensionality might be the same but semantics may change) • E.g.: Apply a mask to a signal or image and you get the discontinuities, edges or a segmentation Interpretation (a.k.a. Understanding) • f:XH such that H is (natural) language encoding domain knowledge • E.g.: Apply a model to a signal or image and you get some knowledge useful for a human expert © 2015-16 Dr. Felipe Orihuela-Espina 6 INTERPRETING STATISTICS © 2015-16 Dr. Felipe Orihuela-Espina 7 Inferential Statistics “If your experiment needs statistics, you ought to have done a better experiment.” Lord Sir Ernest Rutherford of Nelson Neo Zelandés / Británico, 1871-1937 Padre de la física nuclear Descubridor del protón Nobel de Química 1908 © 2015-16 Dr. Felipe Orihuela-Espina 8 Quotations about statistal significance [BlandM1996] “Acceptance of statistics, though gratifying to the medical statistician, may even have gone too far. More than once I have told a colleague that he did not need me to prove that his difference existed, as anyone could see it, only to be told in turn that without the magic p-value he could not have his paper published.” [Nicholls in KatzR2001] “In general, however, null hypothesis significance testing tells us little of what we need to know and is inherently misleading. We should be less enthusiastic about insisting on its use.” © 2015-16 Dr. Felipe Orihuela-Espina 9 Quotations about statistal significance [Falk in KatzR2001] “Significance tests do not provide the information that scientists need, neither do they solve the crucial questions that they are characteristically believed to answer. The one answer that they do give is not a question that we have asked.” [DuPrelJB2009] “Unfortunately, statistical significance is often thought to be equivalent to clinical relevance. Many research workers, readers, and journals ignore findings which are potentially clinically useful only because they are not statistically significant. At this point, we can criticize the practice of some scientific journals of preferably publishing significant results [...] ("publication bias").” © 2015-16 Dr. Felipe Orihuela-Espina 10 Quotations about statistal significance [GardnerMJ1986, co-authored by Altman] “...the use of statistics in medical journals has increased tremendously. One unfortunate consequence has been a shift in emphasis away from the basic results towards an undue concentration on hypothesis testing. In this approach data are examined in relation to a statistical "null" hypothesis, and the practice has led to the mistaken belief that studies should aim at obtaining "statistical significance”. [...] The excessive use of hypothesis testing at the expense of other ways of assessing results has reached such a degree that levels of significance are often quoted alone in the main text and abstracts of papers, with no mention of actual concentrations, proportions, etc, or their differences. The implication of hypothesis testing- that there can always be a simple "yes" or "no" answer as the fundamental result from a medical study-is clearly false and used in this way hypothesis testing is of limited value.” © 2015-16 Dr. Felipe Orihuela-Espina 11 Modelling Deterministic model Values of the dependent variables Values of indepedent and controlled variables Stochastic model © 2015-16 Dr. Felipe Orihuela-Espina Expectation of the dependent variables 12 Stochastic analysis In stochastic dependencies two closely related major analysis can be carried out: Regression analysis It defines the type of relation (linear, exponential, logarithmic, hyperbolic, etc) between the variables It produces an equation a.k.a. model, describing the relation Correlation analysis It defines the degree and consistency of the in/dependence, or the degree of association between the variables It produces a single value summarizing the strength of the assumed relation © 2015-16 Dr. Felipe Orihuela-Espina 13 Regression analysis Regression analysis involves a number of statistical approaches for estimating relations between variables. Regression analysis is widely used for: A. Inference of relations between variables B. (modelling) and Prediction of new outcomes and observations (simulation) © 2015-16 Dr. Felipe Orihuela-Espina 14 Linear univariate regression (deterministic) Dependent variable Slope Independent variable Intersection with the ordinate axis A more general notation Parameters © 2015-16 Dr. Felipe Orihuela-Espina 15 Linear univariate regression (stochastic) Deterministic model Uncertainty Stochastic model © 2015-16 Dr. Felipe Orihuela-Espina 16 Linear univariate regression (stochastic) Stochastic model Expressing uncertainty (error) explicitly for each observation. Error is the difference between the i-th observation and its expectation. In other words, the difference between the measurement and the real value (Yi-E[X]). © 2015-16 Dr. Felipe Orihuela-Espina 17 Linear univariate regression (stochastic) For j independent and controlled variables: This is known as the additive linear model It relates one dependent variable with j independent variables Note that the unknowns are the βi coefficients (a.k.a. parameters). Modelling consists of estimating these coefficients according to a certain criterion. © 2015-16 Dr. Felipe Orihuela-Espina 18 Linear univariate regression (stochastic) In general, for n cases, a full system of equations is generated: © 2015-16 Dr. Felipe Orihuela-Espina 19 General linear model We can conveniently express the previous model using matrices: where: The 1s are necessary for the intersection with the ordinate axis β0. Sometimes, the model is presented without a constant term, 2015-16 Dr. Felipe Orihuela-Espina and thus ©this column disappears. 20 General linear model nx1 nx(j+1) (j+1) x1 © 2015-16 Dr. Felipe Orihuela-Espina nx1 21 Covariance Covariance expresses the trend or tendency in the relation (linear) between the variables If sXY>0 ⇒ if X increases, Y increases If sXY<0 ⇒ if X increases, Y decreases Figure from: [http://biplot.usal.es/ALUMNOS/BIOLOGIA/5BIOLOGIA/Regresionsimple.pdf] © 2015-16 Dr. Felipe Orihuela-Espina 22 Correlation coefficient Pearson correlation coefficient is an index expressing the magnitude of the linear association between two quantitative random variables*, and corresponds to the normalization of the covariance: Covariance Standard deviations *For a formal definition of random variable, please check my slides of the course Introduction to Statistics. © 2015-16 Dr. Felipe Orihuela-Espina 23 Correlation coefficient Figura from: [en.wikipedia.org] © 2015-16 Dr. Felipe Orihuela-Espina 24 Correlation coefficient ¡Beware! This table is out of date. Some of the cells marked as “no desarrollados” are already available. I didn’t have the time to update the table. Table from: [http://pendientedemigracion.ucm.es/info/mide/docs/Otrocorrel.pdf] © 2015-16 Dr. Felipe Orihuela-Espina 25 Adjustment Coefficient of determination R2: A key output of the regression analysis; it represents the proportion of the variance in the dependent variable that is predictable from the independent variable. The coefficient of determination is NOT the linear correlation coefficient r (that’s Pearson), but as you can imagine it is closely related Yep! You guess it; one is the square of the other… © 2015-16 Dr. Felipe Orihuela-Espina 26 Adjustment Figure from: [Wolfram MathWorld] © 2015-16 Dr. Felipe Orihuela-Espina 27 Hypothesis testing Considered the father of inferential statistics The creator of ANOVA among other models Worked at Cambridge and UCL, he was member of the Royal Society He actually substituted Pearson in his chair at UCL As the genious he was, he also worked and achieved recognition for its contribution to many other fields: mathematics, evolutionary biology, genetics, etc In fact, he is also the father of population genetics, describing evolutionary phenomena as a function of variation and distribution of allelic frequency He further found the usefulness of the latin squares to improve experimental designs for agriculture. Sir Ronald Aylmer Fisher (1890-1962) Británico A biography and some links: http://www-history.mcs.st-andrews.ac.uk/Biographies/Fisher.html © 2015-16 Dr. Felipe Orihuela-Espina 28 Null and Alternative Hypothesis Statistical testing is used to accept/rejct hypothesis Null hypothesis (H0): There is no difference or relation, and any observed difference is due to chance H0: μ1=μ2 Alternative hypothesis (Ha): There is difference or relation unlikely to be attributable to chance. Ha: μ1μ2 Example: Research question: ¿Are men taller than women? Null hypothesis: There is no height difference among genders Alternative hypothesis: Gender makes a difference in height. © 2015-16 Dr. Felipe Orihuela-Espina 29 Hypothesis Type / Directionality: One-tail vs Two-tail One-tailed: Used for directional hypothesis testing Alternative hypothesis: There is a difference and we anticipate the direction of that difference Ha: μ1<μ2 Ha: μ1>μ2 Two-tailed: Used for non-directional hypothesis testing Alternative hypothesis: There is a difference but we do not anticipate the direction of that difference Ha: μ1μ2 Example: Research question: ¿Are men taller than women? Null hypothesis: There is no height difference among genders Alternative hypothesis: One tail: Men are taller than women Two tail: One gender is taller than the other. [Figures from: http://www.mathsrevision.net/alevel/pages.php?page=64] © 2015-16 Dr. Felipe Orihuela-Espina 30 Significance Level (α) and test power (1-β) The probability of making Decision \ Reality H0 true / Ha False H0 false / Ha true Accept H0; Reject Ha Ok (p=1-α) Type II Error (β) Reject H0; Accept Ha Type I Error (p=α) Ok (1-β) Type I Errors can be decreased by altering the level of significance (α) Unfortunately, this in turn increments the risk of Type II Errors. …and viceversa The decision on the significance level should be made (not arbitrarily but) based on the type of error we want to reduced. © 2015-16 Dr. Felipe Orihuela-Espina Figure from: [http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/reference/reference_manual_02.html] 31 Hypothesis Type / Directionality: One-tail vs Two-tail Hypotheis directionality affect statistical power One tail tests provide Two tail test more statistical power to detect an effect Choosing a one-tailed test for the sole purpose of attaining significance is not appropriate. You may lose the difference in the other direction! Choosing a one-tailed test after running a twotailed test that failed to reject the null hypothesis is not appropriate. One tail test Source: [http://www.ats.ucla.edu/stat/mult_pkg/faq/general/tail_tests.htm] Figure from: [http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/reference/reference_manual_02.html] © 2015-16 Dr. Felipe Orihuela-Espina 32 Independence of observations: Paired vs Unpaired Paired: There is a one-to-one (biyective) correspondence between the samples of the groups If samples in one group are reorganised then so should samples in the other. Examples: Randomized block experiments with two units per block Studies with individually matched controls Repeated measurements on the same individual Unpaired: There is no correspodence between the samples of the groups. Samples in one group can be reorganised independently of the other Pairing is a strategy of design, not Example of paired data: N sets of twins to know if the 1st born is more aggresive than the second Twin Pair Aggresiveness score 1st born 2nd born 1 86 88 2 71 77 3 77 76 … … … analysis (pairing occur before data N 87 72 collection!). Pairing is used to reduce Example adapted from [DinovI2005] bias and increase precision © 2015-16 Dr. Felipe Orihuela-Espina 33 [DinovI2005] Parametric vs non-parametric Parametric testing: Assumes a certain deistribution of the variable in the population to which we plan to generalize our data Non-parametric testing: No assumption regarding the distribution of the variable in the population That is distribution free, NOT ASSUMPTION FREE!! Non-parametric tests look at the rank order of the values Parametric tests are more powerful than non- parametric ones and so should be used if possible [GreenhalghT 1997 BMJ 315:364] © 2015-16 Dr. Felipe Orihuela-Espina 34 Source: 2.ppt (Author unknown) One way, two way,… N-way analysis Experimental design may be one-factorial, two factorial,… N-factorial i.e. one research question at a time, two research questions at a time, …N research questions at a time. The more ways the more difficult the analysis interpretation One-way analysis measures significance effects of one factor only. Two-way analysis measures significance effects of two factor simultaneously. Etc… © 2015-16 Dr. Felipe Orihuela-Espina 35 Steps to apply a significance test 1. Define a hypothesis 2. Collect data 3. Determine the test to apply 4. Calculate the test value (t,F,χ2) and reexpress as a probabilty p 5. Accept/Reject null hypothesis based on degrees of freedom and significance threshold © 2015-16 Dr. Felipe Orihuela-Espina [GurevychI2011] 36 Which test to apply? Selecting the right test depends on several aspects of the data: Sample count (Low <30; High >30) Independence of observations (Paired, Unpaired) Number of groups or datasets to be compared Data types (Numerical, categorical, etc) Assumed distributions Hypothesis type (One-tail, Two tail). © 2015-16 Dr. Felipe Orihuela-Espina [GurevychI2011] 37 Which test to apply? Independent Variable Number Dependent Variable Type Number Test Type Statistic 1 population N/A 1 Continuous normal One sample ttest Mean 2 independent populations 2 categories 1 Normal Two sample ttest Mean 1 Non-normal Mann Whitney, Wilcoxon rank sum test Median 1 Categorical Chi square test, Fisher’s exact test Proportion 3 or more populations Categorical 1 Normal One way ANOVA Means … … … … … … More complete tables can be found at: •http://www.ats.ucla.edu/stat/mult_pkg/whatstat/choosestat.html •http://bama.ua.edu/~jleeper/627/choosestat.html •http://www.bmj.com/content/315/7104/364/T1.expansion.html © 2015-16 Dr. Felipe Orihuela-Espina 38 CAUSALITY © 2015-16 Dr. Felipe Orihuela-Espina 39 Cogito ergo sum Cause Effect Cogito Sum © 2015-16 Dr. Felipe Orihuela-Espina 40 Causation defies (1st level) logic… Input: “If the floor is wet, then it rained” “If we break this bottle, the floor will get wet” Logic output: “If we break this bottle, then it rained” Example taken from [PearlJ1999] © 2015-16 Dr. Felipe Orihuela-Espina 41 Why is causality so problematic? A very silly example Cannot be computed from the data alone Systematic temporal precedence is not sufficient Co-ocurrence is not sufficient It is not always a direct relation (indirect relations, transitivity/mediation, etc may be present), let alone linear… It may occur across frequency bands YOU NAME IT HERE… Which process causes which? Causality is so difficult that “it would be very healthy if more researchers abandoned thinking of and using terms such as cause and effect” [Muthen1987 in PearlJ2011] © 2015-16 Dr. Felipe Orihuela-Espina 42 Causality requires time/order! “…there is little use in the practice of attempting to dicuss causality without introducing time” [Granger,1969] …whether philosphical, statistical, econometrical, topological, etc… Actually; “time” is NOT necessarily to be strictly understood in the chronological sense (although most times is). “Time” here means a mathematical relation of order in a set. Note that the use of order in a set is close to Lamport’s causality [Lamport L (1978) Comm. ACM, 21(7):558-565]. Also in topological causality this chronological causality is often referred to as “timelike” to indicate that lays along the negative signed dimensión of the Minkowski space © 2015-16 Dr. Felipe Orihuela-Espina 43 Causality requires directionality/context! Algebraic equations, e.g. regression “do not properly express causal relationships […] because algebraic equations are symmetrical objects […] To express the directionality of the underlying process, Wright augmented the equation with a diagram, later called path diagram in which arrows are drawn from causes to effects” [PearlJ2009] Feedback and instantaneous causality in any case are a double causation. In topological causality this is referred to a “nonspacelike” causality © 2015-16 Dr. Felipe Orihuela-Espina 44 A real example [OrihuelaEspinaF2010] An ECG [KaturaT2006] only claim that there are interrelations (quantified using MI) © 2015-16 Dr. Felipe Orihuela-Espina 45 Statistical dependence Statistical dependence is a type of relation between any two variables [WermuthN1998]: if we find one, we can expect to find the other Statistical independence Association (symmetric or assymettric) Deterministic dependence The limits of statistical dependence Statistical independence: The distribution of one variable is the same no matter at which level changes occur on in the other variable X and Y are independent P(X∩Y)=P(X)P(Y) Deterministic dependence: Levels of one variable occur in an exactly determined way with changing levels of the other. Association: Intermediate forms of statistical dependency Symmetric Asymmetric (a.k.a. response) or directed association © 2015-16 Dr. Felipe Orihuela-Espina 46 Associational Inference ≡ Descriptive Statistics!!! The most detailed information linking two variables is given by the joint distribution: P(X=x,Y=y) The conditional distribution describes how the values of X changes as Y varies: P(X=x|Y=y)=P(X=x,Y=y)/P(Y=y) Associational statistics is simply descriptive (estimates, regressions, posterior distributions, etc…) [HollandPW1986] Example: Regression of X on Y is the conditional expectation E(X|Y=y) © 2015-16 Dr. Felipe Orihuela-Espina 47 Statistical dependence vs Causality Statistical dependence provide associational relations and can be expressed in terms of a joint distribution alone Causal relations CANNOT be expressed on terms of statistical association alone [PearlJ2009] Associational inference ≠ Causal Inference [HollandPW1986, PearlJ2009] …ergo, Statistical dependence ≠ Causal Inference In associational inference, time is merely operational © 2015-16 Dr. Felipe Orihuela-Espina 48 Regression and Correlation; two common forms of associational inference Regression Analysis: “the study of the dependence of one or more response variables on explanatory variables” [CoxDR2004] Correlation is a relation over mean values; two variables correlate as they move over/under their mean together (correlation is a ”normalization” of the covariance) © 2015-16 Dr. Felipe Orihuela-Espina 49 Regression and Correlation; two common forms of associational inference Correlation ≠ Statistical dependence If r=0 (i.e. absence of correlation), X and Y are statistically independent, but the opposite is not true [MarrelecG2005]. Correlation ≠ Causation [YuleU1900 in CoxDR2004, WrightS1921] Yet, causal conclusions from a carefully design (often synonym of randomized) experiment are often (not always) valid [HollandPW1986, FisherRA1926 in CoxDR2004] Strong regression ≠ causality [Box1966] Prediction systems ≠ Causal systems [CoxDR2004] © 2015-16 Dr. Felipe Orihuela-Espina 50 Coherence: yet another common form of associational inference Coherence: Often understood as “correlation in the frequency domain” Cxy = |Gxy|2/(GxxGyy) where Gxy is the cross-spectral density, i.e. coherence is the ratio between the (squared) correlation coefficient and the frequency components. Coherence measures the degree to which two series are related Coherence alone does not implies causality! The temporal lag of the phase difference between the signals must also be considered. © 2015-16 Dr. Felipe Orihuela-Espina 51 From association to causation Barriers between classical statistics and causal analysis [PearlJ2009] 1. Coping with untested assumptions and changing conditions 2. Inappropiate mathematical notation © 2015-16 Dr. Felipe Orihuela-Espina 52 Causality Stronger Zero-level causality: a statistical association, i.e. non-independence which cannot be removed by conditioning on allowable alternative features. i.e. Granger’s, Topological First-level causality: Use of a treatment over another causes a change in outcome i.e. Rubin´s, Pearl’s Weaker Second-level causality: Explanation via a generating process, provisional and hardly lending to formal characterization, either merely hypothesized or solidly based on evidence i.e. Suppe’s, Wright’s path analysis e.g. Smoking causes lung cancer Inspired from [CoxDR2004] © 2015-16 Dr. Felipe Orihuela-Espina It is debatable whether second level causality is indeed causality 53 Variable types and their joint probability distribution Variable types: Background variables (B) – specify what is fixed Potential causal variables (C) Intermediate variables (I) – surrogates, monitoring, pathways, etc Response variables (R) – observed effects Joint probability distribution of the variables: P(RICB) = P(R|ICB) P(I|CB) P(C|B) P(B) …but it is possible to integrate over I (marginalized) P(RCB) = P(R|CB) P(C|B) P(B) In [CoxDR2004] © 2015-16 Dr. Felipe Orihuela-Espina 54 Granger’s Causality Granger´s causality: Y is causing X (YX) if we are better to predict X using all available information (Z) than if the information apart of Y had been used. The groundbreaking paper: Granger “Investigating causal relations by econometric models and cross-spectral methods” Econometrica 37(3): 424-438 Granger’s causality is only a statement about one thing happening before another! Rejects instantaneous causality Considered as slowness in recording of information © 2015-16 Dr. Felipe Orihuela-Espina Sir Clive William John Granger (1934 –2009) – University of Nottingham – Nobel Prize Winner 55 Granger’s Causality “The future cannot cause the past” [Granger 1969] “the direction of the flow of time [is] a central feature” Feedback is a double causation; XY and YX denoted XY “causality…is based entirely on the predictability of some series…” [Granger 1969] Causal relationships may be investigated in terms of coherence and phase diagrams © 2015-16 Dr. Felipe Orihuela-Espina 56 Topological causality “A causal manifold is one with an assignment to each of its points of a convex cone in the tangent space, representing physically the future directions at the point. The usual causality in M O extends to a causal structure in M’.” [SegalIE1981] Causality is seen as embedded in the geometry/topology of manifolds Causality is a curve function defined over the manifold The groundbreaking book: Segal IE “Mathematical Cosmology and Extragalactic Astronomy” (1976) The father of causal manifolds is likely to be Lorentz. Nevertheless, Segal’s contribution to the field of causal manifolds is simply overwhelming… Irving Ezra Segal (1918-1998) Professor of Mathematics at MIT © 2015-16 Dr. Felipe Orihuela-Espina 57 Causal (homogeneous Lorentzian) Manifolds: The topological view of causality The cone of causality [SegalIE1981,RainerM1999, MosleySN1990, KrymVR2002] Future Instant present Past © 2015-16 Dr. Felipe Orihuela-Espina 58 Causal (homogeneous Lorentzian) Manifolds: The topological view of causality A relation of causality between the points of pseudo-Riemannian manifold may be [Kronheimer and Penrose, 1967, Proc. Camb. Phil. Soc. 63:481-501]: Horismos : Meaning that y lies on the causal cone Chronological or timelike : Meaning that y lies inside the causal cone Non-spacelike (sometimes referred to simply as causal) : Meaning that y lies not outside the causal cone © 2015-16 Dr. Felipe Orihuela-Espina 59 Rubin Causal Model Rubin Causal Model: “Intuitively, the causal effect of one treatment relative to another for a particular experimental unit is the difference between the result if the unit had been exposed to the first treatment and the result if, instead, the unit had been exposed to the second treatment” The groundbreaking paper: Rubin “Bayesian inference for causal effects: The role of randomization” The Annals of Statistics 6(1): 34-58 The term Rubin causal model Donald B Rubin (1943 – ) – John L. Loeb Professor of Stats at Harvard was coined by his student Paul Holland © 2015-16 Dr. Felipe Orihuela-Espina 60 Rubin Causal Model Causality is an algebraic difference: treatment causes the effect Ytreatment(u)-Ycontrol(u) …or in other words; the effect of a cause is always relative to another cause [HollandPW1986] Rubin causal model establishes the conditions under which associational (e.g. Bayesian) inference may infer causality (makes assumptions for causality explicit). © 2015-16 Dr. Felipe Orihuela-Espina 61 Fundamental Problem of Causal Inference Only Ytreatment(u) or Ycontrol(u) can be observed on a phenomena, but not both. Causal inference is impossible without making untested assumptions …yet causal inference is still possible under uncertainty [HollandPW1986] (two otherwise identical populations u must be prepared and all appropiate background variables must be considered in B). Again!; Causal questions cannot be computed from the data alone, nor from the distributions that govern the data [PearlJ2009] © 2015-16 Dr. Felipe Orihuela-Espina 62 Relation between Granger, Rubin and Suppes causalities Granger Rubin’s model Cause (Treatment) Y t Effect X Ytreatment(u) All other available information Z Z (pre-exposure variables) Granger’s noncausality: X is not Granger cause of Y (relative to information in Z) X and Y are conditionally independent (i.e. P(Y|X,Z)=P(Y|Z)) Granger’s noncausality is equal to Suppes spurious case Modified from [HollandPW1986] © 2015-16 Dr. Felipe Orihuela-Espina 63 Pearl’s statistical causality (a.k.a. structural theory) “Causation is encoding behaviour under intervention […] Causality tells us which mechanisms [stable functional relationships] is to be modified [i.e. broken] by a given action” [PearlJ1999_IJCAI] Judea Pearl (1936-) Professor of computer science and statistics at UCLA Causality, intervention and mechanisms can be encapsulated in a causal model The groundbreaking book: Sewall Green Wright (1889-1988) – Father of path analysis (graphical rules) Pearl J “Causality: Models, Reasoning and Inference” (2000)* Pearl’s results do establish conditions under which first level causal conclusions are possible [CoxDR2004] [PearlJ2000, Lauritzen2000, DawidAP2002] * With permission of his 1995 Biometrika paper masterpiece © 2015-16 Dr. Felipe Orihuela-Espina 64 Statistical causality Conditioning vs Intervening [PearlJ2000] Conditioning: P(R|C)=P(R|CB)P(B|C) useful but innappropiate for causality as changes in the past (B) occur before intervention (C) Intervention: P(R║C)=P(R|CB)P(B) Pearl´s definition of causality Underlying assumption: The distribution of R (and I) remains unaffected by the intervention. Watch out! This is not trivial serious interventions may distort all relations [CoxDR2004] βCB=0 C╨B P(R|C)=P(R║C) Structural coefficient Conditional independence i.e. if C and B are independent there is no difference between conditioning and intervention © 2015-16 Dr. Felipe Orihuela-Espina 65 DATA MINING © 2015-16 Dr. Felipe Orihuela-Espina 66 Initial definitions In a conditional probability P(x|y), the set of P(y) are called the priors. The likelihood function is the probability of the evidence given the parameters i.e. the model: p(x|θ). The posterior probability is the probability of the parameters i.e. the model, given the evidence: p(θ|x). © 2015-16 Dr. Felipe Orihuela-Espina 67 Initial definitions Factors of variation: Aspects of the data that can vary separately. i.e. the intrinsic dimensionality of the manifold Computational element or unit: A mathematical function or block that can be reused to express more complex mathematical functions. Examples: basic logic gates (AND, OR, NOT), artificial neurons, decision trees, etc Fan-in: Maximum number of inputs of a particular element © 2015-16 Dr. Felipe Orihuela-Espina 68 Initial definitions System or computational model: A set of interconnected computational elements, at times represented by a graph. Size of a system: Number of elements in the system. Important to justify deep learning is the observation that reorganizing the way in which computational units are composed or connected can have a drastic effect on the efficiency of representation size [BengioY2009, pg 19]. Types or classes of models: Generative models: Models for randomly generating observable data P(X,Y). These include HMMs, GMMs, restricted Boltzmann Machines, etc Discriminative or conditional models: Models for capturing the dependence of an unobserved variable Y on an observed variable X, P(Y|X). These include linear discriminant analysis, SVM, linear regressors, ANN, ... © 2015-16 Dr. Felipe Orihuela-Espina 69 Posterior probability Using Bayes rules; p(θ|x) = [p(x|θ)p(θ)]/p(x) ...which can be "reexpressed" for easy remembering as the directly proportional (∝) relation: Posterior probability ∝ Likelihood ✕ Prior probability …or in other words, since the joint distribution p(x,θ)=p(x|θ)p(θ), then Posterior probability ∝ Joint distribution © 2015-16 Dr. Felipe Orihuela-Espina 70 Posterior probability From the above (i.e. previous slide), two basic approximations for estimating posterior probabilities follow [ResnikP2010]: The maximum likelihood estimation (MLE) which amounts to counting and then normalizing so that the probabilities sum to 1: MLE produces the choice most likely to have generated the observed data. The maximum a posteriori (MAP) estimation MAP estimate is the choice that is most likely given the observed data. © 2015-16 Dr. Felipe Orihuela-Espina 71 Posterior probability Both, MLE and MAP give us the best estimate according to their respective definitions of "best". In contrast to MLE, MAP estimation applies Bayes's Rule, so that our estimate can take into account prior knowledge about what we expect θ to be in the form of a prior probability distribution P(θ). None, MLE nor MAP give a whole distribution P(θ|x). © 2015-16 Dr. Felipe Orihuela-Espina 72 Patterns Patterns are regularities in data. [Wikipedia:Pattern_recognition] Patterns refers to models (regression or classification) or components of models (e.g. a linear term in a regression) [FayyadU1996, pg 51] [Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 73 Data mining Data mining is: “the application of specific algorithms for extracting patterns from data.” [FayyadU1996] “the computational process of discovering patterns in large data sets” [Wikipedia:Data_mining] the analysis step of the "Knowledge Discovery in Databases" (KDD) process [FayyadU1996, Wikipedia:Data_mining] © 2015-16 Dr. Felipe Orihuela-Espina 74 Data mining • Discovering patterns in large data sets [Wikipedia:Data_mining] Data mining Pattern recognition •Recognition of regularities (patterns) in data [Wikipedia:Pattern_recognition •Data driven classification [JainAK2000] •Nearly synonymous with machine learning [Wikipedia:Pattern_recognition] Different names for the same thing? Machine learning Knowledge discovery •Data-driven discovery of knowledge •It adds processing (cleaning, selection) steps to data mining [FayyadU1996] •Construction and study of algorithms that can learn (act of acquiring new knowledge) from data •Often overlaps with computational statistics [Wikipedia:Machine_learning] © 2015-16 Dr. Felipe Orihuela-Espina 75 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 76 Data mining … © 2015-16 Dr. Felipe Orihuela-Espina 77 [Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!] Data mining Classification is strongly related to regression [FayyadU1996]: Regression is learning a function that maps a data item to a realvalued prediction variable. Classification is learning a function that maps (classifies) a data item into one of several predefined classes. © 2015-16 Dr. Felipe Orihuela-Espina 78 Learning Goal The objective of learning in AI is giving computers the ability to understand our world in terms of inferring semantic concepts and relationships among these concepts. Scope: Single task: Observations comes from a single task Multi-task: Observations comes from several tasks at once © 2015-16 Dr. Felipe Orihuela-Espina 79 Types of learning Supervised: Relys on known (labelled) examples a.k.a. the training set, to find a discrete regressor Unsupervised: Finds regularities and structures (i.e. fits probability distributions) to observations Reinforced: Updates the currently learn model based on rewards assessing its outputs Semi-supervised: From an initially learned supervised model, it evolves unsupervisedly by generating synthetic "rewards" proportional to the likelihood of the new observations. Active: A particular case of semi-supervised learning in which the new observations are chosen or selected from all arriving new observations according to a certain criteria. Transfer: A particular case of semi-supervised learning in which new observations comes from a new domain or task. © 2015-16 Dr. Felipe Orihuela-Espina 80 Basic problems in learning Modelling: It refers to encoding dependencies between variables under a given chosen form. In fact, modelling per se just refers to choosing this form, and in its more minimalistic case it does not require the model to be representative of the phenomenon, explicative nor predictive! It may be just nuts, a silly model! Learning: It refers to optimizing the parameters of the model by minimizing the loss functional i.e. a particular criteria, e.g. least squares error. Inference or reconstruction: It refers to estimating posterior probabilities of hidden variables given observed ones, P(h|x) or h=f(x) © 2015-16 Dr. Felipe Orihuela-Espina 81 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 82 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 83 Data mining Feature selection Feature extraction © 2015-16 Dr. Felipe Orihuela-Espina 84 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 85 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 86 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] 87 © 2015-16 Dr. Felipe Orihuela-Espina Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 88 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 89 Data mining Clustering [Fayyad et atl (1996) AI magazine Fall:37-54, >6500 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 90 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 91 Data mining [Jain et al (2000) IEEE TPAMI 22(1):4-37, >5000 citations!] © 2015-16 Dr. Felipe Orihuela-Espina 92 Optimizing model selection Assumes LTI Y1…Npre -> Hyperparameters for preprocessing Xi,fs -> Feature selection method Y1…Npre -> Hyperparameters for feature selection Xi,class -> Classifier method Y1…Nclass -> Hyperparameters for classification © 2015-16 Dr. Felipe Orihuela-Espina [EscalanteHJ2009] Xi,pre -> Combination of preprocessing methods 93 Data mining “Overfitting: When the algorithm searches for the best parameters for one particular model using a limited set of data, it can model not only the general patterns in the data but also any noise specific to the data set, resulting in poor performance of the model on test data. Possible solutions include cross-validation, regularization, and other sophisticated statistical strategies.” [FayyadU1996] © 2015-16 Dr. Felipe Orihuela-Espina 94 REPRESENTATION LEARNING AND MANIFOLD EMBEDDING © 2015-16 Dr. Felipe Orihuela-Espina 95 Representation learning and manifold embedding A manifold is a topological space* that it is locally Euclidean. The concept of manifold is the generalisation of the traditional Euclidean (linear) space to adapt to nonEuclidean topologies. Note that “locally Euclidean” does not mean that it is constraint to a Euclidean metric globally, but only that it is locally homeomorphic to a Euclidean space. In other words, a manifold is a k-dimensional object placed in a n-dimensional ambient space A k-dimensional manifold is a submanifold with k degrees of freedom, i.e. that can be described with only k coordinates * Remember; a topological space is set with a topology (strucutre). A topology is a set of subsets of the original space that satisfy the following axioms; (i) the empty and the set itself are in the topology, (ii) the unión of a finite collection of sets in the topology is also in the topology, and (iii) the intersetion of an arbitrary collection of sets in the topology is also in the topology. © 2015-16 Dr. Felipe Orihuela-Espina 96 Representation learning and manifold embedding If the manifold is infinitely differentiable then it is called a smooth manifold. A smooth manifold with a metric imposed to induce the topology is called a Riemannian manifold. A submanifold is a subset of a manifold which is itself a manifold. © 2015-16 Dr. Felipe Orihuela-Espina [Wolfram, World of Maths] [Carreira-Perpiñán,1997] 97 Representation learning and manifold embedding An homeomorphism is a continuous bijective transformation between topological spaces X and Y. f:XY The fact that is continuous means that points which are close in X are also close in Y, and points which are far in X are also far in Y. The fact that it is bijective (or 1 to 1) means that it is injective and surjective, and also imply that there exist the inverse f-1:YX If the homeomorphism is differentiable, i.e. if the derivate and its inverse exists, then it is called diffeomorphism. © 2015-16 Dr. Felipe Orihuela-Espina 98 Representation learning and manifold embedding An embedding is a map f:XY such that f is a diffeomorphism from X to f(X), and f(X) is a smooth submanifold of Y. An embedding is the representation of a topological object (e.g. a manifold, graph, lattice, etc) in a certain (sub-)space so that its topology is preserved. In particular, for manifolds, it preserves the open sets in the underlying topology T. [Roweis, 2000] [Maaten, 2007] [Bonatti,2006] © 2015-16 Dr. Felipe Orihuela-Espina 99 Representation learning and manifold embedding Summarizing… A manifold is any object which is locally linear (flat). An embedding is a function from a space to another so that the topology (shape) is preserve through deformations (twisting and stretching) Ergo… Manifold embedding refers to the transformation of your data whilst ensuring you do not alter the intrinsic relations among the observations. © 2015-16 Dr. Felipe Orihuela-Espina 100 Manifold Embedding: Nomenclature Manifold embedding is also called Manifold learning [Souvernir 2005] Multivariate data projection [[Mao,1995] in Demartines, 1997] or simply projection [Venna2007] Data embedding [Yang2004] Representation learning [BengioY2010] The origin space is sometimes called: High dimensional (input) space [Tenenbaum, 2000][Demartines,1997][Venna2007] Vector space [Roweis, 2000][Sammon,1969][Brand,2003] Data space [Souvernir 2005] Observation space [Silva 2002] Domain space [Yang 2004, 2005] Feature space (usually in the context of pattern recognition and analysis) The destination space is usually more consistently called Low-dimensional space But other names include output space [Demartines, 1997][Venna2007] and I personally like…embedding space [Leff, 2007] 101 Manifold Embedding Dimensionality reduction is a particular case of manifold embedding, in which the dimension of the destination space is lower than the original data space Domain specific data are often distributed (lay on, or close to) a low dimensional manifold in a high dimensional space [Yang, 2004] Topology or structure is retained/preserved if the pairwise distances in the low dimensional space approximate the corresponding pairwaise distance in the feature space. [Sammon,1969] 102 Manifold Embedding Variants Multiple manifold embedding Data lie in more than 1 manifold Multi-class manifold embedding Data lie in a single manifold, but sampling contains large gaps, perhaps even fragmenting connected components 103 Manifold embedding The intrinsic dimensionality (ID) of a manifold has been defined as “the number of independent variables that explains satisfactorily” that manifold. Determination of the ID eliminates the possibility of over- or under-fitting. Since it is always possible to find a manifold of any dimension which passes through all points in a data set given enough parameters, the problem of estimating the ID of a dataset is ill-posed in the Hadamard sense Note that is the case of interpolation, which finds a 1-D curve to fit a dataset! [CarreiraPerpiñán,1997] Figure modified from [CarreiraPerpiñán,1997] 104 Manifold embedding Topological dimension is the “local” dimensionality at every point i.e. the dimension of the tangent space The topological dimension is a lower bound of the ID Example: Sphere: ID: 3 Topological dimension: 2 (at every point the sphere can be aproximated by a surface) [Camastra, 2003] 105 Representation learning and manifold embedding In manifold embedding, there are methods for: Estimating the intrinsic dimensionality of the data, without actually projecting the data. Generate a (meaningful) configuration by means of a projection (data projection methods). a.k.a. Representation learning If the configuration is low dimensional then it is often referred to as dimensionality reduction. 106 Manifold embedding Example of methods for estimating the intrinsic dimensionality of data (without projection) Bennet’s algorithm [Bennet, 1969] Local eigenvalue estimator [Verveer et al, 1995] Fukunaga and Olsen algorithm [Fukunaga et al, 1971] Bruske and Sommer work based on topology preserving map [Bruske et al 1998] Trunk’s statistical approach (near neighbour techniques) [Trunk, 1968] [[Trunk, 1976] in [Camastra, 2003] Pettis’ algorithm – Add assumption of uniformly distribution of sampling to derive a simple expression. Near neighbour estimator [Verveer et al, 1995] Fractal based methods [Review by Camastra, 2003] Broomhead’s topological dimension of a time series [Broomhead, 1987] 107 Representation learning and manifold embedding Example of linear data projection methods PCA (Principal Component Analysis) [Refs- LOTS!!] MDS (Multidimensional Scaling, a.k.a. Principal coordinate analysis) (Refs - LOTS!! – [Kruskal, 1974][Cox,1994]) ICA (Independent Component Analysis) [Comon, 1994] CCA (Canonical Correlation Analysis) [Friman, 2002] PP (Projection pursuit) [Carreira-Perpiñán, 1997] 108 Representation learning and manifold embedding Example of non-linear data projection methods Sammon’s non linear mapping (NLM) [Sammon, 1969] GeoNLM [Yang, 2004b] Kohonen’s self organising maps (SOM) [Kohonen, 1997] a.k.a. topologically continuous maps, and Kohonen maps Temporal Kohonen maps [Chappell,1993] Laplacian eigenmaps [Belkin, 2002, 2003] Laplacian eigenmaps with fast N-body methods [Wang, 2006] PCA based: Non-linear PCA [Fodor, 2002], Kernel PCA [Scholkopf, 1998], Principal Curves [Carreira-Perpignan, 1997], Space partition and locally applied PCA [Olsen and Fukunaga, 1973] 109 Representation learning and manifold embedding Example of non-linear data projection methods Isomap [Tenenbaum, 2000] FR-Isomap [Lekadir, 2006], S-Isomap [Geng, 2005], ST-Isomap [Jenkins, 2004], L-Isomap [Silva, 2002], CIsomap [Silva, 2002] Locally linear embedding (LLE) [Roweis, 2000] Hessian Eigenmaps, a.k.a. Hessian Locally Linear Embedding [Donoho, 2003] Curvilinear Component Analysis [Demartines, 1997] Curvilinear Distance Analysis (CDA) [Lee, 2002, 2004] 110 Representation learning and manifold embedding Example of non-linear data projection methods Kernel ICA [Bach, 2003] Manifold charting [Brand, 2003] Stochastic neighbour embedding [Hinton, 2002] Triangulation method [Lee, 1977] Tetrahedral methods: Distance preserving projection [Yang, 2004] 111 Representation learning and manifold embedding Example of non-linear data projection methods Semidefinite embedding (SDE) Minimum Volume Embedding [Shaw, 2007] Conformal Eigenmaps [Maaten, 2007] Maximally angle preserving Maximum Variance Unfolding (MVU) [Maaten, 2007] Variant of LLE Diffusion Maps (DM) Based on a Markov random walk on the high dimensional graph to get a measure of proximity between data. 112 Representation learning and manifold embedding Data representation refers simply to the chosen feature space, i.e. the feature vector [BengioY2013]. The construction or learning of this feature space goes under the name of feature engineering and includes more rudimentary subproblems such as feature selection and extraction e.g. processing and transformations. © 2015-16 Dr. Felipe Orihuela-Espina 113 Representation learning and manifold embedding A good representation is one that disentangles the underlying factors of variation [BengioY2013]. As soon as there is a notion of representation, one can think of a manifold [BengioY2013]. © 2015-16 Dr. Felipe Orihuela-Espina 114 Local vs non-local generalization Local generalization It refers to an underlying assumption made by many learning algorithms; the output f(x1) is similar to f(x2) iff x1 is similar to (i.e. close to/in the neighbourhood of) x2. Non-local generalization Learning a function that behaves differently in different regions of the data-space requires different parameters for each of these regions. © 2015-16 Dr. Felipe Orihuela-Espina 115 Local generalization Local generalization is closely related to manifold learning; Since a manifold is locally Euclidean, it can be approximated locally by linear patches tangent to the manifold surface. If it is smooth, then these patches (i.e. the computational units) will be reasonably large and the number of patches needed (i.e. the size of the computational model) will be small. However, if the manifold is highly curved (i.e. complex highly varying function) then the patches will have to be small increasing the number of patches to characterise the manifold. Figure reproduced from [BengioY2009, pg 25] © 2015-16 Dr. Felipe Orihuela-Espina 116 Local generalization Local generalization is related to the curse of dimensionality. However what matters for generalization is is not the [extrinsic] dimensionality, but the number of variations of the function [i.e. intrinsic dimensionality] that we want to learn. Generalization is mostly achieved by a form of local interpolation between neighbouring training examples. © 2015-16 Dr. Felipe Orihuela-Espina 117 Representation learning and manifold embedding Types of representations Expressive representations Distributed representation Overcomplete representations Invariant representations © 2015-16 Dr. Felipe Orihuela-Espina 118 Representation learning and manifold embedding Expressive representations: It refers to the ability of capturing a huge number of input configurations with a reasonable sized representation. In other words, having few features suffices to cover most of the data space. That's good old content validity meet computational spatial efficiency (Felipe's dixit) Traditional algorithms require O(N) parameters (and/or O(N) training examples) to distinguish O(N) input regions. Linear features e.g. as those learnt by PCA, cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation. However, it is still possible to use the linear fetures in deep learning; e.g. inserting a non-linearity between learned singlelayer linear projections. © 2015-16 Dr. Felipe Orihuela-Espina 119 Representation learning and manifold embedding Distributed representations: It refers to having more than one computational units charting a certain region of the data space at the same time. Distributed representations are often (always?) expressive. Example: Imagine one binary classifier over certain space. It partitions the space into 2 subregions. But having 3 classifiers over that certain space can partition the space into exponentially more regions. Distributed representations can alleviate the curse of the dimensionality and the limitations of local generalization. Figure reproduced from [BengioY2009, pg 27] © 2015-16 Dr. Felipe Orihuela-Espina 120 Representation learning and manifold embedding Overcomplete representations: It refers to having more (hidden) computational units i.e. degrees of freedom, than training examples. Often lead to overfitting endangering generalization. May still be useful for ad-hoc predictive value or denoising However; “importantly, DBMs, (in the case of MNIST despite having million of parameters and only 60k training samples), do not appear to suffer much from overfitting” [SalakhutdinovR2009, pg453] ...hmmm, not sure about this; Salakhutdinov says so, but he does not provide any evidence that this is the case. © 2015-16 Dr. Felipe Orihuela-Espina 121 Representation learning and manifold embedding Invariant representations: It refers to having computational units which by having learn abstract concepts, they achieve outputs which are invariant to local changes of the input. This often need highly non-linear transfer functions. Invariance and abstraction goes hand in hand. Having invariant features is a long standing goal in pattern recognition. Achieving invariance i.e. reducing sensitivity along a certain direction of the data, does not guarantee to have disentangle a certain factor of variance in the data. Although, invariance is often good, the ultimate goal is not to achieve invariance, but to disentangle explanatory factors [BengioY2013], … that's manifold embedding!. Therefore; the goal of building invariant features should be removing sensitivity to directions of variance that are uninformative to the task. Building invariant representation often involves two steps; Low level features are selected to account for the data Higher level features are extracted from low level features © 2015-16 Dr. Felipe Orihuela-Espina 122 DEEP LEARNING © 2015-16 Dr. Felipe Orihuela-Espina 123 Deep learning Much of the actual effort in deploying machine learning algorithms goes into feature engineering. Representation learning closely related to deep learning, is about learning a representation of the data i.e. feature space, that makes it easier to extract useful information when building predictors (e.g. classifiers, regressors, etc). Deep learning is a particular case of representation learning. …it is just that right now (2015-16) it is on the crest of the wave © 2015-16 Dr. Felipe Orihuela-Espina 124 Deep learning Deep architectures are model architectures composed of multiple levels of non-linear operations or computational elements. The number of levels i.e. the longest path from an input node to an output node, is referred to as depth of the architecture. © 2015-16 Dr. Felipe Orihuela-Espina 125 Deep learning An architecture may be: Shallow architecture; often up to 3 levels of depth Deep architectures: More than 3 levels Example: Brain anatomy; 5-10 levels in the visual system [SerreT2007] Funny enough, examples and systems used in scientific papers devoted to deep learning hardly go beyond 3 levels, e.g. [SalakhutdinovR2013_TPAMI]. So not that deep! © 2015-16 Dr. Felipe Orihuela-Espina 126 Deep learning Pros and cons in a nutshell Pros Cons • Relaxes need for feature engineering •Modelling becomes truly data-driven • Bigger compartmentalization of the search space achieved (with a fixed number of hidden variables) • Higher complexity of the model • Larger number of parameters • "Direct" training becomes intractable © 2015-16 Dr. Felipe Orihuela-Espina 127 Deep learning Deep Boltzmann Machines (DBM): A variant of Boltzmann machines that instead of having one single layer of hidden variables (in contrast to the RBM), has multiple layers of hidden variables; with units in odd-numbered layers being conditionally independent given evennumbered layers and viceversa. Figure: Deep Boltzmann Machine with 3 layers. Figure reproduced from [SalakhutdinovR2013_TPAMI] © 2015-16 Dr. Felipe Orihuela-Espina 128 Questions that I'm unable to answer at the moment Overfitting. Clearly deep models are prone to overfitting considering they use overcomplete representations. …it’s not me, but Bengio who warns about this! From its particular example with MNIST images, [SalakhutdinovR2009, pg453] claims this does not seem to be the case. However, he says so but fails to provide any evidence that this is the case. © 2015-16 Dr. Felipe Orihuela-Espina 129 Deep learning To know more: [BengioY2009] Bengio, Y. (2009) "Learning deep architectures for AI" Foundations and trends in machine learning, 2(1):1-127 [BengioY2013] Bengio, Y.; Courville, A.; Vincent, P. (2013) "Representation learning: a review and new perspectives" IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798-1828 [DavisRA2001] Davis, R.A. (2001). Gaussian Processes, Encyclopedia of Environmetrics, Section on Stochastic Modeling and Environmental Change, (D. Brillinger, Editor), Wiley, New York [HintonGE2006] Hinton, Geoffrey E.; Osindero, Simon; Teh, Yee-Whye (2006) "A Fast Learning Algorithm for Deep Belief Nets" Neural Computation 18:1527–1554 [LeCunY2006] LeCun, Yann; Chopra, Sumit, Hadsell, Raia; Ranzato, Marc Aurelio; Huang, Fu Jie (2006) "A tutorial on energy-based learning" in Bakir, G. and Hofman, T. and Schölkopf, B. and Smola, A. and Taskar, B. (Eds), Predicting Structured Data, MIT Press [ResnikP2010] Resnik, Philip and Hardisty, Erick (2010) "Gibbs sampling for the uninitiated" Technical Report CS-TR4956, Institute for Advanced Computer Studies, University of Maryland, 23 pp. [SalakhutdinovR2008_ICML] Salakhutdinov, Ruslan and Murray, Iain (2008) "On the Quantitative Analysis of Deep Belief Networks" 25th International Conference on Machine Learning (ICML), Helsinki, Finland [SalakhutdinovR2009_AISTATS] Salakhutdinov, Ruslan and Hinton, Geoffrey (2009) "Deep Boltzmann Machines" 12th International Conference on Artificial Intelligence and Statistics (ICAISTATS), Clearwater Beach, Florida, USA, pgs. 448-455 [SalakhutdinovR2013_TPAMI] Salakhutdinov, Ruslan; Tenenbaum, Joshua B. and Torralba, Antonio "Learning with hierarchical-deep models" IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1958-1971 [SerreT2007] Serre, T.; Kreiman, M. K.; Cadieu, U.; Knoblich, U.; Poggio, T. (2007) "A quantitative theory of immediate visual recognition" Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165:33-56 [TehYW2010] Y.W. Teh. (2010) Dirichlet process. Encyclopedia of Machine Learning. Springer © 2015-16 Dr. Felipe Orihuela-Espina 130 KNOWLEDGE REPRESENTATION AND DISCOVERY © 2015-16 Dr. Felipe Orihuela-Espina 131 Knowledge representation “Knowledge representation includes ontologies, new concepts for representing, storing, and accessing knowledge. Also included are schemes for representing knowledge and allowing the use of prior human knowledge about the underlying process by the knowledge discovery system.” [FayyadU1996] © 2015-16 Dr. Felipe Orihuela-Espina 132 Knowledge generation To arrive to knowledge from experimentation 3 steps are taken: Data harvesting: Involving all observational and interventional experimentation tasks to acquire data Data acquisition: experimental design, evaluation metrics, capturing raw data Data reconstruction: Translates raw data into domain data. Inverts the data formation process. E.g.: If you captured your data with a certain sensor and the sensor throws electric voltages as output, then reconstruction involves converting those voltages into a meaningful domain variable. E.g.: Image reconstruction Data analysis: From domain data to domain knowledge When big data is involved, it is often referred to as Knowledge discovery © 2015-16 Dr. Felipe Orihuela-Espina 133 Knowledge discovery Figure from [Fayyad et al, 1996] © 2015-16 Dr. Felipe Orihuela-Espina 134 Data interpretation Research findings generated depend on the philosophical approach used [LopezKA2004] Assumptions drive methodological decisions Different (philosophical) approaches for data interpretation [PriestH2001, part 1, LopezKA2004; but basically phylosophy in general] Interpretive (or hermeneutic) phenomenology: Systematic reflection/exploration on the phenomena as a means to grasp the absolute, logical, ontological and metaphysical spirit behind the phenomena Affected by the researcher’s bias Kind of your classical hypothesis driven interpretation approach [Felipe’s dixit] Descriptive (or eidetic) phenomenology Favours data driven over hypothesis driven research [Felipe’s dixit based upon the following] “the researcher must actively strip his or her consciousness of all prior expert knowledge as well as personal biases (Natanson, 1973). To this end, some researchers advocate that the descriptive phenomenologist not conduct a detailed literature review prior to initiating the study and not have specific research questions other than the desire to describe the lived experience of the participants in relation to the topic of study” [Lopez KA 2004] Important note: I do NOT understand these very well © 2015-16 Dr. Felipe Orihuela-Espina 135 Data interpretation Different (philosophical) approaches for data interpretation [PriestH2001, part 1, LopezKA2004; but basically phylosophy in general] (Cont.) Grounded theory analysis Generates theory through inductive examination of data Systematization to break down data, conceptualise it and re-arrange it in new ways Content analysis Facilitates the production of core constructs formulated from contextual settings from which data were derived Emphasizes reproducibility (enabling others to establish similar results) Interpretation (analysis) becomes continual checking and questioning Narrative analysis Qualitative Results (often from interviews) are revisited iteratively detracting words or phrases until core points are extracted. Important note: I do NOT understand these very well © 2015-16 Dr. Felipe Orihuela-Espina 136 Data analysis: more than just thinking your statistical test… Figure source: [OrihuelaEspinaF2012, Workshop on Foundations of Biomedical Knowledge Representation] •Past: Make sense of bygone situations or explain an occurring phenomena, establishing associational or causal relations •Present: Decision making • Future: Infer outcomes, reasoning, prediction, planning, optimization •Hypothesis driven vs data driven Quantitative vs Qualitative •Causality (Zero level, One level, Two level) •Incorporation of domain knowledge (priors) •Algorithm: complexity (order), strategy (e.g. greedy), serial/parellel, exact real number computation •Problem complexity (NP-complete, P-hard…) •Problem size (Information representation, Regularization) •Data relations and behaviour •Validation theory: Type (Construct, Face, Convergent, Ecological, External, Internal, etc), Technique (Leave one out, cross-fold, gold standard, ground truth) •Dimensionality (Intrisic vs Explicit) •Learning (supervised, unsupervised, reinforcement) •Comparison (metric and performance definition) •Data quality and SNR Processing Analysis Understanding •Direct (Intervention) vs Indirect (Sensing) •Sampling •Interviewing, Behavioural simulation,, observational •Synthetic, Experimental, Data base •Positive vs Negative/Complement © 2015-16 Dr. Felipe Orihuela-Espina •Type: Discrete, Continuous, Categorical / Nominal, Ordinal / Ranked •Digital vs Analogic •Nature of data: Time vs Space •Deterministic vs stochastic •Observable vs Nonobservable •One way, Two way, Nways •Fundamental vs Derived 137 Brain map of data analysis NOT INTENDED TO BE COMPREHENSIVE! © 2015-16 Dr. Felipe Orihuela-Espina 139 Why KR models for biomedical engineering? GOAL: Formalizing concepts and relations common in biomedical imaging Affording more time for interpretation Advantages: favours automated data processing, automated knowledge and data integration and semantic integration [HoehndorfR2012] The formalization of experimental knowledge expects that such knowledge is more easily reused to answer other scientific questions [KingRD2009] Ensure reproducibility and quality results [OrihuelaEspinaF2010] Leaves interpretation to humans! © 2015-16 Dr. Felipe Orihuela-Espina 140 Knowledge generation can be streamlined: e.g. Automated identification of natural laws A computer program extrapolated the laws of motion from a pendulum’s swings in just over a day. This tooks physicists centuries to complete! Based on Symbolic Regression Symbolic regression automatically searches for both the parameters to fit an equation and the equation form simultaneously © 2015-16 Dr. Felipe Orihuela-Espina 141 Automating Science? “Computers with intelligence can design and run experiments, but learning from the results to generate subsequent experiments requires even more intelligence.” [WaltzD2009] Goals of automation in science [WaltzD2009]: increase productivity by increasing efficiency (e.g., with rapid throughput) improve quality (e.g., by reducing error) cope with scale © 2015-16 Dr. Felipe Orihuela-Espina 142 Knowledge generation can be streamlined: e.g. Robot scientist Robot scientist ADAM and researcher Prof. King LABORS (Laboratory Ontology for Robot Scientists) ontology [KingRD2011] Formalizes Adam’s functional genomics experiments Based on EXPO (Ontology of scientific experiments) Closing the loop; ADAM can decide on what experiment to do next [WaltzD2009] Limited to hypothesis-led discovery [KingRD2009] © 2015-16 Dr. Felipe Orihuela-Espina 143 Knowledge generation can be streamlined: EXPO EXPO: Ontology of scientific experiments Defines over 200 concepts for creating semantic markup about scientific experiments OWL language EXPO to formalise generic knowledge about scientific experimental design, methodology, and results representation. [SoldatovaLN2006] EXPO is available at http://expo.sourceforge.net/ © 2015-16 Dr. Felipe Orihuela-Espina 144 An overview of EXPO [KingRD2006 presentation on EXPO] © 2015-16 Dr. Felipe Orihuela-Espina 145 AN EXAMPLE OF KR WITH FNIRS © 2015-16 Dr. Felipe Orihuela-Espina 146 Challenges in KR in fNIRS experimentation How to choose? The region to interrogate? The best (most fair) analysis? [OrihuelaEspina2010_OHBM] Inc. processing, parameterization, and analysis flow How to avoid: Physiological noise /systemic effect? Artefacts (e.g optode movement, ambient light)? How to ensure: Physiological plausability? Integrity / validity? [OrihuelaEspina2010_PMB] reuse of formalized experiment information? [KingRD2009] © 2015-16 Dr. Felipe Orihuela-Espina 147 Challenges in KR in fNIRS experimentation: Parameterization [OrihuelaEspinaF2010_OHBM] © 2015-16 Dr. Felipe Orihuela-Espina 148 Challenges in KR in fNIRS experimentation: Modelling light tissue light Light model Chromophore concentration Neurovascular coupling Physiological model Physiological information [Inspired from Banaji, fNIRS Conference, 2012] © 2015-16 Dr. Felipe Orihuela-Espina 149 Challenges in KR in fNIRS experimentation: Modelling Is the data validated? Do we really need a physiological model? A model is useful only if they fulfil very high standards of predictive capability and reliability We learn about the phenomenon while building the model (vicious circle) Purposes of models: Explain data /highlight gaps in understanding Raising open questions Predict hard-to-measure quantities Develop understanding and intuition Prepare us for experimental data Challenge dogmas May force us to ignore priors! [Banaji, fNIRS Conference, 2012; Banaji, JTB, 2006, Banaji, PLoS CB, 2008] © 2015-16 Dr. Felipe Orihuela-Espina 150 Challenges in KR in fNIRS experimentation: Modelling What are the principles that we should follow to build our model? How is the model going to interact with the data? Example of interaction 1 Simulated data Model Example of interaction 2 Model Modelled data Modelled data Compare Compare Subject / Cohort Observed data Subject / Cohort Observed data [Banaji, fNIRS Conference, 2012] © 2015-16 Dr. Felipe Orihuela-Espina 151 Challenges in KR in fNIRS experimentation Closing the loop: from experiment design and data collection to hypothesis formation and revision, and from there to new experiments [WaltzD2009] Complex experiments having different sources Different NIRS devices (HITACHI, SHIMADZU, fNIRX), but also difference sources eye-tracking, EEG, etc.. Accomodating different optical modalities Lack of standard “final” representation format Medical standard DICOM; not as standard as pretended Each provider has its own file format. SNIRF: Shared Near Infrared File Format Specification © 2015-16 Dr. Felipe Orihuela-Espina 152 Challenges in KR in fNIRS experimentation Problem size Information representation (relational, object oriented) Sample size (Extrapolation, generalization, regularization, ill posed problems i.e. Number of observations vs number of covariates) Data mining and KD strategy [FayyadU1996] Model identification and parameterization Underparameterization: low flexibility to explain complex data Overparameterization: Spurious model can explain any data. Difficulties in parameter identification Level of detail Model baoundaries, parameters, variables, purpose © 2015-16 Dr. Felipe Orihuela-Espina 153 Concept map: experimentation Light model Physiological model © 2015-16 Dr. Felipe Orihuela-Espina 154 [OrihuelaEspinaF2010, PMB] Taxonomy of factors in fNIRS experimentation © 2015-16 Dr. Felipe Orihuela-Espina 155 Experimental factors limit interpretation © 2015-16 Dr. Felipe Orihuela-Espina 156 INTERPRETATION GUIDELINES © 2015-16 Dr. Felipe Orihuela-Espina 157 Interpretation; generating knowledge Every analysis must translate the physiological, biological, experimental, etc concepts to a correct mathematical abstraction. Every interpretation must translate the “maths” to real world domain concepts. A common mistake in many papers is to forget about understanding, and only stating the patterns/findings found during the analysis. © 2015-16 Dr. Felipe Orihuela-Espina 158 Interpretation; generating knowledge Understanding is by far the hardest part of data analysis. …and alas it is also the part where maths/stats/computing are (so far) less helpful. © 2015-16 Dr. Felipe Orihuela-Espina 159 Interpretation guidelines Interpretation of results must be strictly confined to the limits imposed by the assumptions made during the image formation, acquisiton, reconstruction, processing and analysis. Rule of thumb: Data analysis takes at least 3 to 5 times data collection time. If it has taken less, then your analysis is likely to be weak, coarse or careless. Example: One month collecting data – 5 months worth of analysis. © 2015-16 Dr. Felipe Orihuela-Espina 160 Interpretation guidelines Look at your data! Know them by heart. Visualize them in as many possible ways as you can imagine and then a few more. e background. Read Have a hug everything out there closely and loosely related to your topic. © 2015-16 Dr. Felipe Orihuela-Espina 161 Interpretation guidelines Always try more than one analysis (convergent validity). Quantitative analysis is often desirable, but never underestimate the power of good qualitative analysis. All scales are necessary and complementary; Structural, functional, effective Inter-subject, intra-subject Neuron-level, region-level © 2015-16 Dr. Felipe Orihuela-Espina 162 Interpretation guidelines The laws of physics are what they are… …but research/experimentation results are not immutable. They strongly depend on the decisions made during the data harvesting, data reconstruction and the three stages of the analysis process. It is the duty of the researcher to make the best decision to arrive at the most robust outcome. Interpretation, interpretation, interpretation… LOOK at your data! © 2015-16 Dr. Felipe Orihuela-Espina 163 THANKS, QUESTIONS? © 2015-16 Dr. Felipe Orihuela-Espina 165