Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DATA ANALYSIS I Data: Probabilistic View Sources • Leskovec, J., Rajaraman, A., Ullman, J. D. (2014). Mining of massive datasets. Cambridge University Press. [5-7] • Zaki, M. J., Meira Jr, W. (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press. [13-25] Assumption: The Data is Random • Suppose you have a certain amount of data, and you look for events of a certain type within that data. • You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. • This essentially means that an algorithm or method you think is useful for finding the events returns more false positives. Example • Suppose there are believed to be some “evil-doers” out there, and we want to detect them. • We have reason to believe that periodically, evil-doers gather at a hotel to plot their evil. – – – – There are one billion people who might be evil-doers. Everyone goes to a hotel one day in 100. A hotel holds 100 people. There are 100,000 hotels. We shall examine hotel records for 1000 days. • There is 250,000 pairs of people who look like evil-doers, even though they are not. Bonferroni’s Principle • Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. • If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for. • Bonferroni’s principle says that we may only detect the expected events by looking for events that are so rare that they are unlikely to occur in random data. Random Variable • The probabilistic view of the data assumes that each numeric attribute X is a random variable, defined as a function that assigns a real number to each outcome of an experiment. • Formally, X is a function X: O →R, where O, the domain of X, is the set of all possible outcomes of the experiment, and R, the range of X, is the set of real numbers. • If the outcomes are numeric, and represent the observed values of the random variable, then X: O→O is simply the identity function: X(v) = v for all v ∈ O. Discrete x Continuous • Discrete random variable: A variable can take on only a finite or countably infinite number of values in its range. • Continuous random variable: A variable can take on any value in its range. Iris Dataset: Sepal Length Random Variable: Sepal Length • All n = 150 values of this attribute lie in the range [4.3,7.9], with centimeters as the unit of measurement. • Let us assume that these constitute the set of all possible outcomes O. • We can consider the attribute X1 to be a continuous random variable, given as the identity function X1(v) = v. Probability Mass Function • Probability mass function: If X is discrete, the probability mass function of X is defined as • The function f gives the probability P(X = x) that the random variable X has the exact value x. Bernoulli distribution • Short and long sepal lengths? We can define a discrete random variable A as follows: • A(v) = – 0 if v < 7 – 1 if v ≥ 7 • Probabilities: – f (1) = P(A = 1) = 13 / 150 = 0.087 = p – f (0) = P(A = 0) = 137 / 150 = 0.913 = 1− p Binomial Distribution • Bernoulli trial is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted. • Let us consider another discrete random variable B, denoting the number of Irises with long sepal length in m independent Bernoulli trials with probability of success p. Example • The probability of observing exactly k = 2 Irises with long sepal length in m = 10 trials is given as Probability Density Function • Probability density function: If X is continuous, the probability density function, which specifies the probability that the variable X takes on values in any interval [a,b]⊂ R is defined as • Probability mass is spread so thinly over the range of values that it can be measured only over intervals [a,b] ⊂ R. Probability Density Function • The density function f must satisfy the basic laws of probability: Sepal Length: Distribution? Normal Distribution • A random variable X has a normal distribution, with the parameters mean μ and variance σ2, if the probability density function of X is given as follows: Sepal Length [μ = 5.84, σ2 = 0.681] Cumulative Distribution Function • Cumulative distribution function: For any random variable X, whether discrete or continuous, we can define the cumulative distribution function F: R → [0,1], which gives the probability of observing a value at most some given value x. Discrete x Continuous • When X is discrete • When X is continuous Cumulative Distribution Function [Normal Distribution] Bivariate Random Variables • Instead of considering each attribute as a random variable, we can also perform pair-wise analysis by considering a pair of attributes, X1 and X2, as a bivariate random variable. • Joint probability mass function (discrete random variables) Bivariate Distributions • Consider the sepal length and sepal width attributes in the Iris dataset. Let A denote the random variable corresponding to long sepal length (at least 7 cm) and the random variable B corresponding to long sepal width (at least 3.5 cm). • Let X = (A B) be a discrete bivariate random variable. Probability Mass Function • X1 (long sepal length), X2 (long sepal width) Density Function • Joint probability density function (continuous random variables) Bivariate Normal Density Multivariate Random Variables • A d-dimensional multivariate random variable (vector random variable) X = (X1,X2, . . . ,Xd )T is defined as a function that assigns a vector of real numbers to each outcome in the sample space. • Joint probability mass function (discrete random variables) • Joint probability density function (continuous random variables)