Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS1512 Foundations of Computing Science 2 (Theoretical part) Kees van Deemter Probability and statistics Propositional Logic Elementary set theory 1 © J R W Hunter, 2006; C J van Deemter 2007 What this is going to be about 1. Suppose you know that the statement p is true and that the statement q is true. What can you say about the statement that p and q ? 2 What this is going to be about 1. Suppose you know that the statement p is true and that the statement q is true. What can you say about the statement that p and q ? In this case, you know that p and q is also true. 2. Suppose you know that the statement p has a probability of .5 and the statement q has a probability of .5. What can you say about the statement that p and q ? 3 What this is going to be about 1. Suppose you know that the statement p is true and the statement q is true. What can you say about the statement that p and q ? In this case, p and q is also true. 2. Suppose you know that the statement p has a probability of .5 and the statement q has a probability of .5. What can you say about the statement that p and q ? It depends! If p and q are independent of each other then you know that p and q has a probability of .25 But suppose (1) p = It will snow (some time) tomorrow and q = It will be below zero (some time) tomorrow Then p and q has a probability >.25 4 What this is going to be about 1. Suppose you know that the statement p is true and the statement q is true. What can you say about the statement that p and q ? In this case, p and q is also true. 2. Suppose you know that the statement p has a probability of .5 and the statement q has a probability of .5. What can you say about the statement that p and q ? It depends! If p and q are independent of each other then you know that p and q has a probability of .25 But suppose (2) p = It will snow (sometime) tomorrow and q = It will not snow (any time) tomorrow Then p and q has a probability 0 5 Before we get there ... Some basic concepts in statistics • different kinds of data • ways of representing data • ways of summarising data This will be useful later in your CS career. For example to • assess whether a computer simulation is accurate • assess whether one user interface is more user friendly than another • estimate the expected run time of a computer program (on typical data) 6 Lecture slides on statistics and probability are based on originals by Professor Jim Hunter. 7 CS1512 Foundations of Computing Science 2 Lecture 1 Probability and statistics (1) 8 © J R W Hunter, 2006; C J van Deemter 2007 Sources Text book (parts of chapters 1-6): Essential Statistics (Fourth Edition) D.G.Rees Chapman and Hall 2001 (Blackwells, ~£28) Courses: ST1505 Mathematical Sciences http://maths.abdn.ac.uk/~ap/st1505/ CS1012 Sets 9 Some definitions Sample space (population) • Set of entities of interest, also called elements • this set may be infinite • entities can be physical objects, events, etc. ... Sample • subset of the sample space 10 More definitions Variable • an attribute of an element which has a value (e.g., its height, weight, etc.) Observation • the value of a variable as recorded for a particular element • an element will have variables with values but they are not observations until we record it Sample data • set of observations derived from a sample 11 Descriptive and Inferential Statistics Descriptive statistics: • Summarising the sample data (as a number, graphic ...) Inferential statistics: • Using data from a sample to infer properties of the sample space • Chose a ‘representative sample’ (properties of sample match those of sample space – difficult) • In practice, use a ‘random sample’ (each element has the same likelihood of being chosen) 12 Variable types Qualitative: • Nominal/Categorical (no ordering in values) e.g. sex, occupation • Ordinal (ranked) e.g. class of degree (1, 2.1, 2.2,...) Quantitative: • Discrete (countable) – [integer] e.g. number of people in a room • Continuous – [double] e.g. height 13 Examples 1. 2. 3. 4. A person’s marital status The length of a CD The size of a litter of piglets The temperature in degrees centigrade 14 Examples 1. A person’s marital status Nominal/categorical 2. The length of a CD Quantitative; continuous or discrete? This depends on how you model length (minutes or bits) 3. The size of a litter of piglets Quantitative, discrete (if we mean the number of pigs) 4. The temperature in degrees centigrade Quantitative, continuous (Even though it does not make sense to say that 200 is twice as warm as 100) Footnote: We us the term `Continuous` a bit loosely: For us a variable is continuous/dense (as opposed to discrete) if between any values x and y, there lies a third value z. 15 Summarising data Categorical (one variable): • X is a categorical variable with values: a1, a2, a3, ... ak, ... aK (k = 1, 2, 3, ... K) • fk = number of times that ak appears in the sample fk is the frequency of ak • if we have n observations then: relative frequency = frequency / n • percentage relative frequency = relative frequency 100 16 Frequency sample of 572 patients (n = 572) Blood Type A AB B O Totals Frequency 210 35 93 234 572 Relative Frequency 0.37 0.06 0.16 0.41 1.00 Percentage RF 37% 6% 16% 41% 100% sum of frequencies = n 17 Bar Chart 18 Summarising data Categorical (two variables): • contingency table • number of patients with blood type A who are female is 108 Blood Type Sex Totals male female A AB B O 102 12 46 120 108 23 47 114 210 35 93 234 Totals 280 292 572 19 Summarising data Categorical (two variables): • contingency table • number of patients with blood type A who are female is 108 Blood Type Sex Totals male female A AB B O 102 12 46 120 108 23 47 114 210 35 93 234 Totals 280 292 572 % Blood Type by sex male female 49% 34% 50% 51% 51% 66% 50% 49% 20 Bar Chart 21 Ordinal data • X is an ordinal variable with values: a1, a2, a3, ... ak, ... aK • ‘ordinal’ means that: a1 ≤ a2 ≤ a3 ≤ ... ≤ ak ≤ ... ≤ aK • cumulative frequency at level k: ck = sum of frequencies of values less than or equal to ak ck = f1 + f2 + f3 + ... + fk = (f1 + f2 + f3 + ... + fk-1 ) + fk = ck-1 + fk • Can be applied to quantitative data as well ... 22 Cumulative frequencies Number of piglets in a litter: (discrete data) c1=f1=1, c2=f1+f2=1, c3=f1+f2+f3=3, c4=f1+f2+f3+f4=6, etc. cK = n Litter size Frequency=f 5 6 7 8 9 10 11 12 13 14 Total 1 0 2 3 3 9 8 5 3 2 Cum. Freq =c 1 1 3 6 9 18 26 31 34 36 36 23 Plotting frequency cumulative frequency 24 Continuous data • A way to obtain discrete numbers from continuous data: Divide range of observations into non-overlapping intervals (bins) • Count number of observations in each bin • Enzyme concentration data in 30 observations: 121 95 85 119 62 Range: 25 to 201 25 81 145 57 104 83 123 100 64 139 110 67 70 151 201 60 113 93 48 68 101 78 118 92 95 For example, you can use 10 bins of width 20: 25 Enzyme concentrations Concentration 19.5 ≤ c < 39.5 39.5 ≤ c < 59.5 59.5 ≤ c < 79.5 79.5 ≤ c < 99.5 99.5 ≤ c < 119.5 119.5 ≤ c < 139.5 139.5 ≤ c < 159.5 159.5 ≤ c < 179.5 179.5 ≤ c < 199.5 199.5 ≤ c < 219.5 Freq. 1 2 7 7 7 3 2 0 0 1 Totals 30 Rel.Freq. 0.033 0.067 0.233 0.233 0.233 0.100 0.067 0.000 0.000 0.033 % Cum. Rel. Freq. 3.3% 10.0% 33.3% 56.6% 79.9% 89.9% 96.6% 96.6% 96.6% 100.0% 1.000 26 Cumulative histogram 27