Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fundamentals of Data Analysis Lecture 3 Basics of statistics Program for today Basic terms and definitions Discrete distributions Continuous distributions Normal distribution Other distributions Topics for discussion What are the applications of statistics in modern physics? How important is the drawing of conclusions based on statistical analysis ? What is the statistics ? Definition of Statistics: 1. A collection of quantitative data pertaining to a subject or group. Examples are blood pressure statistics etc. 2. The science that deals with the collection, tabulation, analysis, interpretation, and presentation of quantitative data What is the statistics ? Two phases of statistics: Descriptive Statistics: o Describes the characteristics of a product or process using information collected on it. Inferential Statistics (Inductive): o Draws conclusions on unknown process parameters based on information contained in a sample. o Uses probability Probability When we cannot rely on the assumption that all sample points are equally likely, we have to determine the probability of an event experimentally. We perform a large number of experiments N and count how often each of the sample points is obtained. The ratio of the number of occurrences of a certain sample point to the total number of experiments is called the relative frequency. Probability The probability is then assigned the relative frequency of the occurrence of a sample point in this long series of repetitions of the experiment. This is based on the axiom, called the "law of large numbers", which says that the relative frequency approaches the true (theoretical) probability of the outcome if the experiment is repeated over and over again. How important is the drawing of conclusions based on statistical analysis. Probability where n(E) is the number of times, the event E took place out of a total of N experiments. From this definition we can see that the probability is a number between 0 and 1. When the probability is 1, then we know that a particular outcome is certain. Probability For a discrete random variable definition of probability is intuitive: n x P N where n(x) is the number of occurences of the desired value of the random variable x (successes) in N samples (N ). Probability For a continuous random variable, this definition requires the identification of a small range of variation Δx (Δx 0), for which the probability is determined : nx0 x x0 x Px0 x x0 x N For a continuous random variable it is preferable to use the probability density function: Px0 x x0 x f x0 x Histogram The histogram is the most important graphical tool for exploring the shape of data distributions. And a good way to visualize trends in population data. The more a particular value occurs, the larger the corresponding bar on the histogram. Histogram Constructing a histogram Step 1: Find range of distribution, largest smallest values Step 2: Choose number of classes, 5 to 20 Step 3: Determine width of classes, one decimal place more than the data, class width = range/number of classes Step 4: Determine class boundaries Step 5: Draw frequency histogram Histogram Number of groups or cells If number of observations < 100 – 5 to 9 cells Between 100-500 – 8 to 17 cells Greater than 500 – 15 to 20 cells Analysis of histogram Analysis of histogram Calculating the average for ungrouped data n Xi X i 1 n and for grouped data: h fi X i X i 1 n f1 X 1 f 2 X 2 ... f h X h . f1 f 2 ... f h Analysis of histogram Boundaries Midpoint Frequency Computation 23.6-26.5 25.0 4 100 26.6-29.5 28.0 36 1008 29.6-32.5 31.0 51 1581 32.6-35.5 34.0 63 2142 35.6-38.5 37.0 58 2146 38.6-41.5 40.0 52 2080 41.6-44.5 43.0 34 1462 44.6-47.5 46.0 16 736 47.6-50.5 49.0 6 294 320 11549 Total Measures of dispersion Range Standard deviation Variance Measures of dispersion The range is the simplest and easiest to calculate of the measures of dispersion. R = Xmax - Xmin Measures of dispersion Standard deviation inside the probe: S n ( Xi X ) i 1 n 1 2 Measures of dispersion For a discrete random variable definition of variation is as follows: V x xi E x Pxi 2 when for continous is: b 2 V x x E x f x dx a Parameters of a distribution Parameter is a characteristic of a population, i.o.w. it describes a population Statistic is a characteristic of a sample, used to make inferences on the population parameters that are typically unknown, called an estimator Parameters of a distribution Population - Set of all items that possess a characteristic of interest Sample - Subset of a population Parameters of a distribution Expected value (EV) discrete random variable: E x k xi Pxi Z i 1 and for continuous random variable: b E x x f x dx a Random numbers 1 2 3 4 5 6 7 8 9 10 1534 7106 2836 7873 5574 7545 7590 5574 1202 7712 6128 8993 4102 2551 0330 2358 6427 7067 9325 2454 6047 8566 8644 9343 9297 6751 3500 8754 2913 1258 0806 5201 5705 7355 1448 9562 7514 9205 0402 2427 9915 8274 4525 5695 5752 9630 7172 6988 0227 4264 2882 7158 4341 3463 1178 5789 1173 0670 0820 5067 9213 1223 4388 9760 6691 6861 8214 8813 0611 3131 8410 9836 3899 3883 1253 1683 6988 9978 8026 6751 9974 2362 2103 4326 3825 9079 6187 2721 1489 4216 3402 8162 8226 0782 3364 7871 4500 5598 9424 3816 8188 6569 1492 2139 8823 6878 0613 7161 0241 3834 3825 7020 1124 7483 9155 4919 3209 5959 2364 2555 9801 8788 6338 5899 3309 0807 0968 0539 4205 8257 Normal distribution Characteristics of the normal curve: It is symmetrical -- Half the cases are to one side of the center; the other half is on the other side. The distribution is single peaked, not bimodal or multimodal Also known as the Gaussian distribution Normal distribution Probability density function: N(μ,σ) N(0,1) - standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1 Normal distribution Normal distribution Cumulative distribution function: Cumulative distribution function is given by: F(x) = P(-oo, x) Normal distribution Example Height of fifteen years old boys has a N(170;5) distribution and height of fifteen years old girls has a N(166,4). We grab independently two samples of 8 boys and 10 girls. What is the probability that the average hight of girls is higher than the one for boys ? X - random variable describing the hight of boys Y - random variable describing the hight of girls Normal distribution Example The difference in the two trials has normal distribution with mean value m= 170 – 166 = 4 and the standard deviation 25 16 4.725 2.17 8 10 Thus we have N(4, 2.17) distribution. The probability P(X-Y < 0) must be calculated. X Y m 0 4 P(U 1.84) P( X Y 0) P 2.17 Normal distribution Example The probability P(X-Y < 0) = P(U < -1.84) = 0,03288 Normal distribution Exercise 1 Weight of canned food (in grams) has N(50,3) distribution. Cans were randomly packed in packs containing 9 pieces. Determine the average weight distribution in the multipacks of canned. What is the probability that the average weight of canned in multi-packs will be: a) not greater than 52 g b) greather than 49 g c) Within the range 48.75 g – 51.25 g Exponential distribution Probability density function for Exponential distribution Cumulative distribution function t distribution 0.1 0.2 30 st. swob. 3 st. swob. 1 st. swob normalny 0.0 f(t) 0.3 0.4 Rozklad t-Studenta -4 -2 0 t 2 4 chi-square distribution 0.4 Rozklad chi-kwadrat 0.2 0.1 0.0 f(chi^2) 0.3 10 st. swob. 3 st. swob. 1 st. swob normalny 0 5 10 chi^2 15 20 F distribution Thanks for attention !