Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 6392 – Data Exploration and Analysis in Relational Databases January 31, 2006 Example Problem Suppose you had the following tables: Employee Employee-Sample Gender Salary Gender Salary Possible Queries • Some possible queries to get the average salary of all females in the company: 1. Select avg(salary) from Employee where gender = “F” 2. Select avg(salary) from Employee-Sample where gender = “F” 3. Select count(*) as C, sum(salary) as S, S/C from Employee-Sample where gender = “F” • Is there a difference between 2 and 3 in terms of results? No. Estimator • What is an estimator? – Ex. count of a sample * (population/count) – On the previous slide, 2 and 3 are estimators for 1. • What is an unbiased estimator? – Basically, an estimator that is not tilted towards the lower or higher side of the estimation • Formally: ^ – x is the estimator for some quantity x ^ ^ – x is an unbiased estimator if E[ x ] = x. Unbiased Estimators • Example – select count(*) as FC from Employee where gender = “F” – select count(*) * (N/n) as EFC from EmployeeSample with gender = “F” • EFC is an unbiased estimator • (N/n) is called the ‘ratio scale’ Unbiased Estimators (1) • Example – select sum(salary) as TFS from Employee where gender = “F” – select sum(salary)*(N/n) as ETFS from Employee-Sample where gender = “F” • ETFS is an unbiased estimator • Note: This is important to statisticians, but secondary for our purposes; we are more concerned about the error Unbiased Estimators (2) • Example – Select avg(salary) as AFS from Employee where gender = “F” – Select count(*) as C, sum(salary) as S, EAFS=S/C from Employee-Sample where gender = “F” • Is EAFS unbiased? Not necessarily. The use of 2 unbiased estimators does not make it unbiased (ratio estimation). Probability Probability • Example: roll a die. How many times will you get 1, 2, 3, 4, 5 or 6? 1/6 1 2 3 4 5 6 Number on die Probability Density • What is the probability that a random number generator will generate .43 (of numbers between 0 and 1)? – Answer: 0% (1/infinity) • What about between .43 and .53? – Answer: 10% (1/10) • The probability density is the area under the curve (integral) = 1. • Any single number has a 0% probability, but an interval has a chance. Probability Density Function Proper distribution if integral = 1 Probability Example • How many female employees (out of 50K employees)? Probability that sample process will give this number 0 2 2 s n Normal distribution 9K 10K 11K actual 50K Probability Sample • If we sampled another company where the actual number of females is 5K, the variance would decrease: Relative Error • In Approximate Query Processing, people use absolute error statistically, but relative error practically. relative error2 = (ETFC – TFC)2 TFC2 Central Limit Theorem • The main point of this theorem is that it does not matter how it was originally distributed – the sample distribution will be normal. • Normal distribution: ( x )2 f ( x) e 2 2 2