Download [Powerpoints] - StatisticalQueries

High Performance Statistical Queries Sponsors Agenda  Introduction  Descriptive Statistics  Linear dependencies  Continuous variables  Discrete variables  Discrete and continuous variables  Definite Integration  Moving Averages Why Statistical Queries?  SQL Server lacks support for statistics  Statistics useful in many cases  Ad-hoc analysis  Advanced reporting  Data overview  Show the best of the T-SQL improvements in SQL Server 2012 Big Role: Data Scientist  Data Scientist: The Sexiest Job of the 21st Century (Harvard Business Review)  A high-ranking professional with the training and curiosity to make discoveries in the world of structured and big data  A solid foundation typically in computer science and applications, modeling, statistics, analytics and math  A data scientist explores and examines data from multiple disparate sources  Exploring, asking questions, doing what-if analysis, questioning existing assumptions and processes  In short: programming and statistics! 5 Performance  Main issue: complex calculations that need other statistics  E.g., StDev uses Avg in formula  Goal: calculate everything with minimal number of passes through the data  Additionally improve performance with:  (Covering) nonclustered indexes  Columnstore index  Sampling! Minimalizing Passes through the Data  How to achieve the minimal number of passes through the data?  With new SQL 2012 Window functions  Rearrange formulas – use mathematical knowledge  Use creativity Frequencies  Frequency tables and graphs are the basic representation of discrete variables  Value, absolute frequency, absolute percentage, cumulative frequency, cumulative percent and histogram Cars AbsFreq CumFreq AbsPerc CumPerc Histogram 0 4238 4238 23 23 *********************** 1 4883 9121 26 49 ************************** 2 6457 15578 35 84 *********************************** 3 1645 17223 9 93 ********* 4 1261 18484 7 100 ******* Solution  Pre-SQL 2012: calculating absolute numbers and then using a non-equi self-join for cumulative (running) numbers  SQL 2012: calculate absolute numbers and then use aggregate functions with framing and order  SQL 2012 with creativity: use window analytic functions  PERCENT_RANK calculates the relative rank of a row within a group of rows in percent  CUME_DIST Calculates the cumulative distribution of a value  CUME_DIST – PERCENT_RANK for the last value in a group equals to the absolute percent of the value Centers  Center of a distribution  The mode is the most common value in the distribution  The median is the value that splits the distribution into two halves  The arithmetic mean or the average is the most common measure for the center of the distribution  Comparing mode, median and mean gives info about the skewness Solution  For mode and mean, use standard aggregate functions  For mode, use also TOP 1 WITH TIES  Many solutions for median  SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytical functions with DISTINCT operator  Note: faster solutions exist Spread  Range = maximal – minimal value  Inter-Quartile Range (IQR) = upper quartile – lower quartile  Degrees of freedom: only (n-1) pieces of information help us calculate the spread  Variance (Var) = (1 / (n - 1)) * SUM((Xi – Mean(X))2)  If sample (n of cases) is big, then we can use n instead of n-1 (variance for the population – VarP)  Standard Deviation (StDev) = SQRT(Var)  Relative Standard Deviation or the Coefficient of the Variation (CV) = StDev / Mean Solution  For range, variance and standard deviation use standard aggregate functions • Many solutions for IQR  SQL 2012: use PERCENTILE_CONT or PERCENTILE_DISC window analytic functions with DISTINCT operator  Note: faster solutions exist Skewness and Kurtosis  Skewness describes asymmetry in a random variable’s probability distribution 3 n n xi   Skew  ( ) (n  1)  (n  2) i 1  • Kurtosis characterizes the relative peakedness or flatness of a distribution n  (n  1) xi   Kurt  ( )  (n  1)  (n  2)  (n  3) i 1  n 3  (n  1) 2 (n  2)  (n  3) 4 Solution  Creativity: expand the subtraction of the mean from the current value on the 3rd and 4th degree: ( x   ) 3  x 3  3x 2   3x 2   3 ( x   ) 4  x 4  4 x 3   6 x 2  2  4 x 3   4 • Mathematics: sum is distributive over product 3x1 2  3x2  2  3 2 ( x1  x2 ) n n i 1 i 1 2 2 ( 3 x  )  3   i  (xi ) • CLR aggregate functions can use the same algorithm Linear Dependencies Continuous Variables  The deviation of the actual from the expected probabilities is the Covariance: CoVar(X,Y) = SUM((Xi – Mean(X)) * (Yi – Mean(Y)) * P(Xi,Yi))  Divide the covariance with a product of the standard deviations of both variables and we get the Correlation Coefficient: Correl = CoVar(X,Y) / (StDev(X) * StDev(Y))  Squared correlation coefficient - Coefficient of Determination: CD = SQUARE(Correl) Solution  SQL 2012: use window aggregate functions Contingency Tables  Contingency tables do not rely on numeric values  The Null Hypothesis: there is no relationship between row and column frequencies  So there should be no difference between observed (O) and expected (E) frequencies Observed Gender Married F 4745 M 5266 Total 10011 Expected Gender Married F 4946 M 5065 Total 10011 Single 4388 4085 8473 Total 9133 9351 18484 Single 4187 4286 8473 Total 9133 9351 18484 Linear Dependencies Discrete Variables • Chi-Squared formula: (O  E ) 2 i E • For the Chi-Squared distribution there are already prepared tables with critical points for different degrees of freedom and for a specific confidence level • Degrees of freedom = the product of the degrees of freedom for columns and rows DF  (C  1) * ( R  1) Chi-Squared Critical Points χ2 value DF 1 0.004 0.02 0.06 0.15 0.46 1.07 1.64 2.71 3.84 6.64 10.83 2 0.10 0.21 0.45 0.71 1.39 2.41 3.22 4.60 5.99 9.21 13.82 3 0.35 0.58 1.01 1.42 2.37 3.66 4.64 6.25 7.82 11.34 16.27 4 0.71 1.06 1.65 2.20 3.36 4.88 5.99 7.78 9.49 13.28 18.47 5 1.14 1.61 2.34 3.00 4.35 6.06 7.29 9.24 11.07 15.09 20.52 6 1.63 2.20 3.07 3.83 5.35 7.23 8.56 10.64 12.59 16.81 22.46 7 2.17 2.83 3.82 4.67 6.35 8.38 9.80 12.02 14.07 18.48 24.32 8 2.73 3.49 4.59 5.53 7.34 9.52 11.03 13.36 15.51 20.09 26.12 9 3.32 4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88 10 3.94 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59 Probability 0.95 0.90 0.80 0.70 0.50 Not significant 0.30 0.20 0.10 0.05 0.01 0.001 Significant Solution  Problem: calculate expected frequencies  SQL 2012 and creativity: use window aggregate functions  Read the statistical significance from a preprepared table Calculating Statistical Significance  Why reading from a pre-prepared table - use own table!  Calculate the values for a distribution with a definite integral over the distribution function  E.g., Gaussian distribution function  Standard normal distribution has mean 0 and StDev 1 1  ( x   ) 2 / 2 2 *e 2  Z  X   Solution  Mathematics: trapezoidal formula for approximate definite integration b  a a (b  a) f ( x)dx  ( f (a)  f (b)) 2 b • For multiple points: b  a h f ( x)dx  [ f ( x1 )  f ( xn )  2  2  ( f ( x2 )  f ( x3 )  ...  f ( xn 1 )] Linear Dependencies Continuous and Discrete Variables  ANOVA tests the variance in means between groups  Null Hypothesis: the only variance comes from variance within and not between samples  Meansquared deviation between a groups, with X denoting group mean and X i denoting the total mean  a SSA MSA  , SSA   ni  ( Xi  X ), DFA  (a  1) i 1 DFA One-Way ANOVA and F-Test  Mean squared deviation within a groups, with ni cases in each group a a ni SSE MSE  , SSE    ( Xij  Xj ), DFE   (ni  1) i 1 j 1 DFE i 1 • F ratio – The bigger the F ratio, the more sure you can reject the Null Hypothesis – Use F tables for critical points MSA F MSE Solution (1)  Mathematics: understand the ANOVA formula  SQL 2012: use aggregate and ranking window functions  Creativity: DFA  (a  1)  MAX ( DENSE _ RANK ( groups )  1) a DFE   (ni  1)  i 1  COUNT (*)  MAX ( DENSE _ RANK ( groups )) Solution (2)  How to get statistical significance, the F value?  Not from a table, not with definite integration  .NET function Chart.DataManipulator.Statistics.Fdistribution  Unfortunately in the System.Windows.Forms.DataVisualization.Charting  Not supported by SQL CLR  Creativity: use console application + SQLCMD mode Moving Averages  Simple moving averages (SMA) i 1 SSMAi  ( vi ) / 3 i 1 i  Weighted moving averages (WMA) SWMAi  ( vi * wi ) / 3 i 2 w 1 i  Exponential moving SEMAi  (vi *  )  ( SEMAi  1 *  ) averages (EMA)    1 Solution (1)  SMA: SQL 2012 aggregate window functions  WMA: SQL 2012 aggregate and analytic window functions Solution (2)  EMA: SQL 2012 aggregate and analytic window functions  EMA: creativity and mathematics – transform EMA formula to include values only, not previous EMA i SEMAi   (   j 2 i j  vj )   i 1  v1 Solution (3)  Would be much simpler with access to the interim calculations when SQL Server does window calculations  Not exposed yet – future versions?  Other possibilities  Recursive CTE  Good old cursor! Q&A  Thank you! Please give feedback to us  http://speakerscore.com/sqlsaturday376  Thank you! Sponsors

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download [Powerpoints] - StatisticalQueries