Download [Powerpoints] - StatisticalQueries

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
High Performance Statistical
Queries
Sponsors
Agenda
 Introduction
 Descriptive Statistics
 Linear dependencies
 Continuous variables
 Discrete variables
 Discrete and continuous variables
 Definite Integration
 Moving Averages
Why Statistical Queries?
 SQL Server lacks support for statistics
 Statistics useful in many cases
 Ad-hoc analysis
 Advanced reporting
 Data overview
 Show the best of the T-SQL improvements in
SQL Server 2012
Big Role: Data Scientist
 Data Scientist: The Sexiest Job of the 21st
Century (Harvard Business Review)
 A high-ranking professional with the training and
curiosity to make discoveries in the world of
structured and big data
 A solid foundation typically in computer science and
applications, modeling, statistics, analytics and math
 A data scientist explores and examines data from
multiple disparate sources
 Exploring, asking questions, doing what-if analysis,
questioning existing assumptions and processes
 In short: programming and statistics!
5
Performance
 Main issue: complex calculations that need
other statistics
 E.g., StDev uses Avg in formula
 Goal: calculate everything with minimal
number of passes through the data
 Additionally improve performance with:
 (Covering) nonclustered indexes
 Columnstore index
 Sampling!
Minimalizing Passes through the Data
 How to achieve the minimal number of
passes through the data?
 With new SQL 2012 Window functions
 Rearrange formulas – use mathematical
knowledge
 Use creativity
Frequencies
 Frequency tables and graphs are the basic
representation of discrete variables
 Value, absolute frequency, absolute percentage,
cumulative frequency, cumulative percent and
histogram
Cars AbsFreq CumFreq AbsPerc CumPerc Histogram
0
4238
4238
23
23 ***********************
1
4883
9121
26
49 **************************
2
6457
15578
35
84 ***********************************
3
1645
17223
9
93 *********
4
1261
18484
7
100 *******
Solution
 Pre-SQL 2012: calculating absolute numbers and
then using a non-equi self-join for cumulative
(running) numbers
 SQL 2012: calculate absolute numbers and then use
aggregate functions with framing and order
 SQL 2012 with creativity: use window analytic
functions
 PERCENT_RANK calculates the relative rank of a row
within a group of rows in percent
 CUME_DIST Calculates the cumulative distribution of a
value
 CUME_DIST – PERCENT_RANK for the last value in a
group equals to the absolute percent of the value
Centers
 Center of a distribution
 The mode is the most common value in the
distribution
 The median is the value that splits the distribution
into two halves
 The arithmetic mean or the average is the most
common measure for the center of the distribution
 Comparing mode, median and mean gives
info about the skewness
Solution
 For mode and mean, use standard aggregate
functions
 For mode, use also TOP 1 WITH TIES
 Many solutions for median
 SQL 2012: use PERCENTILE_CONT or
PERCENTILE_DISC window analytical functions
with DISTINCT operator
 Note: faster solutions exist
Spread
 Range = maximal – minimal value
 Inter-Quartile Range (IQR) = upper quartile –
lower quartile
 Degrees of freedom: only (n-1) pieces of
information help us calculate the spread
 Variance (Var) = (1 / (n - 1)) * SUM((Xi – Mean(X))2)
 If sample (n of cases) is big, then we can use n
instead of n-1 (variance for the population – VarP)
 Standard Deviation (StDev) = SQRT(Var)
 Relative Standard Deviation or the Coefficient of
the Variation (CV) = StDev / Mean
Solution
 For range, variance and standard deviation
use standard aggregate functions
• Many solutions for IQR
 SQL 2012: use PERCENTILE_CONT or
PERCENTILE_DISC window analytic functions
with DISTINCT operator
 Note: faster solutions exist
Skewness and Kurtosis
 Skewness describes asymmetry in a random
variable’s probability distribution
3
n
n
xi  
Skew 
(
)
(n  1)  (n  2) i 1 
• Kurtosis characterizes the relative peakedness
or flatness of a distribution
n  (n  1)
xi  
Kurt 
(
) 
(n  1)  (n  2)  (n  3) i 1

n
3  (n  1) 2
(n  2)  (n  3)
4
Solution
 Creativity: expand the subtraction of the mean
from the current value on the 3rd and 4th degree:
( x   ) 3  x 3  3x 2   3x 2   3
( x   ) 4  x 4  4 x 3   6 x 2  2  4 x 3   4
• Mathematics: sum is distributive over product
3x1 2  3x2  2  3 2 ( x1  x2 )
n
n
i 1
i 1
2
2
(
3
x

)

3

 i
 (xi )
• CLR aggregate functions can use the same
algorithm
Linear Dependencies
Continuous Variables
 The deviation of the actual from the expected
probabilities is the Covariance:
CoVar(X,Y) = SUM((Xi – Mean(X)) * (Yi – Mean(Y)) *
P(Xi,Yi))
 Divide the covariance with a product of the
standard deviations of both variables and we
get the Correlation Coefficient:
Correl = CoVar(X,Y) / (StDev(X) * StDev(Y))
 Squared correlation coefficient - Coefficient of
Determination:
CD = SQUARE(Correl)
Solution
 SQL 2012: use window aggregate functions
Contingency Tables
 Contingency tables do
not rely on numeric
values
 The Null Hypothesis:
there is no relationship
between row and
column frequencies
 So there should be no
difference between
observed (O) and
expected (E) frequencies
Observed
Gender
Married
F
4745
M
5266
Total
10011
Expected
Gender
Married
F
4946
M
5065
Total
10011
Single
4388
4085
8473
Total
9133
9351
18484
Single
4187
4286
8473
Total
9133
9351
18484
Linear Dependencies
Discrete Variables
• Chi-Squared formula:
(O  E ) 2
i E
• For the Chi-Squared distribution there are
already prepared tables with critical points for
different degrees of freedom and for a
specific confidence level
• Degrees of freedom = the product of the
degrees of freedom for columns and rows
DF  (C  1) * ( R  1)
Chi-Squared Critical Points
χ2 value
DF
1
0.004
0.02
0.06
0.15
0.46
1.07
1.64
2.71
3.84
6.64 10.83
2
0.10
0.21
0.45
0.71
1.39
2.41
3.22
4.60
5.99
9.21 13.82
3
0.35
0.58
1.01
1.42
2.37
3.66
4.64
6.25
7.82 11.34 16.27
4
0.71
1.06
1.65
2.20
3.36
4.88
5.99
7.78
9.49 13.28 18.47
5
1.14
1.61
2.34
3.00
4.35
6.06
7.29
9.24 11.07 15.09 20.52
6
1.63
2.20
3.07
3.83
5.35
7.23
8.56 10.64 12.59 16.81 22.46
7
2.17
2.83
3.82
4.67
6.35
8.38
9.80 12.02 14.07 18.48 24.32
8
2.73
3.49
4.59
5.53
7.34
9.52 11.03 13.36 15.51 20.09 26.12
9
3.32
4.17
5.38
6.39
8.34 10.66 12.24 14.68 16.92 21.67 27.88
10
3.94
4.86
6.18
7.27
9.34 11.78 13.44 15.99 18.31 23.21 29.59
Probability
0.95
0.90
0.80
0.70
0.50
Not significant
0.30
0.20
0.10
0.05
0.01 0.001
Significant
Solution
 Problem: calculate expected frequencies
 SQL 2012 and creativity: use window
aggregate functions
 Read the statistical significance from a preprepared table
Calculating Statistical Significance
 Why reading from a pre-prepared table - use
own table!
 Calculate the values for a distribution with a
definite integral over the distribution function
 E.g., Gaussian distribution function
 Standard normal distribution has mean 0 and StDev 1
1
 ( x   ) 2 / 2 2
*e
2 
Z 
X 

Solution
 Mathematics: trapezoidal formula for
approximate definite integration
b

a
a
(b  a)
f ( x)dx 
( f (a)  f (b))
2
b
• For multiple points:
b

a
h
f ( x)dx  [ f ( x1 )  f ( xn ) 
2
 2  ( f ( x2 )  f ( x3 )  ...  f ( xn 1 )]
Linear Dependencies
Continuous and Discrete Variables
 ANOVA tests the variance in means between
groups
 Null Hypothesis: the only variance comes
from variance within and not between
samples
 Meansquared deviation between a groups,
with X denoting group mean and X i denoting
the total mean

a
SSA
MSA 
, SSA   ni  ( Xi  X ), DFA  (a  1)
i 1
DFA
One-Way ANOVA and F-Test
 Mean squared deviation within a groups, with
ni cases in each group
a
a ni
SSE
MSE 
, SSE    ( Xij  Xj ), DFE   (ni  1)
i 1 j 1
DFE
i 1
• F ratio
– The bigger the F ratio, the more
sure you can reject the Null
Hypothesis
– Use F tables for critical points
MSA
F
MSE
Solution (1)
 Mathematics: understand the ANOVA formula
 SQL 2012: use aggregate and ranking
window functions
 Creativity:
DFA  (a  1)  MAX ( DENSE _ RANK ( groups )  1)
a
DFE   (ni  1) 
i 1
 COUNT (*)  MAX ( DENSE _ RANK ( groups ))
Solution (2)
 How to get statistical significance, the F value?
 Not from a table, not with definite integration
 .NET function
Chart.DataManipulator.Statistics.Fdistribution
 Unfortunately in the
System.Windows.Forms.DataVisualization.Charting
 Not supported by SQL CLR
 Creativity: use console application + SQLCMD
mode
Moving Averages
 Simple moving
averages (SMA)
i 1
SSMAi  ( vi ) / 3
i 1
i
 Weighted moving
averages (WMA)
SWMAi  ( vi * wi ) / 3
i 2
w 1
i
 Exponential moving SEMAi  (vi *  )  ( SEMAi  1 *  )
averages (EMA)
   1
Solution (1)
 SMA: SQL 2012 aggregate window functions
 WMA: SQL 2012 aggregate and analytic
window functions
Solution (2)
 EMA: SQL 2012 aggregate and analytic
window functions
 EMA: creativity and mathematics – transform
EMA formula to include values only, not
previous EMA
i
SEMAi   (  
j 2
i j
 vj )  
i 1
 v1
Solution (3)
 Would be much simpler with access to the
interim calculations when SQL Server does
window calculations
 Not exposed yet – future versions?
 Other possibilities
 Recursive CTE
 Good old cursor!
Q&A
 Thank you!
Please give feedback to us
 http://speakerscore.com/sqlsaturday376
 Thank you!
Sponsors
Related documents