Download Working with Data Part 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Working with Data
Part 3
QMM 384
Exploring Data One Variable at a
Time
We start by studying the variables one at a time for the
Equity.jmp file to get familiar with our data and to uncover
any data quality issues.
Statistics of interest for Continuous variables:
– Mean (average)
– Standard deviation (dispersion around the mean)
– Minimum and maximum values
– Number of missing observations
– Shape of distribution (symmetrical or skewed)
– Outliers
– Errors
QMM 384
Exploring Data One Variable at a Time
•
•
•
Statistics of interest for Nominal and Ordinal
variables:
– Number of categories
– Number of observations in each category
– Number of missing observations
Use the Columns Viewer and Distribution to provide
the statistics discussed above.
Go to Cols > Columns Viewer and select all of the
variables from the Select Column list. Click Show
Summary.
QMM 384
Exploring Data One Variable at a Time
• For the nominal variables, you can see the
number of observations (N), the number of
missing values (N Missing), and the
number of categories.
• How many missing cases are there for the
nominal variable JOB?
• How many categories are there for the
nominal variable REASON?
QMM 384
Exploring Data One Variable at a
Time
•
•
•
•
For the continuous variables we see N and N Missing
in addition to summary statistics.
What two continuous variables have the most number
of missing cases? Is this a concern?
Min and Max provide ranges of the data for each
variable.
The mean is the average for each variable and the
standard deviation is a measure of dispersion around
the mean.
QMM 384
Exploring Data One Variable at a
Time
•
•
•
Next we will explore the information provided by
Distribution.
You can get to Distribution through the Analyze menu
in the Data Table.
– Go to Analyze > Distribution and then select all
variables, click Y, Columns, and OK.
You can also access this information by clicking the
Distribution button above the statistics in the window
created by the Columns Viewer.
– Highlight all of the variables and click Distribution.
QMM 384
Exploring Data One Variable at a
Time
• Categorical variables show bar charts and frequency
distributions.
• Continuous variables display histograms, box plots,
and summary statistics.
• The default view is vertical to allow more information
to be shown.
• To change to a horizontal view, click on the red triangle
next to Distribution and select Stack.
• To change this permanently, in the main JMP window
select File > Preferences > Platforms > Distribution
(from the list) > Stack > Apply > OK.
QMM 384
Exploring Data One Variable at a
Time
•
•
•
•
Returning to our Equity file, what percentage of the
customers have JOB as ProfExe?
The graphs also show that LOAN, MORTDUE, and
VALUE all have extreme values significantly greater
than their means. This means that each distribution is
skewed to the right.
Another useful feature in JMP is to click on the bar for
bad risk customers in the BAD graph (BAD=1).
This highlights the bar and also shows us how bad
risk customers are distributed across all of the other
variables.
QMM 384
Exploring Data One Variable at a
Time
•
•
•
Do you see that more bad risk customers are consolidating debt
rather than taking out a loan for home improvement?
Also observe that 2 customers have loan ages near 1200 days
(see box plot to right of graph).
In summary, here are the data quality issues:
– Missing values for many variables (DEBTINC and DEROG).
– Skewed distributions and a long tail for some variables (LOAN,
MORTDUE, VALUE, CLAGE).
– Messy data – continuous variables with many zeros (DEROG and
DELINQ).
– Outliers in CLAGE.
QMM 384
Exploring Data One Variable at a
Time
•
•
•
•
Later in this chapter, we will see how to address these
issues.
Saving your work is important. The easiest way to do
this is to save the script (called JSL for JMP Scripting
Language) from any JMP output window by clicking
on the top red triangle and selecting Script > Save
Script to Data Table.
The saved script will appear in the table panel of the
data table (top left corner).
The output can be recreated by clicking on the red
triangle and selecting Run Script.
QMM 384