Download Chapter 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Chapter 1 Introduction
Outline and Definitions
Statistics: art and science of collecting, analyzing and interpreting data
 Key is not only generating relevant measures but interpreting them
Used in a wide variety of academic disciplines and business environment such as:
i) Accounting - statistical sampling procedures when conducting audits
ii) Economics – economic forecasting
iii) Marketing – industry analysis and surveys
iv) Production – quality control techniques
v) Finance – investment prospects and maintenance
vi) Insurance – actuarial science
Data: facts and figures collected and summarized for a given topic, research or area of concern
 may be the most time consuming part of project (may also be very costly)
 obtained through internal systems (company databases), experiments, external sources
(Hoovers), Gov’t agencies (Bureau of Labor and Statistics, Bureau of Census) and
general business websites
 when existing data is not available  conduct statistical study (experiment / survey)
Data Set: complete set of data (i.e. data collected in a study)
Elements: entities for which data is collected
Eg) For data on Y-town businesses, individual businesses are elements
Variables: unique characteristic/category of interest for each element
Eg) For each Y-town business, you are collect info on type of industry, # employees,
years in business, annual revenue, etc
Observation: set of measurements for each element
Eg) Type of industry, # employees, years in business, etc for Youngstown
Propane (element)
 The total number of data values in a data set is the number of elements multiplied by the
number of variables.
Example
1
Categorical vs. Quantitative Data
-- data for a variable can be described as one of 2 types
A) Categorical Data: Data that is grouped or is identified by specific categories (use of
labels/names). Also known in some texts as Qualitative Data
 Limited statistical summaries (i.e counting within categories and % of observations
within categories)
 Can be represented by numeric labels (coding) or non-numeric labels
 Uses either the nominal or ordinal scales of measurement (see below)
Categorical Variable: Variable whose data is represented by categorical data
Example: Ratings, Sex, Religion, Nationality, College Major
B) Quantitative Data: data that is numeric and contains values that indicates how many
or how much
 Always numeric
 Allows for a wide variety of statistical summaries
 Uses either the interval or ratio scales of measurement (see below)
 Can be either continuous or discrete
Quantitative Variable: Variable represented by quantitative data
Example: Revenue, Prices, Income
Scales of Measurement
 For each element, data collected for a specific variable is categorized as having one of 4
scales of measurement
 Scale of measurement is an assignment describing the type of data contained within
variable  information about the data
 Dictates the data summarization and statistical analysis that are most appropriate
1) Nominal Scale (categorical data where order is not important)
 Variable is described as having nominal scale when the data contains labels or
names
 Labels can be translated into numeric codes (assignment of #’s are arbitrary)
 Categories are mutually exclusive
Example:
2
2) Ordinal Scale (categorical data where order is important)
 Data has the properties of nominal data and the order or rank of the data is
meaningful or important
 Label can be translated into numeric code (numeric coding is normally a logical
process that follows order/rank)
 For numeric codes, difference between values are meaningless
Example:
Note: Order is important but interval between each value may not be the same or not
well defined
3
3) Interval Scale (Quantitative data where differences between the data are meaningful and
measurable)
 Data has the properties of ordinal data, and the interval (distance) between
observations is expressed in terms of fixed unit of measure (standardized units) 
distance between values is measured in equal units or constant size across all
levels of the scale
 With fixed or standardized units, differences between data values becomes
meaningful regardless of position on scale


Always numeric
Point 0 is just another point on scale
 Does not indicate that nothing exists for that variable at that level
 You must ask the question: Does a 0 value indicate that nothing exists at
that value? If answer is no, then numeric data is interval scale
Example:
4
4) Ratio Scale (Quantitative data where the ratio of 2 values is meaningful)
 Data for a particular variable is ratio scale if it has all the proprieties of interval
data and the ratio of two values is meaningful
 Scale must contain a true 0 value indicating that nothing exists for the variable at
the 0 point (absence of characteristic) does it satisfy the absence logic?
 You must ask the question: Does a 0 value indicate that nothing exists at
that value? If answer is yes, then numeric data is ratio scale
Example:
Cross Sectional and Time Series Data
Cross Sectional Data: data collected at the same or approximately the same point in time
Example:
Time Series Data: data collected over an extended period of time
 Purpose is to show a comparison over time
5
Descriptive Statistics: Method for summarizing and describing a given set of data
Purpose: to make sense of a large volume of info
Form: tables, graphs and numeric summaries (most common is the average or mean)
Statistical Inference: Using sample data from a population to draw conclusions and make
predictions about the characteristics of a whole population
Population: entire set of elements in a given study
Sample: portion or subset of the population
Example:
Data Mining: Methods for developing useful decision-making information using data from large
databases
 Use of statistics, mathematics and computer programming to convert raw data into
useful reports/summaries for forecasting, prediction and daily decision making
 Data mining begins with data warehousing (process of capturing, storing and
maintaining data)
Data Warehousing  Data Mining  Reports/Summaries
Appendix: Matrix for Scales of Measurement
Nominal Scale
Ordinal Scale
Data categories are
Data categories are
1. mutually exclusive
1. mutually exclusive
2. have a logical order
(scaled according to
the amount of the
particular characteristic
they possess)
Interval Scale
Data categories are
1. mutually exclusive
2. have a logical order
3. Equal distance
(differences in
the characteristics are
represented by equal
differences in the numbers)
Difference or distance is
standardized
* The point 0 is just another
point on the scale.
gender, ethnicity,
religious affiliation
class, grade (A, B, C,
D, & F), ranks
Temperature measured by
Celsius and Fahrenheit
6
Ratio Scale
Data categories are
1. mutually exclusive
2. have a logical order
3. Equal distance
4. true zero point (The
point 0 reflects an
absence of the
characteristic)
* Can do all the
mathematical
operations usually
associated with
numbers, including
ratios.
age, time, height,
weight, # of chairs in a
room