Download 3.6

3.6 Measures of Similarity and Dissimilarity, Data Mining Applications Data mining focuses on (1) the detection and correction of data quality problems (2) the use of algorithms that can tolerate poor data quality. Data are of high quality "if they are fit for their intended uses in operations, decision making and planning" (J. M. Juran). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. Furthermore, apart from these definitions, as data volume increases, the question of internal onsistency within data becomes paramount, regardless of fitness for use for any external purpose, e.g. a person's age and birth date may conflict within different parts of a database. The first views can often be in disagreement, even about the same set of data used for the same purpose. Definitions are:  Data quality: The processes and technologies involved in ensuring the conformance of data values to business requirements and acceptance criteria.  Data exhibited by the data in relation to the portrayal of the actual scenario.  The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use. Data quality aspects: Data size, complexity, sources, types and formats Data processing issues, techniques and measures We are drowning in data, but starving of knowledge (Jiawei Han). Dirty data What does dirty data mean? Incomplete data(missing attributes, missing attribute values, only aggregated data, etc.) Inconsistent data (different coding schemes and formats, impossible values or out-of-range values), Noisy data (containing errors and typographical variations, outliers, not accurate values) Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context. Aspects of data quality include:  Accuracy  Completeness  Update status  Relevance  Consistency across data sources  Reliability  Appropriate presentation  Accessibility 3.7.1 Measurement and data collection issues Just think about the statement below” a person has a height of 2 meters, but weighs only 2kg`s “. This data is inconsistence. So it is unrealistic to expect that data will be perfect. Measurement error refers to any problem resulting from the measurement process. The numerical difference between measured value to the actual value is called as an error. Both of these errors can be random or systematic. Noise and artifacts Noise is the random component of a measurement error. It may involve the distortion of a value or the addition of spurious objects. Data Mining uses some robust algorithms to produce acceptable results even when noise is present. Data errors may be the result of a more deterministic phenomenon called as artifacts. Precision, Bias, and Accuracy The quality of measurement process and the resulting data are measured by Precision and Bias. Accuracy refers to the degree of measurement error in data. Outliers Missing Values It is not unusual for an object to be missed its attributes. In some cases information is not collected properly. Example application forms , web page forms. Strategies for dealing with missing data are as follows:  Eliminate data objects or attributes with missing values.  Estimate missing values x Ignore the missing values d ring analysis Inconsistent values Suppose consider a city like kengeri which is having zipcode 560060, if the user will give some other value for this locality then we can say that inconsistent value is present. Duplicate data Sometimes Data set contain same object more than once then it is called duplicate data. To detect and eliminate such a duplicate data two main issues are addressed here; first, if there are two objects that actually represent a single object, second the values of corresponding attributes may differ. Issues related to applications are timelines of the data, knowledge about the data and relevance of the data.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 3.6