Download Data Mining Techniques

Data Mining Techniques 2016 White Paper Data Mining Techniques Prepared by Mehmet BEYAZ TTG International, L.T.D. www.ttgint.com 30/06/2016 Words of Wisdom You will see it as you like to see. - Mevlana Jalaluddin Rumi- 1|7 Data Mining Techniques 2016 Introduction Everyone knows that the Internet and smart phones have changed how businesses operate, governments function, and society lives and communicates. Recently, new technological trend is just as transformative: “big data.” Big data starts with the fact that there is a lot more information floating around these days than ever before, and it is being put to extraordinary new uses. Big data is about more than just communication. Since, we live in the world of “Big Data. The idea is that we can learn from a large body of information that we could not comprehend when we used only smaller amounts. DATA MINING We are living in a world, where a vast amount of digital data which is called big data. Plus as the world becomes more and more connected via the Internet of Things (IoT). The IoT has been a major influence on the Big Data landscape. These data are collected consciously from 5 minutes to hourly and daily basses from different sources every day. The analysis of such big data brings ahead business competition to the next level of innovation and productivity. Therefore, the extraction and interpretation of hidden patterns in data sets is of great importance. Data mining is a modern tool that aims to discover meaningful knowledge from large data sets and prediction trends. Data mining offers not only a retrospective view on a business process, but also enables humans to develop a successful market strategy. Origins The Data mining originates in the 80s, when it was introduced and utilized within a research community. Data mining also known as KDD (Knowledge Discovery in Databases) and sometimes refer as a Data Analytics as well. The data mining is defined as the component of KDD process and deals with the examination of inner patterns in databases. Besides that, KDD is concerned about the evaluation and interpretation of discovered patterns. Although, exact meanings of KDD and data mining terms differ from each other, often they are used interchangeably. In this paper I utilize KDD and data mining as synonyms, if it is not specified. Data mining is the analysis of large data observational data sets to find out unknown relationships with in the verity of data set and to summarize the data in novel ways that are both understandable and useful to the data owner. Data mining computational methods find themselves in the intersection of classical statistics, artificial intelligence, and machine learning. Data mining as a whole knowledge discovery process also involves many disciplines, such as databases, data cleaning, visualization, exploratory data analysis, and performance and KPI evaluation. Methods Data mining techniques are categorized into supervised, semi-supervised, and unsupervised methods. Supervised method is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output. 2|7 Data Mining Techniques 2016 Y = f(X) The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data. It is called supervised learning because the process of algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. Unlike the supervised approach, the unsupervised technique is to model the underlying structure or distribution in the data in order to learn more about the data. These are called unsupervised learning because unlike supervised learning above there are no correct answers and there is no teacher. Algorithms are left to their own devises to discover and present the interesting structure in the data. Also data scientists identify semi-supervised learning, which is similar to a supervised one. Problems where you have a large amount of input data (X) and only some of the data is labelled (Y) are called semi-supervised learning problems. These problems sit in between both supervised and unsupervised learning. A good example is a photo archive where only some of the images are labelled, (e.g. dog, cat, cow, person) and the majority are unlabelled. Many real world machine learning problems fall into this area. This is because it can be expensive or time consuming to label data as it may require access to domain experts. Whereas unlabelled data is cheap and easy to collect and store. You can use unsupervised learning techniques to discover and learn the structure in the input variables. In Summary In this paper you learned the difference between supervised, unsupervised and semisupervised learning. You now know that:    Supervised: All data is labelled and the algorithms learn to predict the output from the input data. Unsupervised: All data is unlabelled and the algorithms learn to inherent structure from the input data. Semi-supervised: Some data is labelled but most of it is unlabelled and a mixture of supervised and unsupervised techniques can be used. 3|7 Data Mining Techniques 2016 The aim of the Data mining is may be distinguished in different processes categories. While discovery focuses on searching a database for hidden patterns without a predefined hypothesis about the nature of the pattern and deriving a model of the causal generator of the data. Data mining usually falls into two main categories. They are Predictive and Descriptive. See figure 1 at below. Predictive:   Classification aims to categorize unseen input data records into known classes. The assignment model or classifier learns from the training data set, where the relationship between records and classes is provided. Time series forecasting predicts the future value of a target function based on the previously observed measurements Figure 1 Data mining technics. Descriptive:   4|7 Data mining requires some data to find the pattern. Predictive and Descriptive data mining are also classified in different parts. Regression aims to predict numerical values for input data records. The mapping function learns from the training data set, where the relationship between records and their values is known. Data Mining Techniques 2016 Anomaly detection extracts points or outliers that are considerably different from the rest manifold of data points. Descriptive:   Clustering identifies manifolds of points called clusters with similar properties or behaviours. Association analysis discovers relationships between records within the same data set. Knowledge Discovery in Databases Process The KDD is an automatic, exploratory data analysis and modelling of large data sources. The KDD is the organized process of identifying valid, novel, useful, and human eye understandable patterns from large and complex data sets. Data Mining is the core of the KDD process, involving the connecting of algorithms that explore the data, develop the model and discover previously unknown patterns. The KDD knowledge discovery process is repetitive, interactive, and consists of nine steps. Figure 2 The unifying goal of the KDD process is to extract useful information from data in the context of large databases. Data mining refers to the set of computational methods that extract valuable patterns from original data. Additionally, KDD process is concerned about manipulation with massive data, scaling algorithms for better performance, proper interpretation of retrieved information, and human interaction with the overall process. KDD process is a sequential analysis that includes the following steps, see Figure 2:    5|7 selection, pre-processing, transformation, Data Mining Techniques 2016   data mining, and information interpretation However, this sequential knowledge extraction approach may involve iterations, because at any point the data analyst can change settings and repeat previous steps again. The process starts with determining the KDD goals, and ends with the implementation of the discovered knowledge. Thus, the basic KDD sequence may include closed loops, and the effects are then measured on the new data repositories, and the KDD process is launched again. The knowledge exploration process starts with the development of necessary theoretical and practical background in the application domain. The understanding of relevant knowledge is important to achieve customer’s goals. The followings are a brief description of the nine step KDD process; Selection It implies the selection of the target data set based on goals. Determine what data will be used for the knowledge discovery, such as: what data is available, obtaining additional necessary data, and the integrating all the data for the knowledge discovery into one data set. This process is very important because the data mining learns and discovers from the available data. Pre-processing The quality of the selected data is often inappropriate for further analysis, because of multiple reasons. Outliers, missing variables, or high level of noise during the measurements require special data strategies. Hence, Data reliability is enhanced in this stage. Transformation This step can be crucial for the success of the entire KDD project, and it is usually very project specific. Transformation projects an original data into a low dimensional (dimension reduction) space embedded space and includes linear and nonlinear method. The reduced set of embedded features allows visual inspection and facilitates the further mining of knowledge. Data mining The core element of the KDD process is the data mining phase, which includes several steps. Depending on the customer’s goal, a specific data mining task is chosen classification, anomaly detection, regression, or clustering. There are two major goals in data mining: prediction and description. Then, the chosen data mining algorithm is executed to search for underlying patterns and valuable knowledge. Interpretation/Evaluation The final step of the KDD process is interpretation and evaluation of the retrieved information with respect to the goals defined in the first step. This step involves techniques for visual analysis and a 6|7 Data Mining Techniques 2016 number of performance metrics. The correct interpretation of results is important, because it allows checking assumptions and tuning parameters of previous KDD components. Finally, the discovered knowledge and designed KDD algorithm may be incorporate into an existing business model. The possible usage scenarios encompass reporting and prediction, optimization and automation of the business processes. 7|7 Data Mining Techniques 2016 References 1. Detecting Cellular Network Anomalies Using the Knowledge Discovery Process by, Sergey Chernov, JYVÄSKYLÄ 2015 2. The UCI KDD Archive of Large Data Sets for Data Mining Research and Experimentation by, Stephen D. Bay, Dennis Kibler, Michael J. Pazzani, and Padhraic Smyth Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 3. Data mining and complex telecommunications problems modeling Janusz Granat 4. DATA MINING IN TELECOMMUNICATIONS Gary M. Weiss Department of Computer and Information Science Fordham University 5. Data Mining with Big Data - IEEE Xplore ieeexplore.ieee.org/iel7/69/4358933/06547630.pdf?arnumber=6547630 6. Data Mining for Big Data: A Review Bharti Thakur, Manish Mann Computer Science Department LRIET, Solan (H.P), India 7. https://blog.udemy.com/knowledge-discovery-in-databases/ 8. http://www.economist.com/node/15557443 9. http://www.neural-forecasting.com/nn_for_data_mining.htm 10. http://www.cise.ufl.edu/~ddd/cap6635/Fall-97/Short-papers/KDD3.htm 8|7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Techniques