Download Data Mining Techniques and Research Challenges and

International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) Data Mining Techniques and Research Challenges and Issues Girish Kumar Sorot Arya Institute of Engineering and Technology, Kukas Industrial Area(RIICO), Delhi Road, Jaipur, Rajasthan(India) The development of Information technology has paved way to generate large amount of databases and huge data in various areas. The research in databases and information technology has given rise to approach to store and manipulate precious data for further decision making [1]. Data mining is a process to extract the implicit information and knowledge by extracting from the mass, incomplete, noisy, fuzzy and random data with knowing the data well in advance and which is potentially useful to various fields [2]. Topics of interest include but are not limited to practical areas that span a variety of aspects of data integration and mining including Large-scale data integration and mining , Metadata integration and management, Data security and privacy, Social media data analysis and computing, Web-scale data mining and semantic discovery, Network data integration and delivery, Data filtering and cleaning, Data integration environments and applications, Data models, schemas, Database integration systems, Data management and analysis in specific application domains Data mining algorithms are widely used today for the analysis of large corporate and scientific datasets stored in databases and data archives. Industry, science, and commerce fields often need to analyze very large datasets maintained over geographically distributed sites by using the computational power of distributed and parallel systems. Abstract-- Data mining is considered to deal with huge amounts of data which are kept in the database, to locate required information and facts. Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Non trivial extraction of implicit, previously unknown and potentially useful information from data Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns. In this paper, we discuss the data mining techniques and functionalities with application. Also discuss the research challenges in science and engineering, from the data mining perspective, with a focus on the data mining issues. Keyword-- data mining, data mining techniques and functionalities, research challenges, data mining issues. I. INTRODUCTION With the enormous amount of data stored in files, databases, and other repositories, it is increasingly important, if not necessary, to develop powerful means for analysis and perhaps interpretation of such data and for the extraction of interesting knowledge that could help in decision-making. Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. II. D ATA M IMINING T ECHNIQES AND FUNCTIONALITIES The data mining functionalities and the variety of knowledge they discover are briefly presented in the following list: A. Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules. The data relevant to a userspecified class are normally retrieved by a database query and run through a summarization module to extract the essence of the data at different levels of abstractions. For example, one may want to characterize the OurVideoStore customers who regularly rent more than 30 movies a year. With concept hierarchies on the attributes describing the target class, the attribute oriented induction method can be used, for example, to carry out data summarization. Note that with a data cube containing summarization of data, simple OLAP operations fit the purpose of data characterization. Figure 1 Data mining (Knowledge discovery in database) While data mining and knowledge discovery in databases (or KDD) are frequently treated as synonyms, data mining is actually part of the knowledge discovery process. Figure1 shows data mining as a step in an iterative knowledge discovery process. 529 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) B. Discrimination: Approach: Process the data on tools and parts required in previous repairs at different consumer locations and Data discrimination produces what are called discover the co-occurrence patterns. discriminant rules and is basically the comparison of the general features of objects between two classes referred D. Classification: to as the target class and the contrasting class. For Classification analysis is the organization of data in example, one may want to compare the general given classes. Also known as supervised classification, characteristics of the customers who rented more than 30 the classification uses given class labels to order the movies in the last year with those whose rental account is objects in the data collection. Classification approaches lower than 5. The techniques used for data discrimination normally use a training set where all objects are already are very similar to the techniques used for data associated with known class labels. The classification characterization with the exception that data algorithm learns from the training set and builds a model. discrimination results include comparative measures. The model is used to classify new objects. For example, C. Association analysis: after starting a credit policy, the OurVideoStore managers could analyze the customers’ behaviours vis-àAssociation analysis is the discovery of what are vis their credit, and label accordingly the customers who commonly called association rules. It studies the received credits with three possible labels ―safe‖, ―risky‖ frequency of items occurring together in transactional and ―very risky‖. The classification analysis would databases, and based on a threshold called support, generate a model that could be used to either accept or identifies the frequent item sets. Another threshold, reject credit requests in the future. confidence, which is the conditional probability than an item appears in a transaction when another item appears, 1) Classification: Application is used to pinpoint association rules. Association analysis 1.1 Direct Marketing is commonly used for market basket analysis. For example, it could be useful for the OurVideoStore Goal: Reduce cost of mailing by targeting a set of manager to know what movies are often rented together consumers likely to buy a new cell-phone product. or if there is a relationship between renting a certain type Approach: Use the data for a similar product introduced of movies and buying popcorn or pop. The discovered before. We know which customers decided to buy and association rules are of the form: P->Q [s,c], where P and which decided otherwise. This {buy, don’t buy} decision Q are conjunctions of attribute value-pairs, and s (for forms the class attribute. support) is the probability that P and Q appear together in Collect various demographic, lifestyle, and company a transaction and c (for confidence) is the conditional interaction related information about all such customers. probability that Q appears in a transaction when P is Type of business, where they stay, how much they earn, present. For example, the hypothetic association rules: etc. Use this information as input attributes to learn a RentType(X, “game”) ɅAge(X, “13-19”) ->Buys(X, classifier model. “pop”) [s=2% ,c=55%] would indicate that 2% of the transactions considered are of customers aged between 1.2 Fraud Detection 13 and 19 who are renting a game and buying a pop, and Goal: Predict fraudulent cases in credit card transactions. that there is a certainty of 55% that teenage customers Approach: Use credit card transactions and the who rent a game also buy pop. information on its account-holder as attributes. When 1) Association Rule Discovery: Application does a customer buy, what does he buy, how often he pays on time, etc. 1.1 Supermarket shelf management. Label past transactions as fraud or fair transactions. Goal: To identify items that are brought together by This forms the class attribute. sufficiently many customers. Learn a model for the class of the transactions. Use Approach: Process the point-of-sale data collected with this model to detect fraud by observing credit card barcode scanners to find dependencies among items. transactions on an account. A classic rule - If a customer buys diaper and milk, then E. Prediction: he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers! Prediction has attracted considerable attention given the potential implications of successful forecasting in a 1.2 Inventory Management: business context. There are two major types of Goal: A consumer appliance repair company wants to predictions: one can either try to predict some anticipate the nature of repairs on its consumer products unavailable data values or pending trends, or predict a and keep the service vehicles equipped with right parts to class label for some data. The latter is tied to reduce on number of visits to consumer households. classification. 530 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) Once a classification model is built based on a training H. Evolution and deviation analysis: set, the class label of an object can be foreseen based on Evolution and deviation analysis pertain to the study the attribute values of the object and the attribute values of time related data that changes in time. Evolution of the classes. Prediction is however more often referred analysis models evolutionary trends in data, which to the forecast of missing numerical values, or increase/ consent to characterizing, comparing, classifying or decrease trends in time related data. The major idea is to clustering of time related data. Deviation analysis, on the use a large number of past values to consider probable other hand, considers differences between measured future values. values and expected values, and attempts to find the cause of the deviations from the anticipated values. It is F. Clustering: common that users do not have a clear idea of the kind of Similar to classification, clustering is the organization patterns they can discover or need to discover from the of data in classes. However, unlike classification, in data at hand. It is therefore important to have a versatile clustering, class labels are unknown and it is up to the and inclusive data mining system that allows the clustering algorithm to discover acceptable classes. discovery of different kinds of knowledge and at different Clustering is also called unsupervised classification, levels of abstraction. This also makes interactivity an because the classification is not dictated by given class important attribute of a data mining system. labels. There are many clustering approaches all based on the principle of maximizing the similarity between III. ISSUES IN DATA M INING objects in a same class (intra-class similarity) and Data mining algorithms embody techniques that have minimizing the similarity between objects of different sometimes existed for many years, but have only lately classes (inter-class similarity). been applied as reliable and scalable tools that time and 1) Clustering: Application again outperform older classical statistical methods. While data mining is still in its infancy, it is becoming a 1.1. Market Segmentation: trend and ubiquitous. Before data mining develops into a Goal: subdivide a market into distinct subsets of conventional, mature and trusted discipline, many still customers where any subset may conceivably be selected pending issues have to be addressed. Some of these as a market target to be reached with a distinct marketing issues are addressed below. Note that these issues are not mix. exclusive and are not ordered in any way. Approach: Collect different attributes of customers based A. Security and social issues: on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering Security is an important issue with any data collection quality by observing buying patterns of customers in that is shared and/or is intended to be used for strategic same cluster vs. those from different clusters. decision-making. In addition, when data is collected for customer profiling, user behavior understanding, 1.2. Document Clustering: correlating personal data with other information, etc., Goal: To find groups of documents that are similar to large amounts of sensitive and private information about each other based on the important terms appearing in individuals or companies is gathered and stored. This them. becomes controversial given the confidential nature of Approach: To identify frequently occurring terms in each some of this data and the potential illegal access to the document. Form a similarity measure based on the information. Moreover, data mining could disclose new frequencies of different terms. Use it to cluster. implicit knowledge about individuals or groups that could be against privacy policies, especially if there is Gain: Information Retrieval can utilize the clusters to potential dissemination of discovered information. relate a new document or search term to clustered Another issue that arises from this concern is the documents. appropriate use of data mining. Due to the value of data, G. Outlier analysis: databases of all sorts of content are regularly sold, and Outliers are data elements that cannot be grouped in a because of the competitive advantage that can be attained given class or cluster. Also known as exceptions or from implicit knowledge discovered, some important surprises, they are often very important to identify. While information could be withheld, while other information outliers can be considered noise and discarded in some could be widely distributed and used without control. applications, they can reveal important knowledge in other domains, and thus can be very significant and their analysis valuable. 531 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) B. User interface issues: D. Performance issues: The knowledge discovered by data mining tools is Many artificial intelligence and statistical methods useful as long as it is interesting, and above all exist for data analysis and interpretation. However, these understandable by the user. Good data visualization eases methods were often not designed for the very large data the interpretation of data mining results, as well as helps sets data mining is dealing with today. Terabyte sizes are users better understand their needs. Many data common. This raises the issues of scalability and exploratory analysis tasks are significantly facilitated by efficiency of the data mining methods when processing the ability to see data in an appropriate visual considerably large data. Algorithms with exponential and presentation. There are many visualization ideas and even medium-order polynomial complexity cannot be of proposals for effective data graphical presentation. practical use for data mining. Linear algorithms are However, there is still much research to accomplish in usually the norm. In same theme, sampling can be used order to obtain good visualization tools for large datasets for mining instead of the whole dataset. However, that could be used to display and manipulate mined concerns such as completeness and choice of samples knowledge. The major issues related to user interfaces may arise. Other topics in the issue of performance are and visualization are ―screen real-estate‖, information incremental updating, and parallel programming. There rendering, and interaction. Interactivity with the data and is no doubt that parallelism can help solve the size data mining results is crucial since it provides means for problem if the dataset can be subdivided and the results the user to focus and refine the mining tasks, as well as to can be merged later. Incremental updating is important picture the discovered knowledge from different angles for merging results from parallel mining, or updating data and at different conceptual levels. mining results when new data becomes available without having to re-analyze the complete dataset. C. Mining methodology issues: F. Data source issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of There are many issues related to the data sources, the mining approaches, the diversity of data available, the some are practical such as the diversity of data types, dimensionality of the domain, the broad analysis needs while others are philosophical like the data glut problem. (when known), the assessment of the knowledge We certainly have an excess of data since we already discovered, the exploitation of background knowledge have more data than we can handle and we are still and metadata, the control and handling of noise in data, collecting data at an even higher rate. If the spread of etc. are all examples that can dictate mining methodology database management systems has helped increase the choices. For instance, it is often desirable to have gathering of information, the advent of data mining is different data mining methods available since different certainly encouraging more data harvesting. The current approaches may perform differently depending upon the practice is to collect as much data as possible now and data at hand. Moreover, different approaches may suit process it, or try to process it, later. The concern is and solve user’s needs differently. Most algorithms whether we are collecting the right data at the appropriate assume the data to be noise-free. This is of course a amount, whether we know what we want to do with it, strong assumption. Most datasets contain exceptions, and whether we distinguish between what data is invalid or incomplete information, etc., which may important and what data is insignificant. Regarding the complicate, if not obscure, the analysis process and in practical issues related to data sources, there is the many cases compromise the accuracy of the results. As a subject of heterogeneous databases and the focus on consequence, data preprocessing (data cleaning and diverse complex data types. We are storing different transformation) becomes vital. It is often seen as lost types of data in a variety of repositories. It is difficult to time, but data cleaning, as time consuming and expect a data mining system to effectively and efficiently frustrating as it may be, is one of the most important achieve good mining results on all kinds of data and phases in the knowledge discovery process. Data mining sources. Different kinds of data and sources may require techniques should be able to handle noise in data or distinct algorithms and methodologies. Currently, there is incomplete information. More than the size of data, the a focus on relational databases and data warehouses, but size of the search space is even more decisive for data other approaches need to be pioneered for other specific mining techniques. The size of the search space is often complex data types. A versatile data mining tool, for all depending upon the number of dimensions in the domain sorts of data, may not be realistic. Moreover, the space. The search space usually grows exponentially proliferation of heterogeneous data sources, at structural when the number of dimensions increases. This is known and semantic levels, poses important challenges not only as the curse of dimensionality. This ―curse‖ affects so to the database community but also to the data mining badly the performance of some data mining approaches community. that it is becoming one of the most urgent issues to solve. 532 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) Frequent pattern mining has been a focused theme in IV. MAJOR RESEARCH CHALLENGES data mining research for over a decade [HCXY07]. In this section, we will examine several major Abundant literature has been dedicated to this research, challenges raised in science and engineering from the and tremendous progress has been made, ranging from data mining perspective, and point out some promising efficient and scalable algorithms for frequent item set research directions. mining in transaction databases to numerous research A. Information network analysis frontiers, such as sequential pattern mining, structural pattern mining, correlation mining, associative With the development of Google and other effective classification, and frequent-pattern-based clustering, as web search engines, information network analysis has well as their broad applications. become an important research frontier, with broad The promotion of effective application of pattern applications, such as social network analysis, web analysis methods in scientific and engineering community discovery, terrorist network mining, applications is an important task in data mining. computer network analysis, and network intrusion Moreover, it is important to further develop efficient detection. However, information network research methods for mining long, approximate, compressed, and should go beyond explicitly formed, homogeneous sophisticated patterns for advanced applications, such as networks (e.g., web page links, computer networks, and mining biological sequences and networks and mining terrorist e-connection networks) and delve deeply into patterns related to scientific and engineering processes. implicitly formed, heterogeneous, and multidimensional Furthermore, the exploration of mined patterns for information networks. Science and engineering provide classification, clustering, correlation analysis, and pattern us with rich opportunities on exploration of networks in understanding will still be interesting topics in research. this direction. There are a lot of massive natural, technical, social, C. Stream data mining and information networks in science and engineering Stream data refers to the data that flows into and out of applications, such as gene, protein, and microarray the system like streams. Stream data is usually in vast networks in biology; highway transportation networks in volume, changing dynamically, possibly infinite, and civil engineering; topic- or theme-author-publicationcontaining multi-dimensional features. Typical examples citation networks in library science; and wireless of such data include audio and video recording of telecommunication networks among commanders, scientific and engineering processes, computer network soldiers and supply lines in a battle field. In such information flow, web click streams, and satellite data information networks, each node or link in a network flow. Such data cannot be handled by traditional database contains valuable, multidimensional information, such as systems, and moreover, most systems may only be able textual contents, geographic information, traffic flow, to read a data stream once in sequential order. This poses and other properties. Moreover, such networks could be great challenges on effective mining of stream data highly dynamic, evolving, and inter-dependent. [BBD+02, Agg06]. Many domains of interest today are best described as a First, the techniques to summarize the whole or part of network of interrelated heterogeneous objects. As future the data streams are studied, which is the basis for stream work, link mining may focus on the integration of link data mining. Such techniques include sampling [DH01], mining algorithms for a spectrum of knowledge load shedding [TcZ+03] and sketching techniques discovery tasks. Furthermore, in many applications, the [Mut03], synopsis data structures [GKMS01], stream facts to be analyzed are dynamic and it is important to cubing [CDH+02], and clustering [AHWY03]. Progress develop incremental link mining algorithms. Besides has been made on efficient methods for mining frequent mining knowledge from links, objects and networks, we patterns in data streams [MM02], multidimensional may wish to construct an information network based on analysis of stream data (such as construction of stream both ontological and unstructured information. cubes) [CDH+02], stream data classification [AHWY04], B. Discovery, understanding, and usage of patterns and stream clustering [AHWY03], stream outlier analysis, knowledge rare event detection [GFHY07], and so on. The general philosophy is to develop single-scan algorithms to collect Scientific and engineering applications often handle information about stream data in tilted time windows, massive data of high dimensionality. The goal of pattern exploring micro-clustering, limited aggregation, and mining is to find item sets, subsequences, or approximation. substructures that appear in a data set with frequency no The focus of stream pattern analysis is to approximate less than a user-specified threshold. Pattern analysis can the frequency counts for infinite stream data. be a valuable tool for finding correlations, clusters, classification models, sequential and structural patterns, and outliers. 533 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) Algorithms have been developed to count frequency Clustering the nodes of the sensor networks is an using tilted windows [GHPY02] based on the fact that important optimization problem. Nodes that are clustered users are more interested in the most recent transactions; together can easily communicate with each other, which approximate frequency counting based on previous can be applied to energy optimization and developing historical data to calculate the frequent patterns optimal algorithms for clustering sensor nodes. Other incrementally [MM02] and track the most frequent k works in this field include identification of rare events or items in the continuously arriving data [CM03]. anomalies, finding frequent item sets, and data Stream data is often encountered in science and preprocessing in sensor networks. engineering applications. It is important to explore stream Recent years have witnessed and enormous increase in data mining in such applications and develop applicationmoving object data from RFID records in supply chain specific methods, e.g., real-time anomaly detection in operations, toll and road sensor readings from vehicles on computer network analysis, in electric power grid road networks, or even cell phone usage from different supervision, in weather modeling, in engineering and geographic regions. These movement data, including security surveillance, and other stream data applications. RFID data, object trajectories, anonymous aggregate data such as the one generated by many road sensors, contain D. Mining moving object data, RFID data, and data from rich information. Effective management of such data is a sensor networks major challenge facing society today, with important With the popularity of sensor networks, GPS, cellular implications into business optimization, city planning, phones, other mobile devices, and RFID technology, privacy, and national security. Interesting research has tremendous amount of moving object data has been been conducted on warehousing RFID data sets collected, calling for effective analysis. This is especially [GHLK06], which could handle moving object data sets true in many scientific, engineering, business and by significantly compressing such data, and proposing a homeland security applications. new aggregation mechanism that preserves their path Sensor networks are finding increasing number of structures. Mining moving objects is a challenging applications in many domains, including battle fields, problem due to the massive size of the data, and its smart buildings, and even the human body. Most sensor spatiotemporal characteristics. The methods developed networks consist of a collection of light-weight (possibly along this line include Flow Graph [GHL06b], which is a mobile) sensors connected via wireless links to each probabilistic model that captures the main trends and other or to a more powerful gateway node that is in turn exceptions in moving object data, and FlowCube connected with an external network through either wired [GHL06a], which is a multi-dimensional extension of the or wireless connections. Sensor nodes usually FlowGraph and an adaptive fastest path algorithm communicate in a peer-to-peer architecture over an [GHL+07] that computes routes based on driving patterns asynchronous network. In many applications, sensors are present in the data. RFID systems are known to generate deployed in hostile and difficult to access locations with noisy data so data cleaning is an essential task for the constraints on weight, power supply, and cost. Moreover, correct interpretation and analysis of moving object data, sensors must process a continuous (possibly fast) stream especially when it is collected from RFID applications of data. Data mining in wireless sensor networks (WSNs) and thus demands for cost-effective cleaning methods is a challenging area, as algorithms need to work in (such as [GHS07]). One important application with extremely demanding and constrained environment of moving objects is automated identification of suspicious sensor networks (such as limited energy, storage, movements. A framework for detecting anomalies computational power, and bandwidth). WSNs also [LHKG07] is proposed to express object trajectories require highly decentralized algorithms. using discrete pattern fragments, extract features to form Development of algorithms that take into a hierarchical feature space and learn effective consideration the characteristics of sensor networks, such classification rules at multiple levels of granularity. as energy and computation constraints, network Another line of work on outlier detection in trajectories dynamics, and faults, constitute an area of current focuses on detecting outlying sub-trajectories [LHL08] research. Some work has been done in developing based on partition-and-detect framework, which localized, collaborative, distributed and selfpartitions a trajectory into a set of line segments, and configuration mechanisms in sensor networks. then, detects outlying line segments for trajectory In designing algorithms for sensor networks, it is outliers. The problem of clustering trajectory data imperative to keep in mind that power consumption has [LHW07] is also studied where common sub-trajectories to be minimized. Even gathering the distributed sensor are discovered using the minimum description length data in a single site could be expensive in terms of (MDL) principle. battery power consumed, some attempts have been made Overall, this is still a young field with many research towards making the data collection task energy efficient issues to be explored on mining moving object data, and balance the energy-quality trade-offs. RFID data, and data from sensor networks. 534 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) For example, how to explore correlation and regularity The problems of incorporating domain knowledge into to clean noisy sensor network and RFID data, how to mining when data is scarce and integrating data integrate and construct data warehouses for such data, collection with mining are worth studying in spatial data how to perform scalable mining for peta-byte RFID data, mining, and both theoretical analyses toward general how to find strange moving objects, how to classify studies of spatial phenomena and empirical model multidimensional trajectory data, and so on. With time, designs targeted for specific applications represent the location, moving direction, speed, as well as trends for future research. multidimensional semantics of moving object data, likely Research in this domain needs the confluence of multi-dimensional data mining will play an essential role multiple disciplines including image processing, pattern in this study. recognition, geographic information systems, parallel processing, and statistical data analysis. Automatic F. Spatial, temporal, spatiotemporal, and multimedia categorization of images and videos, classification of data mining spatiotemporal data, finding frequent/sequential patterns Scientific and engineering data is usually related to and outliers, spatial collocation analysis, and many other space, time, and in multimedia modes (e.g., containing tasks have been studied popularly. With the mounting of color, image, audio, and video). With the popularity of such data, the development of scalable analysis methods digital photos, audio DVDs, videos, YouTube, web-based and new data mining functions will be an important map services, weather services, satellite images, digital research frontier for years to come. earth, and many other forms of multimedia, spatial, and G. Mining text, Web, and other unstructured data spatiotemporal data, mining spatial, temporal, spatiotemporal, and multimedia data will become Web is the common place for scientists and engineers increasingly popular, with far-reaching implications to publish their data, share their observations and [MH01, SC03]. For example, mining satellite images experiences, and exchange their ideas. There is a may help detect forest fire, find unusual phenomena on tremendous amount of scientific and engineering data on earth, predict hurricane landing site, discover weather the web. For example, in biology and bioinformatics patterns, and outline global warming trends. research, there are GenBank, ProteinBank, GO, PubMed, Spatial data mining is the process of discovering and many other biological or biomedical information interesting and previously unknown, but potentially repositories available on theWeb. Therefore, theWeb has useful patterns from large spatial data sets [SZHV04]. become the ultimate information access and processing Extracting interesting and useful patterns from spatial platform, housing not only billions of link-accessed data sets is more difficult than extracting the \pages", containing textual data, multimedia data, and corresponding patterns from traditional numeric and linkages, on the surface Web, but also query-accessible categorical data due to the complexity of spatial data \databases" on the deep Web.With the advent of Web 2.0, types, spatial relationships, and spatial autocorrelation. there is an increasing amount of dynamic \work°ow" Interesting research topics in this field include prediction emerging. With its penetrating deeply into our daily life of events at particular geographic locations, detecting and evolving into unlimited dynamic applications, the spatial outliers whose no-spatial attributes are extreme Web is central in our information infrastructure. Its relative to its neighbors, finding co-location patterns virtually unlimited scope and scale render immense where instances containing the patterns often located in opportunities for data mining. close geographic proximity, and grouping a set of spatial H. Data cube-oriented multidimensional online objects into clusters. Future research is needed to analytical mining compare the difference and similarity between classical Scientific and engineering datasets are usually highdata mining and spatial data mining techniques, model dimensional in nature. Viewing and mining data in semantically rich spatial properties other than multidimensional space will substantially increase the neighborhood relationships, design effective statistical power and flexibility of data analysis. Data cube methods to interpret the mined spatial patterns, computation and OLAP (online analytical processing) investigate proper measures for location prediction to technologies developed in data warehouse have improve spatial accuracy and facilitate visualization of substantially increased the power of multidimensional spatial relationships by representing both spatial and nonanalysis of large datasets. spatial features. 535 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) Some researchers began to investigate how to conduct Besides popular bar charts, pie charts, curves, traditional data mining and statistical analysis in the histograms, quantile plots, quantitle-quantile plots, multi-dimensional manner efficiently. For example, boxplots, scatter plots, there are also many visualization regression cube [CDH+06] is designed to support tools using geometric (e.g., dimension stacking, parallel efficient computation of the statistical models. In this coordinates), hierarchical (e.g., treemap), and icon-based framework, each cell can be compressed into an auxiliary (e.g., Chernoff faces and stick figures) techniques. matrix with a size independent of the number of tuples Moreover, there are methods for visualizing sequences, and then the statistical measures for any data cell can be time-series data, phylogenetic trees, graphs, networks, computed from the compressed data of the lower-level web, as well as various kinds of patterns and knowledge cells without accessing the raw data. In a prediction cube (e.g., decision-trees, association rules, clusters and [CCLR05], each cell contains a value that summarizes a outliers) [FGW01]. There are also visual data mining predictive model trained on the data corresponding to that tools that may facilitate interactive mining based on cell and characterizes its decision behavior or user's judgement of intermediate data mining results predictiveness. The authors further show that such cubes [AEEK99]. Recently, we have developed a DataScope can be efficiently computed by exploiting the idea of system that maps relational data into 2-D maps so that model decomposition. In [LH07], the issues of anomaly multidimensional relational data can be browsed in detection in multi-dimensional time-series data are Google map's way [WLX+07]. We believe that visual examined. A time-series data cube is proposed to capture data mining is appealing to scientists and engineers the multi-dimensional space formed by the attribute because they often have good understanding of their data, structure and facilitate the detection of anomalies based can use their knowledge to interpret their data and on expected values derived from higher level, more patterns with the help of visualization tools, and interact general time-series. Moreover, an efficient search with the system for deeper and more effective mining. algorithm is proposed to iteratively select subspaces in Tools should be developed for mapping data and the original high-dimensional space and detect anomalies knowledge into appealing and easy-to-understand visual within each one. Recent study on sampling cubes forms, and for interactive browsing, drilling, scrolling, [LHY+08] discuss about the desirability of OLAP over and zooming data and patterns to facilitate user sampling data, which may not represent the full data in exploration. Finally, for visualization of large amount of the population. The proposed sampling cube framework data, parallel processing and high-performance could efficiently calculate confidence intervals for any visualization tools should be investigated to ensure high multidimensional query and uses the OLAP structure to performance and fast response. group similar segments to increase sampling size when J. Domain-specific data mining: needed. Further, to handle high dimensional data, a Data mining by integration of sophisticated scientific Sampling Cube Shell method is proposed to effectively and engineering domain knowledge besides general data reduce the storage requirement while still preserving mining methods and tools for science and engineering, query result quality. Such multi-dimensional, especially each scientific or engineering discipline has its own data high-dimensional, analysis tools will ensure data can be sets and special mining requirements, some could be analyzed in hierarchical, multidimensional structures rather different from the general ones. Therefore, inefficiently and flexibly at user's finger tips. This leads to depth investigation of each problem domain and the integration of online analytical processing with data development of dedicated analysis tools are essential to mining, i.e., OLAP mining. Some efforts have been the success of data mining in this domain. Here we devoted along this direction, but grand challenge still examine two problem domains: biology and software exist when one needs to explore the large space of engineering. choices to find interesting patterns and trends [RC07]. We believe that OLAP mining will substantially 1) Biological data mining enhance the power and flexibility of data analysis and The fast progress of biomedical and bioinformatics lead to the construction of easy-to-use tools for the research has led to the accumulation and publication (on analysis of massive data with hierarchical structures in the web) of vast amount of biological and bioinformatics multidimensional space. It is a promising research field data. However, the analysis of such data poses much for developing effective tools and scalable methods for greater challenges than traditional data analysis methods exploratory-based scientific and engineering data mining. [BHLY04]. For example, genes and proteins are gigantic I. Visual data mining in size (e.g., a DNA sequence could be in billions of base pairs), very sophisticated in function, and the patterns of A picture is worth a thousand words. There have been their interactions are largely unknown. Thus it is a fertile numerous data visualization tools for visualizing various field to develop sophisticated data mining methods for inkinds of data sets in massive amount and of depth bioinformatics research. multidimensional space [Tuf01]. 536 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 6, June 2014) We believe substantial research is badly needed to REFERENCES produce powerful mining tools in many biological and [1] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan, "demon: mining and monitoring evolving data" IEEE transactions bioinformatics subfields, including comparative on knowledge and data engineering, vol. 13, no. 1, genomics, evolution and phylogeny, biological data january/february 2001 cleaning and integration, biological sequence analysis, [2] Philip K. Chan, Florida Institute of Technology Wei Fan, Andreas biological network analysis, biological image analysis, L. Prodromidis, and Salvatore J. Stolfo, Columbia University" biological literature analysis (e.g., PubMed), and systems Distributed Data Mining in Credit Card Fraud Detection" november/december 1999 1094-7167/99/$10.00 © 1999 IEEE biology. From this point view, data mining is still very [3] Rachna Somkunwar, "A study on Various Data Mining young with respect to biology and bioinformatics Approaches of Association Rules " IJARCSSE Volume 2, Issue 9, applications. Substantial research should be conducted to September 2012 ISSN: 2277 128X cover the vast spectrum of data analysis tasks. [4] 2) Data mining for software engineering Software program executions potentially (e.g., when program execution traces are turned on) generate huge amounts of data. However, such data sets are rather di®erent from the datasets generated from the nature or collected from video cameras since they represent the executions of program logics coded by human programmers. It is important to mine such data to monitor program execution status, improve system performance, isolate software bugs, detect software plagiarism, analyze programming system faults, and recognize system malfunctions. Data mining for software engineering can be partitioned into static analysis and dynamic/stream analysis, based on whether the system can collect traces beforehand for post-analysis or it must react at real time to handle online data. Different methods have been developed in this domain by integration and extension of the methods developed in machine learning, data mining, pattern recognition, and statistics. For example, statistical analysis such as hypothesis testing) approach [LFY+06] can be performed on program execution traces to isolate the locations of bugs which distinguish program success runs from failing runs. Despite of its limited success, it is still a rich domain for data miners to research and further develop sophisticated, scalable, and real-time data mining methods. [5] [6] [7] [8] [9] Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono, and Huan Liu, Member, IEEE, " Effective Data Mining Using Neural Networks" IEEE transactions on knowledge and data engineering, vol. 8, no. 6, december 1996 Daniel A. Keim, " Information Visualization and Visual Data Mining" IEEE transactions on visualization and computer graphics, vol. 7, no. 1, january-march 2002 Mario Cannataro, Antonio Congiusta, Andrea Pugliese, Domenico Talia and Paolo Trunfio, " Distributed Data Mining on Grids: Services, Tools, and Applications" IEEE transactions on systems, man, and cybernetics—part b: cybernetics, vol. 34, no. 6, december 2004 Michael Goebel, Le Gruenwald, "A survey of data mining and knowledge Discovery software tools" SIGKDD Explorations. Copyright 1999 ACM SIGKDD, June 1999. Volume 1, Issue 1 – page 21 S.Hameetha Begum, "Data Mining Tools and Trends – An Overview " International Journal of Emerging Research in Management &Technology ISSN: 2278-9359 Tipawan Silwattananusarn1 and Assoc.Prof. Dr. KulthidaTuamsuk" Data Mining and Its Applications for KnowledgeManagement : A Literature Review from 2007 to2012" International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.2, No.5, September 2012 BIBLOGRAPHY Girish Kumar Sorot received his B.Tech. degree in computer science & engineering from Rajasthan Technical University, Kota and currently pursuing M.Tech. degree in computer science & engineering from Rajasthan Technical University, Kota (Rajasthan). His current research interests are in the areas of network Security, cloud computing, and Real-Time System. V. CONCLUSION In this paper, we have examined a few important research challenges and issues in science and engineering data mining. Also examine data mining techniques for Data security and privacy like fraud detection and direct marketing , Social media data analysis and computing , Web-scale data mining and semantic discovery , Largescale data integration and mining . 537

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining Techniques and Research Challenges and