Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014), pp.1-10 http://dx.doi.org/10.21742/ijpccem.2014.1.1.01 Mining Big Data: Current Status, and Forecast to the Future for Telecom Data A. Venkata Krishna Prasad1 and CH.M.H.Saibaba2 1 Department of Computer Scienceand Engineering, K.L. University Vijayawada [email protected] 2 KL University, Vijayawada, AP, India [email protected] Abstract Big Data is a new term used to identify the datasets,due to their large size and complexity; we cannot manage and organize them with our current methodologies or data mining software tools. Big Data mining is the method of extracting required information from these huge datasets or continuesflow of data.Due to its huge volumeof data, variability of data, and velocity of the data, it was not possible earlier in the existing system. The Big Data challenge is becoming one of the most exciting opportunities for the future. We present in this issue, a broad overview of the topic, its current status, controversy and forecast the data to the future. Telecommunication companies generate a tremendous amount of data. These data include call detail data, which describes the state of the hardware and software components in the network, and customer data, which describes the telecommunication customers. The entire data needs to be capture and store in data warehouse or in cloud with systematic extracting procedures, able to answer adhoc queries and for visualization and mining. Applying big data mining on warehouse/cloud telecom data can be used to identify telecommunication fraud, improve marketing effectiveness, identify network faults, current status, and forecast to the future on telecom data etc. 1. Introduction Let’s look at current size of usage of Internet data. According to Google thenumber of people browsing the internet in 1998 was around one million in, but quickly increase the usage to 1 billion in 2000 and have been exceeded 1 trillion in 2008. This rapid increase in usage of Internet, drastic change in acceptance of people using social media applications, such as Facebook, Twitter, Weibo, etc., that allow users to create contents freely and amplify the already huge Web volume.Today’s businesses, there are few things to keep in mind while beginning big data and analytics innovation projects How much can companies in the telecommunications industry benefit from ―big data‖? That’s a critical question. Every telecom operator is searching for new methods to increase revenues and profits in a short duration in a stagnant growth in the telecom industry, but little telecom companyhave alreadydemonstrated the capabilities needed to use the new technology. That’s why telecom operators seeking to initial inroads with big data andare advised to avoid the usual top-down approach. Top-down method indicates a problem in business to be solved and then seeks out the data for analyzing. This is particularly true in the telecom sector. Most of the telecom operators conduct analytics programs that enable them to use their internal data to boost the efficiency of their networks, group of customers, and drive profitability with some success. ISSN: 2205-8478 IJPCCEM Copyright ⓒ 2014 GV School Publication International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) But the strength ofBig Data poses a different challenge: how to combine such larger amounts of information to increase revenues and profits across the entire telecom value chain, from network operations to product development to marketing, sales, and customer service — and even to monetize the data itself. 2. Big Data Big-data projectsis use for defining a business problem to be solved. Then we tryto identify what and how data might help to solve the problem. These projects are used to run traditional business.Frequently byachieving only incremental benefits years wise. The bottom-up approach is used for solving by considering with the existing internal and external data, and allowsus out-of-the-box opportunities to emerge. Big-data pilots demand speed of the data access, agility of the data, and constant iteration. Doug Laney [19] was the first one in talking about 3 V’s in Big Data management: 2 Volume: there is large data than ever before, its size of data continues increasing, but not the percent of data that our tools can support. Variety: there are different types of data, as text, sensor data, audio, video, graph, and more Velocity: data is arriving to the system continuously as streams of data and we are interested in retrieving useful information from it in real time.Nowadays, there are two more V’s: Variability: no consistent in the structure of data and how users want to interpret or understand thedata Copyright ⓒ 2014 GV School Publication International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) Value: business value that gives organization a compelling advantage, which has the ability for making decisions based on answering questions that were previously considered beyond reach ability Gartner [15] summarizes this in their definition of Big Data in 2012 as high volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making. There are many applications of Big Data, example: Business: costumer personalization, churn detection Technology: reducing process time from hours to seconds Health: mining DNA of each person, to discover, monitorand improve health aspects of every one Smart cities: cities focused on sustainable economicdevelopment and high quality of life, with wisely managingof natural resources 3. Big Data for Development To explain the usage of Big Data mining, we would like to give the detailed work that Global Pulse is [33] using Big Data to improve life in developing countries. Global Pulse is a United Nations initiative,and launched in 2009, it is used as an innovative lab, and that is basically working on mining Big Data for developing countries. They pursue a strategy that consists of 1) research involves innovative methods and techniques for analyzing real-time digital data to identify early emerging vulnerabilities; 2)combining free and open source technology toolkit for analyzing real-time data and sharing hypotheses; and 3) establishing an integrated, global network to pilot the approach at country level real-time digital data . Global Pulse describe the main opportunities in Big Dataand Challenges & Opportunities‖[22]: • It gives warning: to improve the fast response in time of crisis,detecting anomalies in the usage of real-time digitalmedia • Real-time awareness: It design the programs and policies with companies in a more fine-grained representation of reality in real time data • Real-time feedback: feedbackcheck should be frequent in policies and programs fails in business for Copyright ⓒ 2014 GV School Publication 3 International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) better growth and monitoring it in real situation. Feedback reports help us to make the needed changes. The Big Data mining revolution is not restricted to the industrialized world, as mobiles are spreading in developing countries as well. It is estimated that there are over five billion mobile phones, and that 80% are located in developing countries. 4. Contributed Articles We selected four contributions that together define thesignificant artof research in Big Data Mining, and that explains a broad overview of the field and its forecast the future. Other significant work in Big Data Mining can be found in the main conferences as KDD, ICDM, ECMLPKDD, or journals as ‖Data Mining and Knowledge Discovery‖ or ‖Machine Learning‖. Scaling Big Data Mining Infrastructure: This paper presentsand explainabout Big Data mining infrastructures, and the experience of doing analytics in the telecom industry for realtime data. It explainthat due to the current application of the data mining tools, it is not easy to perform analytics. Most of the time is consumed in preparation of report real time data by using the application of data mining methods, and turning initial models into robust solutions. Mining Heterogeneous Information Networks: This paper shows that mining heterogeneous information on different networks is new and changeling research frontier in Big Data mining. It considers interconnected, multi-typed data, including the typical relational database data, as heterogeneous information networks. These semi-structured diverse information in different network models of application of force with a lever and link between network and can uncover surprisingly rich knowledge from interconnected data. Big Graph Mining: This paper presents an overview of mining big graphs, focusing in the use of the Pegasus tool [18], showing some findings in the Web Graph and Twitter social network. The paper gives inspirational future research directions for big graph mining. Mining Large Streams of User Data for: This paper presents some lessons learned with the Netflix Prize, and discusses the recommender and personalization techniques used in Netflix. It discusses recent important problems and future research directions. Contains an interesting discussion about bigdata, if we need more data or better models to improve our learning methodology. 5. The Promiseof Big Data for Telecom Big data promises to promote growth and increase efficiency and profitability across the entire telecom value chain. Exhibit 2, next page, shows the benefits of big data over the opportunities available through traditional data warehousing technologies. They include: 4 Optimizing routing between networks and giving the quality of service by analyzing network trafficin real time. Analyzing thecall data records in real time to identify duration of the call tariff and the fraudulent activities on the network in real time. Allowing call center agents to offer customized plans to subscriberimmediately andexplain them the befits of tariff leading to higher retentions and better profitably. Tailoring marketing campaigns about the services to individual customers using location-based and social networking technologies in real time data Usage of (CRM)helps us not only to understand the business trends about also customer behaviorpatens which enables the marketers to design new products and services.Big data can even open up new sources of revenue, such as selling insights Copyright ⓒ 2014 GV School Publication International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) about customers to third parties. Feedback reports help us to make the needed changes in the organization for more revenues .Exhibit2, 6. Tools: Open Source Revolution The Big Data is anoutstanding tool inreal time. And pioneers to the open source software revolution. Large companies as Facebook, Yahoo!, Twitter, and LinkedIn benefit and contribute working on open source projects. Big Data infrastructure deals with Hadoop, and other related software as: 1) Apache Hadoop [3]: HadoopMap Reduce is a software framework which is used for easily writing applications which process large amounts of data (multi-terabyte datasets) in-parallel on large clusters. A MapReduceis the method which dividethe input data-set into independent objects which arethen processedand the task is that resultant is mapped in parallel manner. The resultant data, sorts the outputs of the maps, which willthen help us to reduce the input task. Typically both the input data and the output data of the file are stored in a file-system. The steps involved in proceedingwill takes care of scheduling tasks, monitoring them and re-execute the failed tasks. 2) Apache Hadoop is an open-source software framework for distributed storage and distributed processing of Big Data on clusters of commodity hardware. Its Hadoop Distributed File System (HDFS) splits files into large blocks of data (default 64MB or 128MB) and distributes the blocks of data amongst the nodes in the cluster. For processing the data, the Hadoop Map/Reduce ships code (specifically Jar files) to the nodes that have the required data, and the nodes then process the data in parallel. This method of approach takes advantage of data locality, [3] in contrast to conventional HPC architecture which usually relies on a parallel file system (compute and data separated, but connected with high-speed networking)[4]. Copyright ⓒ 2014 GV School Publication 5 International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) 3) Scalable to large data sets. Our core algorithms for clustering, classification and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well. • Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from existing categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. 4) R [29]: R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. There are some important differences, but much code written for S runs unaltered. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made. 5) MOA [5]: MOA is the most popular open source framework for data stream mining, with a very active growing community (blog). It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, and concept drift detection and recommender systems) and tools for evaluation. Related to the WEKA project, MOA is also written in Java, while scaling to more demanding problems. 6) The Big Graph Mining (BGM) workshop brings together researchers and practitioners to address various aspects of graph mining in this new era of big data, such as new graph mining platforms, theories that drive new graph mining techniques, scalable algorithms and visual analytics tools that spot patterns and anomalies, applications that touch our daily lives, and more. Together, we explore and discuss how these important facets of are advancing in this age of big graphs. 7. Forecast There are many important challenges in Big Datamanagement and analytics on the real time data, which arise in the formof: large and diverse data and continuously evolving [27; 16]. Leading to big challenges to Researchers and practitioners in coming years: 6 Copyright ⓒ 2014 GV School Publication International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) • Details of Architecture. It is not clear how to design the architecture of a system, which has to deal with the historic data and with real time data at the same time.Lambdaarchitecture ofNathan Marz [25], it explains abouthow to solve the problem of computing the arbitrary function on arbitrary data in real-time bydecomposingthe problem into three layers:the batch layer, theserving layer, and the speed layer. Which explains same as in Hadoop system for the batch layer & storm for the speed layer. The properties of the new system are:robust, scalable, general, extensible,allows ad hoc queries, minimal maintenance, debug gable and fault tolerant. • Statistical presentation. It is important to present the result in statisticalform, As Efron explains in his book about LargeScale Inference [10], it is likely to go wrongwith largedata sets and thousands of queries arises . • Mining in Distributed System. Many data mining techniques arenot trivial to face the real time data. Distributed versions is requireforsome methods, a lot of research is needed with practicaland theoretical analysis of real time data to provide some news standards. • Evolving data. Data may be evolving over time,so it is important task for Big Data mining techniquesshould be able to adapt and abilityto detect the changes in real time data. For example, the data stream mining fieldhas very powerful techniques for this task [13]. • Storage:working with Big Data, the quantity of data is large in size. The storage space is the main criteria for Big Data. There are twomain methods: compression where we compress the data and store in the system, in this methods we don’t looseanything. Using compression,we might take more time and less space. Sampling, in this method we can looseinformation, but we will gains inspace may be in orders of magnitude. For exampleFeldman et al. [12] use coresetswill reduce the complexityof Big Data problems. Coresets are small setsthat provably approximate the original data for a givenproblem. Using merge-reducemethodallhe small sets can thenbe used for solving in difficult machine and understand the problem inparallel. • Visualization. A main task of Big Data analysis is howto visualize the results in real time. Data is so large, it is verydifficult to present the data in user-friendly New techniques,and frameworks to tell and show how to view the data which isneeded, as for example the photographs, info graphicsand essays in the beautiful book ‖The Human Face ofBig Data‖ [30]. • Hidden Big Data. Large quantities of useful data aregetting lost since new data is largely untagged filebasedand unstructured data. The 2012 IDC studyon Big Data [14] explains that in 2012, 23% (643 exabytes)of the digital universe would be useful for BigData if tagged and analyzed. However, currently only3% of the potentially useful data is tagged, and evenless is analyzed. 8. Conclusions Big data helps telecom operatorsin real-time opportunity to gain knowledge on complete picture of their operations and their customers, and to help them further intheir innovation process. The industry as a whole spends far less amount of time on R&D. It requires R&D on real time data which helps in increase in percentage of sales, performance on revenues and its efforts to change its ways for moresuccessful in today’s world. Big data demands of every industry a very different and unconventional approach to business development. The operators that can incorporate new agile strategies into their organizational DNA fastest will gain a real competitive advantage over their slower rivals. Copyright ⓒ 2014 GV School Publication 7 International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] 8 SAMOA, http://samoa-project.net, (2013). C. C. Aggarwal,―Managing and Mining SensorData‖. Advances in Database Systems. Springer, (2013). Apache Hadoop, http://hadoop.apache.org. Apache Mahout, http://mahout.apache.org. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, ―MOA: Massive Online Analysis‖http://moacms.waikato.ac.nz/. Journal of Machine LearningResearch (JMLR), (2010). C. Bockermann and H. Blom, ―The Streams Framework.Technical Report 5‖, TU Dortmund University, vol. 1, (2012). D. Boyd and K. Crawford,―Critical Questions forBig Data‖. Information, Communication and Society, vol. 15, no. 5, pp. 662–679, (2012). F. Diebold, ‖Big Data‖ Dynamic Factor Models forMacroeconomic Measurement and Forecasting‖. DiscussionRead to the Eighth World Congress of the EconometricSociety, (2000). F. Diebold,―On the Origin(s) and Development of theTerm‖Big Data‖. Pier Working Paper Archive, PennInstitute for Economic Research, Department of Economics,University of Pennsylvania, 2012.SIGKDD Explorations, vol. 14, no. 2 Page B. Efron,―Large-Scale Inference: Empirical Bayes Methodsfor Estimation, Testing, and Prediction‖. Institute ofMathematical Statistics Monographs. Cambridge UniversityPress, 2010. U. Fayyad, ―Big Data Analytics: Applications and Opportunitiesin On-line Predictive Modeling‖. http://big-data-mining.org/keynotes/#fayyad, 2012. D. Feldman, M. Schmidt, and C. Sohler,―Turning Bigdata into Tiny Data: Constant-size Coresets for k-means,pca and Projective Clustering. In SODA, 2013. J. Gama, ― Knowledge Discovery from Data Streams.Chapman& Hall/Crc Data Mining and Knowledge Discovery‖.Taylor& Francis Group, 2010. J. Gantz and D. Reinsel,―IDC: The Digital Universe in2020: Big Data, Bigger Digital Shadows, and BiggestGrowth in the Far East‖. December 2012. Gartner, http://www.gartner.com/it-glossary/bigdata. V. Gopalkrishnan, D. Steier, H. Lewis, and J. Guszcza, ―Big Data, Big Business: Bridging the Gap‖. In Proceedingsof the 1st International Workshop on Big Data,Streams and Heterogeneous Source Mining: Algorithms,Systems, Programming Models and Applications, BigMine’12, pages 7-11, New York, NY, USA, 2012. ACM. Intel. Big Thinkers on Big Data,http://www.intel.com/content/www/us/en/bigdata/big-thinkerson-big-data.html,2012. U. Kang, D. H. Chau, and C. Faloutsos, ―PEGASUS:Mining Billion-Scale Graphs in the Cloud‖. 2012. D. Laney,―3-D Data Management: Controlling DataVolume, Velocity and Variety. META Group ResearchNote‖, February 6, 2001. J. Langford, ―VowpalWabbit‖,http://hunch.net/˜vw/,2011. D. J. Leinweber,―Stupid Data Miner Tricks: Overfittingthe S&P 500‖. The Journal of Investing, 16:15–22, 2007. E. Letouz´e, ― Big Data for Development: Opportunities& Challenges‖. May 2011. J. Lin,―MapReduce is Good Enough? If All You Haveis a Hammer, Throw Away Everything That’s Not aNail! CoRR‖, abs/1209.2191, 2012. Copyright ⓒ 2014 GV School Publication International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) [24] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,C. Guestrin, and J. M.Hellerstein,―Graph Lab: A NewparallelFramework for Machine Learning‖. In Conferenceon Uncertainty in Artificial Intelligence (UAI),Catalina Island, California, July 2010. [25] N. Marz and J. Warren,―Big Data: Principles and Best Practices of Scalable RealtimeData Systems‖. ManningPublications, 2013. [26] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari.S4,―Distributed Stream Computing Platform‖. In ICDMWorkshops, pages 170–177, 2010. [27] C. Parker,―Unexpected Challenges in Large Scale Machine Learning‖. In Proceedings of the 1st International Workshopon Big Data, Streams and Heterogeneous SourceMining: Algorithms, Systems, Programming Modelsand Applications, BigMine ’12, pages 1–6, New York,NY, USA, 2012. ACM. [28] A. Petland, ―Reinventing Society inthe Wake of Big Data‖. Edge.org,http://www.edge.org/conversation/reinventing-societyin-the-wake-of-big-data,2012. [29] R Core Team. R, ― A Language and Environment for StatisticalComputing. R Foundation for Statistical Computing‖,Vienna, Austria, 2012. ISBN 3-900051-07-0. [30] R. Smolan and J. Erwitt,―The Human Face of Big Data‖.Sterling Publishing Company Incorporated, 2012. [31] Storm, http://storm-project.net. [32] N. Taleb. ―Antifragile: How to Live in a World We Don’tUnderstand. Penguin Books‖, Limited, 2012. [33] UN Global Pulse, http://www.unglobalpulse.org. [34] S. M. Weiss and N. Indurkhya, ―Predictive Data Mining:a Practical Guide‖. Morgan Kaufmann Publishers Inc.,San Francisco, CA, USA, 1998. [35] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch,and G. Lapis,―IBM Understanding Big Data: Analyticsfor Enterprise Class Hadoop and Streaming Data‖, McGraw-Hill Companies,Incorporated, 2011. Copyright ⓒ 2014 GV School Publication 9 International Journal of Private Cloud Computing Environment and Management Vol. 1, No. 1 (2014) 10 Copyright ⓒ 2014 GV School Publication