Download Mining Big Data: Current Status, and Forecast to the Future for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014), pp.1-10
http://dx.doi.org/10.21742/ijpccem.2014.1.1.01
Mining Big Data: Current Status, and Forecast to the Future for
Telecom Data
A. Venkata Krishna Prasad1 and CH.M.H.Saibaba2
1
Department of Computer Scienceand Engineering, K.L. University
Vijayawada
[email protected]
2
KL University, Vijayawada, AP, India
[email protected]
Abstract
Big Data is a new term used to identify the datasets,due to their large size and complexity;
we cannot manage and organize them with our current methodologies or data mining
software tools. Big Data mining is the method of extracting required information from these
huge datasets or continuesflow of data.Due to its huge volumeof data, variability of data, and
velocity of the data, it was not possible earlier in the existing system. The Big Data challenge
is becoming one of the most exciting opportunities for the future. We present in this issue, a
broad overview of the topic, its current status, controversy and forecast the data to the future.
Telecommunication companies generate a tremendous amount of data. These data include
call detail data, which describes the state of the hardware and software components in the
network, and customer data, which describes the telecommunication customers. The entire
data needs to be capture and store in data warehouse or in cloud with systematic extracting
procedures, able to answer adhoc queries and for visualization and mining. Applying big
data mining on warehouse/cloud telecom data can be used to identify telecommunication
fraud, improve marketing effectiveness, identify network faults, current status, and forecast to
the future on telecom data etc.
1. Introduction
Let’s look at current size of usage of Internet data. According to Google thenumber of
people browsing the internet in 1998 was around one million in, but quickly increase the
usage to 1 billion in 2000 and have been exceeded 1 trillion in 2008. This rapid increase in
usage of Internet, drastic change in acceptance of people using social media applications,
such as Facebook, Twitter, Weibo, etc., that allow users to create contents freely and amplify
the already huge Web volume.Today’s businesses, there are few things to keep in mind while
beginning big data and analytics innovation projects How much can companies in the
telecommunications industry benefit from ―big data‖? That’s a critical question. Every
telecom operator is searching for new methods to increase revenues and profits in a short
duration in a stagnant growth in the telecom industry, but little telecom companyhave
alreadydemonstrated the capabilities needed to use the new technology. That’s why telecom
operators seeking to initial inroads with big data andare advised to avoid the usual top-down
approach. Top-down method indicates a problem in business to be solved and then seeks out
the data for analyzing. This is particularly true in the telecom sector. Most of the telecom
operators conduct analytics programs that enable them to use their internal data to boost the
efficiency of their networks, group of customers, and drive profitability with some success.
ISSN: 2205-8478 IJPCCEM
Copyright ⓒ 2014 GV School Publication
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
But the strength ofBig Data poses a different challenge: how to combine such larger amounts
of information to increase revenues and profits across the entire telecom value chain, from
network operations to product development to marketing, sales, and customer service — and
even to monetize the data itself.
2. Big Data
Big-data projectsis use for defining a business problem to be solved. Then we tryto identify
what and how data might help to solve the problem. These projects are used to run traditional
business.Frequently byachieving only incremental benefits years wise. The bottom-up
approach is used for solving by considering with the existing internal and external data, and
allowsus out-of-the-box opportunities to emerge. Big-data pilots demand speed of the data
access, agility of the data, and constant iteration. Doug Laney [19] was the first one in talking
about 3 V’s in Big Data management:




2
Volume: there is large data than ever before, its size of data continues increasing, but
not the percent of data that our tools can support.
Variety: there are different types of data, as text, sensor data, audio, video, graph, and
more
Velocity: data is arriving to the system continuously as streams of data and we are
interested in retrieving useful information from it in real time.Nowadays, there are
two more V’s:
Variability: no consistent in the structure of data and how users want to interpret or
understand thedata
Copyright ⓒ 2014 GV School Publication
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)

Value: business value that gives organization a compelling advantage, which has the
ability for making decisions based on answering questions that were previously
considered beyond reach ability
Gartner [15] summarizes this in their definition of Big Data in 2012 as high volume, velocity
and variety information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making. There are many applications of Big Data,
example:




Business: costumer personalization, churn detection
Technology: reducing process time from hours to seconds
Health: mining DNA of each person, to discover, monitorand improve health aspects
of every one
Smart cities: cities focused on sustainable economicdevelopment and high quality of
life, with wisely managingof natural resources
3. Big Data for Development
To explain the usage of Big Data mining, we would like to give the detailed work that
Global Pulse is [33] using Big Data to improve life in developing countries. Global Pulse is a
United Nations initiative,and launched in 2009, it is used as an innovative lab, and that is
basically working on mining Big Data for developing countries. They pursue a strategy that
consists of 1) research involves innovative methods and techniques for analyzing real-time
digital data to identify early emerging vulnerabilities; 2)combining free and open source
technology toolkit for analyzing real-time data and sharing hypotheses; and 3) establishing an
integrated, global network to pilot the approach at country level real-time digital data . Global
Pulse describe the main opportunities in Big Dataand Challenges & Opportunities‖[22]: • It
gives warning: to improve the fast response in time of crisis,detecting anomalies in the usage
of real-time digitalmedia • Real-time awareness: It design the programs and policies with
companies in a more fine-grained representation of reality in real time data • Real-time
feedback: feedbackcheck should be frequent in policies and programs fails in business for
Copyright ⓒ 2014 GV School Publication
3
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
better growth and monitoring it in real situation. Feedback reports help us to make the
needed changes. The Big Data mining revolution is not restricted to the industrialized world,
as mobiles are spreading in developing countries as well. It is estimated that there are over
five billion mobile phones, and that 80% are located in developing countries.
4. Contributed Articles
We selected four contributions that together define thesignificant artof research in Big Data
Mining, and that explains a broad overview of the field and its forecast the future. Other
significant work in Big Data Mining can be found in the main conferences as KDD, ICDM,
ECMLPKDD, or journals as ‖Data Mining and Knowledge Discovery‖ or ‖Machine
Learning‖.
Scaling Big Data Mining Infrastructure: This paper presentsand explainabout Big Data
mining infrastructures, and the experience of doing analytics in the telecom industry for realtime data. It explainthat due to the current application of the data mining tools, it is not easy
to perform analytics. Most of the time is consumed in preparation of report real time data by
using the application of data mining methods, and turning initial models into robust solutions.
Mining Heterogeneous Information Networks: This paper shows that mining
heterogeneous information on different networks is new and changeling research frontier in
Big Data mining. It considers interconnected, multi-typed data, including the typical
relational database data, as heterogeneous information networks. These semi-structured
diverse information in different network models of application of force with a lever and link
between network and can uncover surprisingly rich knowledge from interconnected data.
Big Graph Mining: This paper presents an overview of mining big graphs, focusing in the
use of the Pegasus tool [18], showing some findings in the Web Graph and Twitter social
network. The paper gives inspirational future research directions for big graph mining.
Mining Large Streams of User Data for: This paper presents some lessons learned with
the Netflix Prize, and discusses the recommender and personalization techniques used in
Netflix. It discusses recent important problems and future research directions. Contains an
interesting discussion about bigdata, if we need more data or better models to improve our
learning methodology.
5. The Promiseof Big Data for Telecom
Big data promises to promote growth and increase efficiency and profitability across the
entire telecom value chain. Exhibit 2, next page, shows the benefits of big data over the
opportunities available through traditional data warehousing technologies. They include:





4
Optimizing routing between networks and giving the quality of service by analyzing
network trafficin real time.
Analyzing thecall data records in real time to identify duration of the call tariff and
the fraudulent activities on the network in real time.
Allowing call center agents to offer customized plans to subscriberimmediately
andexplain them the befits of tariff leading to higher retentions and better profitably.
Tailoring marketing campaigns about the services to individual customers using
location-based and social networking technologies in real time data
Usage of (CRM)helps us not only to understand the business trends about also
customer behaviorpatens which enables the marketers to design new products and
services.Big data can even open up new sources of revenue, such as selling insights
Copyright ⓒ 2014 GV School Publication
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
about customers to third parties. Feedback reports help us to make the needed
changes in the organization for more revenues .Exhibit2,
6. Tools: Open Source Revolution
The Big Data is anoutstanding tool inreal time. And pioneers to the open source software
revolution. Large companies as Facebook, Yahoo!, Twitter, and LinkedIn benefit and
contribute working on open source projects. Big Data infrastructure deals with Hadoop, and
other related software as:
1) Apache Hadoop [3]: HadoopMap Reduce is a software framework which is used for
easily writing applications which process large amounts of data (multi-terabyte datasets) in-parallel on large clusters. A MapReduceis the method which dividethe input
data-set into independent objects which arethen processedand the task is that resultant
is mapped in parallel manner. The resultant data, sorts the outputs of the maps, which
willthen help us to reduce the input task. Typically both the input data and the output
data of the file are stored in a file-system. The steps involved in proceedingwill takes
care of scheduling tasks, monitoring them and re-execute the failed tasks.
2) Apache Hadoop is an open-source software framework for distributed storage and
distributed processing of Big Data on clusters of commodity hardware. Its Hadoop
Distributed File System (HDFS) splits files into large blocks of data (default 64MB
or 128MB) and distributes the blocks of data amongst the nodes in the cluster. For
processing the data, the Hadoop Map/Reduce ships code (specifically Jar files) to the
nodes that have the required data, and the nodes then process the data in parallel. This
method of approach takes advantage of data locality, [3] in contrast to conventional
HPC architecture which usually relies on a parallel file system (compute and data
separated, but connected with high-speed networking)[4].
Copyright ⓒ 2014 GV School Publication
5
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
3) Scalable to large data sets. Our core algorithms for clustering, classification and
collaborative filtering are implemented on top of scalable, distributed systems.
However, contributions that run on a single machine are welcome as well.
• Currently Mahout supports mainly three use cases: Recommendation mining
takes users' behavior and from that tries to find items users might like. Clustering
takes e.g. text documents and groups them into groups of topically related
documents. Classification learns from existing categorized documents what
documents of a specific category look like and is able to assign unlabelled
documents to the (hopefully) correct category.
4) R [29]: R provides a wide variety of statistical and graphical techniques, including
linear and nonlinear modeling, classical statistical tests, time-series analysis,
classification, clustering, and others. R is easily extensible through functions and
extensions, and the R community is noted for its active contributions in terms of
packages. There are some important differences, but much code written for S runs
unaltered. Many of R's standard functions are written in R itself, which makes it easy
for users to follow the algorithmic choices made.
5) MOA [5]: MOA is the most popular open source framework for data stream mining,
with a very active growing community (blog). It includes a collection of machine
learning algorithms (classification, regression, clustering, outlier detection, and
concept drift detection and recommender systems) and tools for evaluation. Related
to the WEKA project, MOA is also written in Java, while scaling to more demanding
problems.
6) The Big Graph Mining (BGM) workshop brings together researchers and
practitioners to address various aspects of graph mining in this new era of big data,
such as new graph mining platforms, theories that drive new graph mining
techniques, scalable algorithms and visual analytics tools that spot patterns and
anomalies, applications that touch our daily lives, and more. Together, we explore
and discuss how these important facets of are advancing in this age of big graphs.
7. Forecast
There are many important challenges in Big Datamanagement and analytics on the real
time data, which arise in the formof: large and diverse data and continuously evolving [27;
16]. Leading to big challenges to Researchers and practitioners in coming years:
6
Copyright ⓒ 2014 GV School Publication
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
• Details of Architecture. It is not clear how to design the architecture of a system, which
has to deal with the historic data and with real time data at the same
time.Lambdaarchitecture ofNathan Marz [25], it explains abouthow to solve the problem
of computing the arbitrary function on arbitrary data in real-time bydecomposingthe
problem into three layers:the batch layer, theserving layer, and the speed layer. Which
explains same as in Hadoop system for the batch layer & storm for the speed layer. The
properties of the new system are:robust, scalable, general, extensible,allows ad hoc
queries, minimal maintenance, debug gable and fault tolerant.
• Statistical presentation. It is important to present the result in statisticalform, As Efron
explains in his book about LargeScale Inference [10], it is likely to go wrongwith
largedata sets and thousands of queries arises .
• Mining in Distributed System. Many data mining techniques arenot trivial to face the
real time data. Distributed versions is requireforsome methods, a lot of research is needed
with practicaland theoretical analysis of real time data to provide some news standards.
• Evolving data. Data may be evolving over time,so it is important task for Big Data
mining techniquesshould be able to adapt and abilityto detect the changes in real time
data. For example, the data stream mining fieldhas very powerful techniques for this task
[13].
• Storage:working with Big Data, the quantity of data is large in size. The storage space is
the main criteria for Big Data. There are twomain methods: compression where we
compress the data and store in the system, in this methods we don’t looseanything. Using
compression,we might take more time and less space. Sampling, in this method we can
looseinformation, but we will gains inspace may be in orders of magnitude. For
exampleFeldman et al. [12] use coresetswill reduce the complexityof Big Data problems.
Coresets are small setsthat provably approximate the original data for a givenproblem.
Using merge-reducemethodallhe small sets can thenbe used for solving in difficult
machine and understand the problem inparallel.
• Visualization. A main task of Big Data analysis is howto visualize the results in real
time. Data is so large, it is verydifficult to present the data in user-friendly New
techniques,and frameworks to tell and show how to view the data which isneeded, as for
example the photographs, info graphicsand essays in the beautiful book ‖The Human
Face ofBig Data‖ [30].
• Hidden Big Data. Large quantities of useful data aregetting lost since new data is largely
untagged filebasedand unstructured data. The 2012 IDC studyon Big Data [14] explains
that in 2012, 23% (643 exabytes)of the digital universe would be useful for BigData if
tagged and analyzed. However, currently only3% of the potentially useful data is tagged,
and evenless is analyzed.
8. Conclusions
Big data helps telecom operatorsin real-time opportunity to gain knowledge on complete
picture of their operations and their customers, and to help them further intheir innovation
process. The industry as a whole spends far less amount of time on R&D. It requires R&D on
real time data which helps in increase in percentage of sales, performance on revenues and
its efforts to change its ways for moresuccessful in today’s world. Big data demands of every
industry a very different and unconventional approach to business development. The
operators that can incorporate new agile strategies into their organizational DNA fastest will
gain a real competitive advantage over their slower rivals.
Copyright ⓒ 2014 GV School Publication
7
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
8
SAMOA, http://samoa-project.net, (2013).
C. C. Aggarwal,―Managing and Mining SensorData‖. Advances in Database Systems. Springer,
(2013).
Apache Hadoop, http://hadoop.apache.org.
Apache Mahout, http://mahout.apache.org.
A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, ―MOA: Massive Online
Analysis‖http://moacms.waikato.ac.nz/. Journal of Machine LearningResearch (JMLR), (2010).
C. Bockermann and H. Blom, ―The Streams Framework.Technical Report 5‖, TU Dortmund
University, vol. 1, (2012).
D. Boyd and K. Crawford,―Critical Questions forBig Data‖. Information, Communication and
Society, vol. 15, no. 5, pp. 662–679, (2012).
F. Diebold, ‖Big Data‖ Dynamic Factor Models forMacroeconomic Measurement and
Forecasting‖. DiscussionRead to the Eighth World Congress of the EconometricSociety, (2000).
F. Diebold,―On the Origin(s) and Development of theTerm‖Big Data‖. Pier Working Paper
Archive, PennInstitute for Economic Research, Department of Economics,University of
Pennsylvania, 2012.SIGKDD Explorations, vol. 14, no. 2 Page
B. Efron,―Large-Scale Inference: Empirical Bayes Methodsfor Estimation, Testing, and
Prediction‖. Institute ofMathematical Statistics Monographs. Cambridge UniversityPress, 2010.
U. Fayyad, ―Big Data Analytics: Applications and Opportunitiesin On-line Predictive Modeling‖.
http://big-data-mining.org/keynotes/#fayyad, 2012.
D. Feldman, M. Schmidt, and C. Sohler,―Turning Bigdata into Tiny Data: Constant-size Coresets
for k-means,pca and Projective Clustering. In SODA, 2013.
J. Gama, ― Knowledge Discovery from Data Streams.Chapman& Hall/Crc Data Mining and
Knowledge Discovery‖.Taylor& Francis Group, 2010.
J. Gantz and D. Reinsel,―IDC: The Digital Universe in2020: Big Data, Bigger Digital Shadows,
and BiggestGrowth in the Far East‖. December 2012.
Gartner, http://www.gartner.com/it-glossary/bigdata.
V. Gopalkrishnan, D. Steier, H. Lewis, and J. Guszcza, ―Big Data, Big Business: Bridging the
Gap‖. In Proceedingsof the 1st International Workshop on Big Data,Streams and Heterogeneous
Source Mining: Algorithms,Systems, Programming Models and Applications, BigMine’12, pages
7-11, New York, NY, USA, 2012. ACM.
Intel. Big Thinkers on Big Data,http://www.intel.com/content/www/us/en/bigdata/big-thinkerson-big-data.html,2012.
U. Kang, D. H. Chau, and C. Faloutsos, ―PEGASUS:Mining Billion-Scale Graphs in the Cloud‖.
2012.
D. Laney,―3-D Data Management: Controlling DataVolume, Velocity and Variety. META Group
ResearchNote‖, February 6, 2001.
J. Langford, ―VowpalWabbit‖,http://hunch.net/˜vw/,2011.
D. J. Leinweber,―Stupid Data Miner Tricks: Overfittingthe S&P 500‖. The Journal of Investing,
16:15–22, 2007.
E. Letouz´e, ― Big Data for Development: Opportunities& Challenges‖. May 2011.
J. Lin,―MapReduce is Good Enough? If All You Haveis a Hammer, Throw Away Everything
That’s Not aNail! CoRR‖, abs/1209.2191, 2012.
Copyright ⓒ 2014 GV School Publication
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
[24] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,C. Guestrin, and J. M.Hellerstein,―Graph Lab: A
NewparallelFramework for Machine Learning‖. In Conferenceon Uncertainty in Artificial
Intelligence (UAI),Catalina Island, California, July 2010.
[25] N. Marz and J. Warren,―Big Data: Principles and Best Practices of Scalable RealtimeData
Systems‖. ManningPublications, 2013.
[26] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari.S4,―Distributed Stream Computing Platform‖.
In ICDMWorkshops, pages 170–177, 2010.
[27] C. Parker,―Unexpected Challenges in Large Scale Machine Learning‖. In Proceedings of the 1st
International Workshopon Big Data, Streams and Heterogeneous SourceMining: Algorithms,
Systems, Programming Modelsand Applications, BigMine ’12, pages 1–6, New York,NY, USA,
2012. ACM.
[28] A.
Petland,
―Reinventing
Society
inthe
Wake
of
Big
Data‖.
Edge.org,http://www.edge.org/conversation/reinventing-societyin-the-wake-of-big-data,2012.
[29] R Core Team. R, ― A Language and Environment for StatisticalComputing. R Foundation for
Statistical Computing‖,Vienna, Austria, 2012. ISBN 3-900051-07-0.
[30] R. Smolan and J. Erwitt,―The Human Face of Big Data‖.Sterling Publishing Company
Incorporated, 2012.
[31] Storm, http://storm-project.net.
[32] N. Taleb. ―Antifragile: How to Live in a World We Don’tUnderstand. Penguin Books‖, Limited,
2012.
[33] UN Global Pulse, http://www.unglobalpulse.org.
[34] S. M. Weiss and N. Indurkhya, ―Predictive Data Mining:a Practical Guide‖. Morgan Kaufmann
Publishers Inc.,San Francisco, CA, USA, 1998.
[35] P. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch,and G. Lapis,―IBM Understanding Big Data:
Analyticsfor
Enterprise
Class
Hadoop
and
Streaming
Data‖,
McGraw-Hill
Companies,Incorporated, 2011.
Copyright ⓒ 2014 GV School Publication
9
International Journal of Private Cloud Computing Environment and Management
Vol. 1, No. 1 (2014)
10
Copyright ⓒ 2014 GV School Publication