Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2010 Workshop on Massive Data Analytics on the Cloud (MDAC 2010) April 26, 2010 Raleigh, NC, USA In association with the 19th Annual World Wide Web Conference (WWW2010) Making Sense of Mountains of Data Search Online Transaction Processing System Feedback/Action Embedded Analytics ClickSteam, CRM Claim data (text, picture, video) Call data records Location Tracking (GPS), iPhone, Vehicle Use Data, $ Trans tracking (Across borders & IP providers), Dashboards Semi-Un-struct Continuous arrival of high volume information (evolving, highly variant) (struct-/semi--/un-structured Financial Planning Scorecards Auto/Cross Correlation Analytics, Predictive Analytics Billions of mobile devices Feeds: Census Bureau Data Market Data, Weather Data Sensors data Mash ups Web Data (for search) Web Buz data (for reputation analysis) PetaBytes -> Exabytes Deep & Wide Analytics Fine grained – individual product and customer at a time and place Massive Data Analytic Platforms • Google: Original MapReduce implementation • Microsoft: Dryad • Yahoo!, Facebook, and many others: Hadoop • Ecosystems: Hive, Pig, Jaql, Zookeeper, • Alternatives to Map/Reduce, e.g. Pregel M • 1000’s processors Petabytes of data …and growing R M M C C Partition Sort • • C R • • • • • • “Easy” parallelism Scalability Fault-Tolerance Elastic Flexibility Cost / Performance Chairpeople Perspective • Other parallel systems technology and customers – Parallel Database – enterprise data warehousing – Parallel ETL (extraction, transformation, load) – Search and text analytics • Hadoop and related technologies – Finance, Telco, Healthcare, Retail, Government, … Questions Posed in Call For Papers • What kinds of problems are people trying to solve? • How are existing massive-scaleout platforms used, and what extensions would be helpful? • Other kinds of platforms for different problems? • How to integrate with existing environments such as data warehouses? • Challenges in managing massive datasets? • Legal/moral challenges associated with mining these data sets? Agenda (morning) 9:00 - 10:30: Session 1 Introduction and Welcome Invited Talk: "Hadoop: An Industry Perspective" Dr. Amr Awadallah, CTO, VP-Engineering, Cloudera 10:30 - 11:00: Coffee Break* 11:00 - 12:30: Session 2 Distributed Indexing of Web Scale Datasets for the Cloud Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos, Nectarios Koziris; National Technical University of Athens Beyond Online Aggregation: Parallel and Incremental Data Mining with Online Map-Reduce Joos-Hendrik Böse1, Artur Andrzejak2, Mikael Högqvist2; 1Intl. Comp. Sci. Institute, 2Zuse Institute Berlin (ZIB) Efficient Updates for a Shared Nothing Analytics Platform Katerina Doka3, Dimitrios Tsoumakos4, Nectarios Koziris3; 3National Technical University of Athens, Greece, 4University of Cyprus 12:30 - 1:30: Lunch* Agenda (afternoon) 1:30 - 3:30: Session 3 Invited Talk: "Large Scale Applications on Hadoop in Yahoo" Dr. Vijay Narayanan, Yahoo! Labs Silicon Valley, Extracting User Profiles from Large Scale Data Michal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass, David Konopnicki; IBM Research, Haifa A Novel Approach to Multiple Sequence Alignment using Hadoop Data Grids Sudha Sadasivam, G. Baktavatchalam; PSG College of Technology 3:30 - 4:00: Coffee Break* 4:00 - 5:30: Session 4 Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra, Vikas Deshpande, Kemafor Anyanwu; North Carolina State University SPARQL Basic Graph Pattern Processing with Iterative MapReduce Jaeseok Myung, Jongheum Yeon, Sang-goo Lee; Seoul National University Parallelizing Random Walk with Restart for Large-Scale Query Recommendation Meng-Fen Chiang, Tsung-Wei Wang, Wen-Chih Peng; National Chiao Tung University Hsinchu, Taiwan Acknowledgements Workshop Chairs Ullas Nambiar, IBM India Research Lab, New Delhi, India John McPherson, IBM Almaden Research Center, USA David Konopnicki, IBM Haifa Research Lab, Israel Steering Committee Rakesh Agrawal, Microsoft Search Labs, Mountain View, CA, USA Alon Halevy, Google Inc., Mountain View, CA, USA Invited Speakers Amr Awadallah, CTO, VP-Engineering, Cloudera, "Hadoop: An Industry Perspective" Vijay Narayanan, Yahoo! Labs Silicon Valley, "Large Scale User Modeling on Hadoop" Program Committee Amr Awadallah, Cloudera, USA Andrew McCallum, University of Massachusetts Amherst, USA Assaf Schuster, Technion - Israel Institute of Technology Gautam Das, University of Texas, Arlington, USA Jimeng Sun, IBM Watson Research Center, USA John Shafer, Microsoft Search Labs, USA Kevin Chang, University of Illinois at Urbana-Champaign, USA Kun Liu, Yahoo! Labs, USA Louiqa Raschid, University of Maryland, College Park, USA Michal Shmueli-Scheuer, IBM Haifa Research Lab, Israel Michael Sheng, University of Adelaide, Australia Mong Li Lee, National University of Singapore, Singapore Rajeev Gupta, IBM India Research Lab, India Vanja Josifovski, Yahoo Research, USA Yannis Sismanis, IBM Almaden Research Center, USA Yi Chen, Arizona State University, USA Wen-syan Li, SAP, China