Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Hadoop Technical Workshop Academic Hadoop Usage Overview • University of Washington Curriculum – Teaching Methods – Reflections – Student Background – Course Staff Requirements UW: Course Summary • Course title: “Problem Solving on Large Scale Clusters” • Primary purpose: developing large-scale problem solving skills • Format: 6 weeks of lectures + labs, 4 week project UW: Course Goals • Think creatively about large-scale problems in a parallel fashion; design parallel solutions • Manage large data sets under memory, bandwidth limitations • Develop a foundation in parallel algorithms for large-scale data • Identify and understand engineering tradeoffs in real systems Lectures • • • • 2 hours, once per week Half formal lecture, half discussion Mostly covered systems & background Included group activities for reinforcement Classroom Activities • Worksheets included pseudo-code programming, working through examples – Performed in groups of 2—3 • Small-group discussions about engineering and systems design – Groups of ~10 – Course staff facilitated, but mostly openended Readings • No textbook • One academic paper per week – E.g., “Simplified Data Processing on Large Clusters” – Short homework covered comprehension • Formed basis for discussion Lecture Schedule • • • • • • Introduction to Distributed Computing MapReduce: Theory and Implementation Networks and Distributed Reliability Real-World Distributed Systems Distributed File Systems Other Distributed Systems Intro to Distributed Computing • • • • What is distributed computing? Flynn’s Taxonomy Brief history of distributed computing Some background on synchronization and memory sharing MapReduce • Brief refresher on functional programming • MapReduce slides – More detailed version of module I • Discussion on MapReduce Networking and Reliability • Crash course in networking • Distributed systems reliability – What is reliability? – How do distributed systems fail? – ACID, other metrics • Discussion: Does MapReduce provide reliability? Real Systems • Design and implementation of Nutch • Tech talk from Googler on Google Maps Distributed File Systems • Introduced GFS • Discussed implementation of NFS and AndrewFS for comparison Other Distributed Systems • BOINC: Another platform • Broader definition of distributed systems – DNS – One Laptop per Child project Labs • Also 2 hours, once per week • Focused on applications of distributed systems • Four lab projects over six weeks Lab Schedule • Introduction to Hadoop, Eclipse Setup, Word Count • Inverted Index • PageRank on Wikipedia • Clustering on Netflix Prize Data Design Projects • Final four weeks of quarter • Teams of 1—3 students • Students proposed topic, gathered data, developed software, and presented solution Example: Geozette Image © Julia Schwartz Example: Galaxy Simulation Image © Slava Chernyak, Mike Hoak Other Projects • Bayesian Wikipedia spam filter • Unsupervised synonym extraction • Video collage rendering Ongoing research: traceroutes • Analyze time-stamped traceroute data to model changes in Internet router topology – 4.5 GB of data/day * 1.5 years – 12 billion traces from 200 PlanetLab sites • Calculates prevalence and persistence of routes between hosts Ongoing research: dynamic program traces • Dynamic memory trace data from simulators can reach hundreds of GB • Existing work focuses on sampling • New capability: record all accesses and post-process with Hadoop Common Features • Hadoop! • Used publicly-available web APIs for data • Many involved reading papers for algorithms and translating into MapReduce framework Background Topics • Programming Languages • Systems: – Operating Systems – File Systems – Networking • Databases Programming Languages • MapReduce is based on functional programming map and fold • FP is taught in one quarter, but not reinforced – “Crash course” necessary – Worksheets to pose short problems in terms of map and fold – Immutable data a key concept Multithreaded programming • Taught in OS course at Washington – Not a prerequisite! • Students need to understand multiple copies of same method running in parallel File Systems • Necessary to understand GFS • Comparison to NFS, other distributed file systems relevant Networking • TCP/IP • Concepts of “connection,” network splits, other failure modes • Bandwidth issues Other Systems Topics • Process Scheduling • Synchronization • Memory coherency Databases • Concept of shared consistency model • Consensus • ACID characteristics – Journaling – Multi-phase commit processes Course Staff • Instructor (me!) • Two undergrad teaching assistants – Helped facilitate discussions, directed labs • One student sys admin – Worked only about three hours/week Preparation • Teaching assistants had taken previous iteration of course in winter • Lectures retooled based on feedback from that quarter – Added reasonably large amount of background material • Ran & solved all labs in advance The Course: What Worked • Discussions – Often covered broad range of subjects • Hands-on lab projects • “Active learning” in classroom • Independent design projects Things to Improve: Coverage • Algorithms were not reinforced during lecture – Students requested much more time be spent on “how to parallelize an iterative algorithm” • Background material was very fast-paced Things to Improve: Projects • Labs could have used a moderated/scripted discussion component – Just “jumping in” to the code proved difficult – No time was devoted to Hadoop itself in lecture – Clustering lab should be split in two • Design projects could have used more time Future Course Ideas Overview • • • • • Systems course Web application design Integration in other applications courses Misc. content ideas Making your own data sets Systems Course • Focused on parallel & distributed systems • Hadoop included in comparison to other cluster techniques • Emphasis on performance, profiling, and management Topic Map MapReduce Distributed FS Condor IPC mechanisms n-phase commit Communication networks and topologies consensus sockets Definitions of failure and reliability MPI Introductory Material • Networking basics • Multithreading Distributed Reliability • Reliability metrics • Methods of failure • Techniques to combat failure – Journaling, n-phase commit • Techniques to achieve consensus – Leader election, voting Parallel Processing • How to parallelize algorithms • Parallelization in one machine vs. across several machines – Techniques applicable to one vs. other – Cache coherency – Memory distribution Parallelization Frameworks • Multithreading on one machine • RPC, MPI, PVM • Higher-level scheduling – Condor vs. Hadoop • Tradeoffs in design Algorithm Design Comparison • Matrix multiplication, sorting, searching, PageRank, etc – … For a standard distributed system – … For Hadoop Distributed Storage • NFS, AFS, GFS • Database clustering techniques – Distributed SQL databases – HBase – Distributed memory caches, object stores Lab Focus • Implementing parallel and distributed algorithms • Experiment with different frameworks • Perform measurements – Bandwidth consumption – Latency & performance • Code analysis Final Thoughts • Lots of low-level programming involved • Appropriate mostly for last-year students • Hadoop community would find scholarly benchmarks useful – wiki.apache.org/hadoop/ProjectSuggestions – “JIRA” bug/feature request database Web Application Design Basic Web Development Topics 3-tier apps, e-commerce requirements LAMP Stack / Ruby on Rails HTML XML HTTP Large-Scale Web Server Technology Amazon Web Services Nutch/Lucene, Hadoop RPC Systems SOAP/XMLHttpRequest Distributed System Basics Next Steps • RPC – Internal RPC; message queues and distributed back-ends – Thrift, ProtocolBuffers – SOAP and XMLHttpRequest Scaling Really Big • Nutch/Lucene • Hadoop • Amazon Web Services Data Aggregation and Analysis • • • • How to crawl and parse web pages Generate link graphs Perform analyses (e.g., PageRank) Semantic analysis Web Site Tuning • Web page layout optimization – Speed – Accessibility – Ease-of-use • Server log analysis – User-targeted site features • Service replication – Consistency, latency issues Security and the Web • • • • Data sanitization SQL injection attacks DOS attacks Data collection methods & ethics – User data privacy Projects • Code labs in Python, PHP, Ruby • Simple database design • Building a small search engine with Nutch/Lucene • Design scalable architecture and run on Amazon EC2 • Web site design project – Security/penetration analysis of other teams’ sites Final Course Thoughts • Web-based services are increasingly relevant – Exciting new opportunity for students – Example course in action: www.cs.washington.edu/education/courses/ cse454/07au/ Using Hadoop in Other Courses • Hadoop is a natural component for many existing courses – Artificial intelligence – Web search – Data mining / information retrieval – Databases (HBase) – Networking – Computational biology? Graphics? Low Level Module • “MapReduce in a week:” code.google.com/edu/content/parallel.html • 3-lecture series on distributed processing and Hadoop; enough to get students started • … more discussion of online resources next AI/Data Mining Ideas • Use Nutch to perform a web crawl and classify pages using Bayesian analysis • Hadoop makes processing easy – Data sanitization – Classifier engine (Use WEKA right in Hadoop) – HDFS for document storage/retrieval/search AI/Data Mining Ideas • Extract semantically valuable data from web pages – E.g., match names to phone numbers, – News articles to locations • Hadoop allows students to explore a much broader scale than previously possible Graphics Examples • Re-encode a render pipeline as a set of MapReduce tasks • Use feature detection + clustering on a corpus of images to find images with similar shapes/features Student-Generated Ideas • Data processing with Yahoo Pig • Distributed SQL databases • Distributed systems “ground-up” projects: – Sockets, then RPC, then Hadoop • Other concepts: Bittorrent, DHTs, P2P • Other frameworks: e.g., BOINC projects Making Datasets • Your department is full of data! – Graphics data – Sensor data from RFID, Ubicomp, robotics… – Measurements from networking lab – Ask around: Someone has a few dozen gigs of log files to donate – (What happens if you leave Ethereal in promiscuous mode for a week straight?) Making Datasets • Other departments are full of data! – Biology – Chemistry – Physics (campus particle accelerator?) Making Datasets • The web is full of data! – Use Nutch to crawl web sites – Wikis are especially good (hmm..) Conclusions • Hadoop isn’t a full course in itself – But it combines well with a lot of other ideas • Can be used for at least a half a course • … Or as little as a week or two • Look around you – Hadoop can be applied to more areas than you might think Open Source Tools for Teaching Overview • • • • • Slides Lab Materials Readings Video Lectures Datasets http://code.google.com/edu Slides • Multiple short course outlines available: • “MapReduce in a week” • “Introduction to Problem Solving on Large Scale Clusters” • “MapReduce Mini Lecture Series” Labs • Lab designs from UW course available – “Introduction to MapReduce” – “A Simple Inverted Index” – “PageRank on the Wikipedia Corpus” – “Clustering the Netflix Movie Data” Readings • Google has several papers available – “Introduction to Distributed Systems” – “MapReduce: Simplified Data Processing on Large Scale Clusters” – “The Google File System” – “BigTable: A Distributed Storage System for Structured Data” http://research.google.com/pubs/papers.html Lecture Videos MapReduce Mini-series Datasets: Wikipedia • Wikipedia supports free “bulk download” of data – Current site snapshot (big) – Entire revision history (massive) • Eliminates need for Nutch crawls • Good for indexing, search labs http://download.wikimedia.org Datasets: Netflix • Netflix’s web site provides recommendations • Theory: Other people watched movie X, then Y. You watched X, you might like Y. • Open question: Can you provide more useful recommendations than their current system? Datasets: Netflix • The Netflix Prize: $1,000,000 if you can find a better algorithm, based on their criteria • They provide you with a large dataset of existing rental associations to work with www.netflixprize.com Conclusions • Lots of starter materials available on the web – Good for reference – Get teaching assistants up to speed • Readings, sample worksheets and other resources are open content & ready to use Aaron Kimball [email protected]