Download Python In The Cloud

Python In The Cloud PyHou MeetUp, Dec 17th 2013 Chris McCafferty, SunGard Consulting Services Overview • • • • • • • What is the Cloud? What is Big Data? Big Data Sources Python and Amazon Web Services Python and Hadoop Other Pythonic Cloud providers Wrap-up What Is The Cloud • I want 40 servers and I want them NOW • I want to store 100Tb of data cheaply and reliably • We can do this with Cloud technologies What is Big Data • “Three Vs” – Volume – Variety – Velocity • Genome: sequencing machines throw off several TB per day. Each. • Hard drive performance is often the killer bottleneck, both reading and writing What is NOT Big Data • Anything where the whole data set can be held in memory on a single standard instance • Data that can be held straightforwardly in a traditional relational database • Problems where most of the data can be trivially excluded • There are many challenging problems in the world – but not all need Cloud or Big Data tools to solve them To The Cloud! • Amazon Web Services is the 800lb gorilla in this space – Start here if in doubt • Other options are RackSpace, Microsoft Azure, (PiCloud/Multyvac?) • You can also spin up some big iron very cheaply – Current AWS big memory spec is cr1.8xlarge – 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network – $3.50 per hour Geo Big Data Sources • NASA SRTM data is on the large side • NASA recently released a huge set of data directly into the cloud: NEX – Earth Sciences data sets • Made available on Amazon Web Services public datasets • Available on S3 at: – s3://nasanex/NEX-DCP30 – s3://nasanex/MODIS – s3://nasanex/Landsat • There are many, many geo data sets available now (NOAA Lidar, etc) Time for some code • Example - Use S3 browser to look at new NASA NEX data • Let’s download some with boto package • Quickest to do this from an Amazon data centre • See DemoDownloadNasaNEX.py Weather & Big Data Sources • Good public weather and energy data • It's hard to move data around for free: just try! • Power grids shed many GB of public data a day – Historical data sets form many Terabytes • Weather data available from NOAA – QCLD: Hourly, daily, and monthly summaries for approximately 1,600 U.S. locations. – ASOS data contains sensor data at one-minute intervals. 5 min intervals available too. • 900 stations, 3-4MB per day, 12 years of data = 11-15TB data set. Why go to the cloud • Cheap - see AWS pricing here – spot pricing of m1.medium normally ~1c/hr • The cloud is increasingly where the (public) data will reside • Pay as you go, less bureaucracy • Support for Big Data technologies out of the box – Amazon Elastic Compute Cloud (EC2) gives you a Hadoop cluster with minimal • Host a big web server farm or video streaming cluster Python on AWS EC2 • AWS = Amazon Web Services. The Big Cloud • EC2 = Elastic Cloud Compute • Let’s run up an instance and see what we have available • See this script as one way to upgrade to Python 2.7 • Note absence of high-level packages like NumPy, matplotlib and Pandas • It would be very useful to have a very high-level Python environment… StarCluster • Cluster management in AWS, written by a group at MIT • Convenient package to spin up clusters (Hadoop or other) and copy across files • Machine images (AMIs) for high-level Python environments (NumPy, matplotlib, Pandas, etc) • Not every high-level library is there – No sklearn (SciKit-Learn, machine learning) – But easier to pip-install with most pre-requisites already there • • • • Sun Grid Engine: Job Management Hadoop Boto plugin dumbo… and much more Python's Support for AWS • boto - interface to AWS (Amazon Web Services) • Hadoop Streaming - use Python in MapReduce tasks • mrjob - Framework that wraps Hadoop Streaming and uses boto • pydoop- wraps Hadoop Pipes, which is a C++ API into Hadoop Map Reduce • Write Python in User-Defined Functions in Pig, Hive – Essentially wraps MapReduce and Hadoop Streaming Boto - Python Interface to AWS • • • • • Support for HDFS Upload/download from Amazon S3 and Glacier Start/stop EC2 instances Manage users through IAM Virtually every API available from AWS is supported • django-storages uses boto to present an S3 storage option • See http://docs.pythonboto.org/en/latest/ • Make sure you keep your AWS key-pair secure Another Code Example – upload • Example where we merge many files together and upload to S3 • Merge files to avoid the Small Files Problem • Note use of retry decorator (exponential backoff) • See CopyToCloud.py and MergeAndUploadTxOutages.py What is ? • A scalable data and job manager suitable for MapReduce jobs • Core technologies date from early 2000s at Google • Retries failed tasks, redundant data, good for commodity hardware • Rich ecosystem of tools including NoSQL databases, good Python support • Example, let’s spin up a cluster of 30 machines with StarCluster Hadoop Scales Massively Hadoop Streaming • Hadoop passes incoming data in rows on stdin • Any program (including Python) can process the rows and emit to stdout • Logging and errors to stderror Hadoop Streaming - Echo • Useful example that can be used for debugging • Tells you what Hadoop is actually passing your task • See echo.py • Similar example firstten.py peeks at the first ten lines then stops • Useful for debugging Hadoop Parsing Example • Python's regex support makes it very good for parsing unstructured data • One of the keys in working with Hadoop and Big Data is getting it into a clean row-based format • Apply 'schema on read' • Transmission Data from PJM is updated here every 5 mins: https://edart.pjm.com/reports/linesout.txt • Needs cleaning up before we can use it for detailed analysis - note multi-line format • Script split_transmission.py • Watch out for Hadoop splitting input blocks in the middle of a file Alternatives to AWS • Picloud offers open source software enabling you to run large computational clusters – Just acquired by DropBox – Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr – Doesn't offer many of the things Amazon does (AMIs, SMS) but great for computation or a private cloud • Disco is MapReduce implemented in Python – Started life at Nokia – Has its own Distributed Filesystem (like HDFS) • Or roll your own cluster in-house with pp (parallel python) • StarCluster Sun Grid Engine on other vendor/in-house • Google App Engine…? PiCloud • Acquired by DropBox Nov 2013 • DropBox will probably come out with its own cloud compute offering in 2014 • As of Dec 2013, no new sign-ups • Existing customers encouraged to migrate to Multyvac • Feb 25th 2014 PiCloud will switch off • The underlying PiCloud software is still open source Conclusions • For cheap compute power and cheap storage, look to the cloud • Python is well-supported in this space • Consider being close to your data: in the same cloud – Moving data is expensive and slow • Leverage AWS with tools like boto, StarCluster • Beware setting up complex environments: installing packages takes time and effort • Ideally, think Pythonicly – use the best tools to get the job done Links • Good rundown on the Python ecosystem around Hadoop from Jan 2013: – http://blog.cloudera.com/blog/2013/01/a-guide-topython-frameworks-for-hadoop/ • Early vision for PiCloud (YouTube Mar 2012) – http://www.youtube.com/watch?v=47NSfuuuMfs • Disco MapReduce Framework from PyData – http://www.youtube.com/watch?v=YuLBsdvCDo8 – PuTTY tool for windows • Some AWS & Python war stories: – http://nz.pycon.org/schedule/presentation/12 Thank you • Chris McCafferty • http://christophermccafferty.com/blog • Slides will be at: • http://christophermccafferty.com/slides • Contact me at: • [email protected] • [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Python In The Cloud