Download Big Data Programming with Hadoop and Spark - PSC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Hadoop
Programming
Bryon Gill, Pittsburgh Supercomputing Center
Hadoop Overview
• Framework for Big Data
• Map/Reduce
• Platform for Big Data Applications
Map/Reduce
• Apply a Function to all the Data
• Harvest, Sort, and Process the Output
Map/Reduce
Big
Data
Split 1
Output 1
Split 2
Output 2
Split 3
Map
F(x)
Output 3
Split 4
Output 4
… Split n
… Output n
© 2014 Pittsburgh Supercomputing
CenterPittsburgh Supercomputing Center
© 2010
Reduce
F(x)
Result
4
HDFS
• Distributed FS Layer
• WORM fs
– Optimized for Streaming Throughput
• Exports
• Replication
• Process data in place
HDFS Invocations: Getting Data In and Out
•
•
•
•
•
•
hadoop dfs -ls
hadoop dfs -put
hadoop dfs -get
hadoop dfs -rm
hadoop dfs -mkdir
hadoop dfs -rmdir
Writing Hadoop Programs
• Wordcount Example: Wordcount.java
– Map Class
– Reduce Class
Compiling
• javac -cp $HADOOP_HOME/hadoop-core*.jar \
-d WordCount/ WordCount.java
Packaging
• jar -cvf WordCount.jar -C WordCount/ .
Submitting your Job
• hadoop \
jar WordCount.jar \
org.myorg.WordCount \
/datasets/compleat.txt \
$MYOUTPUT \
-D mapred.reduce.tasks=2
Configuring your Job Submission
• Mappers and Reducers
• Java options
• Other parameters
Monitoring
• Important Ports:
–
–
–
–
Hearth-00.psc.edu:50030 – Jobtracker (MapReduce Jobs)
Hearth-00.psc.edu:50070 – HDFS (Namenode)
Hearth-03.psc.edu:50060 – Tasktracker (Worker Node)
Hearth-03.psc.edu:50075 – Datanode
Hadoop Streaming
• Write Map/Reduce Jobs in any language
• Excellent for Fast Prototyping
Hadoop Streaming: Bash Example
• Bash wc and cat
• hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \ input /datasets/plays/ \
-output mynewoutputdir \
-mapper '/bin/cat' \
-reducer '/usr/bin/wc -l '
Hadoop Streaming Python Example
• Wordcount in python
• hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming*.jar \
-file mapper.py \
-mapper mapper.py \
-file reducer.py \
-reducer reducer.py \
-input /datasets/plays/ \
-output pyout
Applications in the Hadoop Ecosystem
•
•
•
•
•
Hbase (NoSQL database)
Hive (Data warehouse with SQL-like language)
Pig (SQL-style mapreduce)
Mahout (Machine learning via mapreduce)
Spark (Caching computation framework)
Spark
• Alternate programming framework using HDFS
• Optimized for in-memory computation
• Well supported in Java, Python, Scala
Spark Resilient Distributed Dataset (RDD)
•
•
•
•
•
RDD for short
Persistence-enabled data collections
Transformations
Actions
Flexible implementation: memory vs. hybrid vs. disk
Spark example
• lettercount.py
Spark Machine Learning Library
• Clustering (K-Means)
• Many others, list at
http://spark.apache.org/docs/1.0.1/mllib-guide.html
K-Means Clustering
• Randomly seed cluster starting points
• Test each point with respect to the others in its cluster to find a new mean
• If the centroids change do it again
• If the centroids stay the same they've converged and we're done.
• Awesome visualization:
http://www.naftaliharris.com/blog/visualizing-k-means-clustering/
K-Means Examples
• spark-submit \
$SPARK_HOME/examples/src/main/python/mllib/kmeans.py \
hdfs://hearth-00.psc.edu:/datasets/kmeans_data.txt 3
• spark-submit \
$SPARK_HOME/examples/src/main/python/mllib/kmeans.py \
hdfs://hearth-00.psc.edu:/datasets/archiver.txt 2
Questions?
• Thanks!
References and Useful Links
•
HDFS shell commands:
http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html
•
Writing and running your first program:
http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197
•
https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program
•
Hadoop Streaming:
http://hadoop.apache.org/docs/stable1/streaming.html
https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program/hadoop-streaming
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
•
Hadoop Stable API:
http://hadoop.apache.org/docs/r1.2.1/api/
•
Hadoop Official Releases:
https://hadoop.apache.org/releases.html
•
Spark Documentation
http://spark.apache.org/docs/latest/
Related documents