Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Big Data Management Repeat Assignment Description In this assignment your task is to perform some analytics on a dataset and prepare the batch layer, serving layer and the speed layer of the lambda architecture. You will be using the CRAN package download logs (http://cran-logs.rstudio.com). These log files contain all hits to http://cran.rstudio.com mirror related to downloads of the R packages. The raw log files have been parsed into CSV and anonymised. Since these logs contain massive amount of data (from 2012 to date), you will only be using the logs for the first week of June, 2022 (1st June - 5th June). These log files are available at: http://cran-logs.rstudio.com/2022/2022-06-01.csv.gz http://cran-logs.rstudio.com/2022/2022-06-02.csv.gz http://cran-logs.rstudio.com/2022/2022-06-03.csv.gz http://cran-logs.rstudio.com/2022/2022-06-04.csv.gz http://cran-logs.rstudio.com/2022/2022-06-05.csv.gz The package download log contains data about the following variables: date: Download date time: Download time (in UTC) size: Package size (in bytes) r_version: Version of R used to download package r_arch: Processor architecture (i386 = 32 bit, x86_64 = 64 bit) r_os: Operating System (darwin9.8.0 = mac, mingw32 = windows) package: Name of the package downloaded country: Two letter ISO country code ip_id: A daily unique id assigned to each IP address Questions You answer the same questions as in assignment 1 but this time you are required to use the SQL API of Apache Spark. Prepare Cassandra structures and Spark code that saves the precomputed data into these structures: 1. Show number of downloads for package ggplot2. 1 2. Highest number of downloads by a country. Show its name. 3. Top 10 smallest sized packages. 4. What are the top 10 least popular packages? 5. Highest number of downloads by an Operating System. 6. What is the most popular package in Ireland? 7. What is the highest number of downloads by a single machine? 8. What OS is the least popular among the R programmers? 9. How many users use MAC OS? 10. List total number of incomplete records - lines which have missing values. Tasks Task 1 - 20% For your first task, you are required to use Apache Spark RDD’s transformations and actions to answer above questions about the dataset. Task 2 - 20% In this task you are required to use Apache Spark’s SQL API to to answer above questions about the dataset. Store the results for each question in Apache Cassandra. Task 3 - 20% In the last task, you are required to use Apache Spark’s Streaming API to compute the real-time views for the questions. For storing these views you need to use the Apache Cassandra. To emulate a live-stream of the download logs, you are required to write a separate Python script that reads 1000 lines every 5 seconds from each log file and stores them as separate files (log1, log2, log3, etc.) in the streaming directory on which your application is listening. Submission • Submit your solution on Moodle by the specified deadline. • Acceptable file format: Python notebook - name it student name.ipynb. The notebook should be exported as iPython Notebook with *.ipynb extension. If the code in your notebook does not run, it will result in 20% penalty. 2 • Take two screenshots of your solution to each question (code + its output into the Cassandra table where applicable) and insert it in a word document, generate a pdf of this document. • Zip both files together and submit your solution on Moodle by the deadline. 3