Download BDM Repeat Assignment 2022

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Big Data Management
Repeat Assignment
In this assignment your task is to perform some analytics on a dataset and prepare the
batch layer, serving layer and the speed layer of the lambda architecture. You will be
using the CRAN package download logs ( These log
files contain all hits to mirror related to downloads of the
R packages. The raw log files have been parsed into CSV and anonymised. Since these
logs contain massive amount of data (from 2012 to date), you will only be using the logs
for the first week of June, 2022 (1st June - 5th June). These log files are available at:
The package download log contains data about the following variables:
date: Download date
time: Download time (in UTC)
size: Package size (in bytes)
r_version: Version of R used to download package
r_arch: Processor architecture (i386 = 32 bit, x86_64 = 64 bit)
r_os: Operating System (darwin9.8.0 = mac, mingw32 = windows)
package: Name of the package downloaded
country: Two letter ISO country code
ip_id: A daily unique id assigned to each IP address
You answer the same questions as in assignment 1 but this time you are required to use
the SQL API of Apache Spark. Prepare Cassandra structures and Spark code that saves
the precomputed data into these structures:
1. Show number of downloads for package ggplot2.
2. Highest number of downloads by a country. Show its name.
3. Top 10 smallest sized packages.
4. What are the top 10 least popular packages?
5. Highest number of downloads by an Operating System.
6. What is the most popular package in Ireland?
7. What is the highest number of downloads by a single machine?
8. What OS is the least popular among the R programmers?
9. How many users use MAC OS?
10. List total number of incomplete records - lines which have missing values.
Task 1 - 20%
For your first task, you are required to use Apache Spark RDD’s transformations and
actions to answer above questions about the dataset.
Task 2 - 20%
In this task you are required to use Apache Spark’s SQL API to to answer above questions
about the dataset. Store the results for each question in Apache Cassandra.
Task 3 - 20%
In the last task, you are required to use Apache Spark’s Streaming API to compute the
real-time views for the questions. For storing these views you need to use the Apache
Cassandra. To emulate a live-stream of the download logs, you are required to write a
separate Python script that reads 1000 lines every 5 seconds from each log file and stores
them as separate files (log1, log2, log3, etc.) in the streaming directory on which
your application is listening.
• Submit your solution on Moodle by the specified deadline.
• Acceptable file format: Python notebook - name it student name.ipynb. The
notebook should be exported as iPython Notebook with *.ipynb extension. If the
code in your notebook does not run, it will result in 20% penalty.
• Take two screenshots of your solution to each question (code + its output into the
Cassandra table where applicable) and insert it in a word document, generate a
pdf of this document.
• Zip both files together and submit your solution on Moodle by the deadline.