240A Winter 2016 HW2

240A Winter 2016 HW2 (Web Traffic Analysis with Hadoop/MapReduce)

Due on March 1, 2016.

Updated on Feb 23, 2016: clarify the submission requirement.

This exercise is to analyze an Apache log using Hadoop MapReduce and is optional for non computer science students.

Sample code and Hadoop use at Comet

What to do

Use Java Mapreduec to parallelize your parallel processing (or most of your processing). and you may use other languages to help your postprocessing. For Java, modify the sample code to analyze an Apache log dataset under /home/tyang/cs240sample/log/trafficdata. There are 3 apache log files for page views in a week: apache1.splunk.com, apache2.splunk.com, and apache3.splunk.com. Since the raw dataset size is too small to see performance impact, you duplicate each data file 16 times with different names to artificially increase the traffic. Thus the dataset to process contains 48 files.

Record the overall execution time of above tasks when the number of machine nodes allocated varies from 1, 2, to 4. You can use unix utility "time" as illustrated in the sample script.

The default Hadoop system parameters are listed here. You may adjust the performance by playing with some system parameters (e.g. modify number of map tasks or split size). A discussion on how to tune is here.

What to submit

Additional References