240A Winter 2016 HW2

240A Winter 2016 HW2 (Web Traffic Analysis with Hadoop/MapReduce)

Due on March 1, 2016.

Updated on Feb 23, 2016: clarify the submission requirement.

This exercise is to analyze an Apache log using Hadoop MapReduce and is optional for non computer science students.

Sample code and Hadoop use at Comet

Slides on how to use Comet's Hadoop Mapreduce are here.
The Java word counting example is available at Comet under /home/tyang/cs240sample/mapreduce.
A Java sample code for log analysis is available at Comet under /home/tyang/cs240sample/log. The sample code is based on this article.
To compile, allocate a dedicated node and use "make" under log directory. To run, use script "sumit-log-comet.sh" under log directory.
Sample code on how to parse the Apache log . You can search "apache log java parser" from the web to get more samples.

What to do

Use Java Mapreduec to parallelize your parallel processing (or most of your processing). and you may use other languages to help your postprocessing. For Java, modify the sample code to analyze an Apache log dataset under /home/tyang/cs240sample/log/trafficdata. There are 3 apache log files for page views in a week: apache1.splunk.com, apache2.splunk.com, and apache3.splunk.com. Since the raw dataset size is too small to see performance impact, you duplicate each data file 16 times with different names to artificially increase the traffic. Thus the dataset to process contains 48 files.

Report the daily traffic for this website (# of unique users and # of page views).
Report the top 5 most frequent URLs accessed and top 5 most active users (based on IP address). Show their visit frequency.

Record the overall execution time of above tasks when the number of machine nodes allocated varies from 1, 2, to 4. You can use unix utility "time" as illustrated in the sample script.

The default Hadoop system parameters are listed here. You may adjust the performance by playing with some system parameters (e.g. modify number of map tasks or split size). A discussion on how to tune is here.

What to submit

The submitted code directory contains the instruction on how to compile, how to test, how to collect output.
A simple text report contains your group name(s), the required results, and execution performance. Explain if you have changed the parameters and your finding on their performance impact.
The hardcopy includes the changed code and a report of your findings.

Additional References

Comet information