Updated on Feb 23, 2016: clarify the submission requirement.
This exercise is to analyze an Apache log using Hadoop MapReduce and is optional for non computer science students.
Sample code and Hadoop use at Comet
To compile, allocate a dedicated node and use "make" under log directory. To run, use script "sumit-log-comet.sh" under log directory.
What to do
Use Java Mapreduec to parallelize your parallel processing (or most of your processing). and you may use other languages to help your postprocessing. For Java, modify the sample code to analyze an Apache log dataset under /home/tyang/cs240sample/log/trafficdata. There are 3 apache log files for page views in a week: apache1.splunk.com, apache2.splunk.com, and apache3.splunk.com. Since the raw dataset size is too small to see performance impact, you duplicate each data file 16 times with different names to artificially increase the traffic. Thus the dataset to process contains 48 files.
The default Hadoop system parameters are listed here. You may adjust the performance by playing with some system parameters (e.g. modify number of map tasks or split size). A discussion on how to tune is here.
What to submit