Lab 8: Word Frequency Counter

Goals for this lab

By the time you have completed this lab, you should:

  • know what a hash table is
  • learn about the C++ standard template library (STL)
  • have more experience working with strings in C++

Lab preparation

After logging in, create the directory for this lab:

mkdir -p cs24/lab8

Change into the lab2 directory and copy the files you will need for this lab:

cd cs24/lab8
cp ~bboe/public_html/cs24_m13/code/lab8/* .

Open up README.txt with your favorite text editor and be sure to add your and your partner's name so that you both receive credit.

Hash Table

Definition

Hash table are one of the basic data structures. It is a mapping between a unique key and its value, that is usually stored in the form of a table (or array). To get the value by knowing the key we will use a hash function. A hash function tells us where in the array a value for the given key is stored. It calculates the (in theory unique) index in the array from the key.

For example, we have a small hash table to identify if the number is odd or even. It will have two records in its array:

0 Even
1 Odd

Now we provide hash function as mod 2 and a number (key) num, that we want to check. We calculate index i = num % 2 and look for an answer (value) at array[i]. If num = 10 then i = 10 % 2 = 0 and array[0] = "Even". If num = 11 then i = 11 % 2 = 1 and array[1] = "Odd".

We can also use hash tables to count the number of odds and evens in some given sequence.

Now our table array has two recores:

0 Number of evens
1 Number of odds

If we are encountering an even or an odd at some point in the code, we can simply increase corresponding number in the table via array[num % 2]++;

In this case our key, that is some number, maps to a value that is a number of the occurences of the same type numbers.

Tomorrow in class we will discuss implementation details of hash tables. However, for today's lab you will just need to use an existing hash table. Hash tables in general can store many more than 2 items, in fact they can store an unbounded number of items with the right implementation.

Standard Template Library

In the previous labs we have talked about C++ templates. Turns out there is a library with a bunch of very useful templates that are ready to use.

vector

One template is the vector template that allows you to create a dynamic array of an arbitrary type and provides methods for dynamic memory allocation. Here is an example of code that creates a string array and adds

#include <vector>
#include <iostream>
using namespace std;

int main(){
    vector<int> nums(3, 5);  // create array of 3 ints, initialize to 5s
                             // nums = [5, 5, 5]
    nums.push_back(6);       // add 6 in the end
                             // nums = [5, 5, 5, 6]
    nums.pop_back();         // remove the last element
                             // nums = [5, 5, 5]
    for(vector<int>::iterator i = nums.begin(); i != nums.end(); ++i)
        cout << *i << endl;  // prints out the content of the nums
    return 0;
}

As a template is a container built to use with an arbitrary type, it makes sence to implement a special type of pointers, that would work with containers. These are called iterators. In the example above, to iterate through a vector of integers we declare (right in the middle of the code) vector<int>::iterator i. We say that we start from the first element i = vector.begin() and we print numbers until i is pointing to the "end" element, which is one after the last in the vector i != nums.end(). Iterators have very similar arithmetics to pointers. i++ will move iterator to one position forward, nums.begin() + 4 will be pointing to the element at position nums[4] and *i will return a value that is stored at a current position.

string

Another example are strings. Strings are containers specificaly to store strings and provide a number of popular methods.

#include <iostream>
#include <string>
using namespace std;

int main() {
    // Assume the standard input stream contains "abc qwe\n345\n"
    string s1, s2, s3;
    cin >> s1;  // "abc"
    cin >> s2;  // "qwe"
    cin >> s3;  // "345"
    s1 = s1 + s2 + s3;
    cout << s1 << endl;  // abcqwe345
    return 0;
}

unordered_map (hash table)

One more example is unordered_map, that is an implementation of a hash table mapping key to value.

#include <iostream>
#include <unordered_map>
using namespace std;

int main() {
    int value;
    string key("abc");
    unsorted_map<string, int> hash_table;  // initialize the hash table

    // test if a key is contained in the hash table
    cout << (hash_table.find(key) != hash_table.end()) << endl;  // 0

    // Accessing an item that was not previously contained will add it
    // using the default constructor (initialize to 0 for type int)
    cout << hash_table[key] << endl;  // 0

    // test if a key is contained in the hash table
    cout << (hash_table.find(key) != hash_table.end()) << endl;  // 1

    hash_table[key]++;  // Increment the key
    cout << hash_table[key] << endl;  // 1

    return 0;
}

If we want to iterate trough the table, then we have to create appropriate iterator unordered_map<string, int>::iterator it = hash_table.begin();

Each element returned by the hash table iterator is of type pair that contains two attributes: first, the key; and second, the value. To refer separately to the key try it->first, for the value try it->second.

NOTE: The unordered_map class does not exist in older versions of C++. Thus in order to compile this program (and your lab8 program) you will need to add -std=c++11 to the clang compile line.

sort

Finally, there is a numper of popular algorithms, implemented in STL. For example, in this lab we will be using sort. Sort will sort elements in ascending order, unless provided a sorting function that specifies how we compare two elements.

For example, if we wanted to output the elements of our hash table in reverse (descending) order based on the key we would do the following:

#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;

bool sort_reverse(const pair<string, int> &a, const pair<string, int> &b) {
  return a.first > b.first;
}

int main(){
  // Initialize the hash table with some fun values
  unordered_map<string, int> ht = {{"Bananas", 10}, {"Apples", 16},
                                   {"Pears", 8}, {"Zebras", 1}};

  // unorderd_map iterators cannot be sorted so we must copy its contents
  // into a vector
  vector<pair<string, int> > values(ht.begin(), ht.end());
  // sort all of the values in the vector using the function `reverse` for
  // comparisons
  sort(values.begin(), values.end(), sort_reverse);

  // Output the items
  // The `auto` keyword in C++11 will automatically determine its type.
  // In this case its type is: vector<pair<string, int> >::iterator
  for(auto it = values.begin(); it != values.end(); ++it)
    cout << it->first << " " << it->second << endl;

  return 0;
}

Writing frequency.cpp

In this lab you will be writing a program that calculates number of ocurrences of words in a standard input. It should print out statistics on the encountered words. The order of the output should be such that the most frequently used words are listed first, and words of the same frequency are sorted alphanumerically.

Words can include punctuation and are separated by any form of whitespace. Using cin >> term where term is a string will handle the whitespace issue for you. All comparisons should be case insensitive, meaning that IS, Is, iS, and is should all be counted as one word, is. You will need to write a function that converts a string to its lowercase form. You probably want to use the tolower function to help you with that.

For example echo "ASD aSd asd sdf dfg." | ./a.out should produce the output:

3 asd
1 dfg.
1 sdf

Submitting the project

Only one person in a group need submit the project. If both members of a group submit we will only score the last submission made between the two group members. Please review the submission instructions as needed. On the submission site you will find the command you need to use to submit the project. Note that you may resubmit this project as many times as necessary up until the deadline.

 
 

Template design by Andreas Viklund