Project 1 Classifier Agent (CS 165A Spring 2022, UC Santa Barbara)

Instructor: Yu-Xiang Wang

Logistics and timeline

Introduction

In this project, you will design an agent to read movie reviews and decide whether the review is positive or negative.

I love Eddie Izzard. I think this is awesome, and the other television specials should be looked at as well. He has a good book "Dress To Kill" out to buy as well, which I think people should read. I loved that this program won an Emmy, and anyone who likes history will probably get a laugh from Eddie. Enjoy :)
Though it had the misfortune to hit the festival circuit here in Austin (SXSW Film) just as we were getting tired of things like Shakespeare in Love, and Elizabeth, this movie deserves an audience. An inside look at the staging of "The Scottish Play" as actors call "Macbeth" when producing it to avoid the curse, this is a crisp, efficient and stylish treatment of the treachery which befalls the troupe. With a wonderfully evocative score, and looking and sounding far better than its small budget would suggest, this is a quiet gem, not world-class, but totally satisfying.
I just watched Congo on DVD.In most cases I love these kind of movies but this one is different. It made me write my first comment for a movie on IMDb. I was amazed how such a team of experienced filmmakers could come up with this movie as a result. You can see there was a lot of money for this production but you can't make a good movie if you don't have a good script. And as a producer Frank Marshall gave us plenty of great movies to watch; he never should have tried to become another Spielberg. This one shows how hard it is to make a good movie, maybe you've got all the ingredients but if you can't cook stay out of the kitchen. If Can make a suggestion don't spend your money on this one. If you want to see it watch it on television first and make up your own mind.
This is one of Crichton's best books. The characters of Karen Ross, Peter Elliot, Munro, and Amy are beautifully developed and their interactions are exciting, complex, and fast-paced throughout this impressive novel.<br /><br />And about 99.8 percent of that got lost in the film. Seriously, the screenplay AND the directing were horrendous and clearly done by people who could not fathom what was good about the novel. I can't fault the actors because frankly, they never had a chance to make this turkey live up to Crichton's original work. I know good novels, especially those with a science fiction edge, are hard to bring to the screen in a way that lives up to the original. But this may be the absolute worst disparity in quality between novel and screen adaptation ever. The book is really, really good. The movie is just dreadful.

What do you need to do?

1. Basic coding requirments

The basic part of the project requires you to complete the implemention of two python classes:(a) a "feature_extractor" class, (b) a "classifier_agent" class.

The "feature_extractor" class will be used to process a paragraph of text like the above into a Bag of Words feature vector.

The "classifier_agent" class will involve multiple functionalities of a binary linear classifier agent. These include functions for making predictions, learning from labeled training data and evaluating agent performance. For the learning part of it, you will implement the gradient descent and stochastic gradient descent algorithms that we learned from the lectures.

Through the exercise, you will understand the elements of an agent program and the Modeling / Inference / Learning paradigm of an AI agent design.

The implementation of the learning algorithm will require you to use a slightly different loss function to the version of the logistic loss that we learned in the lecture with $\mathcal{Y} = \{-1,1\}$. We will be using what we called a Cross-Entropy loss that works with $\mathcal{Y} = \{0,1\}$ instead. The cross-entropy loss for the linear classifier is defined as $$ \ell(w, (x,y)) = -(\log \hat{p}_w(x) y + \log(1-\hat{p}_w(x))(1-y)), $$ where $$\hat{p}_w(x) = \frac{\exp(w^Tx)}{1 + \exp(w^Tx)}$$ is the probabilistic prediction of the classifier. Here $x\in\mathbb{R}^d$ is the feature vector and the weights $w \in \mathbb{R}^d$.

The training loss function will be the average of the cross-entropy loss over the training data, i.e., $$ L(w) = \frac{1}{n}\sum_{i=1}^n \ell(w, (x_i,y_i)). $$

Make sure you check that your gradient calculations are correct before you implement it.

You should be able to get roughly 85\% test accuracy with this baseline classifier using the vanilla Bag of Words features.

2. (Optional) advanced coding requirements

The advanced version of the project (bonus question) requires you to come up with better features than Bag-of-Words so as to improve the agent's classification accuracy.

A few suggestions are:

I would encourage you to try at least tf-idf feature, which will bump up your accuracy by quite a bit already; and you will get to learn how to inherit a python class by following the template we provided.

Otherwise, the task is completely open-ended. You can use anything you like to improve the accuracy of this classifier's performance, such as using BERT features.

We will set up a leaderboard so those who have the best accuracy will earn extra bonus points.

At some point, it might become computationally challenging to run your code on Gradescope, in which case you should contact the TA, who will manually look into your situation.

3. Report

You need to write a short report (with Jupyter notebook). We provided a template (Project1_Report_Template.ipynb) with a few required questions to get you started.

Some part of the reports might also give you useful tips on how you may intepret the predictions of your agent.

If you have hand-written parts, e.g., for the gradient derivation, you may scan them include them in the Jupyter notebook. The report (both an *.ipynb file, and a pdf file you print out) should be submitted to Gradescope.

Code that we provide

There are three python files that we provide in the StartupKit.

classifier.py is the file that you will need to edit. you will find that we have provided most of the code that solves the boring part of the problem, e.g., data loading / preprocessing and so on. The intention is such that you can focus on the more creative part of the agent design job and to standardize the pipeline.

All functions you will need to complete will contain clear instructions on the input and required output, and sometimes hints. Implementations are typically short (no more than a few lines of code).

main.py is the file that provides a demo how you should use the classifier class.

You may type

python main.py

to run it. Notice that it won't run right away.

Another illustration of how you may use functions in classifier.py is in the template report.

Setting up the Python platform

The list of functions you need to complete

The main tasks for you is to design and train this classifier agent by implementing two different types of the feature extractors and two different types of the learnaing algortihms.

  1. To complete Basic Coding Requirements, you need to implement the following functions in classifier.py
    • bag_of_word_feature
    • score_function
    • predict error
    • loss_function
    • gradient
    • train_gd
    • train_sgd
  1. To complete Advanced Coding Requirements, you need to implement the following functions in classifier.py

    • tfidf_extractor
    • compute_word_idf

      For other feature extractors you should complete an implementation of

    • custom_feature_extractor

What to submit to gradescope

You need to submit two things:

  1. The completed python module classifier.py with each functions implemented. Please make sure there are no syntax error otherwise the autograder won't run. You will be graded for each function you completed.

  2. The parameters of your best trained model in a file best_model.npy which you may generate by calling python train_custom_model.py. You are responsible for ensuring that the classifier.custom_feature_extractor works and that it is compatible with best_model.npy. Note that we will not run training for you. Your custom_feature_extractor class and this parameter_vector from best_model.npy will be used to instantiate a classifer that we will evaluate on new examples.

Grading rules