- Due at 11:59 pm on April 27, 2023 (Thursday)
- Be sure to read ”Policy on Academic Integrity” on the course syllabus
- Any updates or correction will be posted on piazza, so check there occasionally
- You may discuss with your peers on the high-level but
**each student must write his/her own codes and report**. You need to declare your collaborators. We will use software to automatically detect any plagiarisms. - TA in charge of the project: Esha, Vihaan and Zihan

In this project, you will design an agent to read movie reviews and decide whether the review is positive or negative.

- Two examples of positive reviews

```
I love Eddie Izzard. I think this is awesome, and the other television specials should be looked at as well. He has a good book "Dress To Kill" out to buy as well, which I think people should read. I loved that this program won an Emmy, and anyone who likes history will probably get a laugh from Eddie. Enjoy :)
```

```
Though it had the misfortune to hit the festival circuit here in Austin (SXSW Film) just as we were getting tired of things like Shakespeare in Love, and Elizabeth, this movie deserves an audience. An inside look at the staging of "The Scottish Play" as actors call "Macbeth" when producing it to avoid the curse, this is a crisp, efficient and stylish treatment of the treachery which befalls the troupe. With a wonderfully evocative score, and looking and sounding far better than its small budget would suggest, this is a quiet gem, not world-class, but totally satisfying.
```

- Two examples of negative reviews

```
I just watched Congo on DVD.In most cases I love these kind of movies but this one is different. It made me write my first comment for a movie on IMDb. I was amazed how such a team of experienced filmmakers could come up with this movie as a result. You can see there was a lot of money for this production but you can't make a good movie if you don't have a good script. And as a producer Frank Marshall gave us plenty of great movies to watch; he never should have tried to become another Spielberg. This one shows how hard it is to make a good movie, maybe you've got all the ingredients but if you can't cook stay out of the kitchen. If Can make a suggestion don't spend your money on this one. If you want to see it watch it on television first and make up your own mind.
```

```
This is one of Crichton's best books. The characters of Karen Ross, Peter Elliot, Munro, and Amy are beautifully developed and their interactions are exciting, complex, and fast-paced throughout this impressive novel.<br /><br />And about 99.8 percent of that got lost in the film. Seriously, the screenplay AND the directing were horrendous and clearly done by people who could not fathom what was good about the novel. I can't fault the actors because frankly, they never had a chance to make this turkey live up to Crichton's original work. I know good novels, especially those with a science fiction edge, are hard to bring to the screen in a way that lives up to the original. But this may be the absolute worst disparity in quality between novel and screen adaptation ever. The book is really, really good. The movie is just dreadful.
```

The basic part of the project requires you to complete the implemention of two python classes:（a） a "feature_extractor" class, (b) a "classifier_agent" class.

The "feature_extractor" class will be used to process a paragraph of text like the above into a **Bag of Words** feature vector.

The "classifier_agent" class will involve multiple functionalities of a **binary linear classifier** agent. These include functions for making predictions, learning from labeled training data and evaluating agent performance. For the learning part of it, you will implement the **gradient descent** and **stochastic gradient descent** algorithms that we learned from the lectures.

Through the exercise, you will understand the elements of an agent program and the Modeling / Inference / Learning paradigm of an AI agent design.

The implementation of the learning algorithm will require you to use a *slightly different* loss function to the version of the logistic loss that we learned in the lecture with $\mathcal{Y} = \{-1,1\}$. We will be using what we called a Cross-Entropy loss that works with $\mathcal{Y} = \{0,1\}$ instead.
The cross-entropy loss for the linear classifier is defined as
$$
\ell(w, (x,y)) = -(\log \hat{p}_w(x) y + \log(1-\hat{p}_w(x))(1-y)),
$$
where
$$\hat{p}_w(x) = \frac{\exp(w^Tx)}{1 + \exp(w^Tx)}$$
is the probabilistic prediction of the classifier. Here $x\in\mathbb{R}^d$ is the feature vector and the weights $w \in \mathbb{R}^d$.

The training loss function will be the average of the cross-entropy loss over the training data, i.e., $$ L(w) = \frac{1}{n}\sum_{i=1}^n \ell(w, (x_i,y_i)). $$

Make sure you check that your gradient calculations are correct before you implement it.

You should be able to get roughly 85% test accuracy with this baseline classifier using the vanilla Bag of Words features.

Please make sure you have the most up-to-date version of NumPy and Scipy installed. Feel free reach out to us if you are facing any package errors caused by pervious version of these package.

The main idea behind the BoW model is to treat each document as an unordered "bag" of words, ignoring the grammar and word order, but keeping track of the frequency of each word. The model follows these steps:

Tokenization: Break the text into individual words (tokens) using techniques such as whitespace splitting, punctuation removal, and stemming or lemmatization to normalize word forms.

Vocabulary building: Create a dictionary of unique words (vocabulary) found across all documents. Each word in the vocabulary serves as a feature for the feature vectors.

Vectorization: Convert each document into a feature vector where each element corresponds to a word in the vocabulary. The value of each element represents the frequency or presence (binary) of the corresponding word in the document.

##### Example:¶

Suppose we have the following two text documents:

Document 1: "The cat sat on the mat." Document 2: "The dog sat on the rug."

After tokenization and vocabulary building, we have:

Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "rug"]

Using the BoW model, we can represent the documents as feature vectors:

Document 1: [2, 1, 1, 1, 1, 0, 0] Document 2: [2, 0, 1, 1, 0, 1, 1]

The Compressed Sparse Column (CSC) matrix is a format for storing sparse matrices in memory-efficient data structures, which only store the non-zero elements. It is particularly useful for performing matrix operations, such as matrix multiplication, without converting the sparse matrix back to a dense matrix, which could lead to memory issues.

The BoW representations tend to be sparse, meaning that most of the elements in the matrix are zeros. Sparse matrices only store non-zero elements, leading to a significant reduction in memory usage and much more efficient computation compared to dense matrices. The memory and computational efficiency of sparse matrices make it possible to scale BoW representations to large text corpora and vocabularies, this is why we need to use sparse matrix in our project.

Creating a CSC matrix

In our project, DO NOT construct a dense numpy.array then convert to sparse.csc_array. That will defeat its purpose.

###### Method 1: Create a CSC matrix from coordinate (COO) format data¶

You can create a CSC matrix by providing the row indices, column indices, and values of the non-zero elements in coordinate (COO) format.

import numpy as np from scipy.sparse import csc_matrix # Example data in COO format row_indices = np.array([0, 1, 2]) col_indices = np.array([0, 1, 2]) values = np.array([1, 2, 3]) # Create a CSC matrix sparse_matrix = csc_matrix((values, (row_indices, col_indices)), shape=(3, 3))

###### Method 2: Create a CSC matrix from a Dictionary of Keys (DOK) format¶

You can create a CSC matrix from a dictionary where keys represent the (row, column) index pairs, and values are the non-zero elements.

from scipy.sparse import csc_matrix # Example data in DOK format dok_data = {(0, 0): 1, (1, 1): 2, (2, 2): 3} # Create a CSC matrix sparse_matrix = csc_matrix((3, 3), dtype=np.int64) for (row, col), value in dok_data.items(): sparse_matrix[row, col] = value

Remember to provide the shape argument when creating the CSC matrix to specify the dimensions of the matrix. This is particularly important when there are empty rows or columns in your sparse matrix.

Performing matrix operations

When working with CSC matrices, always use the functions and methods provided by the scipy.sparse module to perform matrix operations. These methods are designed to work efficiently with sparse matrices and avoid converting them back to dense matrices.

Here's an example of matrix multiplication using CSC matrices:

from scipy.sparse import csc_matrix # Create two CSC matrices. A = csc_matrix([[1, 0, 0], [0, 2, 0], [0, 0, 3]]) B = csc_matrix([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) # Multiply the CSC matrices (sparse). C = A.dot(B)

Tips to avoid converting a CSC matrix back to a dense matrix

Use the scipy.sparse module's functions and methods for matrix operations, as they are designed to work efficiently with sparse matrices.

When using NumPy functions that accept arrays as inputs, make sure to use their scipy.sparse counterparts, if available. For example, use scipy.sparse.vstack instead of numpy.vstack.

Avoid using the toarray() or todense() methods on a CSC matrix, as they will convert the sparse matrix back to a dense matrix. If you need to access individual elements, use the sparse_matrix[i, j] indexing syntax.

In a good implementation, one epoch (a full data pass) of SGD should take no more than 3-4 seconds on a laptop. If your code is slow using bag of words features, please check whether you accidentally transform the sparse matrix into dense matrix.

The advanced version of the project (bonus question) requires you to come up with better features than Bag-of-Words so as to improve the agent's classification accuracy.

A few suggestions are:

- a. tf-idf feature (see, e.g., https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- b. n-gram (see, e.g., https://en.wikipedia.org/wiki/N-gram)
- c. remove stopwords (Basically words to discard: https://en.wikipedia.org/wiki/Stop_word)
- d. Other word embedding / sentence embedding, e.g., Word2Vec, Bert, etc.

I would encourage you to try at least tf-idf feature and n-gram, which will bump up your accuracy by quite a bit already; and you will get to learn how to inherit a python class by following the template we provided.

While you can implement tf-idf, n-gram and other feature extractors yourself (and it is a useful learning experience), we allow any external library that you can find, e.g., sklearn, as long as you wrap it into a `feature_extractor`

class. We provided an example in the start-up kit how to do this with a sklearn based two-gram implementation.

To say it differently, the task is completely open-ended. You can use any feature extractor you like to improve the accuracy of this classifier's performance, and you can use any algorithm to train your linear classifier.

All we need is for you to provide the extracted features and a learned classifier weights (see detailed instruction about what to submit to Gradescope below).

A leaderboard will be set up so those who have the best accuracy will earn extra bonus points.

You need to write a short report (with Jupyter notebook). We provided a template (`Project1_Report_Template.ipynb`

) with a few required questions to get you started.

Some part of the reports might also give you useful tips on how you may intepret the predictions of your agent.

If you have hand-written parts, e.g., for the gradient derivation, you may scan them include them in the Jupyter notebook. The report (both an *.ipynb file, and a pdf file you print out) should be submitted to Gradescope.

There are three python files that we provide in the StartupKit.

`classifier.py`

is the file that you will need to edit.
you will find that we have provided most of the code that solves the boring part of the problem, e.g., data loading / preprocessing and so on. The intention is such that you can focus on the more creative part of the agent design job and to standardize the pipeline.

- We provide a
`tokenizer`

that is compatible with the provided`vocab.txt`

. - We provide a template of
`feature_extractor`

class. You need to complete its implementation. - We provide a template of
`classifier_agent`

class with its functionality designed. - We also provide a template of
`tfidf_extractor`

class which inherits`feature_extractor`

. It is an example showing you how you may implement other feature extraction methods so you may pass that into the`classifier_agent`

and all the code you implemented will work.

All functions you will need to complete will contain clear instructions on the input and required output, and sometimes hints. Implementations are typically short (no more than a few lines of code).

`main.py`

is the file that provides a demo how you should use the classifier class.

You may type

```
python main.py
```

to run it. Notice that it won't run right away.

Another illustration of how you may use functions in `classifier.py`

is in the template report.

Any platforms are fine, please use Python3.7 or above and make sure that the corresponding

`numpy, scipy`

packages are up to date.You also need to install

`matplotlib`

for plotting figures your and`pandas`

for running some part of the codes that we provide in your report template.To install these standard python packages, I would suggest you to use package manage such as

`pip`

or`conda`

.For debugging I suggest either using jupyter notebook or Python IDE such as PyCharm.

There are many different platforms and it is hard for us to test exhaustively. Please try the code early and make sure that things works. If not, seek help on Piazza early. Do not wait until the last minute because then it will be devastating to be unable to work on the project due to technical issues.

For further Questions regarding setting up python environment / jupyter notebook and so on should be asked on Piazza and peer help are encouraged.

The main tasks for you is to design and train this classifier agent by implementing two different types of the feature extractors and two different types of the learnaing algortihms.

To complete

*Basic Coding Requirements*, you need to implement the following functions in`classifier.py`

`bag_of_word_feature`

`score_function`

`predict`

`error`

`loss_function`

`gradient`

`train_gd`

`train_sgd`

For an example how the start-up kit works, go to

`main.py`

To complete

*Advanced Coding Requirements*, you need to participate in the leaderboard by coming up with your own feature extractor. You should complete an implementation of`custom_feature_extractor`

or`custom_feature_extractor2`

in`classifier.py`

with any idea you have.- Train a linear classifier using your custom feature extractor and save its weight.
- Provide your extracted features for all public test data that you have.

You may use

`train_custom_model.py`

to create the files you need to submit.

For the **Basic coding requirement**, you need to submit:

- The completed python module
`classifier.py`

with each functions implemented. Please make sure there are no syntax error otherwise the autograder won't run. You will be graded for each function you completed.

For the **Advanced coding requirement**, you need to submit:

The parameters of your best trained model in a file

`best_model.npy`

and the corresponding pre-processed feature file`custom_feat_test.npy`

. which you may generate by calling`python train_custom_model.py`

with your own feature extractor. Do not shuffle your "public" test data.- The
`process_data_and_save_as_file`

function will process the sentence input and return the pre-processed features (Xtrain, Xtest) and labels (Ytrain, Ytest), it will also save the pre-processed features as the filename you gave. So, you need to use this function to save your process the test data and save Xtest as file name`custom_feat_test.npy`

. The`load_data_from_file`

function will load the saved feature files like`custom_feat_test.npy`

.

- The

Note that we will not run training for you. Your provided features in `custom_feat_test.npy`

and this `parameter_vector`

from `best_model.npy`

will be used to instantiate a linear classifer. The number you got locally will be reflected on the leaderboard.

- You should also submit
**a zip file**that contains`classifier.py`

and`train_custom_model.py`

and any other instructions so the TAs are able to run your feature extractor. We won't run it right away but submitting the code is required.

Notice that it is easy for you to get 100% accuracy if you train using the public test data. Please do NOT do that. Ultimately, if you are among the first few on the leaderboard, we will work with you to run your own feature extraction code on new test data, so as to determine its final performance. So overfitting to the public test data won't help you very much, since we will evaluate on new examples.

For the **Project report**, you need to submit:

- A pdf version of the report and its source code in the provided notebook template.

- Basic coding requirements: 70%
- Participation in leaderboard and beating the TA's tf-idf-based baseline: 10% bonus
- Winning the leaderboard (on hidden test data): Top 1: 30% bonus, Top 3: 15% bonus, Top 10, 10% bonus.
- Report 30% + 5% bonus for the clarity, effort and novelty in your design of the custom feature extractor.