293S Projects

This page will be revised with more project ideas. Slides based on this page are here

Requirements

Develop a system prototype or algorithm implementation for document searching using a dataset or multiple datasets. Your project solution can leverage one or multiple packages of open source code.
Demonstrate the challenge of problem addressed, and leverage state-of-art technology from top-rated conferences.
Apply evaluation metrics to assess the success.

You may study algorithmic solutions, or system performance related issues for searching.

Project Timelines

Form a 2-person team, develop a project plan, and find papers to study.
Meet with me around Oct 12 (Third Week of the quarter) to discuss about the project and paper selection. Through this discussion, we may select or assign a paper for presentation suitable for other students to learn on a related topic.
Present the selected paper and your project in Nov or earler Dec.
Demonstrate your project with me and submit the following project material by the end of the quarter.
- Slides for the paper(s) presented.
- A project report with about 4-6 pages. The report needs to include
  - 1) Objectives+challenges.
  - 2) State-of-art techniques you have leveraged (The citation should include author names, title, where they publish, year).
  - 3) Key algorithms+techniques used with examples.
  - 4) Data set +metrics+evaluation results/your findings.
  - 5) Project contribution, your efforts in the project, and how the project is related to the class material learned.
- Code and data sets used
- Instructions for building and executing your demo so that your results are reproducible
Grading will be based on the above five sections of the description in addition to the class presentation effort.

In choosing your own project, you can follow other work, using the recent technical paper(s) published in top-rated information retrieval/search or related conferences (SIGIR, WWW, WSDM, ACL, EMNLP).

Tips for the class presentation:

1. Please write the conference name/year published, and the affiliations of the authors in the slides. Discuss about the motivation, the problem definition, key techniques/method(s), the dataset/experiment results, and what you plan to do.
2. Please provide some example(s) as earlier as possible in the presentation to guide the audience so we can understand the key ideas of the methods. Please explain the examples patiently.
3. Do not only write a few text words in the slides. Use the figures/tables/illustrations together with the text, taking advantages of all space available in each slide.
Your presentation will receive some points for clarity and sufficient details provided on the above points.

Computing resource:

You can use any computing resource available to you, including your own laptops.
Google Colab. Write and execute Python in a browser with free GPU.
CSIL machines with no GPU. You may request for some extra disk space for a large dataset.
Expanse cluster . I have some CPU/GPU hours allocated for this course.

Past project reports

Samples

Some project ideas written for 2021 CS293 course

Document retrieval based neural impact scores while using the inverted index.
- Learning Passage Impacts for Inverted Indexes by Malia et al., SIGIR 2021. You can replicate this work called DeepImpact (check https://github.com/castorini/pyserini/blob/master/docs/experiments-deepimpact.md)
  Or you can evaluate the search time with BMW and its variances. We can provide you the BMW-based source code or use the code in the next paragraph.
- This github contains an implementation of a variance of BMW Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, Rossano Venturini, Faster BlockMax WAND with variable-sized blocks, ACM SIGIR 2017.
  How does it perform when we use a DeepImpact based neural scoring based on the above SIGIR 21 paper above instead of BM25?
- The above DeepImpact approach is one way to improve the first-stage ranking. Another approach can follow this paper: Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807. Check this https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md.
Develop a simple multi-threaded key-value store using Linux files in C++/C to serve multiple contextual document embedding requests. The goal is to have a low-latency time to access the embedding representation of each document while providing a reasonable concurrency. This system can be used for document re-ranking: Composite Re-Ranking for Efficient Document Search with BERT WSDM 2022.
Efficient C++ implementation wthout GPU for online-reranking with ColBERT. BERT C++ open-source code using Intel MKL is available here (Makefile may need a small change)
Document Re-ranking based on Pyserini
With Pyserini, you can derive a list of top document IDs for each tested query. You can gather the related documents for these queries and build a set of new features such as neural text features or knowledge entity features for each document. Rerank top documents for each query.
You do not have to build an inverted index (which takes time to program). You can leverage Pyserini code and simply retrieve the text of top documents and for each query, build necessary features for top documents saved in memory.
The instruction on how to run Pyserini on Expanse can be found here.
Inverted index is classified as sparse retrieval. Dense retrieval is another approach which receives an attention recently with the advancement of BERT.
Approximate nearest neighbor negative contrastive learning for dense text retrievaL ICLR 2021, github: https://github.com/microsoft/ANCE