293S Projects
293S Projects
This page will be revised with more project ideas.
Slides based on this page are here
Requirements
- Develop a system prototype or algorithm implementation for document searching using a dataset or multiple datasets.
Your project solution can leverage one or multiple packages of open source code.
- Demonstrate the challenge of problem addressed, and leverage state-of-art technology from top-rated conferences.
- Apply evaluation metrics to assess the success.
You may study algorithmic solutions, or system performance related issues for searching.
Project Timelines
In choosing your own project, you can follow other work, using the recent technical paper(s) published
in top-rated information retrieval/search or related conferences (SIGIR, WWW, WSDM, ACL, EMNLP).
Tips for the class presentation:
- 1. Please write the conference name/year published, and the affiliations of the authors in the slides. Discuss
about the motivation, the problem definition, key techniques/method(s), the dataset/experiment results, and what you plan to do.
- 2. Please provide some example(s) as earlier as possible in the presentation
to guide the audience so we can understand the key ideas of the methods.
Please explain the examples patiently.
- 3. Do not only write a few text words in the slides. Use the figures/tables/illustrations together with the text, taking advantages of all space available in each slide.
- Your presentation will receive some points for clarity and sufficient details provided on the above points.
Computing resource:
- You can use any computing resource available to you, including your own laptops.
-
Google Colab.
Write and execute Python in a browser with free GPU.
- CSIL machines with no GPU. You may request for some extra disk space for a large dataset.
- Expanse cluster .
I have some CPU/GPU hours allocated for this course.
Past project reports
Some project ideas written for 2021 CS293 course
- Document retrieval based neural impact scores while using the inverted index.
-
Learning Passage Impacts for Inverted Indexes
by Malia et al., SIGIR 2021.
You can replicate this work called DeepImpact (check
https://github.com/castorini/pyserini/blob/master/docs/experiments-deepimpact.md)
Or you can evaluate the search time with BMW and its variances. We can provide you the BMW-based source code or use the code in the next paragraph.
-
This github contains an implementation of a variance of BMW
Antonio Mallia, Giuseppe Ottaviano, Elia Porciani, Nicola Tonellotto, Rossano Venturini, Faster BlockMax WAND with variable-sized blocks, ACM SIGIR 2017.
How does it perform when we use a DeepImpact based neural scoring based on the above SIGIR 21 paper above instead of BM25?
- The above DeepImpact approach is one way to improve the first-stage ranking. Another approach can follow this paper:
Jimmy Lin and Xueguang Ma.
A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques.
arXiv:2106.14807.
Check this https://github.com/castorini/pyserini/blob/master/docs/experiments-unicoil.md.
- Develop a simple multi-threaded key-value store using Linux files in C++/C to serve multiple contextual document embedding requests.
The goal is to have a low-latency time to access the embedding representation of each document while providing a reasonable concurrency.
This system can be used for document re-ranking:
Composite Re-Ranking for Efficient Document Search with BERT
WSDM 2022.
- Efficient C++ implementation wthout GPU for
online-reranking with
ColBERT.
BERT C++ open-source code using Intel MKL is available here (Makefile may need a small change)
- Document Re-ranking based on Pyserini
With
Pyserini, you can derive a list of top document IDs for each tested query.
You can gather the related documents for these queries
and build a set of new features such as neural text features or knowledge entity features
for each document. Rerank top documents for each query.
You do not have to build an inverted index (which takes time to program).
You can leverage Pyserini code and simply retrieve the text of top documents and for each query,
build necessary features for top documents saved in memory.
The instruction on how to run Pyserini on
Expanse
can be found
here.
- Inverted index is classified as sparse retrieval. Dense retrieval is another approach which receives an attention recently with the advancement of BERT.
Approximate nearest neighbor negative contrastive learning for dense text retrievaL
ICLR 2021,
github: https://github.com/microsoft/ANCE