This page will be revised with more project ideas. Slides based on this page are here
Requirements
Resource: The CSIL machines can be used to build your project and there is or will be a sandbox directory with a large space for every student in this class.
Timelines
Example Projects
From HW1, you can derive a list of top-500 document IDs for each tested query and there are 150 queries.
You can gather the related documents for these queries from the trec45-processed.html, and build a set of new features such as neural text features or knowledge entity features for each document. Rerank top documents for each query.
You do not have to build an inverted index (which takes time to program). You can leverage HW1 Java code and simply retrieve the text of top documents and for each query, build necessary features for top documents saved in memory.
These features can be found in the directory /cs/sandbox/faculty/tyang/290N/jinjin_data which has a README file.
A description of these text features is in J. Shao, S. Ji, and T. Yang. Privacy-aware Document Ranking with Neural Signals. Proc. of 2019 ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2019). Slides
An earlier paper Quickscorer: a fast algorithm to rank documents with additive ensembles of regression trees SIGIR 2015. Source code is available.
You can choose your own project. You can follow other work, using the recent technical paper(s) published in top-rated information retrieval and mining conferences (SIGIR, WWW, WSDM, KDD, or SIGMOD/VLDB). Feel free to discuss with me.
Links to some open source implementations and datasets