293S Projects

This page will be revised with more project ideas. Slides based on this page are here

Requirements

Develop a system prototype or algorithm implementation for document searching/mining using a dataset or multiple datasets. Your project solution can leverage one or multiple packages of open source code.
Demonstrate the challenge of problem addressed, and leverage state-of-art technology from top-rated conferences.
Apply evaluation metrics to assess the success.

You may study algorithmic solutions for ranking and mining, or system performance related issues.

Resource: The CSIL machines can be used to build your project and there is or will be a sandbox directory with a large space for every student in this class.

Timelines

Form a 2-person team, develop a project plan, and find papers to study.
Meet with me the first week of Nov to discuss about the project and paper selection. Through this discussion, we may select or assign a paper for presentation suitable for other students to learn on a related topic.
Present a selected paper and your project in late Nov and the first week of Dec.
Demonstrate your project with me and submit the following project material by the end of the quarter.
- Slides for the paper(s) presented.
- A project report with about 4 pages (at most 6 pages). The report needs to include 1) Objectives+challenges. 2) State-of-art techniques you have leveraged (The citation should include author names, title, where they publish, year). 3) Key algorithms+techniques used with examples. 4) Data set +metrics+evaluation results/your findings. 5) Your efforts in the project and how the project is related to the class material learned.
- Code and data sets used
- Instructions for building and executing your demo so that your results are reproducible

Example Projects

Document Reranking based on HW1
From HW1, you can derive a list of top-500 document IDs for each tested query and there are 150 queries.
You can gather the related documents for these queries from the trec45-processed.html, and build a set of new features such as neural text features or knowledge entity features for each document. Rerank top documents for each query.
You do not have to build an inverted index (which takes time to program). You can leverage HW1 Java code and simply retrieve the text of top documents and for each query, build necessary features for top documents saved in memory.
- Additional text features for ranking for top documents of these TREC queries:
  These features can be found in the directory /cs/sandbox/faculty/tyang/290N/jinjin_data which has a README file.
  A description of these text features is in J. Shao, S. Ji, and T. Yang. Privacy-aware Document Ranking with Neural Signals. Proc. of 2019 ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2019). Slides
- Additional papers that use neural features for re-ranking
  - Z. Dai, C. Xiong, J. Callan, and Z.Liu. Convolutional neural networks for soft-matching n-grams in ad-hoc search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 126--134. ACM, 2018.
  - C. Xiong, Z. Liu, J. Callan, and T.-Y. Liu. Towards better text understanding and retrieval through kernel entity salience modeling. In The 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '18, pages 575--584. ACM, 2018.
  - A copy of C++ based source code for DRMM neural ranking is in directory /cs/sandbox/faculty/tyang/290N/DRMM-ranking, accessible from a CSIL machine. You can run this code in CSIL based on commands in README.2020.txt. DRMM reference is: J. Guo, Y. Fan, Q. Ai, and W. B.Croft. A Deep Relevance Matching Model for Ad-hoc Retrieval. Sllides. CIKM 2016.
  - Source code for a few neural ranking algorithms is available in this paper: OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline by S. MacAvaney, WSDM 2020. It works on Python 3.6 while it may NOT work for Python 3.7 (e.g. on CSIL machines).
Sparse neural inverted index representation for documents (CIKM 2018)
Use neural features to build inverted index and conduct TFIDF or BM25 ranking. The datasets used here also includes Trec45.
Multi-stage ranking with BERT by Nogueira et al.
The dataset and some code on how to replicate this work are available from here, using MS Macro-Passage-Ranking package.
Optimiation of search system performance.
- Fast intersection of multiple posting lists Wang et al., Evaluating List Intersection on SSDs for Parallel I/O Skipping VLDB 2020.
- Fast tree ensembles
  RapidScorer: Fast Tree Ensemble Evaluation by Maximizing Compactness in Data Level Parallelization in KDD 2018
  An earlier paper Quickscorer: a fast algorithm to rank documents with additive ensembles of regression trees SIGIR 2015. Source code is available.
Privacy-preserving document search
S. Ji, et al. Privacy-aware Ranking with Tree Ensembles on the Cloud . SIGIR 2018) Slides
J. Shao, et al. Privacy-aware Document Ranking with Neural Signals. SIGIR 2019. Slides
Similar document clustering with word embeddings. From Word Embeddings To Document Distances by Kusner et al. ICML' 2015. The code and dataset Some recent work on how to Speeding up Word Mover’s Distance and its variants via properties of distances between embeddings
Create your own project
You can choose your own project. You can follow other work, using the recent technical paper(s) published in top-rated information retrieval and mining conferences (SIGIR, WWW, WSDM, KDD, or SIGMOD/VLDB). Feel free to discuss with me.

Links to some open source implementations and datasets

Wikipedia data sets 121K documents, 715MB. More from here.
Many text collections in Machine learning data sets at UCI
Google word embeddings
A Java-based package called RankLib for different learning-to-rank algorithms. Past 293 homework that uses such a package is here.
C++ tree-based ranking package
Some datasets on reviews and topic modeling
Code and data for SIGIR 2019 paper on recommendation