Deduplication and Search for Versioned Datasets

[Project Overview] [Publication] [People]

Project Overview

Organizations and companies often archive high volumes of versioned digital datasets. There are research challenges and opportunities for developing integrated archival and search support needed for data preservation, electronic discovery, and regulatory compliance. Since versioned datasets contain highly repetitive content, deduplication can reduce the storage demand by an order of magnitude or more; however such an optimization is resource-intensive. After deduplication, the structure of inverted index for versioned data becomes complex and it is expensive to search relevant results. This project will study low-cost solutions for compact archiving and indexing and develop efficient algorithms and system techniques for searching versioned datasets. It will also consider that the archived data can be stored in an untrusted cloud environment and investigate tradeoffs in efficiency and privacy-preserving for search.

This project will be focused on studying key challenges and cost-sensitive technical aspects in integrated archival and search support for managing large versioned datasets. The main tasks include efficient software architecture and optimization in detecting duplicated content on a cloud cluster architecture, fast multi-phase search with a hybrid index structure to exploit content similarity and query characteristics, and an efficient privacy-preserving framework with top result ranking.



  1. J. Shao, S. Ji, A. O. Glova, Y. Qiao, T. Yang, and T. Sherwood. Index Obfuscation for Oblivious Document Retrieval in a Trusted Execution Environment. Proc. of 29th ACM International Conference on Information and Knowledge Management (CIKM 2020). Pages 1345–1354. Slides

  2. J. Shao, S. Ji, and T. Yang. Privacy-aware Document Ranking with Neural Signals. Proc. of 2019 ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2019). Slides

  3. S. Ji, J. Shao, and T. Yang. Efficient Interaction-based Neural Ranking with Locality Sensitive Hashing. Proc. of International World Wide Web Conference (WWW 2019). Code demo

  4. S. Ji, J. Shao, D. Agun, T. Yang. Privacy-aware Ranking with Tree Ensembles on the Cloud . Proc. of 2018 ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR 2018). Slides

  5. D. Agun, J. Shao, S. Ji, S. Tessaro, T. Yang, Privacy and Efficiency Tradeoffs for Multiword Top K Search with Linear Additive Rank Scoring. Proc. of International World Wide Web Conference (WWW 2018). Slides

  6. X. Tang, M. Alabduljalil, X. Jin, T. Yang, Partition-based Similarity Search with Cache-Conscious Data Traveral. ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 11 Issue 3, April 2017.

  7. X. Jin, D. Agun, T. Yang, Q. Wu, Y. Shen, S. Zhao. Hybrid Indexing for Versioned Document Search with Cluster-based Retrieval. 25th ACM International Conference on Information and Knowledge Management (CIKM 2016).

  8. D. Agun, T. Yang, W. Zhang, Low-Profile Source-side Deduplication for Virtual Machine Backup , Proc. of USENIX HotCloud '16.

  9. X. Jin, T. Yang, X. Tang, A Comparison of Cache Blocking Methods for Fast Execution of Ensemble-based Score Computation. Proc. of 2016 ACM SIGIR conference on Research and Development in Information Retrieval. Slides.

  10. W. Zhang, D. Agun, T. Yang, R. Wolski and H. Tang, VM-Centric Snapshot Deduplication for Cloud Data Backup. Proc. of the 31st International Conference on Massive Storage Systems and Technologies. 2015.

Software Prototype RTP search for versioned data. Extra Linux test data

Supported in part by National Science Foundation under Grant No. 1528041 (PIs: T. Yang and S. Tessaro), and by a Google faculty research award (PI: T. Yang). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.