Divyakant Agrawal
Professor of Computer Science
University of California at Santa Barbara
Current Research Focus
Advanced Query Processing
In the traditional paradigm of stored data -- the data itself is fairly well-defined (e.g., business data) as well as questions or queries over such data are constrained and well-defined (e.g., selections and aggregation). As computers are increasingly used to store and process all types of information -- the interactions or retrieval of such data becomes a significant challenge. First, semantics of such types of retrieval has to be formalized. The research challenge is to develop efficient implementations of these querying paradigms. One of the prominent retrieval methods that has emerged is the notion of top-k queries in which information is retrieved from large datasets based on user specified ranking or preference criterion. Top-k query paradigm is useful since it enables the information to available for human consumption especially in decision-making environments. Typically, such datasets involve a large number of attributes any of which can be used to order the results of the user query. Dr. Agrawal and his colleagues have developed a technique to partition large datasets based on the notion of dominance-ranking. This approach results in reducing the runtime overhead for processing top-k queries significantly. Similarly, Drs. Agrawal and El Abbadi's research group has developed efficient methods for processing top-k queries over multiple datasets distributed over a network. An extension of top-k queries is when the result set needs to be ordered based on aggregated value of an attribute instead of individual values (e.g., total sales in a region). Dr. Agrawal and his colleagues have developed an efficient method for progressive ranking of range aggregated information.
Dr. Agrawal and his colleagues have also developed a notion dynamic multi-dimensional data cubes for interactive analysis of massive datasets. In the context of multi-dimensional data, retrieval of exact information becomes infeasible due to the space explosion problem. Therefore, retrieval is often based on similarity which is transformed to searching for data in a constrained hyperspace. Instead of exhaustively searching the constrained space (which is inefficient), the search is transformed to finding ``nearest actual data points in the space'' referred to as nearest neighbor searching. Dr. Agrawal and his colleagues have proposed a data-structure based on scalar quantization and have developed an approximate but highly efficient algorithm for nearest neighbor search in high-dimensional datasets. In addition, Dr. Agrawal and El Abbadi's research group has developed efficient and effective methods for discovering similar sequences over biological data.
When preference queries are posed over multi-dimensional data they are referred to as skyline queries. This is akin to multi-variable optimization problem that occurs widely in many commercial datasets: "find all automobiles with minimal odometer reading and minimal sales price". Processing skyline queries over large datasets with interactive response time is a significant challenge. Dr. Agrawal and his colleagues have developed an innovative scheme for parallelizing skyline queries over massive datasets both in terms of the sheer size as well as dimensionality. Another strategy that can be used is to pre-compute the results of skyline queries in advance. However, the results of skyline queries become invalid if the underlying dataset undergoes transformation (insertions and deletions of data elements). Dr. Agrawal and his colleagues have developed techniques for optimal maintenance of skyline queries for dynamic datasets.
In addition to dealing with different types of data, data management professionals are confronted with the streaming nature of data in some applications. One of the best examples is the click-stream data that is generated as a result of user selections on Web search results. The challenge is to answer queries based on a streaming context without relying on stored information (sometimes this information may not be available). Data stream process has emerged as a monumental research problem in computer science. Drs. Agrawal and El abbadi are the pioneers in proposing an integrated efficient solution for computing both frequent elements and top-k elements queries over data streams. This is the first solution to date and is useful in a number of applications ranging from Network security to click-fraud detection. Dr. Agrawal and his colleagues have also developed algorithms for efficiently correlating data from multiple data streams based on the notion of punctuation. In the absence of punctuation, memory requirements become unbounded to join two correlated data streams.
Extended Data Models
One of the consequences of the Internet being used as a medium for massive source of information is the realization that most data does not adhere to a standardized format or structure. Traditional data models such as the relational model were intended for structured data. In order to deal with loosely structured data, viz. semi-structured data, eXtended Markup Language (XML) has emerged as a data model of choice because it encompasses both the content as well as the structure of the data in the data object itself. Although XML is richer for querying purposes, it gives rise to complex research challenges for processing queries over XML data.
XML queries specify predicates on the content and structure of the elements of tree-structured XML documents. Hence, discovering the occurrences of twig (tree-structured) query patterns is a core operation for XML documents. Dr. Agrawal and El Abbadi's research group has developed a novel technique for retrieving XML twigs by employing a distributed binary labeling and tree traversal algorithm which results in substantial reduction in search space and response time. In another paper, Dr. Agrawal and his co-authors have developed bottom-up query processing for Generalized tree pattern (GTP) queries over XML documents. GTP queries provide a more comprehensive query framework than XML twig queries.
XML is increasingly being used a de-facto standard for data and information exchange over the Web and the Internet. For example, Web publishers use XML to publish and distribute their content. Thus in a time-sensitive Web-based publish-subscribe system, it becomes essential to develop efficient filtering techniques for distributing published XML content to intended recipients. Dr. Agrawal and his co-authors have developed an adaptable XML filtering technique which employs both prefix-caching and suffix-clustering and is therefore highly scalable when compared to state-of-the-art XML filtering systems. Finally, Dr. Agrawal and El Abbadi's research group has also developed algorithms for both content and structure matching of XML range queries using a dynamic summarization approach that exploits an information-theoretic tool Bloom filter.
A common paradigm in interactive computer systems to address performance bottlenecks is to pre-fetch or cache results of potential user queries. In the context of database systems, this is referred to as materialized view of data. Dr. Agrawal has a long history of developing efficient view materialization techniques for relational databases. Recently, he and his research group has developed approaches for maintaining XML views efficiently and incrementally especially when the underlying XML data-source is updated. In the context of Web-based information systems, the complexity arises due to the hybrid and loosely-couple nature of different components. Typically, the back-end data-source is based on relational model whereas the message exchange and the front-end is based on the XML data model. In a recent article, Dr. Agrawal and his co-authors propose a technique to maintain XML views efficiently and incrementally in a loosely-coupled environment. Furthermore, an another companion article, they describe a detailed implementation of a scalable middleware to enable XML caching for Web services.
Click-Fraud Detection in Advertising Networks
In the context of internet commerce, Web-based advertising has become an important source of revenue for many of the well-known industry leaders such as Google, Yahoo, and MSN. Typically, the network model for Web-based advertising is that there is an intermediary (advertising commissioner) between the publishers and the advertisers. The revenue model is based on pay-per-click advertising, i.e., commission is paid to the publishers and commissioners when an internet user clicks on an a sponsored advertisement link on the publisher's site. Given the large scale of the publisher and advertiser entities, the advertising commissioners such as Google have to confront the problem of spurious and fraudulent clicks that may be generated by dubious publishers to inflate their revenues. Drs. Agrawal and El Abbadi have developed a comprehensive advertising network model to deploy run-time fraud-detection techniques to discover fraud in real-time. They have developed a complete classification framework for detecting click-fraud and reduce the fraud detection problem to the theoretical problem of computing complex aggregates on data streams. The issue of detecting internet fraud becomes more complex when multiple entities collude together collectively to commit a fraud. The research community in 1998 identified an open problem for fraud detection where multiple publishers collude. Drs. Agrawal and El Abbadi used advanced data mining techniques on click-stream data to for detecting this type of fraud. More recently, Drs. Agrawal and El Abbadi's research group has developed a sophisticated data-driven technique for detecting coalition of fraudulent attacks in advertising networks. Unlike the prior approaches, this problem is so complex that its solution requires offline analysis of historical click-stream data. An online solution of coalition attack remains an open problem. It must be noted, that Drs. Agrawal and El Abbadi are the pioneers in this field and in fact their earlier work on click-fraud detection based on an information-theoretic tool Bloom Filters is used widely by industry leaders.
Peer-to-peer Data Management
Drs. Agrawal and El Abbadi were among the first researchers who promoted the idea of using the paradigm of peer-to-peer computing for large-scale data management problems. Their CIDR'2003 paper entitled ``Approximate Range-selection Queries in Peer-to-Peer Systems'' is widely cited in the peer-to-peer research literature. Dr. Agrawal has continued his research investigations in the area of peer-to-peer data management and published several research articles on this topic. Dr. Agrawal and his research group developed a scheme based on reference vectors to index multi-dimensional data in peer-to-peer networks. This scheme effectively supports richer set of query operations such as nearest-neighbor queries and content-based similarity search of unstructured data in P2P networks. Web services enable dynamic integration and interaction of heterogeneous software artifacts, and thereby, facilitates fast and efficient cooperation among multiple entities in Web-based environments. A critical problem in this domain to discover web services efficiently. By extending the P2P paradigm of resource discovery of media objects, Dr. Agrawal and his research group have developed a P2P based web service discovery system called SpiDer. SPiDer organizes the service providers into a structured P2P overlay and allows them to advertise and lookup services in a completely decentralized and dynamic manner. In another companion paper, Dr. Agrawal and his colleagues have developed P2P overlay schemes to support resource discovery in very large computational grids. Finally, Dr. Agrawal and El Abbadi have also developed advanced techniques for efficient routing and load-balancing in P2P networks.
Most P2P systems that are currently used in practice provide only best-effort guarantees. That is, when looking up for a resource in a P2P system a user is not guaranteed that unavailability of a resource is indeed because the resource is non-existent. Best-effort guarantees is acceptable for applications that are not mission-critical (e.g., sharing of music files). When P2P systems are used to manage distributed data over the internet, it becomes necessary to provide a stricter notion of correctness especially since data may be dynamic and hence may be undergoing dynamic transformations. Traditional database solutions in this regard require using synchronization mechanisms based on the notion of read and write locks. Dr. Agrawal and El Abbadi are among the first to propose a lock-free range queries over P2P data while guaranteeing stricter notion of correctness. Drs. Agrawal and El Abbadi have collaborated with Dr. Michel Raynal's research group in IRISA France to formalize the notion of peer-to-peer systems as dynamic distributed systems which is in contrast to the traditional distributed computing which is static. A loose analogy can be derived from queuing models in which the closed queuing model corresponds to the traditional static distributed model and open queuing model corresponds to the dynamic or P2P systems.
Secure and Privacy Preserving Data Processing
Data integration from multiple autonomous data sources has emerged as an important practical problem. The key requirement for such data integration is that owners of such data need to cooperate in a competitive landscape. The research challenge in developing a query processing solution is that the answers to the queries need to be provided while preserving the privacy of the data sources. For example, a medical professionals' office may be interested in finding the common patients with another medical professional but the latter may not be willing to share its information about those clients that are not in the list of the former medical professional and vice versa. In general, allowing unrestricted read access to the whole data may give rise to potential vulnerabilities as well as may have legal implications. Therefore, there is a need for privacy preserving database operations for querying data residing at different parties. Traditionally, security and privacy of information has primarily been based on cryptographic techniques. Although these techniques provide provable guarantees, they are not widely adopted in real applications due to the associated computation and communication overhead. In the past, when data and information resources were confined within an enterprise, security and privacy of data was not a big concern. However, in the contemporary world of open and distributed access to data and systems, security and privacy of distributed information data sources has emerged as a major challenge. Drs. Agrawal and El Abbadi have developed a query processing framework that exploits distribution (instead of cryptographic techniques) for secure and privacy-preserving operations over distributed data. By eliminating the overhead of data encryption/decryption, query-processing over autonomous data sources becomes more efficient and scalable.
Novel Hardware-based Solutions for Data Management
During the past few years Drs. Agrawal and El Abbadi have embarked on a new area of research which is to exploit new hardware systems in the context of database management systems. Dr. Agrawal started out by investigating storage systems that are built using MEMS technology (Micro-electromechanical systems). They observed that MEMS-based storage devices inherently have a two-dimensional structure which can be exploited in the context of data placement and data retrieval for tabular data in relational DBMSs. In particular, Dr. Agrawal's research group has developed a storage scheme that enables relational data to be retrieved both in row-wise manner (useful for OLTP access) and in column-wise manner (useful for OLAP access). Dr. Agrawal and his group extended this work for data placement of two-dimensional data over MEMS-based storage. More recently, Drs. Agrawal and El Abbadi have been exploring the use of Content Addressable Memories (CAM) for accelerating database join operations. They have built a hardware prototype that integrates CAMs (designed for Network Processors) with general-purpose processors. One of the graduate students has demonstrated that the join processing time can be significantly reduced using CAMs. Dr. Agrawal and El Abbadi's research group built another prototype to integrate CAMs with advanced network processor units (NPUs). NPUs with CAMs are widely deployed in network routers for very fast routing lookups. Drs. Agrawal and El Abbadi have instead used this set-up for very high throughput algorithms for data stream operations. These algorithms are highly applicable for analyzing network streams for statistical violations that occur due to network intrusions. In the networking area these problems are referred to as the heavy-hitter problem and the heavy-distinct-hitter problems. Dr. Agrawal's research group has developed CAM-conscious hardware versions of the solutions of both these problems. These research results have a significant potential for wide adaptation in industrial settings.
Information Processing over sensor networks
In collaboration with Professor Subhash Suri at UC Santa Barbara, Dr. Agrawal is investigating the problem of information processing over sensor networks. Information processing on sensor networks poses significant challenges. The collaborative research has resulted in a wide range of information processing techniques over sensor networks: power-aware routing over sensor networks, distributed navigation through a sensor field, and approximation algorithms routing over sensor networks.