XIFENG YAN |
home | research | publications | tutorials | software |
Research Guidelines
(1) Identify fundamental concepts and principles of data mining, (2) Design algorithms to model, manage, and mine large-scale graphs and networks in bioinformatics, social networks, and computer systems, (3) Develop systems to facilitate complex knowledge discovery, involving data mining, data management, and machine learning. |
Research Areas
Data Mining Foundations Mining and Managing Large-Scale Graphs Biological Network Analysis Social Network and Business Analytics Data Mining for Software and Systems |
My research is funded by NSF, ARMY, Alzheimer's Association, UCSB, and CNSF.
Frequent pattern mining
was extensively studied in data mining society. Unfortunately, the exponential
pattern sets generated by many mining processes have undermined their general
utility. We are drowning in data; ironically, we are also drowning in patterns
discovered by ourselves. To overcome the pattern redundancy problem, we proposed
statistical and combinatorial models to summarize discovered patterns. We
continue working on the foundations of the following mining problems: (1)
Colossal pattern mining, (2) Direct mining of
significant/discriminative/interesting patterns,
(3)
Approximate pattern mining, and (4) Pattern-based classification and clustering.
[SDM'03,
VLDB'05,
KDD'05a,
KDD'06,
ICDE'07a,
ICDE'07b,
ICDM'07b,
ICDE'08,
SIGMOD'10]
Managing and Mining Large-Scale Graphs and Networks (Foundation of Graph Information Systems)
Graph data has grown
steadily in a wide range of scientific and commercial domains, such as in
bioinformatics, computer security, and social networks. However, due to the lack
of management and mining tools, it becomes extremely hard, if not impossible,
for users to search and analyze any reasonably large collection of graphs. For
instance, browsing and crosschecking biological network databases depicted
simultaneously in multiple windows is by no means an inspiring experience for
scientists. My study is focused on two fundamental problems in large scale graph
data mining and graph data management: (1) For a given graph data set (single
large network or multiple graphs), what are the hidden structural patterns and
how to find them? (2) Given a graph query, how to index and perform fast search
in large-scale graph datasets? We have made progress in these two problems to
such an extent that it is now close to the design of a general graph information
system. [ICDM'02,
KDD'03,
KDD'05b,
SIGMOD'04,
SIGMOD'05,
ICDE'06,
PAKDD'07,
ICDM'07a,
VLDB'07a,
TODS'05,
TODS'06,
KDD'08b,
SIGMOD'08,
SDM'09,
VLDB'09,
ICDE'10,
SIGMOD'10,
SIGKDD'10]
Biological
Network Analysis
Biological networks
including metabolic, protein-protein interaction, signaling and transcriptional
regulatory networks are not randomly set up for proper functioning of cells.
Instead, not only the activities of individual molecules in these networks, but
also their interactions are integrated and coordinated in a timely and robust
manner, in response to intrinsic and external signals. By searching and mining
these interaction networks, it is becoming possible to identify pathways and
modules in control of specific biological processes. The objective of this
research is to identify composite network motifs in cellular systems and
transcriptional regulatory modules from multiple networks. [KDD'03,
KDD'05b,
ISMB'04,
Bioinformatics'06,
ISMB'07a,
ISMB'07b]
Social Network
and Business Analytics
Social networks are another rich data
source for network research. We are
working on computing problems arising from networks extracted from help-desk
tickets and emails. By mining ticket
processing data, we are able to quantitatively analyze individual behaviors and
social relationships in an organization, which could shed light on management
optimization. Social networks
derived from email communications offer another way to peek into an enterprise's
organization. We are investigating
email networks to infer and categorize the expertise of employees, which could
provide a revolutionary new solution for labor resource management.
[PKDD'05,
VLDB'07b,
KDD'08a,
VLDB'08 demo,
ICDE'09 demo,
ASONAM'10,
BPM'10]
Data Mining for Software
Analysis and Computer Security
The current strain on
software development and services demands systems that are less reliant on human
intervention. The amount of data generated by various systems is ever
increasing, such as system logs, program traces, and intrusion alerts. Mining
these system data has the potential to make computing more intelligent,
reliable, and maintainable. We are seeking data mining techniques to enhance the
performance and reliability of computer systems, such as improving the
effectiveness of storage caching, identifying intrusion sources, and isolating
software bugs by mining source code and runtime data. [FCRC'03, FAST'04,
SDM'05,
FSE'05, SDM'06,
TSE'06, ISSTA'09,
Oakland'10]