This tool performs spectral clustering using either sparse similarity matrix (nearest neighbors) or the Nystrom method. It is also used in the comparison experiments in the following paper:
Parallel Spectral Clustering in Distributed Systems
Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 33, No. 3, pp. 568-586, March 2011
[PDF (2.5MB)]
If you find this tool useful, please cite the above work. The bibtex format is
@article{Chen11, author = {Wen-Yen Chen and Yangqiu Song and Hongjie Bai and Chih-Jen Lin and Edward Y. Chang}, title = {Parallel Spectral Clustering in Distributed Systems}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume = {33}, number = {3}, pages = {568-586}, year = {2011} }
This tool is implemented in MATLAB, and data is stored in .mat format. The code has been tested under 64-bit Linux environment using MATLAB 7.4.0.287 (R2007a). You will be able to regenerate experiment results in the paper. However, results may be slightly different due to the randomness, the CPU speed, and the load of your computer.
Name | Version Number | Release Date |
---|---|---|
MATLAB code and Corel data files | 1.1 (.zip [16MB] ) (.tar.gz [16MB]) | 02/03/2010 |
RCV1 feature file | .mat [98MB] | 08/15/2008 |
RCV1 label file | .mat [146KB] | 08/15/2008 |
README | .txt [5.4KB] | 02/03/2010 |
You can also download our tool at MLOSS and SourceForge.
Please feel free to send your feedback to Wen-Yen Chen and Chih-Jen Lin.
Q1: Does it work on a single 32bit machine?
A1: Yes, this tool is designed for single-machine users to process large-scale data sets. It is also used for comparions experiments (nearest neighbors
vs. Nystrom) in the above paper.
Q2: Can I download the C++ parallel version?
A2: Yes, you are free to download here.
Q3: What is block_size? What is the common value for block_size?
A3: The block_size determines the size of a block when partitioning our original data. Assume our original data has size 100 and we set block_size
to be 5, then we will have 20 blocks. For block_size value, it depends on your physical memory and data sparsity. If you have less memory, use
smaller value. The suggested value is 10 to 100 (or larger if you have sufficient memory).
Q4: The k_means function that you are using keeps diverging (oscilating between two certain values of qold and qnew). What's up with it?
and why didnt you use matlab's kmeans implementation?
A4: The qold/qnew are two variables for storing intra-cluster variances before/after the new clustering assignment. With these two values, we know
when to stop k_means by calculating the ratio, qnew-qold/qold. If the ratio is less than the predefined threshold, we stop k_means. We didn't use
matlab's kmeans implementation because it is not efficient, its running time is much longer than our implementation. We've done a side-by-side
comparisons for verifying our implementation, which is shown to be more efficient.
Q5: I tried using it to cluster some matrices but i keep getting different results everytime i run exactly the same inputs.
Am i doing something wrong?
A5: That is because each time the initialization of k-means is different. Different initializations yields different results.
Q6: What sigma values would you recommend to run on?
A6: That really depends on the data, you can try sigma value from 0.1 to 10, and check the clustering results. In package we provide two options
for setting sigma values. One is manually selection, the other is self-tuning.
Self-tuning technique is based on NIPS'04 paper,
which avoids the need of manual selection and shows better performance.
Q7: I would like to cite this tool. Which paper should I cite?
A7: Please cite the following document:
@article{Chen11, author = {Wen-Yen Chen and Yangqiu Song and Hongjie Bai and Chih-Jen Lin and Edward Y. Chang}, title = {Parallel Spectral Clustering in Distributed Systems}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume = {33}, number = {3}, pages = {568-586}, year = {2011} }