Empirical Results on the Generalization Capabilities and Convergence Properties of the Bayes Point Machine
Perceptual Computing Section, The MIT Media Laboratory |
We explore some of the convergence and generalization properties of the Bayes Point Machine (BPM) developed by Herbrich [3], an alternative to the Support Vector Machine (SVM) for classification. In the separable case, there are an infinite number of hyperplanes in version space that will perfectly separate the data. Instead of choosing a solution based on maximizing the margin (as with the SVM), the BPM seeks an approximation to the Bayes Point, the point in version space that is closest in behavior to the Bayes integral over all hyperplanes. Herbrich approximates this point using a stochastic algorithm for an arbitrary kernel (the billiard algorithm) which bounces a ball around the version space in order to estimate the Bayes Point. Because many kernels imply infinite dimensional feature spaces, it is interesting to investigate how long such an algorithm must run before it will converge. In a series of experiments on separable data (i.e., separable in the feature space), we thus test the BPM algorithm with a polynomial kernel (low-dimesional) and a RBF kernel (infinite dimensional). We compare generalization results with the SVM, investigate convergence rates, and examine the difference between the BPM and SVM solutions under several data conditions. We find that the BPM does converge rapidly and tightly even for infinite dimensional kernel, and that it has significantly better generalization performance only when the number of support vectors is of medium value with respect to the number of training points (i.e., more often with low-dimensional kernels). In addition, we augment Herbrich’s discussion with some comments on the bias term, corrections to his pseudocode, and a MATLAB implementation of his algorithm to be made publicly available.