Random Musings of a Grad Student: September 2012

It's been a while since I last posted. However, I expect to be a lot more active now and expect to make a post at least once a week to discuss different papers I've read in the aftermath of KDD and ICML 2012.

Speaking of which, I was fortunate to be in attendance for Prof. Michael Jordan's keynote at KDD 2012. The theme for his talk was using "divide-and-conquer" for machine learning problems. This is of particular interest now given the emphasis on Big-Data currently prevalent in our research community. While I was highly impressed by all three applications of this simple idea, I was particularly blown away by the Bag of little bootstraps idea presented. More specifically, when Prof. Jordan mentioned towards the end of the talk that the theoretical guarantees for asymptotic convergence also extended to cross-validation and jackknifing (more on this at the end), that piqued my interest. Ever since, I have been wanting to read their ICML paper as well as the longer version on arXiv.

Being an admin for the Cornell Machine Learning Discussion Group (MLDG), I proposed reading this paper for our first meeting of the fall semester, as I felt it was a nice idea that incoming PhD students would also be interested in. Volunteering to lead the discussion also allowed me to read the paper and gain further insight into this idea.

The setting for this paper is simple: In many machine learning problems, we are interesting in estimating the quality of our learned model/parameters.. The standard way of doing this has been using the age-old Monte Carlo technique of bootstrapping. Given a dataset of size N, Bootstrapping is simply the process of repeatedly sampling WITH replacement, new samples of size N. Computing the desired quantity/estimator (model accuracy/parameter/correlation etc..) on each of these samples and then looking at the distribution as you create more samples (i.e. run more bootstrap iterations) allows you to estimate the different statistics related to the quality of your estimator, such as the confidence interval or variance. Along with being extremely simple, one of the nicest things about the bootstrap is that you can show a lot of nice theoretical things about it, in particular that you can get convergence at the rate of O(1/N) as opposed to the O(1/sqrt(N)) you can get using the Central-Limit Theorem. However the bootstrap tends to be computationally intensive as the estimators are computed on sets of size N. Additionally the overhead of data transfer when it comes to running this on a distributed platform (as is common for most big-data applications today) is killing.

Thus people have been using alternates to bootstrapping, most notably subsampling and its' relatives. In these methods, samples of size M are drawn from the dataset w/o replacement, and the estimators are computed on these samples. These values are then used to approximate the estimator distribution for a set of size N via a projection/correction step. These methods have the advantage of being computationally light. However this tends to be sensitive to the choice of M (for which there is no easy way to choose) and for which the statistical guarantees no longer exist if M/N - > 0 as N goes to infinity.

This paper introduces a simple but neat idea of merging these 2 methods, while getting the best of both worlds. At a high level the idea is the following. Like in subsampling, uniformly sample (w/o replacement) a set of elements of size M from the dataset (size N). For each of these samples, repeatedly sample uniformly, WITH replacement, a set of size N. Now compute the required estimators on these sets, and average over the different sample runs to get the required estimator and its' distribution. By sampling a set of size N, we no longer need to perform any correction step. Furthermore, since these sets of size N are simply samples of size M (with counts for each element) they are computationally feasible as well. Thus for example, for an SVM the complexity would be as if the data was of size M (since this simply involves weighting the different slack variables with the counts in the optimization problem). This is also true for other common methods, meaning that we are able to get an estimator for a dataset of size N, while computationally requiring resources only for a dataset of size M.

This simple trick makes this method highly suitable for the large datasets we have today. Furthermore, the authors are able to prove a lot of nice theoretical properties for this method, similar to that for the bootstrap under some assumptions, most of which are highly reasonable : the only questionable one is when they require the number of samples (s) to be of the order of O(N/M), which is not reasonable (as seen by their experiments where they set it to a low value i.e. < 20).

As mentioned at the start, one of the key reasons I was interested in this was its' implications on how we perform cross-validation and jackknifing. Consider the case of learning the regularization parameter for a SVM using k cross-validation. Ideally we would like to have high a "k" as possible since that allows us to estimate the parameter on a dataset of size N(1- 1/k). However that tends to be computationally a lot harder as well. For low values of k we end up estimating a parameter for a dataset size which is very different to the one you will eventually train on. This is particularly problematic for parameters which strongly depend on dataset size (such as the regularization parameter in SVM-perf). Now instead consider using the idea from this paper for this problem. Not only am I assured of learning parameters on a dataset of the same size as what I will train on, but also I can gain computationally based on the resources available. To me, this seems like a far more interesting avenue for application of this technique.

One of the key questions this opened up is if we can do a better job of choosing the samples of size M. Given that we choose samples say only 20 times, then if N > 20M, we will not have seen all the dataset points. In such as scenario, does it make sense to uniformly sample the original M points. Furthermore, does it make sense to uniformly sample the bags of size N (generated from the M points). To me it appears that given the fact that you cannot see all the points, you may want to weight points from more dense regions higher in both sampling stages.

Another question which arises is if the samples drawn in both stages should be independent of what is drawn earlier. In particular I'm wondering if some sort of adaptive sampling can be used based on estimator values seen from previous rounds. Ideas such as the exponential weighting in boosting or the adaptive sampling techniques in active-learning are interesting alternates to the sampling techniques used here, since you can still show nice theoretical things in both these cases.

Finally the one minor critique I had here was that they did not show what impact their method has on the estimator value itself. In other words, does not seeing all the data make my learned estimator worse. It would have been nice to see this for experiments beyond the simulations they run.

Overall a top notch paper which raises a lot of interesting questions and can make serious impact.

Random Musings of a Grad Student

Friday, September 7, 2012

Updated Website and Hotel Review Scraping

Bag of Little Bootstraps