Tuesday, August 27, 2013

Post-hoc usage of competition datasets in IR/NLP/ML

I am in MPI (Tubingen) for the next two weeks attending the Machine Learning Summer School. I have one more batch of papers from SIGIR that I wanted to discuss but will have do so later this week. This time around I wanted to write about :
"Post-hoc usage of competition datasets and the alarmingly increasing trend of performing manual validation/hill-climbing while performing research with these datasets".

Today Machine-learning/IR/NLP research is highly data-driven. Getting data for the problem of interest is a fairly non-trivial. In this regard NIST (in the US) and other similar organizations across the globe have helped facilitate research by collecting datasets and organize tasks and competitions. For example the Session track, one of the more recent and arguably successful tracks at TREC, has greatly helped facilitate research on user search interactions, sessions and complex search tasks. Similarly the DUC and MUC conferences organized by NIST were key in promoting research on the problem of summarization. This is not just limited to the US as other conferences like CLEF and FIRE annually hold competitions for tasks in different languages as well. In fact the success of TREC and the prestigious annual KDD-Cup competition have led the industry to make their data available as part of a competition. The wildly successful Netflix challenge has done wonders for the field of collaborative filtering as it encouraged truly principled research (as I shall dissect a little later in this post). The launch of Kaggle and its increased popularity has only furthered the amount and diversity of data available.

While these competitions clearly provide valuable data, I find increasingly disturbing trends in the usage of such competition data while reviewing/reading papers. I hope to highlight some of these fallacies in the rest of the post not only to (hopefully) convince you that such usage is not principled but also provide alternate more principled techniques to use such data. Let me note that these are just my opinions and I encourage you to ponder these (mal)practices and feel free to share your opinion via the competitions.

 Below are a list of three common fallacies in performing research using these datasets along with fixes:

a) Failure to perform validate parameters using validation set: Which I regard as the original sin of machine learning. I am surprised at the number of papers that fail to validate parameters before testing. It is not ok to introduce additional parameters in your proposed method and then report the best performance using the test set (and compare this against other methods)! This is fundamentally incorrect and does not demonstrate the generalizability of the proposed method. Though this seems obvious, consider how many papers you've read recently that do not perform parameter validation. I find it surprising to find such papers appearing in top-notch conferences like SIGIR, ACL, CIKM etc.. Performing parameter sweeps by manually fixing parameters based on test set performance is also not principled and hence should be avoided.

FIX: Use a validation set/cross-validation to select parameters and only report test-set performance for these validated parameters.

b) Unfair/Incorrect comparison: Like the previous example of comparing a method with additional parameters tuned on the test set, there are other forms which this fallacy manifests itself in. One of the most common cases of this error is comparison against systems that were used in the original competition especially if those systems are no longer state-of-the-art. This is particularly common in summarization papers which commonly use, as their main baseline, decade-old systems that were used in the original DUC/MUC/TAC competitions. This is inherently wrong for multiple reasons: First and foremost lack of comparisons against the state-of-the-art can present a false picture of the method's performance. Secondly, it is important to realize that the original systems had less data to work with and did not have the test set available to them. Even if proper validation is performed, there is likely some amount of manual hill-climbing (explained below) performed which can inflate performance.

FIX: Compare with state-of-the-art methods whenever possible. Having comparisons to systems used in the competition is good. However do not use this as your main baseline as this is not an acceptable substitute for the state-of-the-art.


c) Manual hill-climbing/parameter validation: This is the most subtle and thus least-recognized fallacy. Take the example of a researcher working with some competition dataset whose test set has been released. In this situation it is highly common to adapt and tweak methods to maximize test-set performance. However doing so is equivalent to some form of manual validation or hill climbing and presents a false picture of the generalizability of the proposed method as the test-set performance is used to influence the model selected. This is particularly problematic when the datasets are small as the TREC and summarization datasets tend to be. Due to the small number of test queries in these datasets it is easy to overfit to these datasets by performing such manual validation (either by feature engineering or model/parameter tweaks) and present an unfair picture of the method's performance. These collections are already unstable to begin with as was shown this year at SIGIR. Thus this form of evaluation is inherently biased and far from convincing.

FIX: Coming up with a solution to this problem requires being rigorous about the evaluation methodology (by refraining from making changes on measuring test set performance). I believe the best solution to this problem is to change the way these competitions are organized. As was done in the Netflix challenge I think the best solution is to hold out part of the data and not reveal it publicly. The only access to this dataset would be the ability to evaluate a method on this data (a limited number of times). While implementing this would be hard I believe this is the most principled way to fix this problem.

To reiterate these are my opinions and not directed at any specific work but simply based on past experiences and conversations. I hope these help influence how some of you use these datasets.

No comments:

Post a Comment