Sorry for the hiatus. Been busy the past few months on other activities which has led me to spend far less time organizing my thoughts for this blog than I would like. I hope to be far more engaged from this point on.
Today let me talk about
interpretability in machine learning. I was motivated to write about this topic due to a fascinating discussion in
MLDG this week, which was led by Chenhao. The discussion broached a lot of fascinating questions and really got me thinking which in turn reinvigorated my desire to better express my thoughts in the form of a blog post.
"So what is interpretability?"
Interpretability is a term commonly used in the field to motivate different approaches. It is typically used to signify the simplicity or understandability of a machine learning model. For a discipline that was heavily influenced by
Occam's Razor it seems natural to desire being able to understand a model. This is a particularly common desiderata in applications of machine learning pertaining to expert-driven fields like biology, economics or linguistics since the experts in the fields want to understand and validate the meaning of a model before even considering deploying it. One could make an argument that this beyond just these fields as a large number of ML papers are trying to make their methods more interpretable either via the use of exemplars, feature studies or other such tools.
"So is this a solved problem?"
Far from it. Digging deeper into the literature one observes that work on this topic is fairly limited and tends to be disconnected. Some of the most notable works in this field come from
Stefan Ruping (whose
thesis is probably the best starting point) as well as
Prof. Cynthia Rudin. Even across the works of these two notable researchers, I find that there is no clear-cut, agreed-upon definition of interpretability, which leads to the question:
"How do you define interpretability?"
I believe this is an open question. During MLDG, Chenhao presented a compelling argument for the
Merriam-Webster definition of interpret is currently the broadest and closest thing we have to a concise definition of the term in this field. I believe this is one of the key questions that the field needs to be addressed as we move into an age where computer scientists actively collaborate with experts from different domains to tackle new challenges in today's information rich world.
One possibility is defining interpretability in terms of natural language complexity
i.e., how simple is it to express in terms of natural language. This is clearly a step in the right direction as it presents us a way of quantitatively measuring or evaluating the interpretability. However I am inclined to believe that this definition may be limited. Although a language is a powerful tool, which many believe to be the key aspect that differentiates us from our primate cousins, I am of the opinion that not all understanding is via language. Instead I believe a more well-rounded definition of interpretability would rely on the cognitive complexity. This however is far more open-ended and it is far less clear how this can be evaluated. Hence I believe coming up with an appropriate definition of interpretability will require ML experts actively collaborating with our colleagues in the psychological sciences to come up with ways to express and measure this property. I'll get back to this towards the end.
"Why do we need a definition? Can't we just measure it?"
A valid question indeed. However without a clear definition of interpretability it becomes hard to measure it. To me this starts resembling chicken-and-egg problem. Can you define interpretability based on a measure? Can you measure it given the definition? I do not have a good answer to this question. However what I do know is that the current approaches to "measuring" interpretability are far from satisfying.
In NLP problems it is commonplace to use an SVM for the classification task being studied (such as sentiment analysis, deception detection, political debate voting
etc..) and then sort the features and present the top features as some proof of the "correctness" of the model. This is also commonly done in the topic modeling literature for example. Other attempts in this domain use adhoc formulae to sort and rank a small list of features to help readers "interpret" these models and compare different models. Other work has looked at measuring the human capability of distinguishing the top terms of a topic model from an "intruder" term.
While some of these methods lead to interesting findings, to me these are FAR from being a principled solution to the problem as it is entirely unclear what these methods are actually measuring. In his thesis, Ruping argues that there are three axes/goals of a models:
"Understandable, accurate and efficient"
Accuracy and efficiency are far more well-studied goals for which we have clear definitions and measures for different tasks. However separating the effects of the three axes from each other is hard. That does not mean however that we cannot try! I believe that a good first step in trying to gage the interpretability of a model is via leveraging these other two axes in particular the quantitatively most well-studied one:
Accuracy!
For example: In the aforementioned NLP feature studies, instead of sorting and presenting the different features to compare different models, why not do the following:
1. Choose the top 'k' positive and negative features of the different SVM models being considered.
2. Set all other feature weights to 0.
3. Report accuracy measures for these "2k" sparse models.
You may ask: "What is this achieving? Why is this more principled?"
Well if we assume that the cognitive load of all "2k" sparse models is the same (which is definitely a big if -- however it is a start), then via these steps we have roughly equalized the understandability and efficiency of the different methods and hence we can compare them on the accuracy axis.
Granted that this schema has some assumptions and is far from the final answer to the problem, I feel such an approach would be a step in the right direction as it leverages our expertise in evaluating accuracy of a method while only requiring us to get the interpretability (and efficiency) of different methods to be "almost the same".
(As an aside, consider this thought experiment:
What is the VC-dimension of 'k' sparse linear models in an n-dimensional space?)
"Ok. We can't yet measure it, so what can we do now?"
For starters we can improve interpretability. This can be done by either exploiting the internals of the learning method used (
white-box optimization of interpretability) or without it (
black-box optimization).
Post-pruning of decision trees or approximating non-linear surfaces by locally-linear boundaries are examples of such methods.
Two common types of methods to improve interpretability are:
1.
Feature selection: This is more commonly motivated as improving accuracy (by reducing overfitting); however this also has the side effect of improving interpretability. LASSO and related methods are an example of this kind of technique.
2.
Instance selection: Selecting informative instances for learning or to explain the model. SVMs are an example of one such application implicitly employing this technique as the
support vectors selected form discriminating instances for the model learned. (This can also be done explicitly by removing SVs.)
Ruping in his thesis focuses on the latter (and in particular SVMs). This also tends to be the preferred approach in the NLP community where SVMs are more common.
(
Aside: I find this particularly surprising given the aforementioned feature comparison experiments performed. Why not just use LASSO-like methods to directly extract meaningful features?)
"Instance selection vs. Feature selection?"
I find the problem of choosing one out of the two to be interesting. While selecting text features may be a preferred option for some NLP tasks, in vision problems it may be easier to present instances which are used for differentiating different classes (as seen in different deep learning papers). I think there is further work to be done here as I'm inclined to think that there should exist methods which can get the best of both techniques or smoothly interplay between the two.
"Does a measure for interpretability exist?"
Let me conclude by leaving you to ponder this question. While we would like to compare the interpretability of different methods quantitatively, the existence of such a measure is unclear.
If you made all the way here, pat yourself on the back :).
Apologies for the length of the post. As I said, I had a lot of questions on my mind.
My bags from my trip to Dublin were just returned to me a couple of days ago. (Thanks a lot for taking 15 days to find a simple bag AirFrance !! )
Now that I have my notes again, over the weekend I'll try to recap some of the highlights from
SIGIR and go over some of the papers I found intriguing,