Friday, August 30, 2013

Who do you want for an opponent?

While at dinner today someone told me about the Swedish PhD Defense procedure. At the time I found it really odd and hard to believe as it was completely different to the system used in most other countries. A little further research into the topic made for some very interesting reading.

The really interesting aspect of it is the role of the opponent. In Sweden, an opponent is someone who presents your PhD thesis at your dissertation! Yes, you read that right.

The opponent is someone external (usually) selected by the faculty to oppose a candidate's thesis. They are supposed to read the thesis so thoroughly that they are the ones who present the thesis at the defense and not the candidate! They are also supposed to cross-examine (i.e. critically question) the candidate in detail (hence the term opponent). At the end, the opponent in conjunction with the committee determine if the candidate passes or fails.

I found this to be a very interesting system. Clearly you need to be clear and concise in your writing if you expect someone else to understand and present your work. You also need to be rigorous and have a clear idea of what you're doing since a majority of the defense in spent in the cross-examination phase. However you are subject to the whims and fancies of your opponent. You definitely want an opponent who doesn't hate you while at the same time is well-versed in your research and understands it well.

So who do you think should be your opponent?

Proving GCD is a kernel

At MLSS yesterday Bernard Scholkopf presented an interesting question:

"Is gcd(m,n) a valid kernel"

To me this does not seem difficult. Below is a rather simple proof to show it is a kernel.

Step 1: To prove this, we first realize that it is sufficient to prove log(gcd(m,n)) is a kernel since we know that the exponent of a kernel is a kernel.

Step 2: I will prove that it is a kernel by explicitly constructing a vector space under which log(gcd(m,n) is a dot product.

CONSTRUCTION:

FEATURES: There is a single feature for each prime p (e.g. 2,3,5,7 ..) as well as every higher power of p as well. So for prime number p, there is a feature for p, p2, p3 .... (e.g. 5,25,125,....)

FEATURE VALUE: If the feature for prime p or any of its powers (p2, p3 etc..) is "on" (i.e. non-zero) then the value of the feature is always √log(p).

Lastly we have to describe the feature vector for an arbitrary number n

FEATURE VECTOR: For a number n with prime factorization n = p1q1 * p2q2 * ... pkqk. Then the feature vector for this number will only have these features turned on/non-zero: (∀ i) pi1,pi2,...,piqi

---

It is not hard to see that the dot product of two feature vectors in this space is the log of the gcd of the two numbers.

To see this consider a common prime factor p which has multiplicity q1 in the first number and multiplicity q2 (>= q1 w.l.o.g.) in the second number. Now note the two feature vectors will have p1,p2,...pq1 as common non-zero elements. Hence this contributes a factor of q1*log(p)=log(pq1) towards the dot product, which is exactly the contribution of this prime factor p towards the log(gcd).

---

There are other alternate proofs I can think of (including one which proves that the log gcd is a positive definite matrix). However this seems like the simplest and easiest to understand.

If you have an alternate proof that is even simpler please let me know.


Tuesday, August 27, 2013

Post-hoc usage of competition datasets in IR/NLP/ML

I am in MPI (Tubingen) for the next two weeks attending the Machine Learning Summer School. I have one more batch of papers from SIGIR that I wanted to discuss but will have do so later this week. This time around I wanted to write about :
"Post-hoc usage of competition datasets and the alarmingly increasing trend of performing manual validation/hill-climbing while performing research with these datasets".

Today Machine-learning/IR/NLP research is highly data-driven. Getting data for the problem of interest is a fairly non-trivial. In this regard NIST (in the US) and other similar organizations across the globe have helped facilitate research by collecting datasets and organize tasks and competitions. For example the Session track, one of the more recent and arguably successful tracks at TREC, has greatly helped facilitate research on user search interactions, sessions and complex search tasks. Similarly the DUC and MUC conferences organized by NIST were key in promoting research on the problem of summarization. This is not just limited to the US as other conferences like CLEF and FIRE annually hold competitions for tasks in different languages as well. In fact the success of TREC and the prestigious annual KDD-Cup competition have led the industry to make their data available as part of a competition. The wildly successful Netflix challenge has done wonders for the field of collaborative filtering as it encouraged truly principled research (as I shall dissect a little later in this post). The launch of Kaggle and its increased popularity has only furthered the amount and diversity of data available.

While these competitions clearly provide valuable data, I find increasingly disturbing trends in the usage of such competition data while reviewing/reading papers. I hope to highlight some of these fallacies in the rest of the post not only to (hopefully) convince you that such usage is not principled but also provide alternate more principled techniques to use such data. Let me note that these are just my opinions and I encourage you to ponder these (mal)practices and feel free to share your opinion via the competitions.

 Below are a list of three common fallacies in performing research using these datasets along with fixes:

a) Failure to perform validate parameters using validation set: Which I regard as the original sin of machine learning. I am surprised at the number of papers that fail to validate parameters before testing. It is not ok to introduce additional parameters in your proposed method and then report the best performance using the test set (and compare this against other methods)! This is fundamentally incorrect and does not demonstrate the generalizability of the proposed method. Though this seems obvious, consider how many papers you've read recently that do not perform parameter validation. I find it surprising to find such papers appearing in top-notch conferences like SIGIR, ACL, CIKM etc.. Performing parameter sweeps by manually fixing parameters based on test set performance is also not principled and hence should be avoided.

FIX: Use a validation set/cross-validation to select parameters and only report test-set performance for these validated parameters.

b) Unfair/Incorrect comparison: Like the previous example of comparing a method with additional parameters tuned on the test set, there are other forms which this fallacy manifests itself in. One of the most common cases of this error is comparison against systems that were used in the original competition especially if those systems are no longer state-of-the-art. This is particularly common in summarization papers which commonly use, as their main baseline, decade-old systems that were used in the original DUC/MUC/TAC competitions. This is inherently wrong for multiple reasons: First and foremost lack of comparisons against the state-of-the-art can present a false picture of the method's performance. Secondly, it is important to realize that the original systems had less data to work with and did not have the test set available to them. Even if proper validation is performed, there is likely some amount of manual hill-climbing (explained below) performed which can inflate performance.

FIX: Compare with state-of-the-art methods whenever possible. Having comparisons to systems used in the competition is good. However do not use this as your main baseline as this is not an acceptable substitute for the state-of-the-art.


c) Manual hill-climbing/parameter validation: This is the most subtle and thus least-recognized fallacy. Take the example of a researcher working with some competition dataset whose test set has been released. In this situation it is highly common to adapt and tweak methods to maximize test-set performance. However doing so is equivalent to some form of manual validation or hill climbing and presents a false picture of the generalizability of the proposed method as the test-set performance is used to influence the model selected. This is particularly problematic when the datasets are small as the TREC and summarization datasets tend to be. Due to the small number of test queries in these datasets it is easy to overfit to these datasets by performing such manual validation (either by feature engineering or model/parameter tweaks) and present an unfair picture of the method's performance. These collections are already unstable to begin with as was shown this year at SIGIR. Thus this form of evaluation is inherently biased and far from convincing.

FIX: Coming up with a solution to this problem requires being rigorous about the evaluation methodology (by refraining from making changes on measuring test set performance). I believe the best solution to this problem is to change the way these competitions are organized. As was done in the Netflix challenge I think the best solution is to hold out part of the data and not reveal it publicly. The only access to this dataset would be the ability to evaluate a method on this data (a limited number of times). While implementing this would be hard I believe this is the most principled way to fix this problem.

To reiterate these are my opinions and not directed at any specific work but simply based on past experiences and conversations. I hope these help influence how some of you use these datasets.

Sunday, August 18, 2013

Recapping SIGIR 2013 - Part 2 (Session/Task Search)

Last time around I wrote about my take on trends at this year's SIGIR. This time around I wanted to go over some of the papers that caught my eye including the Best Paper Award winner as well as a bunch of long/short papers on session and task retrieval.

(To those wondering why I'm suddenly so active, it is because writing helps me organize my thoughts about a paper and helps me better analyze it.)


``Beliefs and Biases in Web Search'' by Ryen White : This paper thoroughly deserved to win the best paper award, as it provided excellent insights into biases displayed by searchers and search engines. While psychological studies have shown that beliefs and (sub-conscious) biases influence human behavior and decision-making, this is the first paper that addresses the issue in the context of search. Unlike other work on IR biases (which relate to the presentation by the search engine) this focuses on biases due to skews in the result list as well as seemingly-irrational search behavior. Focusing on yes-no questions from the medical domain the paper studies biases in searchers' click behavior as well as skews in search engine result lists compared to the ground truth labels (obtained from 2 medical experts). The paper shows that both parties (the searcher and search engine) are biased towards positive information (i.e., a Yes response). Some of the findings in this paper that demonstrate this include:
- Users highly favor yes responses when they were uncertain to begin with,
- A survey showed that users demonstrate strong confirmation biases, as they were unlikely to change strong beliefs.
- Search engine tend to disproportionately rank Yes-results higher.
- Users tend to skip No-results to click on Yes-results.

Putting everything together, Ryen showed that this can lead to highly sub-optimal searching as the users wind up with the incorrect answer a large fraction of the time (potentially between 45-55%). These findings have strong implications on search engine ranking and demonstrate the need for (unbiased) ranking functions. While the focus over the past few years has been on satisfying users (and hence training using click data and other indicators of satisfaction), if user satisfaction is driven by confirmation bias this can lead to users finding inaccurate answers. Hence there is a need to either incorporate these biases (say via the snippets shown) while using user satisfaction signals or to explicitly de-bias these rankings or both. Our work on stable coactive learning could be useful for this purpose. Alternately ranking lists could be diversified to show more opinions/perspective, say via a conjunction of ideas from work on user intent diversification and diversification due to opinion biases.



Session Search and Complex Search Tasks:

As I noted earlier, there is a noticeable spike in the work on this topic. The TREC Session track rightfully deserves some accolades as it has helped facilitate much of the work on this topic.

There were quite a few papers at SIGIR which dealt with different aspects of session or task retrieval including our work on Intrinsically Diverse Tasks (which won us the Best Student Paper Award).

Here are some other papers that caught my eye:

"Task-Aware Query Recommendation" by Feild and Allan: The first talk of the conference. This paper studied the effect of task-based context on query recommendation performance. Their method of incorporating multiple previous queries as context is similar to that used in session search, which is based on 2 factors: a) Relevance of context to current query/task, b) How far back was the context.  They showed that while on-task queries as context can help improve recommendations, even a single off-task query as context can be harmful. Their experiments also indicate that state-of-the-art same-task classification methods can be used to remove such off-task queries and improve overall performance.


"Characterizing Stages of a Multi-session Complex Search Task through Direct and Indirect Query Modiļ¬cations" by He et. al.: While previous work have shown that user interaction behavior changes across different stages of a search task, this short paper from the CWI group tries to account for differences in interface functionalities. Based on a user study performed in a course involving 25 media studies students, they studied behavioral changes across task stages from two sources: 1) Those seen explicitly via (manual) query reformulations : the universal feature available across all search interfaces; 2) Changes based on richer search interface features/functionalities (beyond just the search box), such as search filters. Their findings indicate that the behavioral patterns observed are similar across these two sources, which indicates that search engines could use signals from various components beyond the search box to better identify the user search context.

"Aggregated Search Interface Preferences in Multi-Session Search Tasks" by Bron et. al. : This paper, from the ILPS group, is on a related topic i.e., aggregate user interface preferences during multi-session search tasks. In particular this paper studies user preferences between three aggregate interfaces/displays: 1. Tabbed 2. Blended and 3. Similarity: Blended with Find-Similar functionality. Using the same 25 student media-study class data as the above work, they found significant differences in usage of the different interfaces as the search task proceeds. While the tabbed display was found to be the most popular (esp. at the start and end of the search tasks), the similarity display was the least. While the tabbed display was favored at the start of the task for more general exploratory searches, blended displays usage increased during the middle as users looked to find more specific content. A second lab study found that the tabbed display was regarded as the easiest to use with blended displays being harder. This was especially pronounced when the information need of the searcher was the same for the different displays, indicating that interface switching is more likely when there is a change in information need.


``Query Change as Relevance Feedback in Session Search'' by Zhang et. al. : This short paper from Grace Yang's group tackles the problem of improving session search using both the documents previously seen as well as the queries previously issued within a relevance feedback (RF) framework. Their method, which comprises a set of heuristics that alter retrieval for the subsequent query based on differences to the current query, is demonstrated on the TREC 2012 Session track data. Incorporating more than one previously issued queries is one avenue for improving on this work.

"Utilizing Query Change for Session Search" by Guan et. al. : This is essentially a longer version of the above paper. The paper again uses term-weight modification heuristics from the previous paper with a Markov Decision Process  viewpoint of modeling query change.


You may also be interested in :

- Recapping SIGIR 2013 : Overview


EDIT: Corrected an error pointed out to me by Prof. Arjen de Vries about a paper by Bron et. al. that I incorrectly attributed to the CWI group instead of the ILPS group. Apologies for my oversight especially to the authors of the paper.

Recapping SIGIR 2013 : Overview

Last time around I wrote about interpretability in machine learning, something that I have dabbled with in the past. Today I wanted to recap some of the highlights of SIGIR 2013.

To begin with I should mention I found the conference to very well-organized and found some of the talks to be very interesting and stimulating. Compared to the last SIGIR I attended in 2010 (Geneva), I found that there has been a noticeable change in the papers at the conference and the overall program content:

1. User modeling and coming up with models more suited for modeling search user's satisfaction is a very popular topic. The MUBE workshop at SIGIR this year, which actively focused on this topic, was well-attended and lively. KDD this year too had quite a few papers on user modeling.

Coming up with a good surrogate to user satisfaction is excellent for post-hoc analysis of different methods. However I am slightly concerned as to how some of these proposed measures can be employed to train better search models and improve the overall search experience i.e., it is unclear to me if current learning methods can optimize for these new measures instead of tradition measures like NDCG. Thus I feel like there is more learning-oriented work to be done on this topic.

2. In fact it was not just the above topic that had reduced machine learning content but the conference in general. In my opinion, it seems like work on learning-related topics like Learning to Rank (LTR) is now better suited to appear in KDD or ICML than SIGIR. I guess I should not be shocked given rumors of current search engine ranking learners hitting a plateau. Seeing how tuned these rankers are for different criterion and verticals it is understandable why people would worry about the global changes a new learned model could bring about and hence why they tend to not directly use learning. While fine for the short-term, I'm not sure how viable this strategy is for the long term. It seems that the the time is right for someone to come up with the next big thing in LTR.

3. People are a lot more interested in complex search tasks and session search as seen by the substantial number of papers on the topic. I'll detail some of these papers in my next post.


As mentioned earlier, there were a lot of great papers this year at SIGIR which has left me with a long reading list. Over the next few days I'll try to go over some of these papers. I'll try to go over one of the highlight papers (in my opinion) along with papers on a specific topic.

You may also be interested in :

- Recapping SIGIR 2013 - Part 2 (Session/Task Search)

Thursday, August 15, 2013

Interpretability in Machine Learning

Sorry for the hiatus. Been busy the past few months on other activities which has led me to spend far less time organizing my thoughts for this blog than I would like. I hope to be far more engaged from this point on.

Today let me talk about interpretability in machine learning. I was motivated to write about this topic due to a fascinating discussion in MLDG this week, which was led by Chenhao. The discussion broached a lot of fascinating questions and really got me thinking which in turn reinvigorated my desire to better express my thoughts in the form of a blog post.

"So what is interpretability?"

Interpretability is a term commonly used in the field to motivate different approaches. It is typically used to signify the simplicity or understandability of a machine learning model. For a discipline that was heavily influenced by Occam's Razor it seems natural to desire being able to understand a model. This is a particularly common desiderata in applications of machine learning pertaining to expert-driven fields like biology, economics or linguistics since the experts in the fields want to understand and validate the meaning of a model before even considering deploying it. One could make an argument that this beyond just these fields as a large number of ML papers are trying to make their methods more interpretable either via the use of exemplars, feature studies or other such tools.

"So is this a solved problem?"

Far from it. Digging deeper into the literature one observes that work on this topic is fairly limited and tends to be disconnected. Some of the most notable works in this field come from Stefan Ruping (whose thesis is probably the best starting point) as well as Prof. Cynthia Rudin. Even across the works of these two notable researchers, I find that there is no clear-cut, agreed-upon definition of interpretability, which leads to the question:

"How do you define interpretability?"

I believe this is an open question. During MLDG, Chenhao presented a compelling argument for the Merriam-Webster definition of interpret is currently the broadest and closest thing we have to a concise definition of the term in this field. I believe this is one of the key questions that the field needs to be addressed as we move into an age where computer scientists actively collaborate with experts from different domains to tackle new challenges in today's information rich world.

One possibility is defining interpretability in terms of natural language complexity i.e., how simple is it to express in terms of natural language. This is clearly a step in the right direction as it presents us a way of quantitatively measuring or evaluating the interpretability. However I am inclined to believe that this definition may be limited. Although a language is a powerful tool, which many believe to be the key aspect that differentiates us from our primate cousins, I am of the opinion that not all understanding is via language. Instead I believe a more well-rounded definition of interpretability would rely on the cognitive complexity. This however is far more open-ended and it is far less clear how this can be evaluated. Hence I believe coming up with an appropriate definition of interpretability will require ML experts actively collaborating with our colleagues in the psychological sciences to come up with ways to express and measure this property. I'll get back to this towards the end.

"Why do we need a definition? Can't we just measure it?"

A valid question indeed. However without a clear definition of interpretability it becomes hard to measure it. To me this starts resembling chicken-and-egg problem. Can you define interpretability based on a measure? Can you measure it given the definition? I do not have a good answer to this question. However what I do know is that the current approaches to "measuring" interpretability are far from satisfying.

In NLP problems it is commonplace to use an SVM for the classification task being studied (such as sentiment analysis, deception detection, political debate voting etc..) and then sort the features and present the top features as some proof of the "correctness" of the model. This is also commonly done in the topic modeling literature for example. Other attempts in this domain use adhoc formulae to sort and rank a small list of features to help readers "interpret" these models and compare different models. Other work has looked at measuring the human capability of distinguishing the top terms of a topic model from an "intruder" term.

While some of these methods lead to interesting findings, to me these are FAR from being a principled solution to the problem as it is entirely unclear what these methods are actually measuring. In his thesis, Ruping argues that there are three axes/goals of a models:

"Understandable, accurate and efficient"

Accuracy and efficiency are far more well-studied goals for which we have clear definitions and measures for different tasks. However separating the effects of the three axes from each other is hard. That does not mean however that we cannot try! I believe that a good first step in trying to gage the interpretability of a model is via leveraging these other two axes in particular the quantitatively most well-studied one: Accuracy!

For example: In the aforementioned NLP feature studies, instead of sorting and presenting the different features to compare different models, why not do the following:
1. Choose the top 'k' positive and negative features of the different SVM models being considered.
2. Set all other feature weights to 0.
3. Report accuracy measures for these "2k" sparse models.

You may ask: "What is this achieving? Why is this more principled?"
Well if we assume that the cognitive load of all "2k" sparse models is the same (which is definitely a big if -- however it is a start), then via these steps we have roughly equalized the understandability and efficiency of the different methods and hence we can compare them on the accuracy axis.

Granted that this schema has some assumptions and is far from the final answer to the problem, I feel such an approach would be a step in the right direction as it leverages our expertise in evaluating accuracy of a method while only requiring us to get the interpretability (and efficiency) of different methods to be "almost the same".

(As an aside, consider this thought experiment: What is the VC-dimension of 'k' sparse linear models in an n-dimensional space?)

"Ok. We can't yet measure it, so what can we do now?"

For starters we can improve interpretability. This can be done by either exploiting the internals of the learning method used (white-box optimization of interpretability) or without it (black-box optimization).
Post-pruning of decision trees or approximating non-linear surfaces by locally-linear boundaries are examples of such methods.

Two common types of methods to improve interpretability are:
1. Feature selection: This is more commonly motivated  as improving accuracy (by reducing overfitting); however this also has the side effect of improving interpretability. LASSO and related methods are an example of this kind of technique.
2. Instance selection:  Selecting informative instances for learning or to explain the model. SVMs are an example of one such application implicitly employing this technique as the support vectors selected form discriminating instances for the model learned. (This can also be done explicitly by removing SVs.)

Ruping in his thesis focuses on the latter (and in particular SVMs). This also tends to be the preferred approach in the NLP community where SVMs are more common.

(Aside: I  find this particularly surprising given the aforementioned feature comparison experiments performed. Why not just use LASSO-like methods to directly extract meaningful features?)

"Instance selection vs. Feature selection?"

I find the problem of choosing one out of the two to be interesting. While selecting text features may be a preferred option for some NLP tasks, in vision problems it may be easier to present instances which are used for differentiating different classes (as seen in different deep learning papers). I think there is further work to be done here as I'm inclined to think that there should exist methods which can get the best of both techniques or smoothly interplay between the two.

"Does a measure for interpretability exist?"

Let me conclude by leaving you to ponder this question. While we would like to compare the interpretability of different methods quantitatively, the existence of such a measure is unclear.

If you made all the way here, pat yourself on the back :).
Apologies for the length of the post. As I said, I had a lot of questions on my mind.

My bags from my trip to Dublin were just returned to me a couple of days ago. (Thanks a lot for taking 15 days to find a simple bag AirFrance !! )
Now that I have my notes again, over the weekend I'll try to recap some of the highlights from SIGIR and go over some of the papers I found intriguing,