904Labs was at SIGIR 2016 in Pisa. To stay in touch with the latest developments in search engine technology we love to go to scientific conferences. This time, Manos visited SIGIR, the top-tier international conference on search engine technology. Many interesting talks came by in three days, with two of the core topics of discussion being (1) the understanding of natural language and (2) the rise of voice search. Voice queries are on average longer than text queries (7 terms for voice queries vs. 4 for text queries), and they look much more like natural language (they, for example, contain less nouns). Another property of voice queries is that there are less clicks on search results, and that context plays a big role; it is important for the system to keep track of the previous queries and its answers.
Keynote presentations by Google and Amazon gave us great insights on the challenges that big search and online retail companies are facing right now. Google is putting a lot of effort to accommodate voice queries, possibly hinting to a future "customized personal assistant". For Google, future search will be built around three pillars: a) answer, b) converse, and c) anticipate. Answer is about returning the right results, converse is about keeping the context of previous searches (e.g., "How old is Michael Jackson", followed by the query "What ishis height?"), and anticipate is about being proactive and returning results to the user based on his physical context (e.g., weather when she wakes up, traffic conditions when she enters her car).
Closer to us, Amazon gave great insights of how they deal with search for e-commerce. Learning to rank is core to their search, which they train using user behavior – this is very similar to how 904Labs self-learning search works. Features for their learning to rank algorithm include similarity scores to parts of an item, but also behavioral features, and many query reformulation features for increasing recall. Most interesting was an example for the fashion category where someone was looking for a [diamond wedding ring] and the first result was a cheap, fake diamond ring. This was due to training on all people. However, in fashion, a store doesn't want to look cheap, so what Amazon did was to bias the training data towards the more fashionastas users, who like to spend a bit more. The result was that a proper diamond ring was shown at the top, and a replica was shown a few ranks down. Very good use of training data and machine learning indeed!
Last but not least, it was good to see work from big search companies aiming at reusing their massive historical datasets for testing new learning to rank systems. The task is a very difficult one, and an active topic of research. The tutorial on "Counterfactual evaluation of search engines" by Adith Swaminathan and Thorsten Joachims from Cornell University was about this topic. During the tutorial, they presented the latest and greatest ideas on figuring out how to train new systems on historical/old data. Just as a side note, at 904Labs we're actually using algorithms introduced by Joachims and his collaborators, which Manos explained during his presentation on how 904Labs uses the latest technology from academia in its search solutions.
In general, SIGIR 2016 was good fun! It was great to be there to get a shot of the latest trends in IR, and we are happy to find out that from a technological point of view, we are at least on par with what multimillion search companies are using for powering their search. From neural networks to methods for re-using historical data for training new versions of their search engines, we're on top of it.
Contact us if you want to know more about the latest developments in the field of search engines!