Languages Magazine

Behind the Search: The Human Touch in Your Search Results

By Expectlabs @ExpectLabs

How are search engines able to sift through enormous databases to deliver exactly the right results? Watch our own Suvda Myagmar explain how there is a human touch involved in making your search results more relevant.

Learn more about search quality metrics and measurement methods with the previous installment in our “Behind the Search” series.

TRANSCRIPT:

Hi, I’m Suvda Myagmar from Expect Labs and I’m continuing my talk on evaluating search quality. Previously, I discussed search metrics and how to collect data to compute these metrics. One of the data collection methods is manual rating of search results by human judges.

This method involves running a user experiment where workers or judges rate the relevance of search results. The user experiment can be designed so that each task displays a single pair of a query and a result link, or a query and the entire result set, or a query and results from two different ranking algorithms or search engines presented side-by-side. The rating system employed in these user experiments can be various: a simple boolean rating (relevant or not relevant), or a multiple-point rating with 4-5 points, or a sliding scale from 0 to 100 with labels ‘least relevant’ to ‘most relevant’.

All big search companies use hundreds of in-house judges. It’s quite slow and costly. For example, an average human judge can complete about 30 rating tasks per hour, and the labor rate is $20/hour. You need 5-10 submissions from different judges per task to reduce noise and achieve agreement. Based on these assumptions, you can collect complete data only for about 250 data samples per day and end up spending minimum of $800 per day.

A more affordable and efficient method for collecting ratings from human judges is Amazon Mechanical Turk. It’s a crowdsourcing platform where you can hire workers to perform HITs or Human Intelligence tasks that are easy for humans, but hard for computers. For example, labeling images, categorizing, and judging search quality. Based on my experience, mturk experiments provide reasonable speed because you can hire a lot more workers to work in parallel, and the quality is ok if you use master workers. The cost is reasonable too. However, the accuracy and agreement suffer a bit. So you need to collect more redundant data, and use gold standard data for quality control.

At the end, I would like to point out that no matter what method you choose for collecting search quality data, before even getting to that point, while tuning your ranking algorithm, it’s very useful to have an internal search visualization tool that allows you to play with various parameters of your ranking algorithm and see the results immediately. At Expect Labs we use such a ranking tool.


Back to Featured Articles on Logo Paperblog