Want to have greater customization over your search results? Watch software engineer Juan Rodriguez map out the mechanics of custom field mapping and show you exactly how you can begin using this useful tool.
TRANSCRIPT:
Hello. In this video I’m going to show you a feature of the MindMeld API called, custom field mapping. For that, I’m going to be using an example. Let’s say we want to build an application that shows news to our users, and we want to build this application on top of a TechCrunch document collection. Well, that’s pretty simple, we can just open up the Crawl Manager, and we add the TechCrunch domain. I’m just going to copy the URL, and I’m just going to click Add. So, after a few seconds we should see some documents appearing on the right-hand side. There we go, we already have a couple documents.
Let’s take a closer look at these documents. For that, I’m going to use the API Explorer. I’m going to go to the GET/Documents endpoint, and here you have a typical document. You can see that we have pretty good information about it. We have the title of the article, we have the description, and we have some text. Also, we have an image and some other information about the article. Let’s say that for our particular use case, we want to have one more field with the name of the author of this article. So let me go to this article on the TechCrunch website. I’m just going to follow the URL, and here you can see that we actually have the name of the author under the title. It says posted yesterday by this person. So the problem was that the crawler was not even able to identify this as an important field and we need to actually tell the crawler to do it. We need to tell the crawler to get the name of the author and actually put it in a separate field in our document. So how do we do that? Well, this is where the custom field mapping comes really handy. So, I’m going to go to the MindMeld documentation, and here we have a special section about the custom field mapping. You can actually read this section to understand how to use the custom field mapping, but I’m going to show you.
The first thing that you need to know is: what is a fieldmap? A fieldmap is a JSON dictionary that contains the information about how to crawl a specific website. It has two main objects. The first one is an array which is called fields, and every element in this array is a field that you want to extract. So you have to specify the name of the field and then a code of Python snippet that will be used to extract that particular field. You can have as many fields as you want. Also, there’s another array that’s called, “skip if missing.” Here, you can specify the names of the fields that you want to be required. So if any one of these fields that you specify here in this array is actually missing, then the document will be skipped, it will not be not be indexed in your collection. It’s pretty simple, we just need to build a fieldmap for our particular document collection. For that purpose, we can actually use these two, which is called the fieldmap testbed.
Now, the fieldmap testbed has some documentation of its own, there is a short tutorial that you can follow to learn how to use it, there are some pieces of sample code that you can use as templates, and there are also links to more documentation. In the fieldmap testbed, you need to specify a URL which is a link to a document that you want to be doing your tests on, so I’m just going to do the document that I already opened. You just copy the URL and now I actually need to start writing the Python code that will be used to extract the name of the author. The first thing I need to know is, where to look for this name inside the HTML. Well, I’m just going to open up the page source code and I’m going to search for this name. As you can see, we can find the name in multiple places in the source code, but this one is probably the most clear one. You can see that the name is actually inside this A-tag which REL is set to “author.” This is pretty nice; for any TechCrunch document, we can just look for this tag, and then whatever text is inside is going to be the name of the author. Let’s do that, we just go back to the fieldmap, and then we write the code to actually get that particular tag. Now as you can see, we’re actually defining this function which is called extract field, and the function is giving us this D-object, now D is a PI query object. PI query is a library that we use for an extracting information for any web page. There’s a link here to the PI query documentation, so in your spare time you can read more about it to know how to properly use it.
So I’m just going to start writing the code that I need. So I’m going to create the variable called “author,” and I’m going to use this to search for an A tag which REL equals “author,” and from this tag I want the text. And then I just need to return my variable. So let’s try it, I’m just going to click extract, and there you go, we have the name of the author right here. Now one thing I want to point out is that I found the name of the author in a kind of brute force way. I had to look through a source code, and then find the place where it is located. Some browsers have really cool tools that will help simplify the task, so for example in Chrome you can just right click on whatever piece of information and you can click on Inspect Element, and Chrome is going to tell you where you can find that particular element inside the HTML.
Okay, so back to the fieldmap. We already have the code that we need, so now it’s just a matter of copying this code into our fieldmap, but as you know in Python indentations are actually very important. So when you copy this code, you have to make sure that you copy it in the right way, so we have this copy button that is going to help you with that. If you click on it, this dialog appears, and you can see that at the top it actually shows you the code already formatted with the characters already scaped and everything ready so you can just copy and paste it into your fieldmap. At the bottom, there is also a simple fieldmap that you can use for your purpose. So I’m just going to copy this one, I’m going to put it in any text editor and it’s pretty much ready to use. The only thing you need to change is the name of your field. So I’m going to count my field, “TechCrunch author,” and I’m going to save my fieldmap, and I’m going to call it tcfieldmap.txt and there we go. We finally have our fieldmap.
The next step: I’m going to go to Crawl Manager and stop my previous crawl. I’m using a free tier, so every time I want to add a new domain, I actually need to remove my previous one. I can only have one at a time. I’m just going to wait for it to stop so I can actually delete my domain, and there we go. So I’m just going to delete it, and now I’m going to create my new domain. Again it’s going to be TechCrunch. So I just copy the URL, and now I’m going to add my fieldmap. So now that I have my fieldmap, I just click on add, and again after a few seconds, I see some documents coming up in the right hand side. Okay, so there we go, we already have a couple of documents. I’m going to go back to the API Explorer and I’m just going to refresh my document list right here. Okay, so now you can see that we still have the title, the description, the text, the image, but now also we have the name of the author that we wanted, specified in the field just the way we wanted. This is pretty cool. Using the custom field mapping, you can actually take any piece of information from a website. It can be a string, it can be an image, a link, it can be a number, and we can actually take the piece of information and put it in our documents. This is a way in which we can have a pretty custom document collection according to our needs. I hope this video was useful. Thanks for watching!