From semantic markup tags to sitemap files, pick up a few handy tips on how to make your MindMeld API experience the best it can possibly be.
TRANSCRIPT:
Hi, this is Tim Tuttle. Today I’m going to do a video blog post about how you can improve the quality of the results you get from the MindMeld API by optimizing the documents that you index in your documents collection. So as you know, as part of the MindMeld platform we provide a convenient for you to index any collection of documents, either from a website or a database. You can use the developer tools to either crawl an existing website, or you can use our API to populate your documents collection with documents from the same database. Now there’s a few tips that I’d like to share about ways that you can get the best possible results by taking advantage of some of the features in the documents collection.
Probably the most important thing you can do to make sure you get the best results in your MindMeld applications is to use semantic markup tags that we support for every document. For every document that we either crawl from a website, or that you submit, if you include markup tags in the header of the document, such as Open Graph tags, allows us to know specifically what type of text information you’d like to associate with the documents. Right now, we support Open Graph tags and so if there’s a specific title or description or image that you want make sure that we index correctly, it’s a good idea to use these mark-up tags, and if you have a website you include it in the header of your page; that’s the first step.
Now, the second thing you can do to improve the quality of the results you get is to take advantage of the white lists and black lists functionality that we currently have available in the Crawl Manager application. You can find the Crawl Manager in the developer console of the your MindMeld account. What the white list and black list allow you to do is to only include or selectively exclude pages from your website that have an URL that matches a specific pattern. So let’s say for example you have a website that has lots of videos on it, but you also have lots of other pages that might have video reviews, or directory listings, things like that. But all you want to do is index the pages that contain videos. As long as those pages contain a specific URL pattern, you can just include that in the white list of the Crawl Manager and make sure that only those specific video pages are indexed.
The third thing you can do to improve the quality of the results and the quality of your document collection are to use sitemap files. So a sitemap is a standard XML document that you can make available on your website that gives Internet crawlers a better picture of what pages they should index on your site. And our crawler likes most crawlers takes advantage of that. So if you know which pages want to make sure in your documents collection. Sitemaps usually help. Without site maps, sometimes we can’t get 100% coverage because our crawler needs to follow the link structure. your website, and for some websites that can be complex and cumbersome to amend all the pages.
Now the fourth thing you can do to improve the quality of results, is to take advantage of the custom rank fields that we support in the document schema So, in the document, in the definition of the document object, we support specifically, three custom ranked fields called custom rank one, custom rank two, custom rank three. Those fields are very useful if there’s some property of of your content, of your documents that could be very important in ranking. As an example, if you have, let’s say, a review website where viewers give ratings to restaurants or products. The number of reviews could be a very important factor in determining how important that document is among all the restaurants and so you might want to include the review count, or the actual star rating in one of those custom rank fields, and that gives you greater flexibility in the ranking formula that you can configure to ensure that your users will only get the highest ranked and highest reviewed documents you have in your collection.
Now the last thing you can do to optimize the documents collection you have and get the best results, is to take advantage of the documents endpoint that we support in the API. So the Crawl Manager and the crawling functionality we have is a great automated way to ingest and turn web pages on the Internet into specific documents. If you really want to have fine-grained control, it’s probably best to use the documents endpoint and just by writing a simple script and sending an HTTP POST with a JSON object you can tell MindMeld with great detail specifically what each document should look like and that gives you as much control as you want to ensure that your documents collection has only the best meta data, has all the right fields, and will ensure that you get the best results.
Those are a few tips on how to get the best results out of the MindMeld API. I’ll see you soon on another webcast.