How to Start With Hadoop: Shopping, Design, Distribution, and Cloud

Now that you have a good collection of requirements and open questions in hand, it’s time to shop. In terms hosted servers I have used both SuperMicro and Dell servers. I have also evaluated HP and Silicon Graphics (aka Rackable). To be fair, the Dell servers were a more recent model than the SuperMicros, but given my experience I would definitely select Dell again. I should also disclaim that I wasn’t building out Yahoo or Google sized clusters. My clusters ranged from dozens of nodes to less than 500. The size of your cluster matters significantly and will impact lots of decisions.

I found Dell to be the best mix of price, reliability, configurability, and maintainability. The Supermicro had 4 nodes in a 2U chasis. Each node had dual quad core processors and 3 3.5” 1 TB drives. The Dell had dual hex core processors and 12 2 TB SATA drives. The Supermicro had lots of issues; the most annoying was failed drives, at a rate much higher than normal. After a lot of finger pointing we concluded the problem was related to the raid card. HP’s pricing was high any way I sliced it, and was hard to justify over Dell. SG was a very interesting solution that was highly engineered (maybe overly so for small scale solutions < 200 nodes). I had concerns about being locked into a more proprietary solution, especially during these early wild west days of Hadoop. The high price tag was only justifiable when you factored in data center real estate, power, and cooling costs a year beyond what I felt comfortable projecting the life of the cluster. I am sure there are other vendors to explore if you host your own. As always, my recommendation is to try before you buy, and model out the costs (including data center) based on the configuration options and vendor pricing.

From experience, I recommend erroring on the side of more spindles. The Supermicro had 8 cores to 3 spindles; the ratio of cores/cpu was just too low for most of our applications. Now most of those applications could have used a healthy dose of optimization, PIG is called that for a reason. The ratio I chose on the Dell was 12 : 12, IO is obviously much less of a problem here. I have found that if you have space, someone will use it. The extra cost for drives and power was well worth the trade of not having to be the disk space police and decide who was on the chopping block each week.

Network design

Network design is going to also be critical, at a certain size. Once your node count gets to a high enough point and you have lots of job going, be on the lookout for network saturation. If you have a simple gig-e network don’t be surprised if you easily swamp it. Network engineers love to spend your money (sorry guys), and Cisco loves to take it. Here I recommend starting cheap and then upgrading when you have to, unless you know going in that a gig-e network will not cut it. Replacing a core switch is no easy task and will give you a headache trying to swap, but it may be worth waiting on if you don’t know how big you need to get. Likewise think about multiple nics and bonding. With basic Hadoop I bonded two gig-e nics per box. Mapr is aware of multiple nics on the box, but you could still bond if you want to.

Mixed use environments

Once your cluster is up, most of them turn into mixed use environments. At the top of the food chain are production jobs that have an SLA attached to them. You will inevitably have engineers and analysts fighting for time as well (unless you have so overbuilt the cluster this is a non-issue, in which case congratulations – you have bested your CFO!) Task scheduling is obviously your friend here. Reserving the High priority flag for real SLA production jobs is a key policy. Having an Ops team that can play traffic cop is also essential. Sooner or later someone will tank the cluster and block your critical jobs. Your Ops team needs to have rules about how to handle that situation, and more importantly know when it is happening. Using automation tools like Nagios to tell you when certain outputs are overdue is one good alarm mechanism. The team should have the authority to kill jobs that have become an issue. The scheduler that comes with Hadoop isn’t especially smart, though it has some additional options you may want to explore.

Hadoop distribution

As part of this exercise you will also need to decide on the distribution of Hadoop to use, with the three obvious options behind Apache Hadoop (supported by Horton Works), Cloudera and Mapr. Mapr has made some major modifications to the file system and provides some additional functionality such as an NFS mount and a nice GUI in the free version; the paid version has some more enterprise like features. I have never run into any compatibility issues. I currently use Mapr and can attest it has been excellent in terms of stability, bug fixing, and general support. You may want to consider paying for support (Cloudera and Mapr both offer this. Horton most likely does as well, but I don’t have experience working with them). Look over the Mapr features offered in their paid option – these may be important to you. In either case, make sure you are sitting down first: neither option is cheap and will dramatically impact your budget if you choose to go that route. You can certainly get by on the free versions.

Now that you have a lot of data, and yes, it may look more like a Christmas list than a set of requirements, your next step is to model out a few scenarios. Depending on the potential size of the cluster (is it bigger than a bread box), this can be a back of the envelope exercise or a full blown Excel project. I recommend having a couple options here to present to the various teams and executive management. Include various configurations such as a high IO option and a low cost option. This just isn’t a budgetary step, it is where rubber is going to meet the road, so being able to talk thru the pros and cons of the various configurations is important.

Cloud solutions

Up until now I have completely ignored cloud solutions in this blog post. Your obvious options here are Amazon, Microsoft, and Rackspace. I have used both Amazon and Rackspace. Amazon certainly had the jump on everyone and by far and away has the most robust set of options. When you search on case studies, all the big guys you recognize are going to be on Amazon. Rackspace lacks many of the features AWS has and hasn’t optimized their solution for Hadoop or big data. Microsoft made some good news recently about their Hadoop compliance, so if you are a Microsoft centric (as opposed to Linux) company, certainly take a look. Everyone always brings up security when it comes to cloud. Generally speaking, it is no worse or less secure than most privately managed data centers I have seen. If anything, it is possibly more secure.

The real kicker when it comes to Hadoop with any of these guys is the price to performance ratio. For the run of the mill cluster on 24/7, always chugging away on jobs that tend to be pretty similar in nature, the cloud route is almost certainly more expensive (to be fair, make sure you include all the costs like network engineering, power, cooling, etc). If the vendor hasn’t picked hardware that is conducive to big data, you may be paying for more nodes or hours than you need, and at a certain point the flexibility of the cloud is not going to be enough to justify the added cost. On the flip side, if your work load is wildly changing, today I need 300 nodes, and then for the next 28 days only 5, definitely take a look at the cloud. Utility pricing will be your friend in this case. If you are just getting started and don’t want to commit, using the cloud for the first year may be a really smart investment even if your budgetary model shows that it is more expensive. Remember that commitment can be expensive. You may also want to use it as a dev or test cluster where you can direct your engineers to run experimental code, or for IT to test upgrades. It then becomes a secondary cluster and hence an additive cost. My advice here is to include it in your model because someone is bound to ask and you may need the option.

I hope reading this doesn’t discourage anyone from using Hadoop. At the end of the day, Hadoop is very flexible and no matter what you choose, adding it to your bag of tricks can be hugely successful. Ask people in the industry for their advice and opinions – it can go a really long way in helping you pick a solution. Happy data mining!

Business Magazine

How to Start With Hadoop: Shopping, Design, Distribution, and Cloud

Network design

Mixed use environments

Hadoop distribution

Cloud solutions

About the author

Author's Latest Articles

Paperblog Hot Topics

Magazines

COMMUNITY BUSINESS