on Apr 6th, 2007Hadoop - Open source DFS and Cluster technology
I remember studying Advanced computer architecture and parallel computers in college. Just the sheer size and complexity of the book gave me nightmares. But lately, I have become interested in the field of Scalable programming, application clustering and parallelization. One of the reasons, I lost out qualifying for the Google Code Jam was because my algorithm wouldn’t scale, the logic was pretty correct though. And now, after almost two years of my tryst with building a web app and watching it crash before our very eyes has given me the understanding of why its so important to build good scalable applications.
Now scalability can happen at many layers, the code, the design / architecture and also hardware, the last one being the easiest - more power. But the first and more importantly the second make up for all the pain. Developing scalable code has been discussed a million times with lots of tricks on efficient memory usage,controlling memory leaks, serialization and the critical section problem but those scale only to a certain point.
At barcamp, the Yahoo team presented Hadoop. A distributed file system plus a parallelization framework that can be built for very computation intensive applications. The framework provides api’s that you can use to parallelize your tasks, store in a distributed way and more. There has been an upsurge in people developing processor and memory intensive applications, mainly due to the growth of concepts like crawling, semantic web etc. The 2.0 moguls who are trying to build new age crawlers, and indexers with a niche like social media, photos or videos will be the ones actually interested in these frameworks.
So how easy is it? Take a big bunch of machines with a 2GHZ processors ( considered outdated) ; must not cost you more than 6-8k a piece, add to each box 2 * 40 Gb hardisks, That you can pick up for almost 1k a piece. Install a stable flavor of Linux , put them on a network and your done!! So a 7 processor server grid with almost 1/2 Tb of storage will cost you in the order of Rs 60,000 , the price of a good laptop. Also for those who have startups, the hardware is already there, you just need the right software. You can use your existing resources to run a massive computation engine, typically in the nights when there is no work being done.
There is respite for folks who use windows, MS has made it extremely easy to achieve grid computing with their Compute Cluster Product. Whats more, its MS so it will be easy to install, with a lot of documentation and visual studio plugins for execution, profiling and debugging. Comes at a price though, but the results that it can achieve are tremendous.
Its almost like we have come full circle, we go on from these core concepts in the pre web era and now we are back to where we started with these. For more resources on cluster computing go here
