Distributed Data Processing: The next opportunity Open Source needs to conquer?
Just ran across a couple of articles on distributed data processing. InfoQ is discussing the need for a new paradigm for scaling processing at the same time development on an open source solution is progressing as Yahoo! seems to be gearing up development on Hadoop (Tim O’Reilly’s take). When you get to a certain size of data, you have to think about solving problems outside of the traditional application server talking to database server way. If you want to complex processing against millions to billions of entities, your data access time on the traditional database system is way too slow no matter how many boxes or how fast your processing logic is. More and more organizations will start to face this problem, and there is a definite need for an open-source distributed processing framework that can become a standard and allow for developers to start building tools on top of it. Maybe it will be the Hadoop library, maybe it will be someone else.
At my previous employer, we (by we, I mean a bunch of other smart people on my team) have built out a framework to manage this sort of distributed programming. I advocated strongly that the organization should open source the project using many of the reasons discussed earlier. Now that I have left, I hope someone can pick up the torch and get the MapReduce project released.