Hadoop Challenges
Hadoop. The very word is starting to generate gut reactions, which may be positive or negative. Polarized viewpoints are becoming common. Hadoop family components include its distributed file system, encompassing an IP range of storage, replicating each record of data in 3 strategic places across the file system and tracking the distribution centrally. It also includes the MapReduce algorithm for processing distributed data by processing it as locally as possible. Hadoop horizontally scales across dozens, hundreds or thousands of servers.
So why not welcome this new, more inexpensive way to store and process the data it’s designed for - petabytes of unstructured data in a highly distributed manner? The relational purists would first say – and I’d agree – that often Hadoop is tried outside of this narrow context. But further, even when it comes to these workloads, here is some of the well-informed common feedback I get to mentioning the polarizing word Hadoop inside an organization.
Which Hadoop?
There are several variations of the Hadoop footprint, with more undoubtedly to come. Prominently, there is EMC/Facebook and Cloudera/Dell and the Oracle Big Data Appliance, which come with training and support. And there’s open source Apache/Yahoo, for which support is also available through Yahoo Hortonworks. So while Hadoop footprints are supported, one must choose a distribution for something most likely they have no live experience with.
Hadoop is Complex
Then there’s implementation and that is complex. It’s a programming orientation as opposed to a tool orientation. Programmers rule this world and have often isolated themselves from the internal “system integrators” who deploy tools. These groups often do more than build. They analyze, which is also different from information management in the database-only years.
The server requirements are immense, mostly because we are dealing with large amounts of data that require many servers. However, for many organizations, these commodity servers are harder to implement than enterprise-class servers. Many organizations are used to vertical scaling and not to supporting “server farms.”
The divide between the Hadoop implementation team and management is more pronounced as well. It is interesting that at the same time that hands-off software-as-a-service is reaching new heights in organizations, this labor-intensive, brute-force approach antithesis to SaaS is making sense in those same organizations.
Hadoop is for the Internet Companies
While case studies abound for the dot-com pioneers of Hadoop (many of whom contributed to its development) and web marketing companies, other companies are slower to adopt. Since Hadoop is best and most cost-effective for larger data, in these non-web-based companies where current data under management does not make it into the realm of “big data,” the case needs to be made for taking web click, sensor and social data under management. A new wave of scientific and intelligence community use still does little to help the healthcare, banking, and retail company see the need.
Hadoop can be a larger undertaking than a platform selection for the expansion or relocation of data that is currently under management. Without many case studies and with the need to do return on investment comes with it the “pioneer” label, which is too daunting for many in their organizational culture.
Increasing Capabilities within Relational Systems
Large vendors are responding to the Hadoop challenge in two ways and usually both ways at once. One way is the “join them” approach, where the vendor announces Hadoop distributions in addition to continuing full support of their current wares for “big data”, many of which are now extending their capabilities to unprecedented scale. Others incorporate Hadoop-like capabilities as a hedge against Hadoop and a reinforcement of their roadmap, much of which began prior to Hadoop being available.
Hadoop Limitations
We have become accustomed to real-time interactivity with data, but use cases for Hadoop must fall into batch processing. Hadoop also does not support indexing or an SQL interface – not yet anyway. And it’s not strictly ACID compliant so you would not manage transactions there.
Are these knock-out factors for Hadoop? I don’t believe so. In my work, petabytes of unstructured data seems to have a cost-effective home only in Hadoop. Any Hadoop will at most co-exist within the enterprise with its less expensive per-capita server farms processing large amounts of unstructured data, passing some of it to relational systems with broader capabilities, while those relational systems continue to do the bulk of an enterprise’s processing, especially of structured data.
The advancement of the Hadoop toolset will obviate some of the Hadoop limitations. Finance and Healthcare industries will see many of their Hadoop trials move to production in the next year. Hadoop should begin to be matched up against the real challenges of the enterprise now.
Language