My former friends at Greenplum no longer talk to me, so in particular I wasn’t briefed on Pivotal HD and Greenplum HAWQ. Pivotal HD seems to be yet another Hadoop distribution, with the idea that you use Greenplum’s management tools. Greenplum HAWQ seems to be Greenplum tied to HDFS.
The basic idea seems to be much like what I mentioned a few days ago — the low-level file store for Greenplum can now be something else one has heard of before, namely HDFS (Hadoop Distributed File System, which is also an option for, say, NuoDB). Beyond that, two interesting quotes in a Greenplum blog post are:
When a query starts up, the data is loaded out of HDFS and into the HAWQ execution engine.
In addition, it has native support for HBase, supporting HBase predicate pushdown, hive[sic] connectivity, and offering a ton of intelligent features to retrieve HBase data.
The first sounds like the invisible loading that Daniel Abadi wrote about last September on Hadapt’s blog. (Edit: Actually, see Daniel’s comment below.) The second sounds like a good idea that, again, would also be a natural direction for vendors such as Hadapt.
WibiData is essentially on the trajectory:
- Started with platform-ish technology.
- Selling analytic application subsystems, focused for now on personalization.
- Hopeful of selling complete analytic applications in the future.
The same, it turns out, is true of Causata.* Talking with them both the same day led me to write this post. Read more
|Categories: Hadapt, HBase, Market share and customer counts, PivotLink, Predictive modeling and advanced analytics, WibiData||4 Comments|
What I wrote before about Cloudera Impala was quite incomplete. After a followup call, I now feel I have a better handle on the whole thing.
First, some basics:
- Impala is open source code, developed to date entirely by Cloudera people, which adds analytic DBMS capabilities to Hadoop as an alternative to Hive.
- Impala is in public beta, and is targeted for general availability Q1 2013 or so.
- Cloudera plans to get paid for Impala by providing support, and by offering Impala management through its proprietary Cloudera Manager.
- Impala has been under development for about 2 years. A team of 7 or so developers has been mainly in place for a over a year. Furthermore, …
- … notwithstanding that it’s best viewed as a Hive alternative, Impala actually reuses a lot of Hive.
The general technical idea of Impala is:
- It’s an additional daemon that runs on each of your Hadoop nodes.
- Thus, Impala is not subject to Hadoop MapReduce’s latency in starting up Java processes or in storing intermediate result sets to disk.
- Impala operates as a distributed parallel analytic DBMS.*
- Impala works with a variety of Hadoop storage options, each with its own implications for latency or performance.
|Categories: Cloudera, Data models and architecture, Data warehousing, Hadoop, HBase, MapReduce, Open source, Predictive modeling and advanced analytics, SQL/Hadoop integration||12 Comments|
Time for another catch-all post. First and saddest — one of the earliest great commenters on this blog, and a beloved figure in the Boston-area database community, was Dan Weinreb, whom I had known since some Symbolics briefings in the early 1980s. He passed away recently, much much much too young. Looking back for a couple of examples — even if you’ve never heard of him before, I see that Dan ‘s 2009 comment on Tokutek is still interesting today, and so is a post on his own blog disagreeing with some of my choices in terminology.
Otherwise, in no particular order:
1. Chris Bird is learning MongoDB. As is common for Chris, his comments are both amusing and enlightening.
2. When I relayed Cloudera’s comments on Hadoop adoption, I left out a couple of categories. One Cloudera called “mobile”; when I probed, that was about HBase, with an example being messaging apps.
The other was “phone home” — i.e., the ingest of machine-generated data from a lot of different devices. This is something that’s obviously been coming for several years — but I’m increasingly getting the sense that it’s actually arrived.
|Categories: Cloudera, Data integration and middleware, Hadoop, HBase, Informatica, Metamarkets and Druid, MongoDB and 10gen, NoSQL, Open source, Telecommunications||2 Comments|
I chatted with Todd Papaioannou about his new company Continuuity. Todd is as handy at combining buzzwords as he is at concatenating vowels, and so Continuuity — with two “U”s — is making a big data fabric platform as a service with REST APIs that runs over Hadoop and HBase in the private or public clouds. I found the whole thing confusing, in that:
- I recoil against buzzwords. In particular …
- … I pay as little attention to distinctions among PaaS/IaaS/WaaS — Platform/Infrastructure/Whatever as a Service — as I can.
- The Continuuity story sounds Heroku-like, but Todd doesn’t want Continuuity compared to Heroku.
- Todd does want Continuuity discussed in terms of the application server category, but:
- It is hard to discuss app servers without segueing quickly amongst development, deployment, and data connectivity, and Continuuity is no exception to that rule.
- There is doubt as to whether using app servers makes any sense.
But all confusion aside, there are some interesting aspects to Continuuity. Read more
|Categories: Application servers, Cloud computing, Hadoop, HBase, MapReduce, Parallelization, Predictive modeling and advanced analytics, Software as a Service (SaaS)||6 Comments|
With Strata/Hadoop World being next week, there is much Hadoop discussion. One theme of the season is BI over Hadoop. I have at least 5 clients claiming they’re uniquely positioned to support that (most of whom partner with a 6th client, Tableau); the first 2 whose offerings I’ve actually written about are Teradata Aster and Hadapt. More generally, I’m hearing “Using Hadoop is hard; we’re here to make it easier for you.”
If enterprises aren’t yet happily running business intelligence against Hadoop, what are they doing with it instead? I took the opportunity to ask Cloudera, whose answers didn’t contradict anything I’m hearing elsewhere. As Cloudera tells it (approximately — this part of the conversation* was rushed): Read more
|Categories: Business intelligence, Cloudera, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Health care, Investment research and trading, MapR, Market share and customer counts, Telecommunications, Web analytics||4 Comments|
This post started as a minor paragraph in another one I’m drafting. But it grew. Please also see the comment thread below.
Increasingly many data management systems store data in a cluster, putting several copies of data — i.e. “replicas” — onto different nodes, for safety and reliable accessibility. (The number of copies is called the “replication factor”.) But how do they know that the different copies of the data really have the same values? It seems there are three main approaches to immediate consistency, which may be called:
- Two-phase commit (2PC)
- Read-your-writes (RYW) consistency
- Prudent optimism
I shall explain.
Two-phase commit has been around for decades. Its core idea is:
- One node commands other nodes (and perhaps itself) to write data.
- The other nodes all reply “Aye, aye; we are ready and able to do that.”
- The first node broadcasts “Make it so!”
Unless a piece of the system malfunctions at exactly the wrong time, you’ll get your consistent write. And if there indeed is an unfortunate glitch — well, that’s what recovery is for.
But 2PC has a flaw: If a node is inaccessible or down, then the write is blocked, even if other parts of the system were able to accept the data safely. So the NoSQL world sometimes chooses RYW consistency, which in essence is a loose form of 2PC: Read more
|Categories: Aster Data, Clustering, Hadoop, HBase, IBM and DB2, Netezza, NoSQL, Teradata, Vertica Systems||9 Comments|
To a first approximation, HCatalog is the thing that will make Hadoop play nicely with all kinds of ETLT (Extract/Transform/Load/Transform). However, HCatalog is both less and more than that:
- Less, because some of the basic features people expect in HCatalog may well wind up in other projects instead.
- More, in that HCatalog could (also) be regarded as a manifestation of efforts to make Hadoop/Hive more like a DBMS.
The base use case for HCatalog is:
- You have an HDFS file, and know its record format.
- You write DDL (Data Description Language) to create a virtual table whose columns look a lot like the HDFS file’s record fields. The language for this is HiveQL, Hive’s SQL-like language.
- You interact with that table in any of variety of ways (batch operations only), including but not limited to:
Major variants on that include: Read more
In my recent series of Hadoop posts, there were several cases where I had to choose between recommending that enterprises:
- Go with the most advanced features any vendor was credibly advocating.
- Be more cautious, and only adopt features that have been solidly proven in the field.
I favored the more advanced features each time. Here’s why.
To a first approximation, I divide Hadoop use cases into two major buckets, only one of which I was addressing with my comments:
1. Analytic data management.* Here I favored features over reliability because they are more important, for Hadoop as for analytic RDBMS before it. When somebody complains about an analytic data store not being ready for prime time, never really working, or causing them to tear their hair out, what they usually mean is that:
- It couldn’t do the work that needed doing …
- … with reasonable performance and turnaround time …
- … without undue effort in administration and/or programming.
Those complaints are much, much, more frequent than “It crashed”. So it was for Netezza, DATAllegro, Greenplum, Aster Data, Vertica, Infobright, et al. So it also is for Hadoop. And how does one address those complaints? By performance and feature enhancements, of the kind that the Hadoop community is introducing at high speed. Read more
|Categories: Buying processes, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, HBase, Hortonworks, Open source||Leave a Comment|
This is part of a four-post series, covering:
- Annoying Hadoop marketing themes that should be ignored.
- Hadoop versions and distributions, and their readiness or lack thereof for production.
- In general, how “enterprise-ready” is Hadoop?
- HBase 0.92 (this post)
As part of my recent round of Hadoop research, I talked with Cloudera’s Todd Lipcon. Naturally, one of the subjects was HBase, and specifically HBase 0.92. I gather that the major themes to HBase 0.92 are:
- Performance, scalability, and so on.
- “Coprocessors”, which are like triggers or stored procedures.
- Security, as the first major application of co-processors.
HBase coprocessors are Java code that links straight into HBase. As with other DBMS extensions of the “links straight into the DBMS code” kind,* HBase coprocessors seem best suited for very sophisticated users and third parties.** Evidently, coprocessors have already been used to make HBase security more granular — role-based, per-column-family/per-table, etc. Further, Todd thinks coprocessors could serve as a good basis for future HBase enhancements in areas such as aggregation or secondary indexing. Read more
|Categories: Benchmarks and POCs, Cloudera, Hadoop, HBase, MapReduce, NoSQL, Open source, Storage, Theory and architecture||2 Comments|