Edit: There is now a follow-up post on Cloudera Impala with substantially more detail.
In my world it’s possible to have a hasty 2-hour conversation, and that’s exactly what I had with Cloudera last week. We touched on hardware and general adoption, but much of the conversation was about Cloudera Impala, announced today. Like Hive, Impala turns Hadoop into a basic analytic RDBMS, with similar SQL/Hadoop integration benefits to those of Hadapt. In particular:
- Impala is Hive-compatible in query language (HQL, which is a whole lot like SQL), metadata, JDBC/ODBC drivers, etc.
- Unlike Hive, Impala does not work through Hadoop MapReduce.
- Unlike Hadoop MapReduce and hence Hive, Impala does not persist intermediate results to disk. This is good for performance, but on extremely long-running queries it increases the risk you’ll have a node failure and have to restart the query from scratch.
- Impala in its first version is missing some Hive syntax, notably in support for UDFs (User-Defined Functions).
Beyond that: Read more
|Categories: Cloudera, Columnar database management, Database compression, Hadapt, Hadoop, MapReduce, Open source, SQL/Hadoop integration||6 Comments|
Two of the more interesting approaches for integrating Hadoop and MapReduce with relational DBMS come from my clients at Teradata Aster (via SQL/MR and SQL-H) and Hadapt. In both cases, the story starts:
- You can dump any kind of data you want into Hadoop’s file system.
- You can have data in a scale-out RDBMS to get good performance on analytic SQL.
- You can access all the data (not just the relationally stored part) via SQL.
- You can do MapReduce on all the data (not just the Hadoop-stored part).
- To varying degrees, Hadapt and Aster each offer three kinds of advantage over Hadoop-with-Hive:
- SQL performance is (much) better.
- SQL functionality is better.
- At least some of your employees — the “business analysts” — can invoke MapReduce processes through SQL, if somebody else (e.g. your techies or the vendor’s) coded them up in the first place.
Of course, there are plenty of differences. Those start: Read more
My clients at Teradata are introducing a mix-em/match-em Aster/Hadoop box, officially called the Teradata Aster Big Analytics Appliance. Basics include:
- You can fill a rack with nodes either for the Aster DBMS or for Hadoop (Hortonworks flavor), or you can combine them in the same box.
- If you combine them, they share management software (adapted from mainstream Teradata’s) and Infiniband.
- An Aster node has 16 2.6-gigahertz cores and 24 900GB disk drives.
- A Hadoop node has 12 2.0-gigahertz cores and 12 3TB drives.
- A central part of Teradata’s strategy is that Aster and Hadoop nodes can work together via SQL-H.
- The Teradata Aster Big Analytics Appliance is based on a family of Dell servers that fit more compactly into racks than do Teradata’s traditional products.
- The Teradata Aster Big Analytics Appliance replaces a previous interim Teradata Aster appliance that used similar hardware to that in other Teradata systems.
My views on the Teradata Aster Big Analytics Appliance start: Read more
My clients at Hadapt are coming out with a Version 2 to be available in Q1 2013, and perhaps slipstreaming some of the features before then. At that point, it will be reasonable to regard Hadapt as offering:
- A very tight integration between an RDBMS-based analytic platform and Hadoop …
- … that is decidedly immature as an analytic RDBMS …
- … but which strongly improves the SQL capabilities of Hadoop (vs., say, the alternative of using Hive).
Solr is in the mix as well.
Hadapt+Hadoop is positioned much more as “better than Hadoop” than “a better scale-out RDBMS”– and rightly so, due to its limitations when viewed strictly from an analytic RDBMS standpoint. I.e., Hadapt is meant for enterprises that want to do several of:
- Dump multi-structured data into Hadoop.
- Refine or just move some of it into an RDBMS.
- Bring in data from other RDBMS.
- Process of all the above via Hadoop MapReduce.
- Process of all the above via SQL.
- Use full-text indexes on the data.
Hadapt has 6 or so production customers, a dozen or so more coming online soon, 35 or so employees (mainly in Cambridge or Poland), reasonable amounts of venture capital, and the involvement of a variety of industry luminaries. Hadapt’s biggest installation seems to have 10s of terabytes of relational data and 100s of TBs of multi-structured; Hadapt is very confident in its ability to scale an order of magnitude beyond that with the Version 2 product, and reasonably confident it could go even further.
At the highest level, Hadapt works like this: Read more
|Categories: Analytic technologies, Cloudera, Columnar database management, Data models and architecture, Data warehousing, Hadapt, Hadoop, MapR, MapReduce, Market share and customer counts, SQL/Hadoop integration, Text||4 Comments|
- This is a list of Monash Advantage members.
- All our vendor clients are Monash Advantage members, unless …
- … we work with them primarily in their capacity as technology users. (A large fraction of our user clients happen to be SaaS vendors.)
- We do not usually disclose our user clients.
- We do not usually disclose our venture capital clients, nor those who invest in publicly-traded securities.
- Excluded from this round of disclosure is one vendor I have never written about.
- Included in this round of disclosure is one client paying for services partly in stock. All our other clients are cash-only.
For reasons explained below, I’ll group the clients geographically. Obviously, companies often have multiple locations, but this is approximately how it works from the standpoint of their interactions with me. Read more
I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
|Categories: Hadapt, Hadoop, MapReduce, PostgreSQL, SQL/Hadoop integration, Theory and architecture, Workload management||6 Comments|
Ingres, the company, is:
- Changing its name to Actian.
- Deemphasizing Ingres, the product.
- Emphasizing a set of products that don’t exist yet (or at least aren’t shipping), namely lightweight mobile apps that are business-intelligence-plus-an-action, and technology for building them. These are called “Action Apps”, and are discussed on the Actian company blog.
- Positioning all this as something to do with “big data” (what a shock).
It turns out that Actian was the name of an ancient athletic competition commemorating Augustus’ defeat of Anthony at Actium, a battle that was more recently memorialized in the movie Cleopatra. Frankly, I think Cleopatra Software might have been a more interesting company name, although that could mean execs would have to arrive at sales calls rolled up in a carpet.
|Categories: Actian and Ingres, Business intelligence, Hadapt, Market share and customer counts, VectorWise||10 Comments|
Hadoop is immature technology. As such, it naturally offers much room for improvement in both industrial-strengthness and performance. And since Hadoop is booming, multiple efforts are underway to fill those gaps. For example:
- Cloudera’s proprietary code is focused on management, set-up, etc.
- The “Phase 1″ plans Hortonworks shared with me for Apache Hadoop are focused on industrial-strengthness, as are significant parts of “Phase 2″.*
- MapR tells a performance story versus generic Apache Hadoop HDFS and MapReduce. (One aspect of same is just C++ vs. Java.)
- So does Hadapt, but mainly vs. Hive.
- Cloudera also tells me there’s a potential 4-5X performance improvement in Hive coming down the pike from what amounts to an optimizer rewrite.
(Zettaset belongs in the discussion too, but made an unfortunate choice of embargo date.)
|Categories: Cloudera, Greenplum, Hadapt, Hadoop, HBase, MapR, MapReduce, Parallelization, Zettaset||20 Comments|
I met with the Hadapt guys today. I think I can be a bit crisper than before in positioning Hadapt and its use cases, namely:
- Hadapt is additional software on a cluster that also runs fully functional Hadoop/HDFS. (Cloudera Hadoop more than straight-from-Apache Hadoop to date, but that’s not a requirement.)
- The cluster also runs a DBMS on every node, such as PostgreSQL or one of Infobright/Vectorwise.
- Hadapt’s software manages parallel SQL queries by distributing them to the DBMS living on each node. Hadapt says that the resulting query performance far outshines Hive’s.
- Hadapt further says that, by exploiting the partner DBMS, its SQL functionality outpaces Hive’s as well.
- Target Hadapt use cases are centered around keeping machine-generated or other poly-structured data in Hadoop, and extracting, enhancing, or otherwise deriving some of it to live in the relational store.
- In particular, Hadapt seems like an interesting choice when you want to use that relational data as you work on other data that’s still in HDFS, or if you want to keep using the relational data in other kinds of MapReduce jobs.
- That all fits well with my thoughts about the importance of derived data.
Other evolution from what I wrote about Hadapt a few months ago includes:
- Hadapt is in beta now.
- Hadapt has added adult supervision in the form of Philip Wickline, late of Endeca.
In other news, Hadapt is our newest client.
|Categories: Analytic technologies, Cloudera, Data models and architecture, Data warehousing, Hadapt, Hadoop, Infobright, MapReduce, Open source, PostgreSQL, SQL/Hadoop integration, VectorWise||Leave a Comment|
There’s been a flurry of announcements recently in the Hadoop world. Much of it has been concentrated on Hadoop data storage and management. This is understandable, since HDFS (Hadoop Distributed File System) is quite a young (i.e. immature) system, with much strengthening and Bottleneck Whack-A-Mole remaining in its future.
Known HDFS and Hadoop data storage and management issues include but are not limited to:
- Hadoop is run by a master node, and specifically a namenode, that’s a single point of failure.
- HDFS compression could be better.
- HDFS likes to store three copies of everything, whereas many DBMS and file systems are satisfied with two.
- Hive (the canonical way to do SQL joins and so on in Hadoop) is slow.
Different entities have different ideas about how such deficiencies should be addressed. Read more
|Categories: Aster Data, Cassandra, Cloudera, Data warehouse appliances, DataStax, EMC, Greenplum, Hadapt, Hadoop, IBM and DB2, MapReduce, MongoDB and 10gen, Netezza, Parallelization||22 Comments|