I talked with Cloudera shortly ahead of today’s announcement of Cloudera 5.5. Much of what we talked about had something or other to do with SQL data management. Highlights include:
- Impala and Kudu are being donated to Apache. This actually was already announced Tuesday. (Due to Apache’s rules, if I had any discussion with Cloudera speculating on the likelihood of Apache accepting the donations, I would not be free to relay it.)
- Cloudera is introducing SQL extensions so that Impala can query nested data structures. More on that below.
- The basic idea for the nested datatype support is that there are SQL extensions with a “dot” notation to let you get at the specific columns you need.
- From a feature standpoint, we’re definitely still in the early days.
- When I asked about indexes on these quasi-columns, I gathered that they’re not present in beta but are hoped for by the time of general availability.
- Basic data skipping, also absent in beta, seems to be more confidently expected in GA.
- This is for Parquet first, Avro next, and presumably eventually native JSON as well.
- This is said to be Dremel-like, at least in the case of Parquet. I must confess that I’m not familiar enough with Apache Drill to compare the two efforts.
- Cloudera is increasing its coverage of Spark in several ways.
- Cloudera is adding support for MLlib.
- Cloudera is adding support for SparkSQL. More on that below.
- Cloudera is adding support for Spark going against S3. The short answer to “How is this different from the Databricks service?” is:
- More “platform” stuff from the Hadoop stack (e.g. for data ingest).
- Less in the way of specific Spark usability stuff.
- Cloudera is putting into beta what it got in the Xplain.io acquisition, which it unfortunately is naming Cloudera Navigator Optimizer. More on that in a separate post.
- Impala and Hive are getting column-level security via Apache Sentry.
- There are other security enhancements.
- Some policy-based information lifecycle management is being added as well.
While I had Cloudera on the phone, I asked a few questions about Impala adoption, specifically focused on concurrency. There was mention of: Read more
|Categories: Benchmarks and POCs, Cloudera, Data warehousing, Databricks, Spark and BDAS, Market share and customer counts, Petabyte-scale data management, Predictive modeling and advanced analytics, SQL/Hadoop integration||4 Comments|
Parts of the business intelligence differentiation story resemble the one I just posted for data management. After all:
- Both kinds of products query and aggregate data.
- Both are offered by big “enterprise standard” behemoth companies and also by younger, nimbler specialists.
- You really, really, really don’t want your customer data to leak via a security breach in either kind of product.
That said, insofar as BI’s competitive issues resemble those of DBMS, they are those of DBMS-lite. For example:
- BI is less mission-critical than some other database uses.
- BI has done a lot less than DBMS to deal with multi-structured data.
- Scalability demands on BI are less than those on DBMS — indeed, they’re the ones that are left over after the DBMS has done its data crunching first.
And full-stack analytic systems — perhaps delivered via SaaS (Software as a Service) — can moot the BI/data management distinction anyway.
Of course, there are major differences between how DBMS and BI are differentiated. The biggest are in user experience. I’d say: Read more
|Categories: Business intelligence, Buying processes, ClearStory Data, Data mart outsourcing, Pricing, QlikTech and QlikView, Rocana, Tableau Software||Leave a Comment|
In the previous post I broke product differentiation into 6-8 overlapping categories, which may be abbreviated as:
- (Other) trustworthiness
- User experience
and sometimes also issues in adoption and administration.
Now let’s use this framework to examine two market categories I cover — data management and, in separate post, business intelligence.
Applying this taxonomy to data management:
|Categories: Buying processes, Clustering, Data warehousing, Database diversity, Microsoft and SQL*Server, Predictive modeling and advanced analytics, Pricing||2 Comments|
Obviously, a large fraction of what I write about involves technical differentiation. So let’s try for a framework where differentiation claims can be placed in context. This post will get through the generalities. The sequels will apply them to specific cases.
Many buying and design considerations for IT fall into six interrelated areas: Read more
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:
- Rack- or data-center-scale systems.
- The real or imagined demise of Moore’s Law.
I also got updated as to typical Hadoop hardware.
If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)
One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.
|Categories: Amazon and its cloud, Buying processes, Cloudera, Facebook, Google, Intel, Memory-centric data management, Pricing, Solid-state memory||19 Comments|
In 1981, Gerry Chichester and Vaughan Merlyn did a user-survey-based report about transaction-oriented fourth-generation languages, the leading application development technology of their day. The report included top-ten lists of important features during the buying cycle and after implementation. The items on each list were very similar — but the order of the items was completely different. And so the report highlighted what I regard as an eternal truth of the enterprise software industry:
What users value in the product-buying process is quite different from what they value once a product is (being) put into use.
Here are some thoughts about how that comes into play today.
Wants outrunning needs
1. For decades, BI tools have been sold in large part via demos of snazzy features the CEO would like to have on his desk. First it was pretty colors; then it was maps; now sometimes it’s “real-time” changing displays. Other BI features, however, are likely to be more important in practice.
2. In general, the need for “real-time” BI data freshness is often exaggerated. If you’re a human being doing a job that’s also often automated at high speed — for example network monitoring or stock trading — there’s a good chance you need fully human real-time BI. Otherwise, how much does a 5-15 minute delay hurt? Even if you’re monitoring website sell-through — are your business volumes really high enough that 5 minutes matters much? eBay answered “yes” to that question many years ago, but few of us work for businesses anywhere near eBay’s scale.
Even so, the want for speed keeps growing stronger.
3. Similarly, some desires for elastic scale-out are excessive. Your website selling koi pond accessories should always run well on a single server. If you diversify your business to the point that that’s not true, you’ll probably rewrite your app by then as well.
4. Some developers want to play with cool new tools. That doesn’t mean those tools are the best choice for the job. In particular, boring old SQL has merits — such as joins! — that shiny NoSQL hasn’t yet replicated.
5. Some developers, on the other hand, want to keep using their old tools, on which they are their employers’ greatest experts. That doesn’t mean those tools are the best choice for the job either.
6. More generally, some enterprises insist on brand labels that add little value but lots of expense. Yes, there are many benefits to vendor consolidation, and you may avoid many headaches if you stick with not-so-cutting-edge technology. But “enterprise-grade” hardware failure rates may not differ enough from “consumer-grade” ones to be worth paying for.
|Categories: Benchmarks and POCs, Business intelligence, Cloud computing, Clustering, Data models and architecture, Data warehousing, NoSQL, Software as a Service (SaaS), Vertica Systems||3 Comments|
1. Censorship worries me, a lot. A classic example is Vietnam, which basically has outlawed online political discussion.
And such laws can have teeth. It’s hard to conceal your internet usage from an inquisitive government.
2. Software and software related patents are back in the news. Google, which said it was paying $5.5 billion or so for a bunch of Motorola patents, turns out to really have paid $7 billion or more. Twitter and IBM did a patent deal as well. Big numbers, and good for certain shareholders. But this all benefits the wider world — how?
The purpose of legal intellectual property protections, simply put, is to help make it a good decision to create something. …
Why does “securing … exclusive Right[s]” to the creators of things that are patented, copyrighted, or trademarked help make it a good decision for them to create stuff? Because it averts competition from copiers, thus making the creator a monopolist in what s/he has created, allowing her to at least somewhat value-price her creation.
I.e., the core point of intellectual property rights is to prevent copying-based competition. By way of contrast, any other kind of intellectual property “right” should be viewed with great suspicion.
That Constitutionally-based principle makes as much sense to me now as it did then. By way of contrast, “Let’s give more intellectual property rights to big corporations to protect middle-managers’ jobs” is — well, it’s an argument I view with great suspicion.
But I find it extremely hard to think of a technology industry example in which development was stimulated by the possibility of patent protection. Yes, the situation may be different in pharmaceuticals, or for gadgeteering home inventors, but I can think of no case in which technology has been better, or faster to come to market, because of the possibility of a patent-law monopoly. So if software and business-method patents were abolished entirely – even the ones that I think could be realistically adjudicated — I’d be pleased.
3. In November, 2008 I offered IT policy suggestions for the incoming Obama Administration, especially: Read more
|Categories: Buying processes, Google, IBM and DB2, Public policy, Surveillance and privacy||1 Comment|
Generalizing about SaaS (Software as a Service) is hard. To prune some of the confusion, let’s start by noting:
- SaaS has been around for over half a century, and at times has been the dominant mode of application delivery.
- The term multi-tenancy is being used in several different ways.
- Multi-tenancy, in the purest sense, is inessential to SaaS. It’s simply an implementation choice that has certain benefits for the SaaS provider. And by the way, …
- … salesforce.com, the chief proponent of the theory that true multi-tenancy is the hallmark of true SaaS, abandoned that position this week.
- Internet-based services are commonly, if you squint a little, SaaS. Examples include but are hardly limited to Google, Twitter, Dropbox, Intuit, Amazon Web Services, and the company that hosts this blog (KnownHost).
- Some of the core arguments for SaaS’ rise, namely the various efficiencies of data center outsourcing and scale, apply equally to the public cloud, to SaaS, and to AEaaS (Anything Else as a Service).
- These benefits are particularly strong for inherently networked use cases. For example, you really don’t want to be hosting your website yourself. And salesforce.com got its start supporting salespeople who worked out of remote offices.
- In theory and occasionally in practice, certain SaaS benefits, namely the outsourcing of software maintenance and updates, could be enjoyed on-premises as well. Whether I think that could be a bigger deal going forward will be explored in future posts.
For smaller enterprises, the core outsourcing argument is compelling. How small? Well:
- What’s the minimum level of IT operations headcount needed for mission-critical systems? Let’s just say “several”.
- What does that cost? Fully burdened, somewhere in the six figures.
- What fraction of the IT budget should such headcount be? As low a double digit percentage as possible.
- What fraction of revenues should be spent on IT? Some single-digit percentage.
So except for special cases, an enterprise with less than $100 million or so in revenue may have trouble affording on-site data processing, at least at a mission-critical level of robustness. It may well be better to use NetSuite or something like that, assuming needed features are available in SaaS form.*
|Categories: Amazon and its cloud, Buying processes, Cloud computing, Data mart outsourcing, Data warehouse appliances, Data warehousing, Infobright, Netezza, Pricing, salesforce.com, Software as a Service (SaaS), Workday||5 Comments|
Some subjects just keep coming up. And so I keep saying things like:
Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.
Most generalizations about Hadoop are false. Reasons include:
- Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
- The transition from Hadoop 1 to Hadoop 2 will be drastic.
- For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.
Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.
Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.
Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)