Analysis of cloud computing, especially as applied to database management and analytics. Related subjects include:
During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:
- Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
- A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
- There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
- SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.
The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:
- Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
- Databricks has been working on security, and even on the associated certifications.
Two of the technical initiatives Reynold told me about seemed particularly cool. Read more
|Categories: Benchmarks and POCs, Cloud computing, Databricks, Spark and BDAS, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP)||2 Comments|
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
- Spark and Databricks are both prospering, and of course enhancing their technology as well.
- Ditto DataStax.
- Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.
Subjects I’d like to add to that list include:
- MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.
Making Cloudera run in the cloud has three major aspects:
- Cloudera’s usual software, ported to run on the cloud platform(s).
- Cloudera Director, which for example launches cloud instances.
- Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.
Features new in this week’s release of Cloudera Director include:
- An API for job submission.
- Support for spot and preemptable instances.
- High availability.
- Some cluster repair.
- Some cluster cloning.
I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.
As for porting, let me start by noting: Read more
When I find myself making the same observation fairly frequently, that’s a good impetus to write a post based on it. And so this post is based on the thought that there are many analogies between:
- Oracle and the Oracle DBMS.
- IBM and the IBM mainframe.
And when you look at things that way, Oracle seems to be swimming against the tide.
Drilling down, there are basically three things that can seriously threaten Oracle’s market position:
- Growth in apps of the sort for which Oracle’s RDBMS is not well-suited. Much of “Big Data” fits that description.
- Outright, widespread replacement of Oracle’s application suites. This is the least of Oracle’s concerns at the moment, but could of course be a disaster in the long term.
- Transition to “the cloud”. This trend amplifies the other two.
Oracle’s decline, if any, will be slow — but I think it has begun.
There’s a clear market lead in the core product category. IBM was dominant in mainframe computing. While not as dominant, Oracle is definitely a strong leader in high-end OTLP/mixed-use (OnLine Transaction Processing) RDBMS.
That market lead is even greater than it looks, because some of the strongest competitors deserve asterisks. Many of IBM’s mainframe competitors were “national champions” — Fujitsu and Hitachi in Japan, Bull in France and so on. Those were probably stronger competitors to IBM than the classic BUNCH companies (Burroughs, Univac, NCR, Control Data, Honeywell).
Similarly, Oracle’s strongest direct competitors are IBM DB2 and Microsoft SQL Server, each of which is sold primarily to customers loyal to the respective vendors’ full stacks. SAP is now trying to play a similar game.
The core product is stable, secure, richly featured, and generally very mature. Duh.
The core product is complicated to administer — which provides great job security for administrators. IBM had JCL (Job Control Language). Oracle has a whole lot of manual work overseeing indexes. In each case, there are many further examples of the point. Edit: A Twitter discussion suggests the specific issue with indexes has been long fixed.
Niche products can actually be more reliable than the big, super-complicated leader. Tandem Nonstop computers were super-reliable. Simple, “embeddable” RDBMS — e.g. Progress or SQL Anywhere — in many cases just work. Still, if you want one system to run most of your workload 24×7, it’s natural to choose the category leader. Read more
|Categories: Cloud computing, Database diversity, Exadata, IBM and DB2, Market share and customer counts, Microsoft and SQL*Server, NoSQL, Oracle, Software as a Service (SaaS)||25 Comments|
There’s a lot of talk these days about transitioning to the cloud, by IT customers and vendors alike. Of course, I have thoughts on the subject, some of which are below.
1. The economies of scale of not running your own data centers are real. That’s the kind of non-core activity almost all enterprises should outsource. Of course, those considerations taken alone argue equally for true cloud, co-location or SaaS (Software as a Service).
2. When the (Amazon) cloud was newer, I used to hear that certain kinds of workloads didn’t map well to the architecture Amazon had chosen. In particular, shared-nothing analytic query processing was necessarily inefficient. But I’m not hearing nearly as much about that any more.
3. Notwithstanding the foregoing, not everybody loves Amazon pricing.
4. Infrastructure vendors such as Oracle would like to also offer their infrastructure to you in the cloud. As per the above, that could work. However:
- Is all your computing on Oracle’s infrastructure? Probably not.
- Do you want to move the Oracle part and the non-Oracle part to different clouds? Ideally, no.
- Do you like the idea of being even more locked in to Oracle than you are now? [Insert BDSM joke here.]
- Will Oracle do so much better of a job hosting its own infrastructure that you use its cloud anyway? Well, that’s an interesting question.
Actually, if we replace “Oracle” by “Microsoft”, the whole idea sounds better. While Microsoft doesn’t have a proprietary server hardware story like Oracle’s, many folks are content in the Microsoft walled garden. IBM has fiercely loyal customers as well, and so may a couple of Japanese computer manufacturers.
5. Even when running stuff in the cloud is otherwise a bad idea, there’s still: Read more
|Categories: Amazon and its cloud, Cloud computing, Emulation, transparency, portability, IBM and DB2, Microsoft and SQL*Server, Oracle, Pricing||6 Comments|
It is extremely difficult to succeed with SaaS (Software as a Service) and packaged software in the same company. There were a few vendors who seemed to pull it off in the 1970s and 1980s, generally industry-specific application suite vendors. But it’s hard to think of more recent examples — unless you have more confidence than I do in what behemoth software vendors say about their SaaS/”cloud” businesses.
Despite the cautionary evidence, I’m going to argue that SaaS and software can and often should be combined. The “should” part is pretty obvious, with reasons that start:
- Some customers are clearly better off with SaaS. (E.g., for simplicity.)
- Some customers are clearly better off with on-premises software. (E.g., to protect data privacy.)
- On-premises customers want to know they have a path to the cloud.
- Off-premises customers want the possibility of leaving their SaaS vendor’s servers.
- SaaS can be great for testing, learning or otherwise adopting software that will eventually be operated in-house.
- Marketing and sales efforts for SaaS and packaged versions can be synergistic.
- The basic value proposition, competitive differentiation, etc. should be the same, irrespective of delivery details.
- In some cases, SaaS can be the lower cost/lower commitment option, while packaged product can be the high end or upsell.
- An ideal sales force has both inside/low-end and bag-carrying/high-end components.
But the “how” of combining SaaS and traditional software is harder. Let’s review why. Read more
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as: Read more
|Categories: Amazon and its cloud, Cloud computing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Software as a Service (SaaS)||7 Comments|
7-10 years ago, I repeatedly argued the viewpoints:
- Relational DBMS were the right choice in most cases.
- Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
- There were a variety of specialized use cases in which non-relational data models were best.
Since then, however:
- Hadoop has flourished.
- NoSQL has flourished.
- Graph DBMS have matured somewhat.
- Much of the action has shifted to machine-generated data, of which there are many kinds.
So it’s probably best to revisit all that in a somewhat organized way.
I believe in all of the following trends:
- Hadoop is a Big Deal, and here to stay.
- Spark, for most practical purposes, is becoming a big part of Hadoop.
- Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.
Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:
- People would like this to be true, although in most cases only as one of several cluster computing platforms.
- Hadoop, when viewed as an operating system, is extremely primitive.
- Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.
There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.
|Categories: Cloud computing, Databricks, Spark and BDAS, Hadoop, MapReduce, MemSQL, Software as a Service (SaaS)||15 Comments|
It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:
- Amazon Redshift appears to be prospering.
- So are some SaaS (Software as a Service) business intelligence vendors.
- Amazon Elastic MapReduce is still around.
- Snowflake Computing launched with a cloud strategy.
- Cazena, with vague intentions for cloud data warehousing, destealthed.*
- Cloudera made various cloud-related announcements.
- Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
- The general argument for cloud-or-at-least-colocation has compelling aspects.
- Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.