Cloud computing

Analysis of cloud computing, especially as applied to database management and analytics. Related subjects include:

December 12, 2017

Notes on artificial intelligence, December 2017

Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.

1. As I wrote back then in a post about the connection between machine learning and the rest of AI,

It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.

2. Accordingly, it can be reasonable to equate machine learning and AI.

AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
“AI” can be the sexier marketing or fund-raising term.

3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:

Understanding or translation of language (written or spoken as the case may be).
Machine vision or autonomous vehicles.
Facial recognition.
Disease diagnosis via radiology interpretation.

4. The importance of AI and of recent AI advances differs greatly according to application or data category. Read more

Categories: Cloud computing, Predictive modeling and advanced analytics, Public policy, Surveillance and privacy

4 Comments

August 17, 2017

More notes on the transition to the cloud

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

Categories: Amazon and its cloud, Cloud computing, Databricks, Spark and BDAS, Google, Microsoft and SQL*Server, Storage

1 Comment

June 30, 2017

Analytics on the edge?

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. Read more

Categories: Business intelligence, Cloud computing, Data warehousing, Database diversity, Databricks, Spark and BDAS, Log analysis, NoSQL, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP)

1 Comment

June 14, 2017

Light-touch managed services

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

Altus manages jobs for you.
But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: Read more

Categories: Cloud computing, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Software as a Service (SaaS), Surveillance and privacy

3 Comments

June 14, 2017

Cloudera Altus

I talked with Cloudera before the recent release of Altus. In simplest terms, Cloudera’s cloud strategy aspires to:

Provide all the important advantages of on-premises Cloudera.
Provide all the important advantages of native cloud offerings such as Amazon EMR (Elastic MapReduce, or at least come sufficiently close to that goal.
Benefit from customers’ desire to have on-premises and cloud deployments that work:
- Alike in any case.
- Together, to the extent that that makes use-case sense.

In other words, Cloudera is porting its software to an important new platform.* And this port isn’t complete yet, in that Altus is geared only for certain workloads. Specifically, Altus is focused on “data pipelines”, aka data transformation, aka “data processing”, aka new-age ETL (Extract/Transform/Load). (Other kinds of workload are on the roadmap, including several different styles of Impala use.) So what about that is particularly interesting? Well, let’s drill down.

*Or, if you prefer, improving on early versions of the port.

Categories: Amazon and its cloud, Cloud computing, Cloudera, Databricks, Spark and BDAS, Hadoop, Log analysis, MapReduce, Software as a Service (SaaS)

2 Comments

October 3, 2016

Notes on the transition to the cloud

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

Categories: Amazon and its cloud, Cloud computing, Data mart outsourcing, Data warehousing, Database diversity, Google, Hadoop, Health care, Investment research and trading, Log analysis, Microsoft and SQL*Server, NoSQL, Software as a Service (SaaS), Solid-state memory, Splunk, Surveillance and privacy, Text, Web analytics

5 Comments

August 28, 2016

Are analytic RDBMS and data warehouse appliances obsolete?

I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:

Many of the independent vendors were swooped up by acquisition.
- IBM bought Netezza.
- Microsoft bought DATAllegro.
- HP bought Vertica.
- Greenplum went to EMC/VMware/Pivotal.
- Teradata bought Aster.
- Actian bought both ParAccel and Vectorwise.
None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
Impala and other Hadoop-based alternatives are technology options.
Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big. 🙂

Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.

My answer, in a nutshell, is:

Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud — are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.

Categories: Actian and Ingres, Amazon and its cloud, Aster Data, Business intelligence, Calpont, Cloud computing, Data warehouse appliances, Data warehousing, Databricks, Spark and BDAS, DATAllegro, EAI, EII, ETL, ELT, ETLT, EMC, Exasol, Greenplum, Hadapt, Hadoop, HP and Neoview, IBM and DB2, In-memory DBMS, Infobright, Kickfire, Kognitio, MemSQL, Microsoft and SQL*Server, Netezza, Open source, Oracle, ParAccel, Predictive modeling and advanced analytics, SAP AG, SQL/Hadoop integration, Sybase, Teradata, VectorWise, Vertica Systems

29 Comments

July 31, 2016

Notes on Spark and Databricks — technology

During my recent visit to Databricks, I of course talked a lot about technology — largely with Reynold Xin, but a bit with Ion Stoica as well. Spark 2.0 is just coming out now, and of course has a lot of enhancements. At a high level:

Using the new terminology, Spark originally assumed users had data engineering skills, but Spark 2.0 is designed to be friendly to data scientists.
A lot of this is via a focus on simplified APIs, based on
- Unlike similarly named APIs in R and Python, Spark DataFrames work with nested data.
- Machine learning and Spark Streaming both work with Spark DataFrames.
There are lots of performance improvements as well, some substantial. Spark is still young enough that Bottleneck Whack-A-Mole yields huge benefits, especially in the SparkSQL area.
SQL coverage is of course improved. For example, SparkSQL can now perform all TPC-S queries.

The majority of Databricks’ development efforts, however, are specific to its cloud service, rather than being donated to Apache for the Spark project. Some of the details are NDA, but it seems fair to mention at least:

Databricks’ notebooks feature for organizing and launching machine learning processes and so on is a biggie. Jupyter is an open source analog.
Databricks has been working on security, and even on the associated certifications.

Two of the technical initiatives Reynold told me about seemed particularly cool. Read more

Categories: Benchmarks and POCs, Cloud computing, Databricks, Spark and BDAS, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP)

3 Comments

July 19, 2016

Notes from a long trip, July 19, 2016

For starters:

I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.

A running list of recent posts is:

As a companion to this post, I’m publishing a very long one on vendor lock-in.
Spark and Databricks are both prospering, and of course enhancing their technology as well.
Ditto DataStax.
Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.

Subjects I’d like to add to that list include:

MemSQL, Zoomdata, and Neo Technology (also prospering).
Cloudera (multiple topics, as usual).
Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
Microsoft’s reinvention (it feels real).
Metadata (it’s ever more of a thing).
Machine learning (it’s going to be a big portion of my research going forward).
Transitions to the cloud — this subject affects almost everything else.

Categories: About this blog, Business intelligence, Cassandra, ClearStory Data, Cloud computing, Cloudera, Data warehousing, Databricks, Spark and BDAS, DataStax, Facebook, HBase, Market share and customer counts, MemSQL, Microsoft and SQL*Server, NoSQL, Open source, Platfora, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP), Workday, Zoomdata

7 Comments

January 22, 2016

Cloudera in the cloud(s)

Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.

Making Cloudera run in the cloud has three major aspects:

Cloudera’s usual software, ported to run on the cloud platform(s).
Cloudera Director, which for example launches cloud instances.
Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.

Features new in this week’s release of Cloudera Director include:

An API for job submission.
Support for spot and preemptable instances.
High availability.
Kerberos.
Some cluster repair.
Some cluster cloning.

I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.

As for porting, let me start by noting: Read more

Categories: Amazon and its cloud, Business intelligence, Cloud computing, Cloudera, Data models and architecture, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Google, Hadoop, Hortonworks, Market share and customer counts, Microsoft and SQL*Server, Predictive modeling and advanced analytics

8 Comments

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in