data Artisans and Flink basics start:
- Flink is an Apache project sponsored by the Berlin-based company data Artisans.
- Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
- Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.
Like many open source projects, Flink seems to have been partly inspired by a Google paper.
To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example: Read more
|Categories: Cloudera, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, Intel, Market share and customer counts, Open source, Streaming and complex event processing (CEP)||2 Comments|
I visited Databricks in early July to chat with Ion Stoica and Reynold Xin. Spark also comes up in a large fraction of the conversations I have. So let’s do some catch-up on Databricks and Spark. In a nutshell:
- Spark is indeed the replacement for Hadoop MapReduce.
- Spark is becoming the default platform for machine learning.
- SparkSQL (nee’ Shark) is puttering along predictably.
- Databricks reports good success in its core business of cloud-based machine learning support.
- Spark Streaming has strong adoption, but its position is at risk.
- Databricks, the original authority on Spark, is not keeping a tight grip on that role.
I shall explain below. I also am posting separately about Spark evolution, especially Spark 2.0. I’ll also talk a bit in that post about Databricks’ proprietary/closed-source technology.
Spark is the replacement for Hadoop MapReduce.
This point is so obvious that I don’t know what to say in its support. The trend is happening, as originally decreed by Cloudera (and me), among others. People are rightly fed up with the limitations of MapReduce, and — niches perhaps aside — there are no serious alternatives other than Spark.
The greatest use for Spark seems to be the same as the canonical first use for MapReduce: data transformation. Also in line with the Spark/MapReduce analogy: Read more
|Categories: Cloudera, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, Market share and customer counts, Predictive modeling and advanced analytics||5 Comments|
Whenever somebody asks for my help on application technology strategy, I start by trying to ascertain three things. The absolute first is actually a prerequisite to almost any kind of useful conversation, which is to ascertain in general terms what the hell it is that we are talking about.
My second goal is to ascertain technology constraints. Three common types are:
- Compatible with legacy systems and/or enterprise standards.
- Cheap, free and/or open source.
- Proven, vetted by sufficiently many references, and/or generally having an “enterprise-y” reputation.
That’s often a short and straightforward discussion, except in those awkward situations when all three of my bullet points above are applicable at once.
The third item is usually more interesting. I try to figure out what is to be accomplished. That’s usually not a simple matter, because the initial list of goals and requirements is almost never accurate. It’s actually more common that I have to tell somebody to be more ambitious than that I need to rein them in.
Commonly overlooked needs include:
- If you want to sell something and have happy users, you need a good UI.
- You will also soon need tools and a UI for administration.
- Customers demand low-latency/fresh data. Your explanation of why they don’t really need it doesn’t contradict the fact that they want it.
- Providing data access and saying “You can hook up any BI tool you want and build charts” is not generally regarded as offering a good UI.
- When “adding analytics” to something previously focused on short-request processing, it is common to underestimate the variety of things users will soon want to do. (One common reason for this under-estimate is that after years of being told it can’t be done, they’ve learned not to ask.)
And if you take one thing away from this post, then take this:
- If you “know” exactly which features are or aren’t helpful to users, …
- .. and if you supply only what you “know” they should use, …
- … then you will discover that what you “knew” wasn’t really accurate.
I guarantee it.
|Categories: Business intelligence, Buying processes, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics||2 Comments|
In a companion introduction to Kafka post, I observed that Kafka at its core is remarkably simple. Confluent offers a marchitecture diagram that illustrates what else is on offer, about which I’ll note:
- The red boxes — “Ops Dashboard” and “Data Flow Audit” — are the initial closed-source part. No surprise that they sound like management tools; that’s the traditional place for closed source add-ons to start.
- “Schema Management”
- Is used to define fields and so on.
- Is not equivalent to what is ordinarily meant by schema validation, in that …
- … it allows schemas to change, but puts constraints on which changes are allowed.
- Is done in plug-ins that live with the producer or consumer of data.
- Is based on the Hadoop-oriented file format Avro.
Kafka offers little in the way of analytic data transformation and the like. Hence, it’s commonly used with companion products. Read more
|Categories: Data integration and middleware, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Kafka and Confluent, Market share and customer counts, Streaming and complex event processing (CEP)||3 Comments|
- Kafka has gotten considerable attention and adoption in streaming.
- Kafka is open source, out of LinkedIn.
- Folks who built it there, led by Jay Kreps, now have a company called Confluent.
- Confluent seems to be pursuing a fairly standard open source business model around Kafka.
- Confluent seems to be in the low to mid teens in paying customers.
- Confluent believes 1000s of Kafka clusters are in production.
- Confluent reports 40 employees and $31 million raised.
At its core Kafka is very simple:
- Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
- Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
- Kafka handles various issues of scaling, load balancing, fault tolerance and so on.
So it seems fair to say:
- Kafka offers the benefits of hub vs. point-to-point connectivity.
- Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)
|Categories: Data integration and middleware, Humor, Kafka and Confluent, Market share and customer counts, Microsoft and SQL*Server, Open source, Specific users, Streaming and complex event processing (CEP)||10 Comments|
Cloudera released Version 2 of Cloudera Director, which is a companion product to Cloudera Manager focused specifically on the cloud. This led to a discussion about — you guessed it! — Cloudera and the cloud.
Making Cloudera run in the cloud has three major aspects:
- Cloudera’s usual software, ported to run on the cloud platform(s).
- Cloudera Director, which for example launches cloud instances.
- Points of integration, e.g. taking information about security-oriented roles from the platform and feeding then to the role-based security that is specific to Cloudera Enterprise.
Features new in this week’s release of Cloudera Director include:
- An API for job submission.
- Support for spot and preemptable instances.
- High availability.
- Some cluster repair.
- Some cluster cloning.
I.e., we’re talking about some pretty basic/checklist kinds of things. Cloudera Director is evidently working for Amazon AWS and Google GCP, and planned for Windows Azure, VMware and OpenStack.
As for porting, let me start by noting: Read more
I’m on two overlapping posting kicks, namely “lessons from the past” and “stuff I keep saying so might as well also write down”. My recent piece on Oracle as the new IBM is an example of both themes. In this post, another example, I’d like to memorialize some points I keep making about business intelligence and other analytics. In particular:
- BI relies on strong data access capabilities. This is always true. Duh.
- Therefore, BI and other analytics vendors commonly reinvent the data management wheel. This trend ebbs and flows with technology cycles.
Similarly, BI has often been tied to data integration/ETL (Extract/Transform/Load) functionality.* But I won’t address that subject further at this time.
*In the Hadoop/Spark era, that’s even truer of other analytics than it is of BI.
My top historical examples include:
- The 1970s analytic fourth-generation languages (RAMIS, NOMAD, FOCUS, et al.) commonly combined reporting and data management.
- The best BI visualization technology of the 1980s, Executive Information Systems (EIS), was generally unsuccessful. The core reason was a lack of what we’d now call drilldown. Not coincidentally, EIS vendors — notably leader Comshare — didn’t do well at DBMS-like technology.
- Business Objects, one of the pioneers of the modern BI product category, rose in large part on the strength of its “semantic layer” technology. (If you don’t know what that is, you can imagine it as a kind of virtual data warehouse modest enough in its ambitions to actually be workable.)
- Cognos, the other pioneer of modern BI, depending on capabilities for which it needed a bundled MOLAP (Multidimensional OnLine Analytic Processing) engine.
- But Cognos later stopped needing that engine, which underscores my point about technology ebbing and flowing.
|Categories: Business intelligence, Business Objects, Cognos, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Information Builders, MicroStrategy, Software as a Service (SaaS), Teradata||5 Comments|
I only have mixed success at getting my clients to reach out to me for messaging advice when they’re introducing something new. Cloudera Navigator Optimizer, which is being announced along with Cloudera 5.5, is one of my failures in that respect; I heard about it for the first time Tuesday afternoon. I hate the name. I hate some of the slides I saw. But I do like one part of the messaging, namely the statement that this is about “refactoring” queries.
All messaging quibbles aside, I think the Cloudera Navigator Optimizer story is actually pretty interesting, and perhaps not just to users of SQL-on-Hadoop technologies such as Hive (which I guess I’d put in that category for simplicity) or Impala. As I understand Cloudera Navigator Optimizer:
- It’s all about analytic SQL queries.
- Specifically, it’s about reducing duplicated work.
- It is not an “optimizer” in the ordinary RDBMS sense of the word.
- It’s delivered via SaaS (Software as a Service).
- Conceptually, it’s not really tied to SQL-on-Hadoop. However, …
- … in practice it likely will be used by customers who want to optimize performance of Cloudera’s preferred styles of SQL-on-Hadoop, either because they’re already using SQL-on-Hadoop or in connection with an initial migration.
|Categories: Business intelligence, Cloudera, Data pipelining, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, SQL/Hadoop integration||4 Comments|
One pleasure in talking with my clients at MongoDB is that few things are NDA. So let’s start with some numbers:
- >2,000 named customers, the vast majority of which are unique organizations who do business with MongoDB directly.
- ~75,000 users of MongoDB Cloud Manager.
- Estimated ~1/4 million production users of MongoDB total.
Also >530 staff, and I think that number is a little out of date.
MongoDB lacks many capabilities RDBMS users take for granted. MongoDB 3.2, which I gather is slated for early November, narrows that gap, but only by a little. Features include:
- Some JOIN capabilities.
- Specifically, these are left outer joins, so they’re for lookup but not for filtering.
- JOINs are not restricted to specific shards of data …
- … but do benefit from data co-location when it occurs.
- A BI connector. Think of this as a MongoDB-to- SQL translator. Using this does require somebody to go in and map JSON schemas and relational tables to each other. Once that’s done, the flow is:
- Basic SQL comes in.
- Filters and GroupBys are pushed down to MongoDB. A result set … well, it results.
- The result set is formatted into a table and returned to the system — for example a business intelligence tool — that sent the SQL.
- Database-side document validation, in the form of field-specific rules that combine into a single expression against which to check a document.
- This is fairly simple stuff — no dependencies among fields in the same document, let alone foreign key relationships.
- MongoDB argues, persuasively, that this simplicity makes it unlikely to recreate the spaghetti code maintenance nightmare that was 1990s stored procedures.
- MongoDB concedes that, for performance, it will ordinarily be a good idea to still do your validation on the client side.
- MongoDB points out that enforcement can be either strict (throw errors) or relaxed (just note invalid documents to a log). The latter option is what makes it possible to install this feature without breaking your running system.
There’s also a closed-source database introspection tool coming, currently codenamed MongoDB Scout. Read more
|Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Market share and customer counts, MongoDB, NoSQL, Open source, Structured documents, Text||6 Comments|
A lot of what I hear and talk about boils down to “data is a mess”. Below is a very partial list of examples.
To a first approximation, one would expect operational data to be rather clean. After all, it drives and/or records business transactions. So if something goes awry, the result can be lost money, disappointed customers, or worse, and those are outcomes to be strenuously avoided. Up to a point, that’s indeed true, at least at businesses large enough to be properly automated. (Unlike, for example — — mine.)
Even so, operational data has some canonical problems. First, it could be inaccurate; somebody can just misspell or otherwise botch an entry. Further, there are multiple ways data can be unreachable, typically because it’s:
- Inconsistent, in which case humans might not know how to look it up and database JOINs might fail.
- Unintegrated, in which case one application might not be able to use data that another happily maintains. (This is the classic data silo problem.)
Inconsistency can take multiple forms, including: Read more