March 15, 2015

BI for NoSQL — some very early comments

Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. 🙂 Here’s the small amount that I have actually figured out.

As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.

In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).

Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:

The NoSQL database is likely to have more null values.
The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.

That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.

Another set of issues arise on the performance side. Many NoSQL systems have indexes, and thus some kind of filtering capability. Some — e.g. MongoDB — have aggregation frameworks as well. So if you’re getting at the data with some combination of a BI tool, ETL tool or ODBC/JDBC drivers — are you leveraging the capabilities in place? Or are you doing the simplest and slowest thing, which is to suck data out en masse and operate on it somewhere else? Getting good answers to those questions is a work-in-progress at best.

Having established that NoSQL data structures cause problems for BI, let’s turn that around. Is there any way that they actually help? I want to say “NoSQL data often comes in hierarchies, and hierarchies are good for roll-up/drill-down.” But the hierarchies that describe NoSQL data aren’t necessarily the same kinds of hierarchies that are useful for BI aggregation, and I’m indeed skeptical as to how often those two categories overlap.

Hierarchies aside, I do think there are use cases for fundamentally non-tabular BI. For example, consider the following scenario, typically implemented with the help of NoSQL today:

You have more data — presumably machine-generated — than you can afford to keep.
So you keep time-sliced aggregates.
You also keep selective details, namely ones that you identified when they streamed in as being interesting in some way.

Visualizing that properly would be very hard in a traditional tabularly-oriented BI tool. So it could end up with NoSQL-oriented BI tools running over NoSQL data stores. Event series BI done right also seems to be quite non-tabular. That said, I don’t know for sure about the actual data structures used under the best event series BI today.

And at that inconclusive point, I’ll stop for now. If you have something to add, please share it in the comments below, or hit me up as per my Contact link above.

Categories: Business intelligence, Cassandra, EAI, EII, ETL, ELT, ETLT, HBase, MongoDB, NoSQL, Structured documents

Subscribe to our complete feed!

Comments

6 Responses to “BI for NoSQL — some very early comments”

Richard Tibbetts on March 15th, 2015 8:08 pm

I think an important distinction is which kinds of NoSQL are you trying to BI over, and why? About half the NoSQL I see (Mongo, Cassandra, Redis, etc) are in service of operational workloads where the database has been carefully tailored to the key user facing functionality of the app or site. This may result in key/value or hierarchy organization, but as you note it may well not be the organization a business decision maker cares about.

On the other hand, Hadoop deployments, “data lakes” and big data in general, are usually there to archive data for future analysis, and that analysis is often in support of making business decisions. ETL/ELTing the data until you can get it into a traditional warehouse-looking schema, whether you store the result in Vertica or in some SQL-on-Hadoop (Hive/Impala/Parquet/SparkSQL/etc), then looks like you are supporting traditional BI workloads with a cheaper infrastructure. Nothing to sneeze at, but also not revolutionary.

The empowering part of a data lake is being able to drop anything you want into it, often pulling from the aforemention specialized application stores, and put it in the lake in a relatively raw form, without worrying about who is going to use it for what. Then consumers, often data scientists, can do whatever they want with it. This pattern improves agility, and avoids expensive enterprise data warehouse architecture committees and building ETL workflows on spec before you know if people are really going to use the data. However, it replaces that with expensive data scientist time used to clean data, and the potential for instability when source data changes, structurally or semantically. It works pretty well when developer, data scientist, and product manager all sit in the same open plan loft office. But there are impediments in mainstream business.

The real question I have for “NoSQL BI” is how are you going to make better business decisions by doing a kind of query that is possible on the new shared data lake infrastructure, that wouldn’t have been possible in a traditional SQL warehouse/BI architecture. This might mean streaming, it might mean graph, it might mean unstructured RDF data, or text analytics, or machine learning.

I think we need tools which guide business users to asking the right questions, questions that match their infrastructure and their business. My first question for any big data BI vendor is what new questions can you answer?
Curt Monash on March 16th, 2015 11:58 pm

Richard,

Not to be difficult here, but I don’t see a non-tabular “data lake” as “NoSQL”. I was focused on what you seem to be calling the first half of the use cases.

If the data is swimming in a data lake, with no particular issues of latency, then some pre-BI massaging would in most use cases seem unobjectionable. And issues of tying into the performance advantages and disadvantages of the short-request-oriented NoSQL store, if any, would seem completely obviated.
David Gruzman on March 17th, 2015 5:56 am

I think that part of NoSQL analytic should be in form of “relational view”.
Lets consider yet another social network implemented over MongoDB. We will have huge documents for each user with all the friends, comments, posts, etc. In the same time , during analysis I would prefer “User”, “Comment”, “Post” relations, and be able to ask questions like “what is average number of comments per post by hour of the day”. It is perfectly relational question.
This relational view can be “materialized” in other DBMS, or it can be supported by NoSQL itself.
If “real time” aspect is important, than in place option is much more attractive.
If we want to analyze these relations together with external relations – then putting them in the same data lake sounds better.
I also think that there is third case – deep analytic, when we want to apply or train complex models. In this case, I believe” we will need whole objects as they are stored in the NoSQL, may be enriched with some “external” information.
Jeff Carr on March 17th, 2015 5:25 pm

The entire notion of BI on NoSQL is complex, and what makes it worse is the overlap between operational NoSQL and Hadoop, data lakes ect which are metaphors for DW. With that said I’d offer the following. NoSQL data is fundamentally different than relational in 2 ways. It is almost always highly non-uniform in nature and has more dimensions. These two problems make it incompatible with any relational BI tools. These tools are built on relational algebra which expects flat (single dimension) uniform data. Any meaningful discussion of NoSQL BI starts as a mathematical discussion. We have built an mathematical formalism and semantics that allows for BI (aggregate,slice,dice, rollups ect) natively across non-uniform data. The mathematical foundation (MRA) allows dimensional operators for lifting set level operations to arbitrary dimensions. I’m not trying to promote my project, just trying to point out that modern data structures are changing and trying to force the relational algebra model on them via manipulation is probably not the right approach. There is much more work to be done, but the starting place is recognizing what’s changed and how to innovate to meet the need.
Which analytic technology problems are important to solve for whom? | DBMS 2 : DataBase Management System Services on April 12th, 2015 11:50 pm

[…] BI for inherently non-tabular data is definitely an unsolved problem. […]
Are analytic RDBMS and data warehouse appliances obsolete? | DBMS 2 : DataBase Management System Services on August 28th, 2016 9:29 pm

[…] suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the […]

Leave a Reply

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

BI for NoSQL — some very early comments

Comments

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin