March 15, 2015

BI for NoSQL — some very early comments

Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. :) Here’s the small amount that I have actually figured out.

As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.

Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:

That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.

Another set of issues arise on the performance side. Many NoSQL systems have indexes, and thus some kind of filtering capability. Some — e.g. MongoDB — have aggregation frameworks as well. So if you’re getting at the data with some combination of a BI tool, ETL tool or ODBC/JDBC drivers — are you leveraging the capabilities in place? Or are you doing the simplest and slowest thing, which is to suck data out en masse and operate on it somewhere else? Getting good answers to those questions is a work-in-progress at best.

Having established that NoSQL data structures cause problems for BI, let’s turn that around. Is there any way that they actually help? I want to say “NoSQL data often comes in hierarchies, and hierarchies are good for roll-up/drill-down.” But the hierarchies that describe NoSQL data aren’t necessarily the same kinds of hierarchies that are useful for BI aggregation, and I’m indeed skeptical as to how often those two categories overlap.

Hierarchies aside, I do think there are use cases for fundamentally non-tabular BI. For example, consider the following scenario, typically implemented with the help of NoSQL today:

Visualizing that properly would be very hard in a traditional tabularly-oriented BI tool. So it could end up with NoSQL-oriented BI tools running over NoSQL data stores. Event series BI done right also seems to be quite non-tabular. That said, I don’t know for sure about the actual data structures used under the best event series BI today.

And at that inconclusive point, I’ll stop for now. If you have something to add, please share it in the comments below, or hit me up as per my Contact link above.

Comments

6 Responses to “BI for NoSQL — some very early comments”

  1. Richard Tibbetts on March 15th, 2015 8:08 pm

    I think an important distinction is which kinds of NoSQL are you trying to BI over, and why? About half the NoSQL I see (Mongo, Cassandra, Redis, etc) are in service of operational workloads where the database has been carefully tailored to the key user facing functionality of the app or site. This may result in key/value or hierarchy organization, but as you note it may well not be the organization a business decision maker cares about.

    On the other hand, Hadoop deployments, “data lakes” and big data in general, are usually there to archive data for future analysis, and that analysis is often in support of making business decisions. ETL/ELTing the data until you can get it into a traditional warehouse-looking schema, whether you store the result in Vertica or in some SQL-on-Hadoop (Hive/Impala/Parquet/SparkSQL/etc), then looks like you are supporting traditional BI workloads with a cheaper infrastructure. Nothing to sneeze at, but also not revolutionary.

    The empowering part of a data lake is being able to drop anything you want into it, often pulling from the aforemention specialized application stores, and put it in the lake in a relatively raw form, without worrying about who is going to use it for what. Then consumers, often data scientists, can do whatever they want with it. This pattern improves agility, and avoids expensive enterprise data warehouse architecture committees and building ETL workflows on spec before you know if people are really going to use the data. However, it replaces that with expensive data scientist time used to clean data, and the potential for instability when source data changes, structurally or semantically. It works pretty well when developer, data scientist, and product manager all sit in the same open plan loft office. But there are impediments in mainstream business.

    The real question I have for “NoSQL BI” is how are you going to make better business decisions by doing a kind of query that is possible on the new shared data lake infrastructure, that wouldn’t have been possible in a traditional SQL warehouse/BI architecture. This might mean streaming, it might mean graph, it might mean unstructured RDF data, or text analytics, or machine learning.

    I think we need tools which guide business users to asking the right questions, questions that match their infrastructure and their business. My first question for any big data BI vendor is what new questions can you answer?

  2. Curt Monash on March 16th, 2015 11:58 pm

    Richard,

    Not to be difficult here, but I don’t see a non-tabular “data lake” as “NoSQL”. I was focused on what you seem to be calling the first half of the use cases.

    If the data is swimming in a data lake, with no particular issues of latency, then some pre-BI massaging would in most use cases seem unobjectionable. And issues of tying into the performance advantages and disadvantages of the short-request-oriented NoSQL store, if any, would seem completely obviated.

  3. David Gruzman on March 17th, 2015 5:56 am

    I think that part of NoSQL analytic should be in form of “relational view”.
    Lets consider yet another social network implemented over MongoDB. We will have huge documents for each user with all the friends, comments, posts, etc. In the same time , during analysis I would prefer “User”, “Comment”, “Post” relations, and be able to ask questions like “what is average number of comments per post by hour of the day”. It is perfectly relational question.
    This relational view can be “materialized” in other DBMS, or it can be supported by NoSQL itself.
    If “real time” aspect is important, than in place option is much more attractive.
    If we want to analyze these relations together with external relations – then putting them in the same data lake sounds better.
    I also think that there is third case – deep analytic, when we want to apply or train complex models. In this case, I believe” we will need whole objects as they are stored in the NoSQL, may be enriched with some “external” information.

  4. Jeff Carr on March 17th, 2015 5:25 pm

    The entire notion of BI on NoSQL is complex, and what makes it worse is the overlap between operational NoSQL and Hadoop, data lakes ect which are metaphors for DW. With that said I’d offer the following. NoSQL data is fundamentally different than relational in 2 ways. It is almost always highly non-uniform in nature and has more dimensions. These two problems make it incompatible with any relational BI tools. These tools are built on relational algebra which expects flat (single dimension) uniform data. Any meaningful discussion of NoSQL BI starts as a mathematical discussion. We have built an mathematical formalism and semantics that allows for BI (aggregate,slice,dice, rollups ect) natively across non-uniform data. The mathematical foundation (MRA) allows dimensional operators for lifting set level operations to arbitrary dimensions. I’m not trying to promote my project, just trying to point out that modern data structures are changing and trying to force the relational algebra model on them via manipulation is probably not the right approach. There is much more work to be done, but the starting place is recognizing what’s changed and how to innovate to meet the need.

  5. Which analytic technology problems are important to solve for whom? | DBMS 2 : DataBase Management System Services on April 12th, 2015 11:50 pm

    […] BI for inherently non-tabular data is definitely an unsolved problem. […]

  6. Are analytic RDBMS and data warehouse appliances obsolete? | DBMS 2 : DataBase Management System Services on August 28th, 2016 9:29 pm

    […] suppose you just want to do business intelligence, which is still almost always done over relational data structures? Analytic RDBMS offer the […]

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.