When you are considering technology selection or strategy, there are a lot of factors that can each have bearing on the final decision — a whole lot. Below is a very partial list.
In almost any IT decision, there are a number of environmental constraints that need to be acknowledged. Organizations may have standard vendors, favored vendors, or simply vendors who give them particularly deep discounts. Legacy systems are in place, application and system alike, and may or may not be open to replacement. Enterprises may have on-premise or off-premise preferences; SaaS (Software as a Service) vendors probably have multitenancy concerns. Your organization can determine which aspects of your system you’d ideally like to see be tightly integrated with each other, and which you’d prefer to keep only loosely coupled. You may have biases for or against open-source software. You may be pro- or anti-appliance. Some applications have a substantial need for elastic scaling. And some kinds of issues cut across multiple areas, such as budget, timeframe, security, or trained personnel.
Multitenancy is particularly interesting, because it has numerous implications. If you’re a SaaS vendor supporting multiple customers, you must keep each customer’s data inaccessible to other users* — even if you offer high levels of flexibility or customization. You probably also want to keep data logically partitioned by user, in a way that the DBMS recognizes; you may also want that partition to hunt as a pack for caching purposes, especially if no one customer occupies a large part of your database. Administratively, you need a way to measure customer-specific metrics of the sort that might go into SLAs (Service-Level Agreements).
*Of course, there are exceptions. One of my clients is a SaaS vendor facilitating commerce; the whole point of their app is to let two different customers see and update the same records.
Getting more specific now, I’m usually called upon to advise users in two categories — those that already know they want to upgrade analytic functionality, and those that quickly realize they do once I remind them of it. Even so, many organizations struggle with the question “What do you want to do analytically?” It’s tough to blame them, for the question is distressingly circular; a big part of analytics is figuring out which kinds of analytics are worth doing. Also, SaaS vendors often struggle with the same question for a different reason, responding “Well, we know we’ve only been giving them basic stuff to date. What else do you think they would like?”
There’s no perfect solution to those difficulties, but a good way to start the evaluation is by assessing:
- The nature and value of your decisions that analytics could reasonably affect.
- Your realistic scope for automation of analytic decisions.
- The number and training of your “full-time analysts” — statisticians, SQL jocks who can program, SQL jocks who can’t really program, full-time users of BI tools, whatever.
- The number and training of your “part-time analysts” — normal business users who can get something out of a dashboard, and perhaps even drill down into it.
That should at least tell you which broad categories of analytics you want to engage in, and roughly how advanced in those areas you should try to be.
Basic business intelligence/dashboarding? Surely. Visualization-centric BI? If nothing else, it demos well. Basic predictive modeling? Hmm, are you sure nobody will want that? Advanced predictive modeling? Um, are you sure your users can handle that, or that the results will be worth the investment?
When I talk with users, there’s usually a data management problem in the mix too. In such cases, I quickly ask about data-related metrics, starting with database size, ingest volumes (batch, if relevant, but especially continuous), and simultaneous query load /concurrent user count. Similarly important are requirements for various kinds of latency, the big two being query response time and how long it takes for data to first be available for query. Less numeric questions in a similar vein boil down to “What kinds of requests will you make against the database, in what volume?”
And this loops back to the analytic-user inventory. Suppose you had a near-real-time dashboard — would anybody actually look at it minute to minute?
Specialized metrics I request when considering analytic DBMS include “How many columns are there in your widest table?” and “How many joins — or lines of SQL — are there in your most complex query?”, both of which are tools for assessing “Is your use case naturally columnar?”. Another, more general “natural structure of data” kind of consideration is what structure the data is in before it gets to the database being discussed; candidates include relational batch, XML stream, log file, and many more.
Also crucial are requirements for consistency, availability, and data integrity. Those tell you your needs in high availability and disaster recovery, and perhaps even how picky you have to be about your brands of hardware, software, or cloud/hosting provider. They also indicate how much you should care about relational or ACID properties, and where you should come down on CAP Theorem trade-offs.
I could go on even longer, but those seem like a pretty good set of initial questions with which to start discussions of data management, data integration, and analytic tools and architectures. What do you think I left out? And what do you think I could make substantially clearer by just adding a few more words? Any comments will be much appreciated.