Discussion of choices and variety in database management system architecture. Related subjects include:
I’d like to survey a few related ideas:
- Enterprises should each have a variety of different analytic data stores.
- Vendors — especially but not only IBM and Teradata — are acknowledging and marketing around the point that enterprises should each have a number of different analytic data stores.
- In addition to having multiple analytic data management technology stacks, it is also desirable to have an agile way to spin out multiple virtual or physical relational data marts using a single RDBMS. Vendors are addressing that need.
- Some observers think that the real essence of analytic data management will be in data integration, not the actual data management.
Here goes. Read more
|Categories: Data warehousing, Database diversity, EAI, EII, ETL, ELT, ETLT, Exadata, Greenplum, Hadoop, Hortonworks, IBM and DB2, Informatica, Netezza, Oracle, Sybase, Teradata, Workload management||13 Comments|
As a follow-up to the latest Stonebraker kerfuffle, Derrick Harris asked me a bunch of smart followup questions. My responses and afterthoughts include:
- Facebook et al. are in effect Software as a Service (SaaS) vendors, not enterprise technology users. In particular:
- They have the technical chops to rewrite their code as needed.
- Unlike packaged software vendors, they’re not answerable to anybody for keeping legacy code alive after a rewrite. That makes migration a lot easier.
- If they want to write different parts of their system on different technical underpinnings, nobody can stop them. For example …
- … Facebook innovated Cassandra, and is now heavily committed to HBase.
- It makes little sense to talk of Facebook’s use of “MySQL.” Better to talk of Facebook’s use of “MySQL + memcached + non-transparent sharding.” That said:
- It’s hard to see why somebody today would use MySQL + memcached + non-transparent sharding for a new project. At least one of Couchbase or transparently-sharded MySQL is very likely a superior alternative. Other alternatives might be better yet.
- As noted above in the example of Facebook, the many major web businesses that are using MySQL + memcached + non-transparent sharding for existing projects can be presumed able to migrate away from that stack as the need arises.
Continuing with that discussion of DBMS alternatives:
- If you just want to write to the memcached API anyway, why not go with Couchbase?
- If you want to go relational, why not go with MySQL? There are many alternatives for scaling or accelerating MySQL — dbShards, Schooner, Akiban, Tokutek, ScaleBase, ScaleDB, Clustrix, and Xeround come to mind quickly, so there’s a great chance that one or more will fit your use case. (And if you don’t get the choice of MySQL flavor right the first time, porting to another one shouldn’t be all THAT awful.)
- If you really, really want to go in-memory, and don’t mind writing Java stored procedures, and don’t need to do the kinds of joins it isn’t good at, but do need to do the kinds of joins it is, VoltDB could indeed be a good alternative.
And while we’re at it — going schema-free often makes a whole lot of sense. I need to write much more about the point, but for now let’s just say that I look favorably on the Big Four schema-free/NoSQL options of MongoDB, Couchbase, HBase, and Cassandra.
In Part 1 of this two-part series, I outlined four variants on the traditional enterprise data warehouse/data mart dichotomy, and suggested what kinds of DBMS products you might use for each. In Part 2 I’ll cover four more kinds of analytic database — even newer, for the most part, with a use case/product short list match that is even less clear. Read more
Analytic data management technology has blossomed, leading to many questions along the lines of “So which products should I use for which category of problem?” The old EDW/data mart dichotomy is hopelessly outdated for that purpose, and adding a third category for “big data” is little help.
Let’s try eight categories instead. While no categorization is ever perfect, these each have at least some degree of technical homogeneity. Figuring out which types of analytic database you have or need — and in most cases you’ll need several — is a great early step in your analytic technology planning. Read more
There are plenty of viable alternatives to relational database management systems. For short-request processing, both document stores and fully object-oriented DBMS can make sense. Text search engines have an important role to play. E. F. “Ted” Codd himself once suggested that relational DBMS weren’t best for analytics.* Analysis of machine-generated log data doesn’t always have a naturally relational aspect. And I could go on with more examples yet.
*Actually, he didn’t admit that what he was advocating was a different kind of DBMS, namely a MOLAP one — but he was. And he was wrong anyway about the necessity for MOLAP. But let’s overlook those details.
Nonetheless, relational DBMS dominate the market. As I see it, the reasons for relational dominance cluster into four areas (which of course overlap):
- Data re-use. Ted Codd’s famed original paper referred to shared data banks for a reason.
- The benefits of normalization, which include:
- You only have to do programming work of writing something once …
- … and you don’t have to do the programming work of keeping multiple versions of the information consistent.
- You only have to do processing work of writing something once.
- You only have to buy storage to hold each fact once.
- Separation of concerns.
- Different people can worry about programming and “database stuff.”
- Indeed, even performance optimization can sometimes be separated from programming (i.e., when all you have to do to get speed is implement the correct indexes).
- Maturity and momentum, as reflected in the availability of:
- A broad variety of mature relational DBMS.
- Vast amounts of packaged software that “talks” SQL.
Generally speaking, I find the reasons for sticking with relational technology compelling in cases such as: Read more
|Categories: Analytic technologies, Data models and architecture, Database diversity, MOLAP, NoSQL, Object, Theory and architecture||20 Comments|
My NoSQL article is finally posted; I hope it lives up to all the foreshadowing. It is being run online at Intelligent Enterprise/Information Week, as per the link above, where Doug Henschen edited it with an admirably light touch.
Below please find three excerpts* that convey the essence of my thinking on NoSQL. For much more detail, please see the article itself.
*Notwithstanding my admiration for Doug’s editing, the excerpts are taken from my final pre-editing submission, not from the published article itself.
My quasi-definition of “NoSQL” wound up being: Read more
After I criticized the marketing of the Aster/Cloudera partnership, my clients at Aster Data and Cloudera ganged up on me and tried to persuade me I was wrong. Be that as it may, that conversation and others were helpful to me in understanding the core thesis: Read more
|Categories: Analytic technologies, Aster Data, Cloudera, Data warehousing, Database diversity, Hadoop, MapReduce, Parallelization, Petabyte-scale data management||11 Comments|
An enterprise data warehouse should:
- Manage data to high standards of accuracy, consistency, cleanliness, clarity, and security.
- Manage all the data in your organization.
Pick ONE. Read more
|Categories: Data models and architecture, Data warehousing, Database diversity, Teradata, Theory and architecture||8 Comments|
Let’s start from some reasonable premises. Read more
|Categories: Data models and architecture, Database diversity, Hadoop, MapReduce, MarkLogic, NoSQL, OLTP, Theory and architecture||37 Comments|
People often try to draw a distinction between:
- Traditional data of the sort that’s stored in relational databases, aka “structured.”
- Everything else, aka “unstructured” or “semi-structured” or “complex.”
There are plenty of problems with these formulations, not the least of which is that the supposedly “unstructured” data is the kind that actually tends to have interesting internal structures. But of the many reasons why these distinctions don’t tend to work very well, I think the most important one is that:
Databases shouldn’t be divided into just two categories. Even as a rough-cut approximation, they should be divided into three, namely:
- Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays
- Human/Nontabular data — i.e., all other data generated by humans
- Machine-Generated data
Even that trichotomy is grossly oversimplified, for reasons such as:
- These categories overlap.
- There are kinds of data that get into fuzzy border zones.
- Not all data in each category has all the same properties.
But at least as a starting point, I think this basic categorization has some value. Read more
|Categories: Database diversity, Investment research and trading, Log analysis, Telecommunications, Web analytics||19 Comments|