Memory-centric data management
Analysis of technologies that manage data entirely or primarily in random-access memory (RAM). Related subjects include:
7-10 years ago, I repeatedly argued the viewpoints:
- Relational DBMS were the right choice in most cases.
- Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
- There were a variety of specialized use cases in which non-relational data models were best.
Since then, however:
- Hadoop has flourished.
- NoSQL has flourished.
- Graph DBMS have matured somewhat.
- Much of the action has shifted to machine-generated data, of which there are many kinds.
So it’s probably best to revisit all that in a somewhat organized way.
- Question: Why do policemen work in pairs?
- Answer: One to read and one to write.
A lot has happened in MongoDB technology over the past year. For starters:
- The big news in MongoDB 3.0* is the WiredTiger storage engine. The top-level claims for that are that one should “typically” expect (individual cases can of course vary greatly):
- 7-10X improvement in write performance.
- No change in read performance (which however was boosted in MongoDB 2.6).
- ~70% reduction in data size due to compression (disk only).
- ~50% reduction in index size due to compression (disk and memory both).
- MongoDB has been adding administration modules.
- A remote/cloud version came out with, if I understand correctly, MongoDB 2.6.
- An on-premise version came out with 3.0.
- They have similar features, but are expected to grow apart from each other over time. They have different names.
*Newly-released MongoDB 3.0 is what was previously going to be MongoDB 2.8. My clients at MongoDB finally decided to give a “bigger” release a new first-digit version number.
To forestall confusion, let me quickly add: Read more
|Categories: Database compression, Hadoop, Humor, In-memory DBMS, MongoDB, NoSQL, Open source, Structured documents, Sybase||9 Comments|
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Graph analytics and data management are still confused.
Most IT innovation these days is focused on machine-generated data (sometimes just called “machine data”), rather than human-generated. So as I find myself in the mood for another survey post, I can’t think of any better idea for a unifying theme.
1. There are many kinds of machine-generated data. Important categories include:
- Web, network and other IT logs.
- Game and mobile app event data.
- CDRs (telecom Call Detail Records).
- “Phone-home” data from large numbers of identical electronic products (for example set-top boxes).
- Sensor network output (for example from a pipeline or other utility network).
- Vehicle telemetry.
- Health care data, in hospitals.
- Digital health data from consumer devices.
- Images from public-safety camera networks.
- Stock tickers (if you regard them as being machine-generated, which I do).
That’s far from a complete list, but if you think about those categories you’ll probably capture most of the issues surrounding other kinds of machine-generated data as well.
2. Technology for better information and analysis is also technology for privacy intrusion. Public awareness of privacy issues is focused in a few areas, mainly: Read more
The genesis of this post is that:
- Hortonworks is trying to revitalize the Apache Storm project, after Storm lost momentum; indeed, Hortonworks is referring to Storm as a component of Hadoop.
- Cloudera is talking up what I would call its human real-time strategy, which includes but is not limited to Flume, Kafka, and Spark Streaming. Cloudera also sees a few use cases for Storm.
- This all fits with my view that the Current Hot Subject is human real-time data freshness — for analytics, of course, since we’ve always had low latencies in short-request processing.
- This also all fits with the importance I place on log analysis.
- Cloudera reached out to talk to me about all this.
Of course, we should hardly assume that what the Hadoop distro vendors favor will be the be-all and end-all of streaming. But they are likely to at least be influential players in the area.
In the parts of the problem that Cloudera emphasizes, the main tasks that need to be addressed are: Read more
|Categories: Cloudera, Complex event processing (CEP), Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Health care, Hortonworks, Kafka and Confluent, Log analysis, Specific users, Splunk, Web analytics||9 Comments|
I’ve talked with many companies recently that believe they are:
- Focused on building a great data management and analytic stack for log management …
- … unlike all the other companies that might be saying the same thing …
- … and certainly unlike expensive, poorly-scalable Splunk …
- … and also unlike less-focused vendors of analytic RDBMS (which are also expensive) and/or Hadoop distributions.
At best, I think such competitive claims are overwrought. Still, it’s a genuinely important subject and opportunity, so let’s consider what a great log management and analysis system might look like.
Much of this discussion could apply to machine-generated data in general. But right now I think more players are doing product management with an explicit conception either of log management or event-series analytics, so for this post I’ll share that focus too.
A short answer might be “Splunk, but with more analytic functionality and more scalable performance, at lower cost, plus numerous coupons for free pizza.” A more constructive and bottoms-up approach might start with: Read more
As part of my series on the keys to and likelihood of success, I outlined some examples from the DBMS industry. The list turned out too long for a single post, so I split it up by millennia. The part on 20th Century DBMS success and failure went up Friday; in this one I’ll cover more recent events, organized in line with the original overview post. Categories addressed will include analytic RDBMS (including data warehouse appliances), NoSQL/non-SQL short-request DBMS, MySQL, PostgreSQL, NewSQL and Hadoop.
DBMS rarely have trouble with the criterion “Is there an identifiable buying process?” If an enterprise is doing application development projects, a DBMS is generally chosen for each one. And so the organization will generally have a process in place for buying DBMS, or accepting them for free. Central IT, departments, and — at least in the case of free open source stuff — developers all commonly have the capacity for DBMS acquisition.
In particular, at many enterprises either departments have the ability to buy their own analytic technology, or else IT will willingly buy and administer things for a single department. This dynamic fueled much of the early rise of analytic RDBMS.
Buyer inertia is a greater concern.
- A significant minority of enterprises are highly committed to their enterprise DBMS standards.
- Another significant minority aren’t quite as committed, but set pretty high bars for new DBMS products to cross nonetheless.
- FUD (Fear, Uncertainty and Doubt) about new DBMS is often justifiable, about stability and consistent performance alike.
A particularly complex version of this dynamic has played out in the market for analytic RDBMS/appliances.
- First the newer products (from Netezza onwards) were sold to organizations who knew they wanted great performance or price/performance.
- Then it became more about selling “business value” to organizations who needed more convincing about the benefits of great price/performance.
- Then the behemoth vendors became more competitive, as Teradata introduced lower-price models, Oracle introduced Exadata, Sybase got more aggressive with Sybase IQ, IBM bought Netezza, EMC bought Greenplum, HP bought Vertica and so on. It is now hard for a non-behemoth analytic RDBMS vendor to make headway at large enterprise accounts.
- Meanwhile, Hadoop has emerged as serious competitor for at least some analytic data management, especially but not only at internet companies.
Otherwise I’d say: Read more
I’m commonly asked to assess vendor claims of the kind:
- “Our system lets you do multiple kinds of processing against one database.”
- “Otherwise you’d need two or more data managers to get the job done, which would be a catastrophe of unthinkable proportion.”
So I thought it might be useful to quickly review some of the many ways organizations put multiple data stores to work. As usual, my bottom line is:
- The most extreme vendor marketing claims are false.
- There are many different choices that make sense in at least some use cases each.
Horses for courses
It’s now widely accepted that different data managers are better for different use cases, based on distinctions such as:
- Short-request vs. analytic.
- SQL vs. non-SQL (NoSQL or otherwise).
- Expensive/heavy-duty vs. cheap/easy-to-support.
Vendors are part of this consensus; already in 2005 I observed
For all practical purposes, there are no DBMS vendors left advocating single-server strategies.
Vendor agreement has become even stronger in the interim, as evidenced by Oracle/MySQL, IBM/Netezza, Oracle’s NoSQL dabblings, and various companies’ Hadoop offerings.
Multiple data stores for a single application
We commonly think of one data manager managing one or more databases, each in support of one or more applications. But the other way around works too; it’s normal for a single application to invoke multiple data stores. Indeed, all but the strictest relational bigots would likely agree: Read more
Writing data management or analysis software is hard. This post and its sequel are about some of the reasons why.
When systems work as intended, writing and reading data is easy. Much of what’s hard about data management is dealing with the possibility — really the inevitability — of failure. So it might be interesting to survey some of the many ways that considerations of failure come into play. Some have been major parts of IT for decades; others, if not new, are at least newly popular in this cluster-oriented, RAM-crazy era. In this post I’ll focus on topics that apply to single-node systems; in the sequel I’ll emphasize topics that are clustering-specific.
Major areas of failure-aware design — and these overlap greatly — include:
- Backup and restore. In its simplest form, this is very basic stuff. That said — any decent database management system should let backups be made without blocking ongoing database operation, with the least performance impact possible.
- Logging, rollback and replay. Logs are essential to DBMS. And since they’re both ubiquitous and high-performance, logs are being used in ever more ways.
- Locking, latching, transactions and consistency. Database consistency used to be enforced in stern and pessimistic ways. That’s changing, big-time, in large part because of the requirements of …
- … distributed database operations. Increasingly, modern distributed database systems are taking the approach of getting work done first, then cleaning up messes when they occur.
- Redundancy and replication. Parallel computing creates both a need and an opportunity to maintain multiple replicas of data at once, in very different ways than the redundancy and replication of the past.
- Fault-tolerant execution. When one node is inoperative, inaccessible, overloaded or just slow, you may not want a whole long multi-node job to start over. A variety of techniques address this need.
In a single-server, disk-based configuration, techniques for database fault-tolerance start: Read more
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more