Storage – DBMS 2 : DataBase Management System Services

More notes on the transition to the cloud

Curt Monash — Thu, 17 Aug 2017 09:11:01 +0000

Last year I posted observations about the transition to the cloud. Here are some further thoughts.

0. In case any doubt remained, the big questions about transitioning to the cloud are “When?” and “How?”. “Whether”, by way of contrast, is pretty much settled.

1. The answer to “When?” is generally “Over many years”. In particular, at most enterprises the cloud transition will span multiple CIO’s tenure in their positions.

Few enterprises will ever execute on simple, consistent, unchanging “cloud strategies”.

2. The SaaS (Software as a Service) vs. on-premises tradeoffs are being reargued, except that proponents now spell SaaS C-L-O-U-D. (Ali Ghodsi of Databricks made a particularly energetic version of that case in a recent meeting.)

3. In most countries (at least in the US and the rest of the West), the cloud vendors deemed to matter are Amazon, followed by Microsoft, followed by Google. And so, when it comes to the public cloud, Microsoft is much, much more enterprise-savvy than its key competitors.

4. In another non-technical competitive factor: Wal-Mart isn’t the only huge company that is hostile to the Amazon cloud because of competition with other Amazon businesses.

5. It was once thought that in many small countries around the world, there would be OpenStack-based “national champion” cloud winners, perhaps as subsidiaries of the leading telecom vendors. This doesn’t seem to be happening.

Even so, some of the larger managed-economy and/or generally authoritarian countries will have one or more “national champion” cloud winners each — surely China, presumably Russia, obviously Iran, and probably some others as well.

6. While OpenStack in general seems to have fizzled, S3 compatibility has momentum.

7. Finally, let’s return to our opening points: The cloud transition will happen, but it will take considerable time. A principal reason for slowness is that, as a general rule, apps aren’t migrated to platforms directly; rather, they get replaced by new apps on new platforms when the time is right for them to be phased out anyway.

However, there’s a codicil to those generalities — in some cases it’s easier to migrate to the new platform than in others. The hardest migration was probably when the rise of RDBMS, the shift from mainframes to UNIX and the switch to client/server all happened at once; just about nothing got ported from the old platforms to the new. Easier migrations included:

The switch from Unix to Linux. They were very similar.
The adoption of virtualization. A major purpose of the technology was to make migration easy.
The initial adoption of DBMS. Then-legacy apps relied on flat file systems, which DBMS often found easy to emulate.

The cloud transition is somewhere in the middle between those extremes. On the “easy” side:

Popular database management technologies and so on are available in the cloud just as they are on-premise.
Major app vendors are doing the hard work of cloud ports themselves.

Nonetheless, the public cloud is in many ways a whole new computing environment — and so for the most part, customer-built apps will prove too difficult to migrate. Hence my belief that overall migration to the cloud will be very incremental.

Generally available Kudu

Curt Monash — Fri, 16 Jun 2017 15:52:45 +0000

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts:

Kudu went to general availability on January 31. I gather this spawned an uptick in trial activity.
A subsequent release with some basic security features spawned another uptick.
I don’t think Cloudera will mind my saying that there are many hundreds of active Kudu clusters.
But Cloudera believes that, this soon after GA, very few Kudu users are in actual production.

Early Kudu interest is focused on 2-3 kinds of use case. The biggest is the kind of “data warehousing” highlighted above. Cloudera characterizes the others by the kinds of data stored, specifically the overlapping categories of time series — including financial trading — and machine-generated data. A lot of early Kudu use is with Spark, even ahead of (or in conjunction with) Impala. A small amount has no relational front-end at all.

Other notes on Kudu include:

Solid-state storage is recommended, with a few terabytes per node.
You can also use spinning disk. If you do, your write-ahead logs can still go to flash.
Cloudera said Kudu compression ratios can be as low as 2-5X, or as high as 10-20X. With that broad a range, I didn’t drill down into specifics of what they meant.
There seem to be a number of Kudu clusters with 50+ nodes each. By way of contrast, a “typical” Cloudera customer has 100s of nodes overall.
As you might imagine from their newness, Kudu security features — Kerberos-based — are at the database level rather than anything more granular.

And finally, the Cloudera folks woke me up to some issues around streaming data ingest. If you stream data in, there will be retries resulting in duplicate delivery. So your system needs to deal with those one way or another. Kudu’s way is:

Primary keys will be unique. (Note: This is not obvious in a system that isn’t an entire RDBMS in itself.)
You can configure the uniqueness to be guaranteed either through an upsert mechanism or just by simply rejecting duplicates.
Alternatively, you can write code to handle duplication errors, e.g. via Spark.

Interana

Curt Monash — Mon, 17 Apr 2017 10:10:41 +0000

Interana has an interesting story, in technology and business model alike. For starters:

Interana does ad-hoc event series analytics, which they call “interactive behavioral analytics solutions”.
Interana has a full-stack analytic offering, include:
- Its own columnar DBMS …
- … which has a non-SQL DML (Data Manipulation Language) meant to handle event series a lot more fluently than SQL does, but which the user is never expected to learn because …
- … there also are BI-like visual analytics tools that support plenty of drilldown.
Interana sells all this to “product” departments rather than marketing, because marketing doesn’t sufficiently value Interana’s ad-hoc query flexibility.
Interana boasts >40 customers, with annual subscription fees ranging from high 5 figures to low 7 digits.

And to be clear — if we leave aside any questions of marketing-name sizzle, this really is business intelligence. The closest Interana comes to helping with predictive modeling is giving its ad-hoc users inspiration as to where they should focus their modeling attention.

Interana also has an interesting twist in its business model, which I hope can be used successfully by other enterprise software startups as well.

For now, at no extra charge, Interana will operate its software for you as a managed service. (A majority of Interana’s clients run the software on Amazon or Azure, where that kind of offering makes sense.)
However, presumably in connection with greater confidence in its software’s ease of administration, Interana will move this year toward unbundling the service as an extra-charge offering on top of the software itself.

The key to understanding Interana is its DML. Notes on that include:

Interana’s DML is focused on path analytics …
- … but Interana doesn’t like to use that phrase because it sounds too math-y and difficult.
- Interana may be the first company that’s ever told me it’s focused on providing a better nPath.
Primitives in Interana’s language — notwithstanding the company’s claim that it never ever intended to sell to marketing departments — include familiar web analytics concepts such as “session”, “funnel” and so on. (However, these are being renamed to more neutral terms such as “flow” in an upcoming version of the product.)
As typical example questions or analytic subjects, Interana offered:
- “Which are the most common products in shopping carts where time-to-checkout was greater than 30 minutes?”
- Exactly which steps in the onboarding process result in the greatest user frustration?
The Interana folks and I agree that Splunk is the most recent example of a new DML kicking off a significant company.
The most recent example I can think of in which a vendor hung its hat on a new DML that was a “visual programming language” is StreamBase, with EventFlow. That didn’t go all that well.
To use Founder/CTO Bobby Johnson’s summary term, the real goal of the Interana language is to describe a state machine, specifically one that produces (sets of) sequences of events (and the elapsed time between them).

Notes on Interana speeds & feeds include:

Interana only promises data freshness up to micro-batch latencies — i.e., a few minutes. (Obviously, this shuts them out of most networking monitoring and devops use cases.)
Interana thinks it’s very important for query response time to max out at a low number of seconds. If necessary, the software will return approximate results rather than exact ones so as to meet this standard.
Interana installations and workloads to date have gotten as large as:
- 1-200 nodes.
- Trillions of rows, equating to 100s of TBs of data after compression/ >1 PB uncompressed.
- Billions of rows/events received per day.
- 100s of 1000s of (very sparse) columns.
- 1000s of named users.

Although Interana’s original design point was spinning disk, most customers store their Interana data on flash.

Interana architecture choices include:

They’re serious about micro-batching.
- If the user’s data is naturally micro-batched — e.g. a new S3 bucket every few minutes — Interana works with that.
- Even if the customer’s data is streamed — e.g. via Kafka — Interana insists on micro-batching it.
They’re casual about schemas.
- Interana assumes data arrives with some kind of recognizable structure, via JSON, CSV or whatever.
  - Interana observes, correctly, that log data often is decently structured.
    - For example, if you’re receiving “phone home” pings from products you originally manufactured, you know what data structures to expect.
    - Interana calls this “logging with intent”.
  - Interana is fine with a certain amount of JSON (for example) schema change over time.
  - If your arriving data truly is a mess, then you need to calm it down via a pass through Splunk or whatever before sending it to Interana.
- JSON hierarchies turn into multi-part column names in the usual way.
- Interana supports one level of true nesting, and one level only; column values can be “lists”, but list values can’t be list themselves.

Finally, other Interana tech notes include:

Compression is a central design consideration …
- … especially but not only compression algorithms designed to deal with great sparseness, such as run-length encoding (RLE).
- Dictionary compression, in a strategy that is rarer than I once expected it to be, uses a global rather than shard-by-shard dictionary. The data Interana expects is of low-enough cardinality for this to be the better choice.
- Column data is sorted. A big part of the reason is of course to aid compression.
- Compression strategies are chosen automatically for each segment. Wholly automatically, I gather; you can’t tune the choice manually.
As you would think, Interana technically includes multiple data stores.
- Data first hits a write-optimized store. Unlike the case of Vertica, this WOS never is involved in answering queries.
- Asynchronously, the data is broken into columns, and banged to “disk”.
- Asynchronously again, the data is sorted.
- Queries run against sorted data, sorting recent blocks on-the-fly if necessary.
Interana lets you shard different replicas of the data according to different shard keys.
Interana is proud of the random sampling it does when serving approximate query results.

Notes on the transition to the cloud

Curt Monash — Tue, 04 Oct 2016 02:22:21 +0000

1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:

The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
Software as a service, aka SaaS.
Co-location in off-premises data centers, aka colo.
On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.

Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.

This is a good example of Monash’s Laws of Commercial Semantics.

2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.

This fact now seems to be widely understood.

3. The public cloud is a natural fit for those use cases in which elasticity truly matters. Many websites and other consumer internet backends have that characteristic. Such systems are often also a good fit for cloud technologies in general.

This is frequently a good reason for new — i.e. “greenfield” — apps to run in the cloud.

4. Security and privacy can be concerns in moving to the cloud. But I’m hearing that more and more industries are overcoming those concerns.

In connection to that point, it might be interesting to note:

In the 1960s and 1970s, one of the biggest industries for remote computing services — i.e. SaaS — was commercial banking.
Other big users were hospitals and stockbrokers.
The US intelligence agencies are building out their own shared, dedicated cloud.

5. Obviously, Amazon is the gorilla in the cloud business. Microsoft Azure gets favorable mentions as well. I don’t hear much about other public cloud providers, however, except that there are a lot of plans to support Google’s cloud just in case.

In particular, I hear less than I expected to about public clouds run by national-champion telecom companies around the world.

6. It’s inconvenient for an application vendor to offer both traditional and SaaS versions of a product. Release cycles and platform support are different in the two cases. But there’s no reason a large traditional application vendor couldn’t pull it off, and the largest are already more or less claiming to. Soon, this will feel like a market necessity across the board.

7. The converse is less universally true. However, some SaaS vendors do lose out from their lack of on-premises options. Key considerations include:

Does your application need to run close to your customers’ largest databases?
Do your customers still avoid the public cloud?

If both those things are true, and you don’t have an on-premises option, certain enterprises are excluded from your addressable market.

8. Line-of-business departments are commonly more cloud-friendly than central IT is. Reasons include:

Departments don’t necessarily see central IT as any “closer” to them than the cloud is.
Departments don’t necessarily care about issues that give central IT pause.
Departments sometimes buy things that only are available via remote delivery, e.g. narrowly focused SaaS applications or market data.

I discussed some of this in my recent post on vendor lock-in.

9. When the public cloud was younger, it had various technological limitations. You couldn’t easily get fast storage like flash. You couldn’t control data movement well enough for good MPP (Massively Parallel Processing) in use cases like analytic SQL.

Those concerns seem to have been largely alleviated.

10. It takes a long time for legacy platforms to be decommissioned. At some enterprises, however, that work has indeed been going on for a long time, via virtualization.

11. If you think about system requirements:

There is a lot of computing power in devices that may be regarded as IoT nodes — phones, TV boxes, thermostats, cars, industrial equipment, sensors, etc. Client-side computing is getting ever more diverse.
Server-side computing, however, is more homogenous. Enterprises can, should and likely will meet the vast majority of their server requirements on a relatively small number of clusters each.

I argued the latter point in my 2013 post on appliances, clusters, and clouds, using terminology and reasoning that are now only slightly obsolete.

So what will those clusters be? Some will be determined by app choices. Most obviously, if you use SaaS, the SaaS vendor decides which cloud(s) your data is in. And if you’re re-hosting legacy systems via virtualization, that’s another cluster.

Otherwise, clusters will probably be organized by database, in the most expansive sense of term. For example, there could be separate clusters for:

Operational data managed by your general-purpose RDBMS (Oracle, SQL Server, DB2, whatever).
Relational data warehousing, whether in an analytic RDBMS or otherwise.
Log files, perhaps managed in Hadoop or Splunk.
Your website and other internet back-ends, perhaps running over NoSQL data stores.
Text documents managed by some kind of search engine.
Media block or object storage, if the organization’s audio/video/whatever would overwhelm a text search engine. (Text search or document management systems can often also handle low volumes of non-text media.)

Indeed, since computing is rarely as consolidated as CIOs dream of it being, a large enterprise might have several clusters for any of those categories — each running different software for data and storage management — with different deployment choices among colo, true on-prem, and true cloud.

Cloudera Kudu deep dive

Curt Monash — Mon, 28 Sep 2015 07:52:13 +0000

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Part 1 is an overview of Kudu technology.
Part 2 (this post) is a lengthy dive into how Kudu writes and reads data.
Part 3 is a brief speculation as to Kudu’s eventual market significance.

Let’s talk in more detail about how Kudu stores data.

As previously noted, inserts land in an in-memory row store, which is periodically flushed to the column store on disk. Queries are federated between these two stores. Vertica taught us to call these the WOS (Write-Optimized Store) and ROS (Read-Optimized Store) respectively, and I’ll use that terminology here.
Part of the ROS is actually another in-memory store, aka the DeltaMemStore, where updates and deletes land before being applied to the DiskRowSets. These stores are managed separately for each DiskRowSet. DeltaMemStores are checked at query time to confirm whether what’s in the persistent store is actually up to date.
A major design goal for Kudu is that compaction should never block — nor greatly slow — other work. In support of that:
- Compaction is done, server-by-server, via a low-priority but otherwise always-on background process.
- There is a configurable maximum to how big a compaction process can be — more precisely, the limit is to how much data the process can work on at once. The current default figure = 128 MB, which is 4X the size of a DiskRowSet.
- When done, Kudu runs a little optimization to figure out which 128 MB to compact next.
Every tablet has its own write-ahead log.
- This creates a practical limitation on the number of tablets …
- … because each tablet is causing its own stream of writes to “disk” …
- … but it’s only a limitation if your “disk” really is all spinning disk …
- … because multiple simultaneous streams work great with solid-state memory.
Log retention is configurable, typically the greater of 5 minutes or 128 MB.
Metadata is cached in RAM. Therefore:
- ALTER TABLE kinds of operations that can be done by metadata changes only — i.e. adding/dropping/renaming columns — can be instantaneous.
- To keep from being screwed up by this, the WOS maintains a column that labels rows by which schema version they were created under. I immediately called this MSCC — Multi-Schema Concurrency Control — and Todd Lipcon agreed.
Durability, as usual, boils down to “Wait until a quorum has done the writes”, with a configurable option as to what constitutes a “write”.
- Servers write to their respective write-ahead logs, then acknowledge having done so.
- If it isn’t too much of a potential bottleneck — e.g. if persistence is on flash — the acknowledgements may wait until the log has been fsynced to persistent storage.
There’s a “thick” client library which, among other things, knows enough about the partitioning scheme to go straight to the correct node(s) on a cluster.

Leaving aside the ever-popular possibilities of:

Cluster-wide (or larger) equipment outages
Bugs

the main failure scenario for Kudu is:

The leader version of a tablet (within its replica) set goes down.
A new leader is elected.
The workload is such that the client didn’t notice and adapt to the error on its own.

Todd says that Kudu’s MTTR (Mean Time To Recovery) for write availability tests internally at 1-2 seconds in such cases, and shouldn’t really depend upon cluster size.

Beyond that, I had some difficulties understanding details of the Kudu write path(s). An email exchange ensued, and Todd kindly permitted me to post some of his own words (edited by me for clarification and format).

Every tablet has its own in-memory store for inserts (MemRowSet). From a read/write path perspective, every tablet is an entirely independent entity, with its own MemRowSet, rowsets, etc. Basically the flow is:

The client wants to make a write (i.e. an insert/update/delete), which has a primary key.

The client applies the partitioning algorithm to determine which tablet that key belongs in.

The information about which tablets cover which key ranges (or hash buckets) is held in the master. (But since it is cached by the clients, this is usually a local operation.)

It sends the operation to the “leader” replica of the correct tablet (batched along with any other writes that are targeted to the same tablet).

Once the write reaches the tablet leader:

The leader enqueues the write to its own WAL (Write-Ahead Log) and also enqueues it to be sent to the “follower” replicas.

Once it has reached a majority of the WALs (i.e. 2/3 when the replication factor = 3), the write is considered “replicated”. That is to say, it’s durable and would always be rolled forward, even if the leader crashed at this point.

Only now do we enter the “storage” part of the system, where we start worrying about MemRowSets vs DeltaMemStores, etc.

Put another way, there is a fairly clean architectural separation into three main subsystems:

Metadata and partitioning (map from a primary key to a tablet, figure out which servers host that tablet).

Consensus replication (given a write operation, ensure that it is durably logged and replicated to a majority of nodes, so that even if we crash, everyone will agree whether it should be applied or not).

Tablet storage (now that we’ve decided a write is agreed upon across replicas, actually apply it to the database storage).

These three areas of the code are separated as much as possible — for example, once we’re in the “tablet storage” code, it has no idea that there might be other tablets. Similarly, the replication and partitioning code don’t know much anything about MemRowSets, etc – that’s entirely within the tablet layer.

As for reading — the challenge isn’t in the actual retrieval of the data so much as in figuring out where to retrieve it from. What I mean by that is:

Data will always be either in memory or in a persistent column store. So I/O speed will rarely be a problem.
Rather, the challenge to Kudu’s data retrieval architecture is finding the relevant record(s) in the first place, which is slightly more complicated than in some other systems. For upon being told the requested primary key, Kudu still has to:
- Find the correct tablet(s).
- Find the record(s) on the (rather large) tablet(s).
- Check various in-memory stores as well.

The “check in multiple places” problem doesn’t seem to be of much concern, because:

All that needs to be checked is the primary key column.
The on-disk data is front-ended by Bloom filters.
The cases in which a Bloom filter returns a false positive are generally the same busy ones where the key column is likely to be cached in RAM.
Cloudera just assumes that checking a few different stores in RAM isn’t going to be a major performance issue.

When it comes to searching the tablets themselves:

Kudu tablets feature data skipping among DiskRowSets, based on value ranges for the primary key.
The whole point of compaction is to make the data skipping effective.

Finally, Kudu pays a write-time (or compaction-time) cost to boost retrieval speeds from inside a particular DiskRowSet, by creating something that Todd called an “ordinal index” but agreed with me would be better called something like “ordinal offset” or “offset index”. Whatever it’s called, it’s an index that tells you the number of rows you would need to scan before getting the one you want, thus allowing you to retrieve (except for the cost of an index probe) at array speeds.

Introduction to Cloudera Kudu

Curt Monash — Mon, 28 Sep 2015 07:50:02 +0000

This is part of a three-post series on Kudu, a new data storage system from Cloudera.

Part 1 (this post) is an overview of Kudu technology.
Part 2 is a lengthy dive into how Kudu writes and reads data.
Part 3 is a brief speculation as to Kudu’s eventual market significance.

Cloudera is introducing a new open source project, Kudu,* which from Cloudera’s standpoint is meant to eventually become the single best underpinning for analytics on the Hadoop stack. I’ve spent multiple hours discussing Kudu with Cloudera, mainly with Todd Lipcon. Any errors are of course entirely mine.

*Like the impala, the kudu is a kind of antelope. I knew that, because I enjoy word games. What I didn’t know — and which is germane to the naming choice — is that the kudu has stripes.

For starters:

Kudu is an alternative to HDFS (Hadoop Distributed File System), or to HBase.
Kudu is meant to be the underpinning for Impala, Spark and other analytic frameworks or engines.
Kudu is not meant for OLTP (OnLine Transaction Processing), at least in any foreseeable release. For example:
- Kudu doesn’t support multi-row transactions.
- There are no active efforts to front-end Kudu with an engine that is fast at single-row queries.
- Kudu is rather columnar, except for transitory in-memory stores.
Kudu’s core design points are that it should:
- Accept data very quickly.
- Immediately make that data available for analytics.
More specifically, Kudu is meant to accept, along with slower forms of input:
- Lots of fast random writes, e.g. of web interactions.
- Streams, viewed as a succession of inserts.
- Updates and inserts alike.
The core “real-time” use cases for which Kudu is designed are, unsurprisingly:
- Low-latency business intelligence.
- Predictive model scoring.
Kudu is designed to work fine with spinning disk, and indeed has been tested to date mainly on disk-only nodes. Even so, Kudu’s architecture is optimized for the assumption that there will be at least some flash on the node.
Kudu is designed primarily to support relational/SQL processing. However, Kudu also has a nested-data roadmap, which of course starts with supporting the analogous capabilities in Impala.

Also, it might help clarify Kudu’s status and positioning if I add:

Kudu is in its early days — heading out to open source and beta now, with maturity still quite a way off. Many obviously important features haven’t been added yet.
Kudu is expected to be run with a replication factor (tunable, usually =3). Replication is via the Raft protocol.
Kudu and HDFS can run on the same nodes. If they do, they are almost entirely separate from each other, with the main exception being some primitive workload management to help them share resources.
Permanent advantages of older alternatives over Kudu are expected to include:
- Legacy. Older, tuned systems may work better over some HDFS formats than over Kudu.
- Pure batch updates. Preparing data for immediate access has overhead.
- Ultra-high update volumes. Kudu doesn’t have a roadmap to completely catch up in write speeds with NoSQL or in-memory SQL DBMS.

Kudu’s data organization story starts:

Storage is right on the server (this is of course also the usual case for HDFS).
On any one server, Kudu data is broken up into a number of “tablets”, typically 10-100 tablets per node.
Inserts arrive into something called a MemRowSet and are soon flushed to something called a DiskRowSet. Much as in Vertica:
- MemRowSets are managed by an in-memory row store.
- DiskRowSets are managed by a persistent column store.*
- In essence, queries are internally federated between the in-memory and persistent stores.
Each DiskRowSet contains a separate file for each column in the table.
DiskRowSets are tunable in size. 32 MB currently seems like the optimal figure.
Page size default is 256K, but can be dropped as low as 4K.
DiskRowSets feature columnar compression, with a variety of standard techniques.
- All compression choices are specific to a particular DiskRowSet.
- So, in the case of dictionary/token compression, is the dictionary.
- Thus, data is decompressed before being operated on by a query processor.
- Also, selected columns or an entire DiskRowSet can be block-compressed.
Tables and DiskRowSets do not expose any kind of RowID. Rather, tables have primary keys in the usual RDBMS way.
Kudu can partition data in the three usual ways: randomly, by range or by hash.
Kudu does not (yet) have a slick and well-tested way to broadcast-replicated a small table across all nodes.

*I presume there are a few ways in which Kudu’s efficiency or overhead seem more row-store-like than columnar. Still, Kudu seems to meet the basic requirements to be called a columnar system.

Where the innovation is

Curt Monash — Mon, 19 Jan 2015 08:27:57 +0000

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

Product maturity is a huge issue for all the above, and will remain one for years.
Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Object/file/block.
Graph analytics and data management are still confused.

3. As I suggested last year, data transformation is an important area for innovation.

MapReduce was invented for data transformation, which is still a large part of what goes on in Hadoop.
The smart data preparation crowd is deservedly getting attention.
The more different data models — NoSQL and so on — that are used, the greater are the demands on data transformation.

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Mathematically (more) complex models that are at once more accurate and more easily arrived at than (nearly) linear ones.
Similarly, more complex clustering.
Predictive experimentation.
The use of business intelligence and predictive modeling to inform each other.

Beyond that:

Event-series analytics is another exciting area. (At least on the BI side, I frankly expected it to sweep through the relevant vertical markets more quickly than it has.)
I’ve long been disappointed in the progress in text analytics. But sentiment analysis is doing fairly well, many more languages are analyzed than before, and I occasionally hear rumblings of text analytic sophistication inching back towards that already available in the previous decade.
While I don’t write about it much, modern BI navigation is an impressive and wonderful thing.

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

“Fencing” processes and resources away from each other improves system quality, in that it defends against both deliberate attacks and inadvertent error.
Fencing is costly, both in terms of context-switching and general non-optimization. Nonetheless, I suspect that …
… the cost of such process isolation may need to be borne.
Object-oriented programming and its associated contracts are good things in this context. But it’s obvious they’re not getting the job done on their own.

More specifically,

It is cheap to give single-purpose intelligent devices more computing power than they know what to do with. There is really no excuse for allowing them to be insecure.
It is rare for a modern PC to go much above 25% CPU usage, simply because most PC programs are still single-core. This illustrates that — assuming some offsetting improvements in multi-core parallelism — desktop software could take a security performance hit without much pain to users’ wallets.
On servers, we may in many cases be talking about lightweight virtual machines.

And to be clear:

What I’m talking about would do little to help the authentication/authorization aspects of security, but …
… those will never be perfect in any case (because they depend upon fallible humans) …
… which is exactly why other forms of security will always be needed.

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

Application development technology, languages, frameworks, etc.
The integration of analytics into old-style operational apps.
The never-ending attempts to make large-enterprise-class application functionality available to outfits with small-enterprise sophistication and budgets.

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

In many cases, I think that innovations will prove more valuable — or at least much easier to monetize — when presented to particular vertical markets.
Edit: I followed up on the last point with a post about soft robots.

Notes from a visit to Teradata

Curt Monash — Sun, 31 Aug 2014 09:17:29 +0000

I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.

First, let’s catch up with some personnel gossip. So far as I can tell:

Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
… Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
Oliver Ratzesberger runs Teradata’s software development.
Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.

The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.

In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:

The Teradata 1xxx series is focused on cost-per-bit.
The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
The Teradata 6xxx series is supposed to be able to do “everything”.
The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”

Also:

1xxx and 2xxx systems are meant to be I/O-constrained. 6xxx systems are meant to be constrained mainly by CPU, but every system will be I/O-constrained at some point.
There is at least one example of a Very Well Known organization buying Teradata’s Hadoop-only appliance despite not otherwise being a Hadoop customer. Teradata concedes, however, that this is not a common occurrence.
Customers are increasingly using co-location rather than their own data centers. Many colo organizations charge more or less strictly by floor space. Hence, there’s a push for maximum processing density per rack, power density and weight be damned.

Speaking of not being CPU-constrained — I heard 7-10% as an estimate for typical Hadoop utilization, and also 10-15%. While I didn’t ask, I presume these figures assume traditional MapReduce types of Hadoop workloads. I’m not sure why these figures are yet lower than eBay’s long-ago estimates of Hadoop “parallel efficiency”.

Like Carson used to do, Jeff shared a variety of hardware and networking tidbits with me. In particular:

Jeff is confident in Moore’s Law continuing for at least 5 more years. (I think that’s a near-consensus; the 2020s, however, are another matter.)
Teradata still uses SAS rather than SATA for all disk (spinning or solid-state) controllers. They’re now seeing 6-700 MB/sec/device on SSDs (Solid State Disk), up from 3-400.
SSD prices are down 60% over the past 6 months, vs. much slower declines previously.
Formerly a SanDisk/Pliant partisan, Teradata now thinks there are multiple vendors of good SSDs. (I’m not sure whether they’d be happy if I said which one they currently like best.)
Jeff foresees InfiniBand and Ethernet more or less merging. Right now Teradata is using a lot of 56 Gb/sec InfiniBand.

Since Oliver is now a Teradata mucky-muck, I asked about virtual data marts, an idea that he pretty much invented or at least popularized back in his eBay days. Comments included:

Teradata now calls them Data Labs.
Adoption is very high.
One major feature is “time boxing” — they expire after a period of time unless you renew them.
Analysis of virtual data mart usage is a good guide as to what you might want to add to your permanent data warehouse.

And I’ll stop here, although I hope that a couple more-focused posts will also eventually flow from the visit.

Notes and comments, May 6, 2014

Curt Monash — Tue, 06 May 2014 13:46:54 +0000

After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.

My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
My CitusDB post picked up a few clarifying comments.

Here is a catch-all post to complete the set.

1. The recently-announced Cloudera/MongoDB relationship* is still at the Barney stage. That said, I’m optimistic that their stated intention to add substance to the relationship will eventually come to fruition. If nothing else, the two companies have high regard for each other, at least at the Mike Olson/Max Schireson level.

*That’s one of numerous deals with my fingerprints on it, but in this case only lightly. It was probably on track to happen even without my nudges.

2. Most of what I talked about when I visited MongoDB is confidential; the public stuff was mainly in my recent MongoDB technology post. But in one exception, I asked Max for an update as to MongoDB enterprise use cases. He reported a cluster in data combination, especially but not only in use cases which have both a high-volume part and dynamic-schema aspects. Specific examples Max cited included:

Tracking financial holdings from a variety of asset classes — especially if derivatives are involved, because they have a dynamic-schema aspect.
Product catalogs, including for use on web sites.
Customer information.
Patient information.

3. I didn’t ask everybody I saw in California about business trends, and much of what we did discuss was confidential. That said:

MapR was proud of its numbers.
So was DataStax.
ClearStory has a bunch of Very Big Enterprises as customers, mainly but not only in consumer sectors (e.g. retail, packaged goods).

4. Platfora is focusing a bit, starting with clickstream and security — i.e., event series stuff. And by the way, they report that the term “event series” is working well for them.

5. I gather from a variety of comments and conversations that Amazon Redshift has achieved considerable traction.

6. Something I can’t find evidence of having posted before: I think multiple businesses monitor online sales or similar business successes as a guide to network problems. eBay did this via a custom in-memory MOLAP (Multidimensional Online Analytic Process) system years ago. Best evidence that this is hardly restricted to eBay: all the “me-too” responses I get from telling that story.

7. Citus Data tells me that as of PostgreSQL 9.4, Postgres will be able to return just the part of a JSON column needed for a query. This is as opposed to storing the whole thing as text and only retrieving it in its entirety.

8. In the comments to my “Spark on fire” post, Patrick McFadin pointed out that Mahout is transitioning from MapReduce to Spark. (All new work will be on Spark, although old MapReduce-based routines will continue to be supported.) It turns out that Derrick Harris wrote about that over a month ago, and I just missed the news.

9. Also in predictive analytics — there are rumblings that R could eventually be supplanted by Julia, although R’s massive libraries of algorithms still give it the advantage now.

10. Multiple vendors, fed up with the intermittent slowdowns from garbage collection, are moving some processing off the Java heap. Unfortunately, I neglected to ask any of them what the remaining differences then were between Java and C++ programming.

11. And to finish on a light note: BDAS — the project of which Spark is only a part — is pronounced “bad-ass”, something I first heard from Dave Patterson.

Hardware and storage notes

Curt Monash — Thu, 01 May 2014 02:05:16 +0000

My California trip last week focused mainly on software — duh! — but I had some interesting hardware/storage/architecture discussions as well, especially in the areas of:

Rack- or data-center-scale systems.
The real or imagined demise of Moore’s Law.
Flash.

I also got updated as to typical Hadoop hardware.

If systems are designed at the whole-rack level or higher, then there can be much more flexibility and efficiency in terms of mixing and connecting CPU, RAM and storage. The Google/Facebook/Amazon cool kids are widely understood to be following this approach, so others are naturally considering it as well. My most interesting of several mentions of that point was when I got the chance to talk with Berkeley computer architecture guru Dave Patterson, who’s working on plans for 100-petabyte/terabit-networking kinds of systems, for usage after 2020 or so. (If you’re interested, you might want to contact him; I’m sure he’d love more commercial sponsorship.)

One of Dave’s design assumptions is that Moore’s Law really will end soon (or at least greatly slow down), if by Moore’s Law you mean that every 18 months or so one can get twice as many transistors onto a chip of the same area and cost than one could before. However, while he thinks that applies to CPU and RAM, Dave thinks flash is an exception. I gathered that he thinks the power/heat reasons for Moore’s Law to end will be much harder to defeat than the other ones; note that flash, because of what it’s used for, has vastly less power running through it than CPU or RAM do.

Otherwise, I didn’t gain much new insight into actual flash uptake. Everybody thinks flash is or soon will be very important; but in many segments, folks are trading off disk vs. RAM without worrying much about the intermediate flash alternative.

I visited two Hadoop distribution vendors this trip, namely the ones who are my clients – Cloudera and MapR. I remembered to ask one of them, Cloudera, about typical Hadoop hardware, and got answers that sounded consistent with hardware trends Hortonworks told me about last August. The story is, more or less:

The default assumption remains $20-30K/node, 2 sockets, 12 disks. (Edit: See lively price discussion in the comments below.)
Most hardware vendors have standard/default Hadoop boxes by now, and in many cases customers just buy what’s on offer.
The aforementioned disks sometimes get up to 4 terabytes now.
128GB is now the norm for RAM. 256GB is common. Higher amounts are seen, up to – in rare cases – 2-4 TB.
Flash is of interest, but isn’t being demanded much yet. This could change when flash’s storage density matches disk’s.
Flash interest is highest for Impala.

Cloudera suggested that the larger amounts of RAM tend to be used when customers frame the need as putting certain analytic datasets entirely in RAM. This rings true to me; there’s lots of evidence that users think that way, and not just in analytic cases. This is probably one of the reasons that they often jump straight from disk to RAM without fully exploring the opportunities of flash.

One last thing — the big cloud vendors are at least considering the use of their own non-Intel chip designs, which might be part of the reason for Intel’s large Hadoop investment.