What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.
Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.
The top integration and integration-like challenges for me, from a practical standpoint, are:
- Integrating silos — a decades-old problem still with us in a big way.
- Dynamic schemas with joins.
- Low-latency business intelligence.
- Human real-time personalization.
Other concerns that get mentioned include:
- Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
- Logical data warehouse, a term that doesn’t actually mean anything real.
- In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.
Let’s skip those latter issues for now, focusing instead on the first four.
While the software industry has been working on application integration for decades, there’s clearly a long way yet to go. Let me illustrate by way of personal story.
I needed a new laptop computer on short notice, and decided to go with an HP Folio.* Driving to a local Wal-Mart seemed more practical than ordering online, as a couple of stores near my house were listed by Walmart.com as being in stock. I called just to check; both were out of stock. The Wal-Mart folks on the phone told me such errors are routine.
*It was pretty much the cheapest all-solid-state credible alternative I could find, is said to have a good keyboard, and has an Ethernet port for all those client visits when guest Wi-Fi doesn’t work.
You may recall my outraged tweets about a similar silos-of-non-integration story in Dell customer support, a couple of years back. Yet Dell is one of the larger computer companies in the world, while Wal-Mart is one of the most accomplished computer users. If Wal-Mart and Dell can’t get basic system functionality right, just imagine how screwed up everybody else is.
Dynamic schemas with joins
There are multiple reasons to use dynamic schemas over fixed ones. This is especially true when recording web interaction data, because every page can have very different information to log. But there are also multiple reasons to want to use joins, especially when your application combines two or more of:
- User-specific reference, demographic, and/or psychographic data.
- System-wide reference data driving user-specific personalization.
- Orders and inventory.
- Verbose log data.
That doesn’t mean that a fully general join syntax is needed in every DBMS. But it does mean that the workarounds to joinlessness I wrote about a couple of years ago often don’t suffice.
Fortunately, much better stuff is being developed. The best that I know of still awaits launch — but I’ve begun to connect users with vendors who can address that problem head-on.
Low-latency business intelligence
If you have data pounding into a short-request system, there are several levels of BI you could try to do on it in human real time.
- System monitoring. There are lots of tools for that.
- Simple business aggregations. Top-end system monitoring stacks can help with that too (notably Splunk). Alternatively, you can maintain a few aggregates in even a NoSQL database.*
- More serious BI, drawing on the various information in your data warehouse. That one’s tougher.
*Counters are the canonical example.
Single-server RDBMS have, for years, combined OLTP (OnLine Transaction Processing) and a reasonable amount of reporting or BI. As needs get more intense, Oracle and SAP are throwing hardware at the problem, via Exadata, Exalytics, HANA, and so on. But suppose you prefer a short-request system that scales out, runs on cheap commodity hardware, and fits well into the cloud. What do you do then?
One approach, which in some form I’ve recommend to multiple clients, is to stream the data to some kind of analytic data store, and serve your analytics from there. That technology is getting better all the time, even though many vendors haven’t yet recognized the magnitude of the need and opportunity.
More responsive personalization
Another kind of human-real-time analytics is even more important — automated response, such as ad personalization. Ideally, you want your response to be well-informed by everything the user has been doing over the past few minutes and even seconds. But two difficulties loom.
First, if we combine this point and the previous two, we might ideally want to stream data from a NoSQL store to an analytic one and back to a short-request SQL DBMS. That would be — complicated. Fortunately, there are a variety of not-crazy approaches, with varying degrees of cost, pain, or risk, with more coming soon as different kinds of data stores somewhat re-converge.
The second problem is more conceptual. What are the models and algorithms that tell us how to personalize based on up-to-the-second information? Since only the most simple-minded approaches seem practical to implement, only the most simple-minded answers have ever been worked out. A lot of data science lies ahead — and for once I don’t think that term is overwrought.
And with that I’m shutting down for 2 1/2 weeks for vacation. Depending on how things go with my new HP Folio :), as well as Wi-Fi in Istanbul, I hope to be fairly responsive to blog comments and email, and indeed will work on setting up a long October California trip. But I also hope that, for once, there isn’t any vacation-busting news; I’ve had some bad luck in that regard before, professionally and personally alike.