January 19, 2015

Where the innovation is

I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. :) But if we abandon any hope that this post could be comprehensive, I can at least say:

1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.

2. Even so, there’s much room for innovation around data movement and management. I’d start with:

3. As I suggested last year, data transformation is an important area for innovation. 

4. There’s a lot going on in investigative analytics. Besides the “platform” technologies already mentioned, in areas such as fast-query, data preparation, and general execution engines, there’s also great innovation higher in the stack. Most recently I’ve written about multiple examples in predictive modeling, such as:

Beyond that:

5. Back in 2013, in what was perhaps my previous most comprehensive post on innovation, I drew a link between innovation and refactoring, where what was being refactored was “everything”. Even so, I’ve been ignoring a biggie. Security is a mess, and I don’t see how it can ever be solved unless systems are much more modular from the ground up. By that I mean:

More specifically,

And to be clear:

6. You’ve probably noticed the fuss around an open letter about artificial intelligence, with some press coverage suggesting that AI is a Terminator-level threat to humanity. Underlying all that is a fairly interesting paper summarizing some needs for future research and innovation in AI. In particular, reading the paper reminded me of the previous point about security.

7. Three areas of software innovation that, even though they’re pretty much in my wheelhouse, I have little to say about right now are:

8. There is, of course, tremendous innovation in robots and other kinds of device. But this post is already long enough, so I’ll address those areas some other time.

Related links

Comments

11 Responses to “Where the innovation is”

  1. J. Andrew Rogers on January 19th, 2015 2:10 pm

    A problem with the Vs is that they are almost always connected by “or” in real platforms. Unfortunately, many applications require e.g. Volume *and* Velocity simultaneously. This is not a technical limitation per se, it reflects architectural limitations of platforms due to narrower original use cases. This has made, for example, popular open source (and most closed source) platforms unsuitable for emerging sensor and machine-generated data applications, which often require extremely high continuous ingest of live data concurrent with storage and operational queries (not summarizations) on a rolling window that spans days or weeks. This is the canonical Internet of Things workload in a nutshell and in my experience the typical working set size is 100 TB, give or take an order of magnitude. In-memory is too small, and most on-disk storage behaviors are too archival-like.

    High-end SQL environments have sophisticated I/O scheduling that connects real-time execution and ingest processing with storage, but this has been absent in virtually all “big data” platforms. The challenge for the Hadoop ecosystem is that addressing Volume *and* Velocity *and* Variety simultaneously requires a much tighter coupling of components and I/O management than the model they are used to. It is not just Hadoop/Spark either; I’ve heard many similar stories from people using platforms like MongoDB, Cassandra, and MemSQL, which while monolithic still rely on primitive and relatively decoupled internal models that make it difficult to seamlessly connect all of the execution paths under load.

    Internet of Things workloads are exposing weaknesses in many of the existing Big Data architectures by requiring them to be more general than they have been. It is not something that can be trivially added to an architecture after the fact so this will drive more evolution in the platform market.

  2. Curt Monash on January 19th, 2015 2:28 pm

    Hi, and thanks for your comment!

    I may be a little more optimistic about the consequences of current technology efforts than you are, or perhaps we’re just emphasizing slightly different things. Solving Volume and Velocity in the same system, while far from trivial, is straightforward — you have a low-latency store to first receive the data (presumably in-memory), a fat and efficient set of (parallel) pipes to persistent storage, and good federation between the two at the time of query.

    Meanwhile, Variety and Variability are almost orthogonal to those issues, with the big exception being that decades of superb engineering in analytic performance for tabular (relational or MOLAP) data stores may be marginalized when addressing new kinds of data challenges.

  3. clive boulton on January 19th, 2015 5:23 pm

    Under the covers the application innovators are combining multiple databases into web-scale consumer internet sites or into web-scale platforms for vertical enterprise app builders (case in point J.Andrew Rodgers above).

    The engineering skill is combining a collection of the V’s into a polystore and sales skill quickly getting traction on slither of application functionality. Leaving complete ERP business solutions to run modern manufacturing are still stuck in pre-Y2K database architectures, even in SaaS.

    Missing innovations seem to be in horizontal licencing to seamlessly connect systems and provide a consistent UI across a collection of sites for rich web-clients. http://component.kitchen/components

    Or perhaps on-premises makes a comeback and connects to the “V’s” via a SaaS-appliance hybrid model?

    Coda: Something seems missing to kick off an enterprise app boom like consumer site have over the last 20 years.

  4. J. Andrew Rogers on January 20th, 2015 1:42 pm

    Curt, the recurring challenge customers are having with high-velocity data is that they need the on-disk data to be fully indexed for fast queries and online as well at this scale, not just stored. Big data platforms tend to be designed with the assumption that if it is on disk, it is for offline processing only. Query performance falls off a cliff the minute you touch disk but the use cases for this data tends to be operational. It is a showstopper for many IoT analytic applications.

    Variety is solved but with one qualification. Data platforms, big or small, have a difficult time scaling data models built around interval data types; you can prove these can’t be scaled out with either hash or range partitioning, particularly for online data models. This notoriously includes geospatial and constraint data models, hence the conspicuous absence of geospatial big data platforms even though these are the largest data sets that exist.

    These touch on two things that make SpaceCurve’s architecture unique. It is the first database platform built around a pure interval computational model, based on some computer science research I did back when I was designing back-ends for Google Earth; even traditional SQL elements are represented as hyper-rectangles under the hood. At least as important, the storage and execution engine are a novel design, borrowing some ideas from my supercomputing days, that allow us to take 10 GbE through processing, indexing, and storage per node on cheap Linux clusters concurrent with low-latency parallel queries at very large scales.

    When we first went into the market, we thought our obvious differentiation would be based on our effortless scaling of interval data models and analytics. Ironically, customers value the platform at least as much for the smooth wire speed performance when disk is involved, which I would expect more platforms to do well. It seems trivial but the rough transition from memory to disk is turning out to be a killer for applications involving high velocity data.

  5. Venkat Krishnamurthy on January 24th, 2015 1:37 am

    Curt

    On graph analytics and data management being confused…

    We have some experiential learning in the space the last couple of years with Urika (starting with changing the vexing capitalization of the name from uRiKa). Essentially we learned that ‘graph for analytics’ use cases are distinct from ‘graphs for data management’. I think the big reason for the confusion is that graphs are an abstract data structure which could be stored and analyzed in any number of different ways, and lumping them into ‘Graphs’ as a category is somewhat meaningless.

    Basically you don’t need graph databases to do graph analytics. If you’re building graph databases, then the primary use case is implicitly or explicitly related to traversal or pattern matching/isomorphism. While these are more common than people think (eg. every 3rd normal form relational data model is easily viewable as a composite of multiple graphs), people don’t think that way because there’s no major value-add beyond the relational model in doing so. Graph db’s are further confused because of the RDF/semantic graph vs. property-graph data model ‘fork’, and the attendant rise of multiple query languages (declarative SPARQL, imperative Gremlin, Cypher etc).

    OTOH, if you’re doing analytics (loosely defined as characterizing the graph via computed quantities like clustering coefficient or centrality measures), you may not even need a ‘proper’ graph. Several efficient techniques exploit the duality between graphs and matrices so you can treat these as linear algebra problems. As a simple example, you can download stock price data from yahoo, compute pairwise correlations of returns (a half-matrix) and you’d have an edge-weighted graph. In such cases, the graph is a ‘lazily’ materialized analytic data structure – you only create it when you need it.

    The graph analytics area has momentum because it provides a very distinct and powerful analytic toolbox from the usual set. GraphLab/Dato is a great example. The graph database side has a tougher road since it requires changing how people think about data, and has to additionally get past category overload (‘NoSQL’) and inertia (relational data model). Finally in either case, the attendant algorithmic problems don’t go away, making for lots of lurking performance monsters.

    What we also know(being Cray after all) is that the underlying system engineering has a huge impact here – eg. latency tolerant multithreaded hardware or software runtimes, shared memory, partitioning, graph-parallel processing models being prime examples.

  6. Soft robots, Part 1 — overview | DBMS 2 : DataBase Management System Services on January 27th, 2015 7:29 am

    […] This series partially fulfils an IOU left in my recent post on IT innovation. […]

  7. clive boulton on February 7th, 2015 3:58 pm

    Venkat’s comment. GraphLab/Dato is a great example. GL’s realization that folks don’t think in graphs they think in tabular data Dato forms. Graphs are fine with Donald Knuth. Mere mortals go into overload. Platforms that ingest tabular, optimize for velocity, output intelligence, and (I’d add) provide means to add ERP orders / process transactions will strike new gold IMHO.

  8. Which analytic technology problems are important to solve for whom? | DBMS 2 : DataBase Management System Services on April 12th, 2015 11:54 pm

    […] a need and/or desire for more sophisticated analytic tools, in predictive modeling and […]

  9. Information technology for personal safety | DBMS 2 : DataBase Management System Services on May 25th, 2015 8:02 pm

    […] overview of innovation opportunities ended by saying there was great opportunity in devices. It also offered notes on predictive […]

  10. Patric P. on November 4th, 2015 6:40 am

    In my opinion, modeling databases in the cloud is a significant innovation of the last few years. Until very recently, the market was completely dominated by desktop database modelers. You had to buy a license and your modeling application was tied up to your desktop computer. There were no alternatives. But a few years ago first online modelers were released. They were very simple and couldn’t be used in professional app development, but the idea itself was a breakthrough. No desktop licenses, no installation, no upgrades. The only requirement is to have a suitable web browser and access to the internet. Just open a browser, log in, and you can get to work. All your models are stored in the cloud, so you can access them anywhere and anytime. Moreover, a web-based tool gives you brand-new possibilities to collaborate within your team, and to work remotely. And it must be pointed out that the new commercial generation of these tools provides almost similar features and capabilities as their leading desktop counterparts. And they have supreme collaboration features that allow you to build and manage your team, and to work together on your db models. I tested both most significant online tools for database modeling – GenMyModel (www.genmymodel.com) and Vertabelo (www.vertabelo.com), and I must say that they are the future of database design. As these applications run in a browser, their interfaces need to be light. That makes them far more intuitive and user friendly in comparison with heavy and overloaded UIs of desktop modelers.
    What are your opinions? Do you think that the future of database modeling tools is on the internet?

  11. Curt Monash on November 4th, 2015 9:13 pm

    This is for modeling of what kinds of databases, that run where?

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.