March 18, 2013

Dataset management

I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:

Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. 🙂 Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.

My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:

As for the specific products, both of which you might want to check out:


6 Responses to “Dataset management”

  1. aaron on March 18th, 2013 2:33 pm

    It is interesting to look at the two (oddly segregated) streams:
    – imposing structure on Hadoop data (HCatalog, HBase catalog and catalog tracker, etc. and many more)
    – looking at the Hadoop ontology such as the above
    The convergence would be at the app level if the data doesn’t have strict single desired semantic understanding.

    It is also fascinating how useless most of these tools seem where the data is semi-structured or has multiple interpreters attempting to understand it in distinct ways.

  2. Curt Monash on March 18th, 2013 8:12 pm

    It’s a trade-off.

    In a traditional relational world, the schemas for different apps are integrated into one hairball, but there are clean interfaces between the database and the app.

    In the dynamic schema world, the database schema is tightly integrated into the app, but is much more independent of other apps’ schemas.

  3. Hari on March 29th, 2013 7:56 am

    Hi Curt,

    How’s a ‘dataset’ different from a ‘data mart’ form the old data-warehousing days?

    Why are you introducing a new term?


  4. Curt Monash on March 29th, 2013 10:41 am


    In (overly) simple terms:

    A dataset would most typically be a single file in some format or other, managed by Hadoop.

    A data mart would most typically be a set of relational tables, managed by an RDBMS.

  5. Teradata bought Hadapt and Revelytix | DBMS 2 : DataBase Management System Services on July 23rd, 2014 4:29 am

    […] out a data integration suite to cover a limited universe of data stores. And Revelytix’ dataset management technology is a nice piece toward an integrated data […]

  6. Cloudera in the cloud(s) | DBMS 2 : DataBase Management System Services on January 22nd, 2016 6:57 am

    […] Cloudera Navigator for object storage is a roadmap item. […]

Leave a Reply

Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:


Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.