I coined a new term, dataset management, for my clients at Revelytix, which they indeed adopted to describe what they do. It would also apply to the recently released Cloudera Navigator. To a first approximation, you may think of dataset management as either or both:
- Metadata management in a structured-file context.
- Lineage/provenance, auditing, and similar stuff.
Why not just say “metadata management”? First, the Revelytix guys have long been in variants of that business, and they’re tired of the responses they get when they use the term. Second, “metadata” could apply either to data about the file or to data about the data structures in the file or perhaps to data about data in the file, making “metadata” an even more confusing term in this context than in others.
My idea for the term dataset is to connote more grandeur than would be implied by the term “table”, but less than one might assume for a whole “database”. I.e.:
- A dataset contains all the information about something. This makes it a bigger deal than a mere table, which could be meaningless outside the context of a database.
- But the totality of information in a “dataset” could be less comprehensive than what we’d expect in a whole “database”.
As for the specific products, both of which you might want to check out:
- Cloudera Navigator:
- Is one product from a leading Hadoop company.
- Assumes you use Cloudera’s flavor of Hadoop.
- Is generally available.
- Starts with auditing (lineage coming soon).
- Revelytix Loom:
- Is the main product of a small metadata management company.
- Is distro-agnostic.
- Is in beta.
- Already does lineage.