Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:
- Any subcategory
- Database diversity
1. A couple years ago I wrote skeptically about integrating predictive modeling and business intelligence. I’m less skeptical now.
- The predictive experimentation I wrote about over Thanksgiving calls naturally for some BI/dashboarding to monitor how it’s going.
- If you think about Nutonian’s pitch, it can be approximated as “Root-cause analysis so easy a business analyst can do it.” That could be interesting to jump to after BI has turned up anomalies. And it should be pretty easy to whip up a UI for choosing a data set and objective function to model on, since those are both things that the BI tool would know how to get to anyway.
I’ve also heard a couple of ideas about how predictive modeling can support BI. One is via my client Omer Trajman, whose startup ScalingData is still semi-stealthy, but says they’re “working at the intersection of big data and IT operations”. The idea goes something like this:
- Suppose we have lots of logs about lots of things.* Machine learning can help:
- Notice what’s an anomaly.
- Group* together things that seem to be experiencing similar anomalies.
- That can inform a BI-plus interface for a human to figure out what is happening.
Makes sense to me.
* The word “cluster” could have been used here in a couple of different ways, so I decided to avoid it altogether.
Finally, I’m hearing a variety of “smart ETL/data preparation” and “we recommend what columns you should join” stories. I don’t know how much machine learning there’s been in those to date, but it’s usually at least on the roadmap to make the systems (yet) smarter in the future. The end benefit is usually to facilitate BI.
2. Discussion of graph DBMS can get confusing. For example: Read more
|Categories: Business intelligence, Greenplum, Hadoop, Hortonworks, Log analysis, Neo Technology and Neo4j, Nutonian, Predictive modeling and advanced analytics, RDF and graphs, WibiData||1 Comment|
I talked with the Snowflake Computing guys Friday. For starters:
- Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
- The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
- The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
- Snowflake claims excellent SQL coverage for a 1.0 product.
- Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.
Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*
*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.
In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:
- Ingest formats are JSON, XML or AVRO for now.
- I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.
I don’t know enough details to judge whether I’d call that an example of schema-on-need.
A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather: Read more
|Categories: Amazon and its cloud, Cloud computing, Data mart outsourcing, Data models and architecture, Data warehousing, Market share and customer counts, Parallelization, Pricing, Software as a Service (SaaS), Structured documents||1 Comment|
We all tend to assume that data is a great and glorious asset. How solid is this assumption?
- Yes, data is one of the most proprietary assets an enterprise can have. Any of the Goldman Sachs big three* — people, capital, and reputation — are easier to lose or imitate than data.
- In many cases, however, data’s value diminishes quickly.
- Determining the value derived from owning, analyzing and using data is often tricky — but not always. Examples where data’s value is pretty clear start with:
- Industries which long have had large data-gathering research budgets, in areas such as clinical trials or seismology.
- Industries that can calculate the return on mass marketing programs, such as internet advertising or its snail-mail predecessors.
*”Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.” I love that motto, even if Goldman Sachs itself eventually stopped living up to it. If nothing else, my own business depends primarily on my reputation and information.
This all raises the idea – if you think data is so valuable, maybe you should get more of it. Areas in which enterprises have made significant and/or successful investments in data acquisition include: Read more
|Categories: Data mart outsourcing, eBay, Health care, Investment research and trading, Log analysis, Scientific research, Text, Web analytics||5 Comments|
After visiting California recently, I made a flurry of posts, several of which generated considerable discussion.
- My claim that Spark will replace Hadoop MapReduce got much Twitter attention — including some high-profile endorsements — and also some responses here.
- My MemSQL post led to a vigorous comparison of MemSQL vs. VoltDB.
- My post on hardware and storage spawned a lively discussion of Hadoop hardware pricing; even Cloudera wound up disagreeing with what I reported Cloudera as having said. Sadly, there was less response to the part about the partial (!) end of Moore’s Law.
- My Cloudera/SQL/Impala/Hive apparently was well-balanced, in that it got attacked from multiple sides via Twitter & email. Apparently, I was too hard on Impala, I was too hard on Hive, and I was too hard on boxes full of cardboard file cards as well.
- My post on the Intel/Cloudera deal garnered a comment reminding us Dell had pushed the Intel distro.
- My CitusDB post picked up a few clarifying comments.
Here is a catch-all post to complete the set. Read more
I caught up with my clients at MongoDB to discuss the recent MongoDB 2.6, along with some new statements of direction. The biggest takeaway is that the MongoDB product, along with the associated MMS (MongoDB Management Service), is growing up. Aspects include:
- An actual automation and management user interface, as opposed to the current management style, which is almost entirely via scripts (except for the monitoring UI).
- That’s scheduled for public beta in May, and general availability later this year.
- It will include some kind of integrated provisioning with VMware, OpenStack, et al.
- One goal is to let you apply database changes, software upgrades, etc. without taking the cluster down.
- A reasonable backup strategy.
- A snapshot copy is made of the database.
- A copy of the log is streamed somewhere.
- Periodically — the default seems to be 6 hours — the log is applied to create a new current snapshot.
- For point-in-time recovery, you take the last snapshot prior to the point, and roll forward to the desired point.
- A reasonable locking strategy!
- Document-level locking is all-but-promised for MongoDB 2.8.
- That means what it sounds like. (I mention this because sometimes an XML database winds up being one big document, which leads to confusing conversations about what’s going on.)
- Security. My eyes glaze over at the details, but several major buzzwords have been checked off.
- A general code rewrite to allow for (more) rapid addition of future features.
Some technical background about Splunk
In an October, 2009 technical introduction to Splunk, I wrote (emphasis added):
Splunk software both reads logs and indexes them. The same code runs both on the nodes that do the indexing and on machines that simply emit logs.
It turns out that the bolded part was changed several years ago. However, I don’t have further details, so let’s move on to Splunk’s DBMS-like aspects.
I also wrote:
The fundamental thing that Splunk looks at is an increment to a log – i.e., whatever has been added to the log since Splunk last looked at it.
That remains true. Confusingly, Splunk refers to these log increments as “rows”, even though they’re really structured and queried more like documents.
I further wrote:
Splunk has a simple ILM (Information Lifecycle management) story based on time. I didn’t probe for details.
Splunk’s ILM story turns out to be simple indeed.
- As data streams in, Splunk adds it to the most recent — “hot” — bucket. Once a bucket is full, it becomes immutable — “warm” — and a new hot bucket is opened to receive data.
- Splunk executes queries against whichever of these time-slice buckets make sense, then unions results together as needed.
Finally, I wrote:
I get the impression that most Splunk entity extraction is done at search time, not at indexing time. Splunk says that, if a <name, value> pair is clearly marked, its software does a good job of recognizing same. Beyond that, fields seem to be specified by users when they define searches.
I have trouble understanding how Splunk could provide flexible and robust reporting unless it tokenized and indexed specific fields more aggressively than I think it now does.
The point of what I in October, 2013 called
a high(er)-performance data store into which you can selectively copy columns of data
and which Splunk enthusiastically calls its “High Performance Analytic Store” is to meet that latter need.
Inverted list technology is confusing for several reasons, which start: Read more
|Categories: Data models and architecture, NoSQL, SAP AG, Splunk, Structured documents, Text||1 Comment|
A couple of points that arise frequently in conversation, but that I don’t seem to have made clearly online.
“Metadata” is generally defined as “data about data”. That’s basically correct, but it’s easy to forget how many different kinds of metadata there are. My list of metadata kinds starts with:
- Data about data structure. This is the classical sense of the term. But please note:
- In a relational database, structural metadata is rather separate from the data itself.
- In a document database, each document might carry structure information with it.
- Other inputs to core data management functions. Two major examples are:
- Column statistics that inform RDBMS optimizers.
- Value ranges that inform partition pruning or, more generally, data skipping.
- Inputs to ancillary data management functions — for example, security privileges.
- Support for human decisions about data — for example, information about authorship or lineage.
What’s worse, the past year’s most famous example of “metadata”, telephone call metadata, is misnamed. This so-called metadata, much loved by the NSA (National Security Agency), is just data, e.g. in the format of a CDR (Call Detail Record). Calling it metadata implies that it describes other data — the actual contents of the phone calls — that the NSA strenuously asserts don’t actually exist.
And finally, the first bullet point above has a counter-intuitive consequence — all common terminology notwithstanding, relational data is less structured than document data. Reasons include:
- Relational databases usually just hold strings — or maybe numbers — with structural information being held elsewhere.
- Some document databases store structural metadata right with the document data itself.
- Some document databases store data in the form of (name, value) pairs. In some cases additional structure is imposed by naming conventions.
- Actual text documents carry the structure imposed by grammar and syntax.
- A lengthy survey of metadata kinds, biased to Hadoop (August, 2012)
- Metadata as derived data (May, 2011)
- Dataset management (May, 2013)
- Structured/unstructured … multi-structured/poly-structured (May, 2011)
|Categories: Data models and architecture, Hadoop, Structured documents, Surveillance and privacy, Telecommunications||5 Comments|
I’ve heard a lot of buzz recently around Spark. So I caught up with Ion Stoica and Mike Franklin for a call. Let me start by acknowledging some sources of confusion.
- Spark is very new. All Spark adoption is recent.
- Databricks was founded to commercialize Spark. It is very much in stealth mode …
- … except insofar as Databricks folks are going out and trying to drum up Spark adoption.
- Ion Stoica is running Databricks, but you couldn’t tell that from his UC Berkeley bio page. Edit: After I posted this, Ion’s bio was quickly updated.
- Spark creator and Databricks CTO Matei Zaharia is an MIT professor, but actually went on leave there before he ever showed up.
- Cloudera is perhaps Spark’s most visible supporter. But Cloudera’s views of Spark’s role in the world is different from the Spark team’s.
The “What is Spark?” question may soon be just as difficult as the ever-popular “What is Hadoop?” That said — and referring back to my original technical post about Spark and also to a discussion of prominent Spark user ClearStory — my try at “What is Spark?” goes something like this:
- Spark is a distributed execution engine for analytic processes …
- … which works well with Hadoop.
- Spark is distinguished by a flexible in-memory data model …
- … and farms out persistence to HDFS (Hadoop Distributed File System) or other existing data stores.
- Intended analytic use cases for Spark include:
- SQL data manipulation.
- ETL-like data manipulation.
- Streaming-like data manipulation.
- Machine learning.
- Graph analytics.
IBM excels at game technology, most famously in Deep Blue (chess) and Watson (Jeopardy!). But except at the chip level — PowerPC — IBM hasn’t accomplished much at game/real world crossover. And so I suspect the Watson hype is far overblown.
I believe that for two main reasons. First, whenever IBM talks about big initiatives like Watson, it winds up bundling a bunch of dissimilar things together and claiming they’re a seamless whole. Second, some core Watson claims are eerily similar to artificial intelligence (AI) over-hype three or more decades past. For example, the leukemia treatment advisor that is being hopefully built in Watson now sounds a lot like MYCIN from the early 1970s, and the idea of collecting a lot of tidbits of information sounds a lot like the Cyc project. And by the way:
- MYCIN led to E-MYCIN, which led to the company Teknowledge, which raised a lot of money* but now has almost faded from memory.
- Cyc is connected to the computer science community’s standard unit of bogosity.
The 2013 Gartner Magic Quadrant for Operational Database Management Systems is out. “Operational” seems to be Gartner’s term for what I call short-request, in each case the point being that OLTP (OnLine Transaction Processing) is a dubious term when systems omit strict consistency, and when even strictly consistent systems may lack full transactional semantics. As is usually the case with Gartner Magic Quadrants:
- I admire the raw research.
- The opinions contained are generally reasonable (especially since Merv Adrian joined the Gartner team).
- Some of the details are questionable.
- There’s generally an excessive focus on Gartner’s perception of vendors’ business skills, and on vendors’ willingness to parrot all the buzzphrases Gartner wants to hear.
- The trends Gartner highlights are similar to those I see, although our emphasis may be different, and they may leave some important ones out. (Big omission — support for lightweight analytics integrated into operational applications, one of the more genuine forms of real-time analytics.)
Anyhow: Read more