Introduction to Metamarkets and Druid
I previously dropped a few hints about my clients at Metamarkets, mentioning that they:
- Have built vertical-market analytic platform technology.
- Use a lot of Hadoop.
- Throw good parties. (That’s where the background photo on my Twitter page comes from.)
But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:
- Introduction to Metamarkets and Druid (this post)
- Druid overview
- Metamarkets’ back-end technology
Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.
Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.
To understand how Metamarkets’ technology works, it probably helps to start by realizing: Read more
Workday update
In August 2010, I wrote about Workday’s interesting technical architecture, highlights of which included:
- Lots of small Java objects in memory.
- A very simple MySQL backing store (append-only, <10 tables).
- Some modernistic approaches to application navigation.
- A faceted approach to BI.
I caught up with Workday recently, and things have naturally evolved. Most of what we talked about (by my choice) dealt with data management, business intelligence, and the overlap between the two.
It is now reasonable to say that Workday’s servers fall into at least seven tiers, although we talked mainly about five that work together as a kind of giant app/database server amalgamation. The three that do noteworthy data management can be described as:
- In-memory objects and transactions. This is similar to what Workday had before.
- Persistent MySQL. Part of this is similar to what Workday had before. In addition, Workday is now storing certain data in tables in the ordinary relational way.
- In-memory caching and indexing. This has three aspects:
- Indexes for the ordinary relational tables, organized in interesting ways.
- Indexes for Workday’s search-box navigation (as per my original Workday technical post, you can search across objects, task-names, etc.).
- Compressed copies of the Java objects, used to instantiate other servers as needed. The most obvious uses of this are:
- Recovery for the object/transaction tier.
- Launch for the elastic compute tier. (Described below.)
Two other Workday server tiers may be described as: Read more
QlikTech bought Expressor
QlikTech has bought Expressor. Notes on that include:
- Expressor wanted to offer data integration/ETL (Extract/Transform/Load) that was all things to all people — great parallel performance, great UI, great price, etc.
- In practice, Expressor seemed to focus on cheap/easy ETL in the Microsoft Windows (I mean server) market.
- Expressor never got much traction. This seems confirmed by the “more than 20” figure for headcount mentioned in the acquisition press release.
- Both the press release and some tweets by QlikTech’s Donald Farmer seem to confirm that Expressor is being taken off the market for “boil the ocean” ETL. It will be companion technology to/integrated technology in QlikView.
- Unsurprisingly, Donald indicated that Expressor technology would expand past its Microsoft focus. (Edit: “If needed”)
Categories: Business intelligence, EAI, EII, ETL, ELT, ETLT, Expressor, Pricing, QlikTech and QlikView | 5 Comments |
Introduction to Cloudant
Cloudant is one of the few NoSQL companies with >100 paying subscription customers. For starters:
- Cloudant’s core software is a fork of CouchDB.
- Cloudant only sells you software as a service.
- More precisely, whether Cloudant offers DBaaS (DataBase as a Service) or PaaS (Platform as a Service) or a “data layer” (Cloudant’s preferred terminology) depends on your taste in buzzwords.
- I gather that Cloudant (the company) wants to handle pretty much all your data management needs. But Cloudant (the product) isn’t there yet, especially on the analytic side.
- Before CouchDB and Membase joined together, Cloudant was positioned as the big(ger) data version of CouchDB.
Company demographics include:
- Cloudant is based in Boston.
- Cloudant started out as a Y Combinator company in 2008, and “got serious” in 2009.
- Cloudant now has ~20 employees.
- Management hires include a couple of former Vertica guys.
The Cloudant guys gave me some customer counts in May that weren’t much higher than those they gave me in February, and seem to have forgotten to correct the discrepancy. Oh well. The latter (probably understated) figures included ~160 paying customers, of which:
- ~100 were multitenant.
- ~60 were single tenant.
- 1 was on-premise (but still managed by Cloudant) because of privacy concerns.
The largest Cloudant deployments seem to be in the 10s of terabytes, across a very low double digit number of servers.
Categories: Cloudant, Clustering, Couchbase, CouchDB, MapReduce, Market share and customer counts, NoSQL, Pricing, Specific users, Storage | 2 Comments |
Quick-turnaround predictive modeling
Last November, I wrote two posts on agile predictive analytics. It’s time to return to the subject. I’m used to KXEN talking about the ability to do predictive modeling, very quickly, perhaps without professional statisticians; that the core of what KXEN does. But I was surprised when Revolution Analytics told me a similar story, based on a different approach, because ordinarily that’s not how R is used at all.
Ultimately, there seem to be three reasons why you’d want quick turnaround on your predictive modeling: Read more
Categories: Business intelligence, Investment research and trading, KXEN, Predictive modeling and advanced analytics, Revolution Analytics, Telecommunications, Web analytics | 10 Comments |
Kognitio’s story today
I had dinner tonight with the Kognitio folks. So far as I can tell:
- Branding has been mercifully simplified. Everything is now called “Kognitio” (as opposed to, for example, “WX2”).
- Notwithstanding its long history of selling disk-based DBMS and denigrating memory-only configurations, Kognitio now says that in fact it’s always been an in-memory DBMS vendor.
- Notwithstanding its long history of selling (or attempting to sell) analytic DBMS, Kognitio wants to be viewed as an accelerator to your existing DBMS. This is apparently inspired in part by SAP HANA, notwithstanding that HANA’s direction is to evolve into a hybrid OLTP/analytic general-purpose DBMS.
- Notwithstanding its lack of analytic platform features, Kognitio wants to be viewed as selling an analytic platform.
- Notwithstanding its memory-centric focus, Kognitio doesn’t want to compress data. Kognitio’s opinion — which to my knowledge is shared by few people outside Kognitio — seems to be that the CPU cost of compression/decompression isn’t justified by the RAM savings from compression.
- Kognitio still is pushing a cloud/SaaS (Software as a Service) story. Even if you want to use Kognitio (the product) on-premises, Kognitio (the company) calls that “private cloud” and offers to let you pay annually.
Kognitio believes that this story is appealing, especially to smaller venture-capital-backed companies, and backs that up with some frieNDA pipeline figures.
Between that success claim and SAP’s HANA figures, it seems that the idea of using an in-memory DBMS to accelerate analytics has legs. This makes sense, as the BI vendors — Qlik Tech excepted — don’t seem to be accomplishing much with their proprietary in-memory alternatives. But I’m not sure that Kognitio would be my first choice to fill that role. Rather, if I wanted to buy an unsuccessful analytic RDBMS to use as an in-memory accelerator, I might consider ParAccel, which is columnar, has an associated compression story, has always had a hybrid memory-centric flavor much as Kognitio has, and is well ahead of Kognitio in the analytic platform derby. That said, I’ll confess to not having talked with or heard much about ParAccel for a while, so I don’t know if they’ve been able maintain technical momentum any more than Kognitio has.
Categories: Cloud computing, Data warehousing, Database compression, Kognitio, Memory-centric data management, ParAccel, Software as a Service (SaaS) | 2 Comments |
Cool analytic stories
There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …
… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.
Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.
Even so, I have two questions in my inbox that boil down to “What are the coolest or most significant analytics stories out there?” So let’s round up some of what I know. Read more
Categories: Analytic technologies, Google, Health care, Investment research and trading, Predictive modeling and advanced analytics, Scientific research, Telecommunications, Web analytics | 6 Comments |
Thoughts on “data science”
Teradata is paying me to join a panel on “data science” in downtown Boston, Tuesday May 22, at 3:00 pm. A planning phone call led me to jot down a few notes on the subject, which I’m herewith adapting into a blog post.
For starters, I have some concerns about the concepts of data science and data scientist. Too often, the term “data scientist” is used to suggest that one person needs to have strong skills both in analytics and in data management. But in reality, splitting those roles makes perfect sense. Further:
- It may or may not make sense to say that a computer scientist is doing “science”; the term “data scientist” inherits that ambiguity.
- It may or may not make sense to say that a corporate scientist is doing “science”; for example, a petroleum geologist might do very valuable work without making any scientific discoveries. The term “data scientist” inherits that ambiguity too.
- Too often, people use the term big data as if it were something radically new, rather than a continuation of what has been done in large-scale analytic data management for decades. “Data science” has a similar problem.
- The term “data science” sounds as if you need specialized academic training to do it, which isn’t really true.
The leader in raising these issues is probably Neil Raden.
But there’s one respect in which I think the term “data science” is highly appropriate. In conventional science, gathering data is just as much of an accomplishment as analyzing it. Indeed, most Nobel Prizes are given for experimental results. Similarly, if you’re doing data science, you should be thinking hard about how to corral ever more useful data. Techniques include but are not limited to:
- Keeping data you used to throw away. This has driven a lot of growth in relational data warehouses and big bit buckets alike.
- Bribing customers and prospects. Loyalty cards are the paradigmatic example.
- Split testing. The more internet-based users you have, the more tests you can do.
- Storing derived data. That can be as simple as pre-computing the scores from your predictive analytics model, or it can be as complex as running a 50-step sequence of Hadoop jobs.
- Getting data from third parties, for example:
- Supply chain partners (right now this rarely amounts to more than simple BI, but that could change in the future).
- Data vendors of various kinds (e.g. credit bureaus).
- Social media/the internet in general, which also usually involves some kind of service provider.
Categories: Analytic technologies, Data warehousing, Predictive modeling and advanced analytics, Teradata | 4 Comments |
Notes on the analysis of large graphs
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition
- Relationship analytics applications
- Analysis of large graphs (this post)
My series on graph data management and analytics got knocked off-stride by our website difficulties. Still, I want to return to one interesting set of issues — analyzing large graphs, specifically ones that don’t fit comfortably into RAM on a single server. By no means do I have the subject figured out. But here are a few notes on the matter.
How big can a graph be? That of course depends on:
- The number of nodes. If the nodes of a graph are people, there’s an obvious upper bound on the node count. Even if you include their houses, cars, and so on, you’re probably capped in the range of 10 billion.
- The number of edges. (Even more important than the number of nodes.) If every phone call, email, or text message in the world is an edge, that’s a lot of edges.
- The typical size of a (node, edge, node) triple. I don’t know why you’d have to go much over 100 bytes post-compression*, but maybe I’m overlooking something.
*Even if your graph has 10 billion nodes, those can be tokenized in 34 bits, so the main concern is edges. Edges can include weights, timestamps, and so on, but how many specifics do you really need? At some point you can surely rely on a pointer to full detail stored elsewhere.
The biggest graph-size estimates I’ve gotten are from my clients at Yarcdata, a division of Cray. (“Yarc” is “Cray” spelled backwards.) To my surprise, they suggested that graphs about people could have 1000s of edges per node, whether in:
- An intelligence scenario, perhaps with billions of nodes and hence trillions of edges.
- A telecom user-analysis case, with perhaps 100 million nodes and hence 100s of billions of edges.
Yarcdata further suggested that bioinformatics use cases could have node counts higher yet, characterizing Bio2RDF as one of the “smaller” ones at 22 billion nodes. In these cases, the nodes/edge average seems lower than in people-analysis graphs, but we’re still talking about 100s of billions of edges.
Recalling that relationship analytics boils down to finding paths and subgraphs, the naive relational approach to such tasks would be: Read more
Categories: Analytic technologies, Aster Data, Data models and architecture, Hadoop, Health care, MapReduce, RDF and graphs, Scientific research, Telecommunications, Yarcdata and Cray | 20 Comments |
Terminology: Relationship analytics
This post is part of a series on managing and analyzing graph data. Posts to date include:
- Graph data model basics
- Relationship analytics definition (this post)
- Relationship analytics applications
- Analysis of large graphs
In late 2005, I encountered a company called Cogito that was using a graphical data manager to analyze relationships. They called this “relational analytics”, which I thought was a terrible name for something that they were trying to claim should NOT be done in a relational DBMS. On the spot, I coined relationship analytics as an alternative. A business relationship ensued, which included a short white paper. Cogito didn’t do so well, however, and for a while the term “relationship analytics” faltered too. But recently it’s made a bit of a comeback, having been adopted by Objectivity, Qlik Tech, Yarcdata and others.
“Relationship analytics” is not a perfect name, both because it’s longish and because it might over-connote a social-network focus. But then, no other term would be perfect either. So we might as well stick with it.
In that case, “relationship analytics” could use an actual definition, preferably one a little heftier than just:
Analytics on graphs.