Discussion of how data warehousing and analytic technologies are applied to logfile analysis. Related subjects include:
- The use of analytic technologies to study web and network event data
Cassandra’s reputation in many quarters is:
- World-leading in the geo-distribution feature.
- Impressively scalable.
- Hard to use.
This has led competitors to use, and get away with, sales claims along the lines of “Well, if you really need geo-distribution and can’t wait for us to catch up — which we soon will! — you should use Cassandra. But otherwise, there are better choices.”
My friends at DataStax, naturally, don’t think that’s quite fair. And so I invited them — specifically Billy Bosworth and Patrick McFadin — to educate me. Here are some highlights of that exercise.
DataStax and Cassandra have some very impressive accounts, which don’t necessarily revolve around geo-distribution. Netflix, probably the flagship Cassandra user — since Cassandra inventor Facebook adopted HBase instead — actually hasn’t been using the geo-distribution feature. Confidential accounts include:
- A petabyte or so of data at a very prominent company, geo-distributed, with 800+ nodes, in a kind of block storage use case.
- A messaging application at a very prominent company, anticipated to grow to multiple data centers and a petabyte of so of data, across 1000s of nodes.
- A 300 terabyte single-data-center telecom account (which I can’t find on DataStax’s extensive customer list).
- A huge health records deal.
- A Fortune 10 company.
DataStax and Cassandra won’t necessarily win customer-brag wars versus MongoDB, Couchbase, or even HBase, but at least they’re strongly in the competition.
DataStax claims that simplicity is now a strength. There are two main parts to that surprising assertion. Read more
|Categories: Cassandra, Clustering, Couchbase, Data models and architecture, DataStax, Facebook, HBase, Health care, Log analysis, Market share and customer counts, MongoDB and 10gen, NoSQL, Petabyte-scale data management, Specific users||9 Comments|
Glassbeam checked in recently, and they turn out to exemplify quite a few of the themes I’ve been writing about. For starters:
- Glassbeam has an analytic technology stack focused on poly-structured machine-generated data.
- Glassbeam partially organizes that data into event series …
- … in a schema that is modified as needed.
Glassbeam basics include:
- Founded in 2009.
- Based in Santa Clara. Back-end engineering in Bangalore.
- $6 million in angel money; no other VC.
- High single-digit customer count, …
- … plus another high single-digit number of end customers for an OEM offering a limited version of their product.
All Glassbeam customers except one are SaaS/cloud (Software as a Service), and even that one was only offered a subscription (as oppose to perpetual license) price.
So what does Glassbeam’s technology do? Glassbeam says it is focused on “machine data analytics,” specifically for the “Internet of Things”, which it distinguishes from IT logs.* Specifically, Glassbeam sells to manufacturers of complex devices — IT (most of its sales so far ), medical, automotive (aspirational to date), etc. — and helps them analyze “phone home” data, for both support/customer service and marketing kinds of use cases. As of a recent release, the Glassbeam stack can: Read more
Hortonworks did a business-oriented round of outreach, talking with at least Derrick Harris and me. Notes from my call — for which Rob Bearden* didn’t bother showing up — include, in no particular order:
- Hortonworks denies advanced acquisition discussions with either Microsoft and Intel. Of course, that doesn’t exactly contradict the widespread story of Intel having made an acquisition offer.
- As vendors usually do, Hortonworks denies the extreme forms of Cloudera’s suggestion that Hortonworks competitive wins relate to price slashing. But Hortonworks does believe that its license fees often wind up being lower than Cloudera’s, due especially to Hortonworks offering few extra-charge items than Cloudera.
- Hortonworks used a figure of ~75 subscription customers. This does not include OEM sales through, for example, Teradata, Microsoft Azure, or Rackspace. However, that does include …
- … a small number of installations hosted in the cloud — e.g. ~2 on Amazon Web Services — or otherwise remotely. Also, testing in the cloud seems to be fairly frequent, and the cloud can also be a source of data ingested into Hadoop.
- Since Hortonworks a couple of times made it seem that Rackspace was an important partner, behind only Teradata and Microsoft, I finally asked why. Answers boiled down to a Rackspace Hadoop-as-a-service offering, plus joint work to improve Hadoop-on-OpenStack.
- Other Hortonworks reseller partners seem more important in terms of helping customers consumer HDP (Hortonworks Data Platform), rather than for actually doing Hortonworks’ selling for it. (This is unsurprising — channel sales rarely are a path to success for a product that is also appropriately sold by a direct force.)
- Hortonworks listed its major industry sectors as:
- Web and retailing, which it identifies as one thing.
- Health care (various subsectors).
- Financial services, which it called “competitive” in the kind of tone that usually signifies “we lose a lot more than we win, and would love to change that”.
*Speaking of CEO Bearden, an interesting note from Derrick’s piece is that Bearden is quoted as saying “I started this company from day one …”, notwithstanding that the now-departed Eric Baldeschwieler was founding CEO.
In Hortonworks’ view, Hadoop adopters typically start with a specific use case around a new type of data, such as clickstream, sensor, server log, geolocation, or social. Read more
Perhaps we should remind ourselves of the many ways data models can be caused to churn. Here are some examples that are top-of-mind for me. They do overlap a lot — and the whole discussion overlaps with my post about schema complexity last January, and more generally with what I’ve written about dynamic schemas for the past several years..
Just to confuse things further — some of these examples show the importance of RDBMS, while others highlight the relational model’s limitations.
The old standbys
Product and service changes. Simple changes to your product line many not require any changes to the databases recording their production and sale. More complex product changes, however, probably will.
A big help in MCI’s rise in the 1980s was its new Friends and Family service offering. AT&T couldn’t respond quickly, because it couldn’t get the programming done, where by “programming” I mainly mean database integration and design. If all that was before your time, this link seems like a fairly contemporaneous case study.
Organizational changes. A common source of hassle, especially around databases that support business intelligence or planning/budgeting, is organizational change. Kalido’s whole business was based on accommodating that, last I checked, as were a lot of BI consultants’. Read more
|Categories: Data warehousing, Derived data, Kalido, Log analysis, Software as a Service (SaaS), Specific users, Text, Web analytics||2 Comments|
I talk with a lot of companies, and repeatedly hear some of the same application themes. This post is my attempt to collect some of those ideas in one place.
1. So far, the buzzword of the year is “real-time analytics”, generally with “operational” or “big data” included as well. I hear variants of that positioning from NewSQL vendors (e.g. MemSQL), NoSQL vendors (e.g. AeroSpike), BI stack vendors (e.g. Platfora), application-stack vendors (e.g. WibiData), log analysis vendors (led by Splunk), data management vendors (e.g. Cloudera), and of course the CEP industry.
Yeah, yeah, I know — not all the named companies are in exactly the right market category. But that’s hard to avoid.
Why this gold rush? On the demand side, there’s a real or imagined need for speed. On the supply side, I’d say:
- There are vast numbers of companies offering data-management-related technology. They need ways to differentiate.
- Doing analytics at short-request speeds is an obvious data-management-related challenge, and not yet comprehensively addressed.
2. More generally, most of the applications I hear about are analytic, or have a strong analytic aspect. The three biggest areas — and these overlap — are:
- Customer interaction
- Network and sensor monitoring
- Game and mobile application back-ends
Also arising fairly frequently are:
- Algorithmic trading
- Risk measurement
- Law enforcement/national security
- Stakeholder-facing analytics
I’m hearing less about quality, defect tracking, and equipment maintenance than I used to, but those application areas have anyway been ebbing and flowing for decades.
The third of my three MySQL-oriented clients I alluded to yesterday is MemSQL. When I wrote about MemSQL last June, the product was an in-memory single-server MySQL workalike. Now scale-out has been added, with general availability today.
MemSQL’s flagship reference is Zynga, across 100s of servers. Beyond that, the company claims (to quote a late draft of the press release):
Enterprises are already using distributed MemSQL in production for operational analytics, network security, real-time recommendations, and risk management.
All four of those use cases fit MemSQL’s positioning in “real-time analytics”. Besides Zynga, MemSQL cites penetration into traditional low-latency markets — financial services (various subsectors) and ad-tech.
Highlights of MemSQL’s new distributed architecture start: Read more
|Categories: Clustering, Database compression, Emulation, transparency, portability, Games and virtual worlds, Investment research and trading, Log analysis, MemSQL, MySQL, NewSQL, Transparent sharding, Zynga||6 Comments|
In typical debates, the extremists on both sides are wrong. “SQL vs. NoSQL” is an example of that rule. For many traditional categories of database or application, it is reasonable to say:
- Relational databases are usually still a good default assumption …
- … but increasingly often, the default should be overridden with a more useful alternative.
Reasons to abandon SQL in any given area usually start:
- Creating a traditional relational schema is possible …
- … but it’s tedious or difficult …
- … especially since schema design is supposed to be done before you start coding.
Some would further say that NoSQL is cheaper, scales better, is cooler or whatever, but given the range of NewSQL alternatives, those claims are often overstated.
Sectors where these reasons kick in include but are not limited to: Read more
|Categories: Health care, Investment research and trading, Log analysis, NewSQL, NoSQL, Web analytics||8 Comments|
Business intelligence dashboards are frequently bashed. I slammed them back in 2006 and 2007. Mark Smith dropped the hammer last August. EIS, the most dashboard-like pre-1990s analytic technology, was also the most reviled. There are reasons for this disdain, but even so dashboards shouldn’t be dismissed entirely.
In essence, I’d say:
- Dashboards are overrated and oversold.
- They are useful even so.
- Their usefulness is ebbing as technology advances.
In particular: Read more
I recently proposed a 2×2 matrix of BI use cases:
- Is there an operational business process involved?
- Is there a focus on root cause analysis?
Let me now introduce another 2×2 matrix of analytic scenarios:
- Is there a compelling need for super-fresh data?
- Who’s consuming the results — humans or machines?
My point is that there are at least three different cool things people might think about when they want their analytics to be very fast:
- Fast investigative analytics — e.g., business intelligence with great query response.
- Computations on very fresh data, presented to humans — e.g. “heartbeat” graphics monitoring a network.
- Computations on very fresh data, presented back to a machine — e.g., a recommendation engine that includes makes good use of data about a user’s last few seconds of actions.
There’s also one slightly boring one that however drives a lot of important applications: Read more
|Categories: Business intelligence, Complex event processing (CEP), Games and virtual worlds, Log analysis, Predictive modeling and advanced analytics, Splunk, WibiData||4 Comments|
Informatica, Splunk, and IBM are all public companies, and correspondingly reticent to talk about product futures. Hence, anything I might suggest about product futures from any of them won’t be terribly detailed, and even the vague generalities are “the Good Lord willin’ an’ the creek don’ rise”.
Never let a rising creek overflow your safe harbor.
1. Hadoop can be an awesome ETL (Extract/Transform/Load) execution engine; it can handle huge jobs and perform a great variety of transformations. (Indeed, MapReduce was invented to run giant ETL jobs.) Thus, if one offers a development-plus-execution stack for ETL processes, it might seem appealing to make Hadoop an ETL execution option. And so:
- I’ve already posted that BI-plus-light-ETL vendors Pentaho and Datameer are using Hadoop in that way.
- Informatica will be using Hadoop as an execution option too.
Informatica told me about other interesting Hadoop-related plans as well, but I’m not sure my frieNDA allows me to mention them at all.
IBM, however, is standing aside. Specifically, IBM told me that it doesn’t see the point of doing the same thing, as its ETL engine — presumably derived from the old Ascential product line — is already parallel and performant enough.
2. Last year, I suggested that Splunk and Hadoop are competitors in managing machine-generated data. That’s still true, but Splunk is also preparing a Hadoop co-opetition strategy. To a first approximation, it’s just Hadoop import/export. However, suppose you view Splunk as offering a three-layer stack: Read more
|Categories: EAI, EII, ETL, ELT, ETLT, Hadoop, IBM and DB2, Informatica, Log analysis, MapReduce, Splunk||9 Comments|