Discussion of how data warehousing and analytic technologies are applied to logfile analysis. Related subjects include:
- The use of analytic technologies to study web and network event data
In a short October, 2011 post about Datameer, I wrote:
Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).
That’s all still mainly true, although with the recent Datameer 2.0:
- You can run Datameer and the underlying Hadoop on a desktop or workgroup group.
- There are some infographics pretty-picture-drawing capabilities, which will surely delight those who like vector-based HTML 5 pictures of coffee cups, saucers and macaroons.
- No doubt Datameer has been generally enhanced on multiple fronts.
In essence, Datameer has two positionings.
- One is “OK, you’ve got Hadoop — now wouldn’t you like to do something useful with it?” That can include both business intelligence and ETL.
- Beyond that, Datameer founder/CEO Stefan Groschupf’s core argument is that schema-on-read is really, really useful, even at the cost of absorbing a potentially large performance hit. In other words, he’s making a case for a form of non-relational BI.
|Categories: Business intelligence, Data models and architecture, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop, Log analysis, Market share and customer counts, Web analytics||7 Comments|
I previously dropped a few hints about my clients at Metamarkets, mentioning that they:
- Have built vertical-market analytic platform technology.
- Use a lot of Hadoop.
- Throw good parties. (That’s where the background photo on my Twitter page comes from.)
But while they’re a joy to talk with, writing about Metamarkets has been frustrating, with many hours and pages of wasted of effort. Even so, I’m trying again, in a three-post series:
Much like Workday, Inc., Metamarkets is a SaaS (Software as a Service) company, with numerous tiers of servers and an affinity for doing things in RAM. That’s where most of the similarities end, however, as Metamarkets is a much smaller company than Workday, doing very different things.
Metamarkets’ business is SaaS (Software as a Service) business intelligence, on large data sets, with low latency in both senses (fresh data can be queried on, and the queries happen at RAM speed). As you might imagine, Metamarkets is used by digital marketers and other kinds of internet companies, whose data typically wants to be in the cloud anyway. Approximate metrics for Metamarkets (and it may well have exceeded these by now) include 10 customers, 100,000 queries/day, 80 billion 100-byte events/month (before summarization), 20 employees, 1 popular CEO, and a metric ton of venture capital.
To understand how Metamarkets’ technology works, it probably helps to start by realizing: Read more
Edit: Multiple errors in the post below have been corrected in a follow-on post about DataStax Enterprise and Cassandra.
My client DataStax is announcing DataStax Enterprise 2.0. The big point of the release is that there’s a bunch of stuff integrated together, including at least:
- Cassandra — the NoSQL DBMS, which DataStax sometimes calls “DataStax Server”. Edit: That’s not really a fair criticism of DataStax’s messaging.
- Hadoop MapReduce, which DataStax sometimes calls “Hadoop”. Edit: That is indeed fair.
- Sqoop — the general way to connect relational DBMS to Hadoop, which DataStax sometimes calls “RDBMS integration”.
- Solr — the search-centric Apache project, or big parts of it, which DataStax generally calls either “Solr” or “Solr compatibility”.
- log4j – an Apache project that has something or other to do with logging, or parts of it, which DataStax sometimes calls “log file integration”.
- DataStax OpsCenter — some management tools and so on around Cassandra and the rest of the product line.
DataStax stresses that all this runs on the same cluster, with the same administrative tools and so on. For example, on a single cluster:
- You can manage the interactive data for a web site.
- You can store the logs for that website.
- You can analyze all of the above in Hadoop.
|Categories: Cassandra, Clustering, DataStax, EAI, EII, ETL, ELT, ETLT, Games and virtual worlds, Hadoop, Log analysis, Market share and customer counts, NoSQL, Parallelization, Text, Web analytics||5 Comments|
The most straightforward approach to the applications business is:
- Take general-purpose technology and think through how to apply it to a specific application domain.
- Produce packaged application software accordingly.
However, this strategy is not as successful in analytics as in the transactional world, for two main reasons:
- Analytic applications of that kind are rarely complete.
- Incomplete applications rarely sell well.
I first realized all this about a decade ago, after Henry Morris coined the term analytic applications and business intelligence companies thought it was their future. In particular, when Dave Kellogg ran marketing for Business Objects, he rattled off an argument to the effect that Business Objects had generated more analytic app revenue over the lifetime of the company than Cognos had. I retorted, with only mild hyperbole, that the lifetime numbers he was citing amounted to “a bad week for SAP”. Somewhat hoist by his own petard, Dave quickly conceded that he agreed with my skepticism, and we changed the subject accordingly.
Reasons that analytic applications are commonly less complete than the transactional kind include: Read more
|Categories: Business intelligence, Business Objects, Data mart outsourcing, Investment research and trading, Log analysis, Metamarkets and Druid, Oracle, SAP AG, SAS Institute, Web analytics, WibiData||16 Comments|
I talked with the Sumo Logic folks for an hour Thursday. Highlights included:
- Sumo Logic does SaaS (Software as a Service) log management.
- Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)
- Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.
- Sumo Logic’s main differentiation is automated classification of events.
- There’s some kind of streaming engine in the mix, to update counters and drive alerts.
- Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.
- A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.
- When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.
- Sumo Logic recently raised a bunch of venture capital.
- Sumo Logic’s founders are out of ArcSight, a log management company HP paid a bunch of money for.
- Sumo Logic coined a marketing term “LogReduce”, but it has nothing to do with “MapReduce”. Sumo Logic seems to find this amusing.
What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say: Read more
|Categories: Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Software as a Service (SaaS), Text||6 Comments|
Splunk is announcing the Splunk 4.3 point release. Before discussing it, let’s recall a few things about Splunk, starting with:
- Splunk is first and foremost an analytic DBMS …
- … used to manage logs and similar multistructured data.
- Splunk’s DML (Data Manipulation Language) is based on text search, not on SQL.
- Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).
- Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.
- The paradigmatic use of Splunk is to monitor IT operations in real time. However:
- There also are plenty of non-real-time uses for Splunk.
- Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.
As in any release, a lot of Splunk 4.3 is about “Oh, you didn’t have that before?” features and Bottleneck Whack-A-Mole performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users — especially the non-IT ones — really want to get Splunk information on the tablet devices. While this somewhat contradicts what I wrote a few days ago pooh-poohing mobile BI, let me hasten to point out:
- Splunk is used for a lot of (quasi) real-time monitoring.
- Splunk’s desktop user interfaces are, by BI standards, quite primitive.
That’s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn’t.
|Categories: Business intelligence, Data models and architecture, Data warehousing, Log analysis, Specific users, Splunk, Structured documents, Web analytics||3 Comments|
Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
|Categories: Cassandra, Data models and architecture, Data warehousing, Exadata, Facebook, Google, Hadoop, HBase, Log analysis, Market share and customer counts, MarkLogic, NewSQL, NoSQL, Oracle, Splunk, Yahoo||9 Comments|
As I’ve noted before, the very big web companies have an issue with nested data structures. The subject came up in XLDB talks yesterday too, so my big goal for lunch was to finally understand what was being talked about. Sitting at a table full of eBay and LinkedIn folks turned out to be a good tactic.
The explanation was led by Oliver Ratzesberger, late of eBay* and progenitor of eBay’s Singularity project. In simplest terms, one event can spawn a lot of event attribute information, perhaps in the form of name-value pairs, which it then makes sense to store together in some way. The example Oliver dwelled on was that, on any given web page, there can be 100+ pieces of information to record, including:
- All 50 search results you were shown, and their positions in the search rankings.
- Every ad, image, or graphical element.
- An ID as to which test you were participating in (every page you see on eBay has some element being tested).
*Oliver is leaving eBay for a still-secret large company. I would conjecture that Michael McIntire is on the move too, either to replace Oliver or to go with him, but Oliver did a very good job of not commenting on the matter.
There are several reasons why one might wish to store this information in ways that grieve relational purists. First, reconstructing all this information via joins would be brutally expensive. What’s more, reconstructing all this information via joins could be impractical. Some comes from third party ad servers, which might not reproduce the same ads upon demand. Other is in the form of rankings, which can’t always be reliably reproduced from one query to the next. (That’s just one of several reasons text search and relational DBMS are an awkward fit.)
Also, there’s a strong dynamic schema flavor to these databases. The list of attributes for one web click might be very different in kind from the list for the next page. Forcing that kind of variability into a fixed relational schema, while theoretically possible, doesn’t necessarily make a lot of sense.
|Categories: Data models and architecture, Data warehousing, eBay, Log analysis, Web analytics||5 Comments|
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more
|Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text||2 Comments|
I visited California recently, and chatted with numerous companies involved in Hadoop — Cloudera, Hortonworks, MapR, DataStax, Datameer, and more. I’ll defer further Hadoop technical discussions for now — my target to restart them is later this month — but that still leaves some other issues to discuss, namely adoption and partnering.
The total number of enterprises in the world paying subscription and license fees that they would regard as being for “Hadoop or something Hadoop-related” probably is not much over 100 right now, but I’d expect to see pretty rapid growth. Beyond that, let’s divide customers into three groups:
- Internet businesses.
- Traditional enterprises ‘ internet operations.
- Traditional enterprises’ other operations.
Hadoop vendors, in different mixes, claim to be doing well in all three segments. Even so, almost all use cases involve some kind of machine-generated data, with one exception being a credit card vendor crunching a large database of transaction details. Multiple kinds of machine-generated data come into play — web/network/mobile device logs, financial trade data, scientific/experimental data, and more. In particular, pharmaceutical research got some mentions, which makes sense, in that it’s one area of scientific research that actually enjoys fat for-profit research budgets.
|Categories: Cloudera, Hadoop, Health care, Hortonworks, Investment research and trading, Log analysis, MapR, MapReduce, Market share and customer counts, Scientific research, Web analytics||5 Comments|