Analytic technologies
Discussion of technologies related to information query and analysis. Related subjects include:
- Business intelligence
- Data warehousing
- (in Text Technologies) Text mining
- (in The Monash Report) Data mining
- (in The Monash Report) General issues in analytic technology
Sumo Logic and UIs for text-oriented data
I talked with the Sumo Logic folks for an hour Thursday. Highlights included:
- Sumo Logic does SaaS (Software as a Service) log management.
- Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)
- Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.
- Sumo Logic’s main differentiation is automated classification of events.
- There’s some kind of streaming engine in the mix, to update counters and drive alerts.
- Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.
- A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.
- When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.
- Sumo Logic recently raised a bunch of venture capital.
- Sumo Logic’s founders are out of ArcSight, a log management company HP paid a bunch of money for.
- Sumo Logic coined a marketing term “LogReduce”, but it has nothing to do with “MapReduce”. Sumo Logic seems to find this amusing.
What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say: Read more
| Categories: Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Software as a Service (SaaS), Text | 7 Comments |
Comments on the 2012 Forrester Wave: Enterprise Hadoop Solutions
Forrester has released its Q1 2012 Forrester Wave: Enterprise Hadoop Solutions. (Googling turns up a direct link, but in case that doesn’t prove stable, here also is a registration-required link from IBM’s Conor O’Mahony.) My comments include:
- The Forrester Wave’s relative vendor rankings are meaningless, in that the document compares apples, peaches, almonds, and peanuts. Apparently, it covers any vendor that includes a distribution of Apache Hadoop MapReduce into something it offers, and that offered at least two (not necessarily full production) references for same.
- The Forrester Wave for “enterprise Hadoop” contradicts itself on the subject of Hortonworks.
- The Forrester Wave for “enterprise Hadoop” is correct when it says “Hortonworks … has Hadoop training and professional services offerings that are still embryonic.”
- Peculiarly, the Forrester Wave for “enterprise Hadoop” also says “Hortonworks offers an impressive Hadoop professional services portfolio”. Hortonworks will likely win one or more nice partnership deals with vendors in adjacent fields, but even so its professional services capabilities are … well, a good word might be “embryonic”.
- Forrester Waves always seem to have weird implicit definitions of “data warehousing”. This one is no exception.
- Forrester gave top marks in “Functionality” to 11 of 13 “enterprise Hadoop” vendors. This seems odd.
- I don’t know why MapR, which doesn’t like HDFS (Hadoop Distributed File System), got top marks in “Subproject integration”.
- Forrester gave top marks in “Storage” to Datameer. It also gave higher marks to MapR than to EMC Greenplum, even though EMC Greenplum’s technology is a superset of MapR’s. Very strange. (Edit: Actually, as per a comment below, there is some uncertainty about the EMC/MapR relationship.)
- Forrester gave higher marks in “Acceleration and optimization” to Hortonworks than to Cloudera and IBM, and higher marks yet to Pentaho. Very odd.
- I’m not sure what Forrester is calling a “Distributed EDW file store connector”, but it sounds like something that Cloudera has provided via partnership to a number of analytic DBMS vendors.
- Forrester’s “Strategy” rankings seem to correlate to a metric of “We’re a large enough vendor to go in N directions at once”, for various values of N.
- Forrester is correct to rank Cloudera’s “Adoption” as being stronger than EMC/Greenplum’s or MapR’s. But Hortonworks’ strong mark for “Adoption” baffles me.
| Categories: Cloudera, Data warehousing, EMC, Greenplum, Hadoop, Hortonworks, MapR, MapReduce, Pentaho | 11 Comments |
Departmental analytics — best practices
I believe IT departments should support and encourage departmental analytics efforts, where “support” and “encourage” are not synonyms for “control”, “dominate”, “overwhelm”, or even “tame”. A big part of that is:
Let, and indeed help, departments have the data they want, when they want it, served with blazing performance.
Three things that absolutely should NOT be obstacles to these ends are:
- Corporate DBMS standards.
- Corporate data governance processes.
- The difficulties of ETL.
| Categories: Business intelligence, Data mart outsourcing, Data warehousing, EAI, EII, ETL, ELT, ETLT, Predictive modeling and advanced analytics | 4 Comments |
Microsoft SQL Server 2012 and enterprise database choices in general
Microsoft is launching SQL Server 2012 on March 7. An IM chat with a reporter resulted, and went something like this.
Reporter: [Care to comment]?
CAM: SQL Server is an adequate product if you don’t mind being locked into the Microsoft stack. For example, the ColumnStore feature is very partial, given that it can’t be updated; but Oracle doesn’t have columnar storage at all.
Reporter: Is the lock-in overall worse than IBM DB2, Oracle?
CAM: Microsoft locks you into an operating system, so yes.
Reporter: Is this release something larger Oracle or IBM shops could consider as a lower-cost alternative a co-habitation scenario, in the event they’re mulling whether to buy more Oracle or IBM licenses?
CAM: If they have a strong Microsoft-stack investment already, sure. Otherwise, why?
Reporter: [How about] just cost?
CAM: DB2 works just as well to keep Oracle honest as SQL Server does, and without a major operating system commitment. For analytic databases you want an analytic DBMS or appliance anyway.
Best is to have one major vendor of OTLP/general-purpose DBMS, a web DBMS, a DBMS for disposable projects (that may be the same as one of the first two), plus however many different analytic data stores you need to get the job done.
By “web DBMS” I mean MySQL, NewSQL, or NoSQL. Actually, you might need more than one product in that area.
| Categories: Data warehousing, IBM and DB2, Microsoft and SQL*Server, Mid-range, MySQL, NoSQL, Oracle | 9 Comments |
Departmental analytics — general observations
Department-level adoption of analytic technology isn’t the exception; it’s the norm. Reasons include:
- Many analytic challenges are inherently departmental.
- In many cases, central IT control of analytics isn’t needed.
- Departments move ahead without central approval or involvement because they can.
That said, arguments for centralizing analytic technology include:
- A lot of data is used by more than one department, for example:
- Financial transactions (one or more affected departments and also the central accounting group).
- Web logs (marketing and IT/web operations).
- Departments may not have the requisite technical expertise (and it may be redundant/cost-ineffective for them to acquire it).
What’s more, there are IT best practices to support department-level analytics. Some of the key ones boil down to:
- Be flexible in your analytic DBMS support.
- Be responsive to requests for ETL.
My conclusion is that central IT should encourage (and aid) departmental analytics. Let’s look at some details.
| Categories: Data warehousing | Leave a Comment |
KXEN clarifies its story
I frequently badger my clients to tell their story in the form of a company blog, where they can say what needs saying without being restricted by the rules of other formats. KXEN actually listened, and put up a pair of CTO posts that make the company story a lot clearer.
Excerpts from the first post include (with minor edits for formatting, including added emphasis):
Back in 1995, Vladimir Vapnik … changed the machine learning game with his new ‘Statistical Learning Theory’: he provided the machine learning guys with a mathematical framework that allowed them finally to understand, at the core, why some techniques were working and some others were not. All of a sudden, a new realm of algorithms could be written that would use mathematical equations instead of engineering data science tricks (don’t get me wrong here: I am an engineer at heart and I know the value of “tricks,” but tricks cannot overcome the drawbacks of a bad mathematical framework). Here was a foundation for automated data mining techniques that would perform as well as the best data scientists deploying these tricks. Luck is not enough though; it was because we knew a lot about statistics and machine learning that we were able to decipher the nuggets of gold in Vladimir’s theory.
| Categories: KXEN, Predictive modeling and advanced analytics | 1 Comment |
Splunk update
Splunk is announcing the Splunk 4.3 point release. Before discussing it, let’s recall a few things about Splunk, starting with:
- Splunk is first and foremost an analytic DBMS …
- … used to manage logs and similar multistructured data.
- Splunk’s DML (Data Manipulation Language) is based on text search, not on SQL.
- Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).
- Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.
- The paradigmatic use of Splunk is to monitor IT operations in real time. However:
- There also are plenty of non-real-time uses for Splunk.
- Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.
As in any release, a lot of Splunk 4.3 is about “Oh, you didn’t have that before?” features and Bottleneck Whack-A-Mole performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users — especially the non-IT ones — really want to get Splunk information on the tablet devices. While this somewhat contradicts what I wrote a few days ago pooh-poohing mobile BI, let me hasten to point out:
- Splunk is used for a lot of (quasi) real-time monitoring.
- Splunk’s desktop user interfaces are, by BI standards, quite primitive.
That’s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn’t.
| Categories: Business intelligence, Data models and architecture, Data warehousing, Log analysis, Specific users, Splunk, Structured documents, Web analytics | 3 Comments |
Big data terminology and positioning
Recently, I observed that Big Data terminology is seriously broken. It is reasonable to reduce the subject to two quasi-dimensions:
- Bigness — Volume, Velocity, size
- Structure — Variety, Variability, Complexity
given that
- High-velocity “big data” problems are usually high-volume as well.*
- Variety, variability, and complexity all relate to the simply-structured/poly-structured distinction.
But the conflation should stop there.
*Low-volume/high-velocity problems are commonly referred to as “event processing” and/or “streaming”.
When people claim that bigness and structure are the same issue, they oversimplify into mush. So I think we need four pieces of terminology, reflective of a 2×2 matrix of possibilities. For want of better alternatives, my suggestions are:
- Relational big data is data of high volume that fits well into a relational DBMS.
- Multi-structured big data is data of high volume that doesn’t fit well into a relational DBMS. Alternative: Poly-structured big data.
- Conventional relational data is data of not-so-high volume that fits well into a relational DBMS. Alternatives: Ordinary/normal/smaller relational data.
- Smaller poly-structured data is data for which dynamic schema capabilities are important, but which doesn’t rise to “big data” volume.
Some issues in business intelligence
In November I wrote two parts of a planned multi-post series on issues in analytic technology. Then I got caught up in year-end things and didn’t blog for a month. Well … Happy New Year! I’m back. Let’s survey a few BI-related topics.
Mobile business intelligence — real business value or just a snazzy demo?
I discussed some mobile BI use cases in July 2010, but I’m still not convinced the whole area is a legitimate big deal. BI has a long history of snazzy, senior-exec-pleasing demos that have little to do with substantive business value. For now, I think mobile BI is another of those; few people will gain deep analytic insights staring into their iPhones. I don’t see anything coming that’s going to change the situation soon.
BI-centric collaboration — real business value or just a snazzy demo?
I’m more optimistic about collaborative business intelligence. QlikView’s direct sharing of dashboards will, I think, be a feature competitors must and will imitate. Social media BI collaboration is still in the “mainly a demo” phase, but I think it meets a broader and deeper need than does mobile BI. Over the next few years, I expect numerous enterprises to establish strong cultures of analytic chatter (and then give frequent talks about same at industry conferences). Read more
| Categories: Business intelligence, Business Objects, Gooddata, PivotLink, Software as a Service (SaaS) | 10 Comments |
Agile predictive analytics – the heart of the matter
I’ve already suggested that several apparent issues in predictive analytic agility can be dismissed by straightforwardly applying best-of-breed technology, for example in analytic data management. At first blush, the same could be said about the actual analysis, which comprises:
- Data preparation, which is tedious unless you do a good job of automating it.
- Running the actual algorithms.
Numerous statistical software vendors (or open source projects) help you with the second part; some make strong claims in the first area as well (e.g., my clients at KXEN). Even so, large enterprises typically have statistical silos, commonly featuring expensive annual SAS licenses and seemingly slow-moving SAS programmers.
As I see it, the predictive analytics workflow goes something like this Read more
