Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:
In a general pontification on positioning, I wrote:
every product in a category is positioned along the same set of attributes,
and went on to suggest that summary attributes were more important than picky detailed ones. So how does that play out for investigative analytics?
First, summary attributes that matter for almost any kind of enterprise software include:
- Performance and scalability. I write about analytic performance and scalability a lot. Usually that’s in the context of analytic DBMS, but it also arises in analytic stacks such as Platfora, Metamarkets or even QlikView, and also in the challenges of making predictive modeling scale.
- Reliability, availability and security.* This is more crucial for short-request applications than analytic ones, but even your analytic systems shouldn’t leak data or crash.
- Goodness of fit with legacy systems. I hate that one, because enterprises often sacrifice way too much in favor of that benefit.
- Price. Duh.
*I picked up that phrase when — abbreviated as RAS — it was used to characterize the emphasis for Oracle 8. I like it better than a general and ambiguous concept of “enterprise-ready”.
The reason I’m writing this post, however, is to call out two summary attributes of special importance in investigative analytics — which regrettably which often conflict with each other — namely:
- Agility. People don’t want to submit requests for reports or statistical analyses; they want to get answers as soon as the questions come to mind.
- Completeness of feature set — for a particular use case, that is. There’s no such thing as an investigative analytics offering with a feature set that’s close to complete for all purposes; even SAS, IBM and other behemoths fall short.
Much of what I work on boils down to those two subjects. For example: Read more
|Categories: Aster Data, Business intelligence, Data warehousing, KXEN, Predictive modeling and advanced analytics, SAS Institute, Teradata||9 Comments|
I’ve suggested in the past, approximately, that the platform technology side of business intelligence is more significant than the user interface. That formulation, however, doesn’t exactly capture what I believe. To be more precise, let’s differentiate between a couple aspects of business intelligence UI.
It might seem that a lot of the action in business intelligence revolves around ever-better visualization. After all, Tableau is clearly identified as a visualization-centric technology; who’s hotter than Tableau? And numerous other vendors talk of “visualizations” too. But I don’t think that’s exactly right — rather, I see navigation as being a much bigger deal. And unlike most pure visualization, navigation usually depends strongly on underlying platform capabilities.
Examples of what I mean by innovative navigation — all of which have been developed or have gained prominence over the past decade or so — include:
- QlikView’s core behavior — all that associative navigation.
- QlikView’s collaboration, and every other BI collaboration capability I know of.
- ClearStory, although you won’t get to see what I mean until the launch next month.
- BI search or faceted-search UIs. (E.g. Endeca.)
- BI that is launched from operational applications.
Oracle announced its in-memory columnar option Sunday. As usual, I wasn’t briefed; still, I have some observations. For starters:
- Oracle, IBM (Edit: See the rebuttal comment below), and Microsoft are all doing something similar …
- … because it makes sense.
- The basic idea is to take the technology that manages indexes — which are basically columns+pointers — and massage it into an actual column store. However …
- … the devil is in the details. See, for example, my May post on IBM’s version, called BLU, outlining all the engineering IBM did around that feature.
- Notwithstanding certain merits of this approach, I don’t believe in complete alternatives to analytic RDBMS. The rise of analytic DBMS oriented toward multi-structured data just strengthens that point.
I’d also add that Larry Ellison’s pitch “build columns to avoid all that index messiness” sounds like 80% bunk. The physical overhead should be at least as bad, and the main saving in administrative overhead should be that, in effect, you’re indexing ALL columns rather than picking and choosing.
Anyhow, this technology should be viewed as applying to traditional business transaction data, much more than to — for example — web interaction logs, or other machine-generated data. My thoughts around that distinction start:
- I argued back in 2011 that traditional databases will wind up in RAM, basically because …
- … Moore’s Law will make it ever cheaper to store them there.
- Still, cheaper != cheap, so this is a technology only to use with you most valuable data — i.e., that transactional stuff.
- These are very tabular technologies, without much in the way of multi-structured data support.
|Categories: Columnar database management, Data warehousing, IBM and DB2, Memory-centric data management, Microsoft and SQL*Server, OLTP, Oracle, SAP AG, Workday||5 Comments|
Two years ago I wrote about how Zynga managed analytic data:
Data is divided into two parts. One part has a pretty ordinary schema; the other is just stored as a huge list of name-value pairs. (This is much like eBay‘s approach with its Teradata-based Singularity, except that eBay puts the name-value pairs into long character strings.) … Zynga adds data into the real schema when it’s clear it will be needed for a while.
What was then the province of a few huge web companies is now poised to be a broader trend. Specifically:
- Relational DBMS are adding or enhancing their support for complex datatypes, to accommodate various kinds of machine-generated data.
- MongoDB-compatible JSON is the flavor of the day on the short-request side, but alternatives include other JSON, XML, other key-value, or text strings.
- It is often possible to index on individual attributes inside the complex datatype.
- The individual attributes inside the complex datatypes amount to virtual columns, which can play similar roles in SQL statements as physical columns do.
- Over time, the DBA may choose to materialize virtual columns as additional physical columns, to boost query performance.
That migration from virtual to physical columns is what I’m calling “schema-on-need”. Thus, schema-on-need is what you invoke when schema-on-read no longer gets the job done.
|Categories: Data models and architecture, Data warehousing, MongoDB and 10gen, PostgreSQL, Schema on need, Structured documents||9 Comments|
- Stores CDRs (Call Detail Records), many or all of which are collected via …
- … some kind of back door into the AT&T switches that many carriers use. (See Slide 2.)
- Has also included “subscriber information” for AT&T phones since July, 2012.
- Contains “long distance and international” CDRs back to 1987.
- Currently adds 4 billion CDRs per day.
- Is administered by a Federal drug-related law enforcement agency but …
- … is used to combat many non-drug-related crimes as well. (See Slides 21-26.)
Other notes include:
- The agencies specifically mentioned on Slide 16 as making numerous Hemisphere requests are the DEA (Drug Enforcement Agency) and DHS (Department of Homeland Security).
- “Roaming” data giving city/state is mentioned in the deck, but more precise geo-targeting is not.
I’ve never gotten a single consistent figure, but typical CDR size seems to be in the 100s of bytes range. So I conjecture that Project Hemisphere spawned one of the first petabyte-scale databases ever.
Hemisphere Project unknowns start: Read more
|Categories: Data warehousing, GIS and geospatial, Petabyte-scale data management, Specific users, Surveillance and privacy, Telecommunications||Leave a Comment|
As is the case for most important categories of technology, discussions of BI can get confused. I’ve remarked in the past that there are numerous kinds of BI, and that the very origin of the term “business intelligence” can’t even be pinned down to the nearest century. But the most fundamental confusion of all is that business intelligence technology really is two different things, which in simplest terms may be categorized as user interface (UI) and platform* technology. And so:
- The UI aspect is why BI tends to be sold to business departments; the platform aspect is why it also makes sense to sell BI to IT shops attempting to establish enterprise standards.
- The UI aspect is why it makes sense to sell and market BI much as one would applications; the platform aspect is why it makes sense to sell and market BI much as one would database technology.
- The UI aspect is why vendors want to integrate BI with transaction-processing applications; the platform aspect is, I suppose, why they have so much trouble making the integration work.
- The UI aspect is why BI is judged on … well, on snazzy UIs and demos. The platform aspect is a big reason why the snazziest UI doesn’t always win.
*I wanted to say “server” or “server-side” instead of “platform”, as I dislike the latter word. But it’s too inaccurate, for example in the case of the original Cognos PowerPlay, and also in various thin-client scenarios.
Key aspects of BI platform technology can include:
- Query and data management. That’s the area I most commonly write about, for example in the cases of Platfora, QlikView, or Metamarkets. It goes back to the 1990s — notably the Business Objects semantic layer and Cognos PowerPlay MOLAP (MultiDimensional OnLine Analytic Processing) engine — and indeed before that to the report writers and fourth-generation languages of the 1970s. This overlaps somewhat with …
- … data integration and metadata management. Business Objects, Qlik, and other BI vendors have bought data integration vendors. Arguably, there was a period when Information Builders’ main business was data connectivity and integration. And sometimes the main value proposition for a BI deal is “We need some way to get at all that data and bring it together.”
- Security and access control – authentication, authorization, and all the additional As.
- Scheduling and delivery. When 10s of 1000s of desktops are being served, these aren’t entirely trivial. Ditto when dealing with occasionally-connected mobile devices.
|Categories: Business intelligence, Business Objects, ClearStory Data, Cognos, Data warehousing, Endeca, Information Builders, Metamarkets and Druid, MOLAP, Platfora, Predictive modeling and advanced analytics, QlikTech and QlikView||11 Comments|
Some subjects just keep coming up. And so I keep saying things like:
Most generalizations about “Big Data” are false. “Big Data” is a horrific catch-all term, with many different meanings.
Most generalizations about Hadoop are false. Reasons include:
- Hadoop is a collection of disparate things, most particularly data storage and application execution systems.
- The transition from Hadoop 1 to Hadoop 2 will be drastic.
- For key aspects of Hadoop — especially file format and execution engine — there are or will be widely varied options.
Hadoop won’t soon replace relational data warehouses, if indeed it ever does. SQL-on-Hadoop is still very immature. And you can’t replace data warehouses unless you have the power of SQL.
Note: SQL isn’t the only way to provide “the power of SQL”, but alternative approaches are just as immature.
Most generalizations about NoSQL are false. Different NoSQL products are … different. It’s not even accurate to say that all NoSQL systems lack SQL interfaces. (For example, SQL-on-Hadoop often includes SQL-on-HBase.)
I made a remarkably rumpled video appearance yesterday with SiliconAngle honchos John Furrier and Dave Vellante. (Excuses include <3 hours sleep, and then a scrambling reaction to a schedule change.) Topics covered included, with approximate timechecks:
- 0:00 Introductory pabulum, and some technical difficulties
- 2:00 More introduction
- 3:00 Dynamic schemas and data model churn
- 6:00 Surveillance and privacy
- 13:00 Hadoop, especially the distro wars
- 22:00 BI innovation
- 23:30 More on dynamic schemas and data model churn
Edit: Some of my remarks were transcribed.
- I posted on dynamic schemas data model churn a few days ago.
- I capped off a series on privacy and surveillance a few days ago.
- I commented on various Hadoop distributions in June.
|Categories: Business intelligence, ClearStory Data, Data warehousing, Hadoop, MapR, MapReduce, Surveillance and privacy||Leave a Comment|
I chatted yesterday with the Hortonworks gang. The main subject was Hortonworks’ approach to SQL-on-Hadoop — commonly called Stinger — but at my request we cycled through a bunch of other topics as well. Company-specific notes include:
- Hortonworks founder J. Eric “Eric14″ Baldeschwieler is no longer at Hortonworks, although I imagine he stays closely in touch. What he’s doing next is unspecified, except by the general phrase “his own thing”. (Derrick Harris has more on Eric’s departure.)
- John Kreisa still is at Hortonworks, just not as marketing VP. Think instead of partnerships and projects.
- ~250 employees.
- ~70-75 subscription customers.
Our deployment and use case discussions were a little confused, because a key part of Hortonworks’ strategy is to support and encourage the idea of combining use cases and workloads on a single cluster. But I did hear:
- 10ish nodes for a typical starting cluster.
- 100ish nodes for a typical “data lake” committed adoption.
- Teradata UDA (Unified Data Architecture)* customers sometimes (typically?) jumping straight to a data lake scenario.
- A few users in the 10s of 1000s of nodes. (Obviously Yahoo is one.)
- HBase used in >50% of installations.
- Hive probably even more than that.
- Hortonworks is seeing a fair amount of interest in Windows Hadoop deployments.
*By the way — Teradata seems serious about pushing the UDA as a core message.
Ecosystem notes, in Hortonworks’ perception, included:
- Cloudera is obviously Hortonworks’ biggest distro competitor. Next is IBM, presumably in its blue-forever installed base. MapR is barely on the radar screen; Pivotal’s likely rise hasn’t yet hit sales reports.
- Hortonworks evidently sees a lot of MicroStrategy and Tableau, and some Platfora and Datameer, the latter two at around the same level of interest.
- Accumulo is a big deal in the Federal government, and has gotten a few health care wins as well. Its success is all about security. (Note: That’s all consistent with what I hear elsewhere.)
I also asked specifically about OpenStack. Hortonworks is a member of the OpenStack project, contributes nontrivially to Swift and other subprojects, and sees Rackspace as an important partner. But despite all that, I think strong Hadoop/OpenStack integration is something for the indefinite future.
Hortonworks’ views about Hadoop 2.0 start from the premise that its goal is to support running a multitude of workloads on a single cluster. (See, for example, what I previously posted about Tez and YARN.) Timing notes for Hadoop 2.0 include:
- It’s been in preview/release candidate/commercial beta mode for weeks.
- Q3 is the goal; H2 is the emphatic goal.
- Yahoo’s been in production with YARN >8 months, and has no MapReduce 1 clusters left. (Yahoo has >35,000 Hadoop nodes.)
- The last months of delays have been mainly about sprucing up various APIs and protocols, which may need to serve for a similar multi-year period as Hadoop 1’s have. But there also was some YARN stabilization into May.
Frankly, I think Cloudera’s earlier and necessarily incremental Hadoop 2 rollout was a better choice than Hortonworks’ later big bang, even though the core-mission aspect of Hadoop 2.0 is what was least ready. HDFS (Hadoop Distributed File System) performance, NameNode failover and so on were well worth having, and it’s more than a year between Cloudera starting supporting them and when Hortonworks is offering Hadoop 2.0.
Hortonworks’ approach to doing SQL-on-Hadoop can be summarized simply as “Make Hive into as good an analytic RDBMS as possible, all in open source”. Key elements include: Read more
Perhaps we should remind ourselves of the many ways data models can be caused to churn. Here are some examples that are top-of-mind for me. They do overlap a lot — and the whole discussion overlaps with my post about schema complexity last January, and more generally with what I’ve written about dynamic schemas for the past several years..
Just to confuse things further — some of these examples show the importance of RDBMS, while others highlight the relational model’s limitations.
The old standbys
Product and service changes. Simple changes to your product line many not require any changes to the databases recording their production and sale. More complex product changes, however, probably will.
A big help in MCI’s rise in the 1980s was its new Friends and Family service offering. AT&T couldn’t respond quickly, because it couldn’t get the programming done, where by “programming” I mainly mean database integration and design. If all that was before your time, this link seems like a fairly contemporaneous case study.
Organizational changes. A common source of hassle, especially around databases that support business intelligence or planning/budgeting, is organizational change. Kalido’s whole business was based on accommodating that, last I checked, as were a lot of BI consultants’. Read more
|Categories: Data warehousing, Derived data, Kalido, Log analysis, Software as a Service (SaaS), Specific users, Text, Web analytics||2 Comments|