Analytic technologies

Discussion of technologies related to information query and analysis. Related subjects include:

Business intelligence
Data warehousing
(in Text Technologies) Text mining
(in The Monash Report) Data mining
(in The Monash Report) General issues in analytic technology

June 20, 2018

Brittleness, Murphy’s Law, and single-impetus failures

In my initial post on brittleness I suggested that a typical process is:

Build something brittle.
Strengthen it over time.

In many engineering scenarios, a fuller description could be:

Design something that works in the base cases.
Anticipate edge cases and sources of error, and design for them too.
Implement the design.
Discover which edge cases and error sources you failed to consider.
Improve your product to handle them too.
Repeat as needed.

So it’s necesseary to understand what is or isn’t likely to go wrong. Unfortunately, that need isn’t always met. Read more

Categories: Analytic technologies, Text

5 Comments

June 20, 2018

Brittleness and incremental improvement

Every system — computer or otherwise — needs to deal with possibilities of damage or error. If it does this well, it may be regarded as “robust”, “mature(d), “strengthened”, or simply “improved”.* Otherwise, it can reasonably be called “brittle”.

*It’s also common to use the word “harden(ed)”. But I think that’s a poor choice, as brittle things are often also hard.

0. As a general rule in IT:

New technologies and products are brittle.
They are strengthened incrementally over time.

There are many categories of IT strengthening. Two of the broadest are:

Bug-fixing.
Bottleneck Whack-A-Mole.

1. One of my more popular posts stated:

Developing a good DBMS requires 5-7 years and tens of millions of dollars.

The reasons I gave all spoke to brittleness/strengthening, most obviously in:

Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.

Similar things are true for other kinds of “platform software” or distributed systems.

2. The UI brittleness/improvement story starts similarly: Read more

Categories: Analytic technologies, Buying processes, Theory and architecture

1 Comment

May 20, 2018

Some stuff that’s always on my mind

I have a LOT of partially-written blog posts, but am struggling to get any of them finished (obviously). Much of the problem is that they have so many dependencies on each other. Clearly, then, I should consider refactoring my writing plans. 🙂

So let’s start with this. Here, in no particular order, is a list of some things that I’ve said in the past, and which I still think are or should be of interest today. It’s meant to be background for numerous posts I write in the near future, and indeed a few hooks for such posts are included below.

1. Data(base) management technology is progressing pretty much as I expected.

Vendors generally recognize that maturing a data store is an important, many-years-long process.
Multiple kinds of data model are viable …
… but it’s usually helpful to be able to do some kind of JOIN.
To deal with the variety of hardware/network/storage arrangements out there, layering/tiering is on the rise. (An amazing number of vendors each seem to think they invented the idea.)

2. Rightly or wrongly, enterprises are often quite sloppy about analytic accuracy.

My two central examples have long been inaccurate metrics and false-positive alerts.
In predictive analytics, it’s straightforward to quantify how much additional value you’re leaving on the table with your imperfect accuracy.
Enterprise search and other text technologies are still often terrible.
After years of “real-time” overhype, organizations have seemingly swung to under-valuing real-time analytics.

Categories: Data models and architecture, Database diversity, Predictive modeling and advanced analytics, Public policy, Theory and architecture

5 Comments

December 12, 2017

Notes on artificial intelligence, December 2017

Most of my comments about artificial intelligence in December, 2015 still hold true. But there are a few points I’d like to add, reiterate or amplify.

1. As I wrote back then in a post about the connection between machine learning and the rest of AI,

It is my opinion that most things called “intelligence” — natural and artificial alike — have a great deal to do with pattern recognition and response.

2. Accordingly, it can be reasonable to equate machine learning and AI.

AI based on machine learning frequently works, on more than a toy level. (Examples: Various projects by Google)
AI based on knowledge representation usually doesn’t. (Examples: IBM Watson, 1980s expert systems)
“AI” can be the sexier marketing or fund-raising term.

3. Similarly, it can be reasonable to equate AI and pattern recognition. Glitzy applications of AI include:

Understanding or translation of language (written or spoken as the case may be).
Machine vision or autonomous vehicles.
Facial recognition.
Disease diagnosis via radiology interpretation.

4. The importance of AI and of recent AI advances differs greatly according to application or data category. Read more

Categories: Cloud computing, Predictive modeling and advanced analytics, Public policy, Surveillance and privacy

4 Comments

August 22, 2017

Imanis Data

I talked recently with the folks at Imanis Data. For starters:

The point of Imanis is to make copies of your databases, for purposes such as backup/restore, test/analysis, or compliance-driven archiving. (That’s in declining order of current customer activity.) Another use is migration via restoring to a different cluster than the one that created the data in the first place.
The data can come from NoSQL database managers, from Hadoop, or from Vertica. (Again, that’s in declining order.)
As you might imagine, Imanis makes incremental backups; the only full backup is the first one you do for that database.
“Imanis” is a new name; the previous name was “Talena”.

Categories: Cassandra, Hadoop, Market share and customer counts, NoSQL, Predictive modeling and advanced analytics, Vertica Systems

1 Comment

August 10, 2017

Notes on data security

1. In June I wrote about burgeoning interest in data security. I’d now like to add:

Even more than I previously thought, demand seems to be driven largely by issues of regulatory compliance.
In an exception to that general rule, many enterprise have vague mandates for data encryption.
In awkward contradiction to that general rule, there’s a general sense that it’s just security’s “turn” to be a differentiating feature, since various other “enterprise” needs are already being well-addressed.

We can reconcile these anecdata pretty well if we postulate that:

Enterprises generally agree that data security is an important need.
Exactly how they meet this need depends upon what regulators choose to require.

2. My current impressions of the legal privacy vs. surveillance tradeoffs are basically: Read more

Categories: Data warehousing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Surveillance and privacy

Analytics on the edge?

There’s a theory going around to the effect that:

Compute power is and will be everywhere, for example in cars, robots, medical devices or microwave ovens. Let’s refer to these platforms collectively as “real-world appliances”.
Much more data will be created on these platforms than can reasonably be sent back to centralized/cloudy servers.
Therefore, cloud-centric architectures will soon be obsolete, perhaps before they’re ever dominant in the first place.

There’s enough truth to all that to make it worth discussing. But the strong forms of the claims seem overblown.

1. This story doesn’t even make sense except for certain new classes of application. Traditional business applications run all over the world, in dedicated or SaaSy modes as the case may be. E-commerce is huge. So is content delivery. Architectures for all those things will continue to evolve, but what we have now basically works.

2. When it comes to real-world appliances, this story is partially accurate. An automobile is a rolling network of custom Linux systems, each running hand-crafted real-time apps, a few of which also have minor requirements for remote connectivity. That’s OK as far as it goes, but there could be better support for real-time operational analytics. If something as flexible as Spark were capable of unattended operation, I think many engineers of real-world appliances would find great ways to use it.

3. There’s a case to be made for something better yet. I think the argument is premature, but it’s worth at least a little consideration. Read more

Categories: Business intelligence, Cloud computing, Data warehousing, Database diversity, Databricks, Spark and BDAS, Log analysis, NoSQL, Predictive modeling and advanced analytics, Streaming and complex event processing (CEP)

2 Comments

June 16, 2017

Generally available Kudu

I talked with Cloudera about Kudu in early May. Besides giving me a lot of information about Kudu, Cloudera also helped confirm some trends I’m seeing elsewhere, including:

Security is an ever bigger deal.
There’s a lot of interest in data warehouses (perhaps really data marts) that are updated in human real-time.
- Prospects for that respond well to the actual term “data warehouse”, at least when preceded by some modifier to suggest that it’s modern/low-latency/non-batch or whatever.
- Flash is often — but not yet always — preferred over disk for that kind of use.
- Sometimes these data stores are greenfield. When they’re migrations, they come more commonly from analytic RDBMS or data warehouse appliance (the most commonly mentioned ones are Teradata, Netezza and Vertica, but that’s perhaps just due to those product lines’ market share), rather than from general purpose DBMS such as Oracle or SQL Server.
Intel is making it ever easier to vectorize CPU operations, and analytic data managers are increasingly taking advantage of this possibility.

Now let’s talk about Kudu itself. As I discussed at length in September 2015, Kudu is:

A data storage system introduced by Cloudera (and subsequently open-sourced).
Columnar.
Updatable in human real-time.
Meant to serve as the data storage tier for Impala and Spark.

Kudu’s adoption and roll-out story starts: Read more

Categories: Cloudera, Columnar database management, Data warehouse appliances, Data warehousing, Database compression, Databricks, Spark and BDAS, Hadoop, Market share and customer counts, Netezza, NoSQL, Open source, Solid-state memory, SQL/Hadoop integration, Teradata, Vertica Systems

1 Comment

June 14, 2017

The data security mess

A large fraction of my briefings this year have included a focus on data security. This is the first year in the past 35 that that’s been true.* I believe that reasons for this trend include:

Security is an important aspect of being “enterprise-grade”. Other important checkboxes have been largely filled in. Now it’s security’s turn.
A major platform shift, namely to the cloud, is underway or at least being planned for. Security is an important thing to think about as that happens.
The cloud even aside, technology trends have created new ways to lose data, which security technology needs to address.
Traditionally paranoid industries are still paranoid.
Other industries are newly (and rightfully) terrified of exposing customer data.
My clients at Cloudera thought they had a chance to get significant messaging leverage from emphasizing security. So far, it seems that they were correct.

*Not really an exception: I did once make it a project to learn about classic network security, including firewall appliances and so on.

Certain security requirements, desires or features keep coming up. These include (and as in many of my lists, these overlap):

Easy, comprehensive access control. More on this below.
Encryption. If other forms of security were perfect, encryption would never be needed. But they’re not.
Auditing. Ideally, auditing can alert you to trouble before (much) damage is done. If not, then it can at least help you do proactive damage control in the face of breach.
Whatever regulators mandate.
Whatever is generally regarded as best practices. Security “best practices” generally keep enterprises out of legal and regulatory trouble, or at least minimize same. They also keep employees out of legal and career trouble, or minimize same. Hopefully, they even keep data safe.
Whatever the government is known to use. This is a common proxy for “best practices”.

More specific or extreme requirements include: Read more

Categories: Business intelligence, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, QlikTech and QlikView, Tableau Software

4 Comments

June 14, 2017

Light-touch managed services

Cloudera recently introduced Cloudera Altus, a Hadoop-in-the-cloud offering with an interesting processing model:

Altus manages jobs for you.
But you actually run them on your own cluster, and so you never have to put your data under Altus’ control.

Thus, you avoid a potential security risk (shipping your data to Cloudera’s service). I’ve tentatively named this strategy light-touch managed services, and am interested in exploring how broadly applicable it might or might not be.

For light-touch to be a good approach, there should be (sufficiently) little downside in performance, reliability and so on from having your service not actually control the data. That assumption is trivially satisfied in the case of Cloudera Altus, because it’s not an ordinary kind of app; rather, its whole function is to improve the job-running part of your stack. Most kinds of apps, however, want to operate on your data directly. For those, it is more challenging to meet acceptable SLAs (Service-Level Agreements) on a light-touch basis.

Let’s back up and consider what “light-touch” for data-interacting apps (i.e., almost all apps) would actually mean. The basics are: Read more

Categories: Cloud computing, Cloudera, Data warehousing, EAI, EII, ETL, ELT, ETLT, Hadoop, Software as a Service (SaaS), Surveillance and privacy

3 Comments

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in

Analytic technologies

Brittleness, Murphy’s Law, and single-impetus failures

Brittleness and incremental improvement

Some stuff that’s always on my mind

Notes on artificial intelligence, December 2017

Imanis Data

Notes on data security

Analytics on the edge?

Generally available Kudu

The data security mess

Light-touch managed services

Search our blogs and white papers

Monash Research blogs

User consulting

Vendor advisory

Monash Research highlights

Recent posts

Categories

Date archives

Admin