Data warehousing

Analysis of issues in data warehousing, with extensive coverage of database management systems and data warehouse appliances that are optimized to query large volumes of data. Related subjects include:

June 8, 2015

Teradata will support Presto

At the highest level:

Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
Teradata is announcing support for Presto with a classic open source pricing model.
Presto will also become, roughly speaking, Teradata’s replacement for Hive.
Teradata’s Presto efforts are being conducted by the former Hadapt.

Now let’s make that all a little more precise.

Regarding Presto (and I got most of this from Teradata)::

To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
… Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
Facebook at various points in time created both Hive and now Presto.
Facebook started the Presto project in 2012 and now has 10 engineers on it.
Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.

Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project: Read more

Categories: Cloudera, Columnar database management, Data integration and middleware, Data pipelining, Data warehousing, Facebook, Hadapt, Hadoop, MapReduce, Market share and customer counts, Open source, Petabyte-scale data management, SQL/Hadoop integration, Teradata, Workload management

19 Comments

May 13, 2015

Notes on analytic technology, May 13, 2015

1. There are multiple ways in which analytics is inherently modular. For example:

Business intelligence tools can reasonably be viewed as application development tools. But the “applications” may be developed one report at a time.
The point of a predictive modeling exercise may be to develop a single scoring function that is then integrated into a pre-existing operational application.
Conversely, a recommendation-driven website may be developed a few pages — and hence also a few recommendations — at a time.

Also, analytics is inherently iterative.

Everything I just called “modular” can reasonably be called “iterative” as well.
So can any work process of the nature “OK, we got an insight. Let’s pursue it and get more accuracy.”

If I’m right that analytics is or at least should be modular and iterative, it’s easy to see why people hate multi-year data warehouse creation projects. Perhaps it’s also easy to see why I like the idea of schema-on-need.

2. In 2011, I wrote, in the context of agile predictive analytics, that

… the “business analyst” role should be expanded beyond BI and planning to include lightweight predictive analytics as well.

I gather that a similar point is at the heart of Gartner’s new term citizen data scientist. I am told that the term resonates with at least some enterprises. Read more

Categories: Business intelligence, Data warehousing, Datameer, Hadoop, Log analysis, Oracle, Platfora, Predictive modeling and advanced analytics, SAS Institute, Software as a Service (SaaS), Tableau Software, Web analytics

2 Comments

April 16, 2015

Notes on indexes and index-like structures

Indexes are central to database management.

My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
… and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
Recently, I’ve had numerous conversations in which indexing strategies played a central role.

Perhaps it’s time for a round-up post on indexing. 🙂

1. First, let’s review some basics. Classically:

An index is a DBMS data structure that you probe to discover where to find the data you really want.
Indexes make data retrieval much more selective and hence faster.
While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)

2. Further: Read more

Categories: Data warehousing, Database compression, GIS and geospatial, Google, MapReduce, McObject, MemSQL, MySQL, ScaleDB, solidDB, Sybase, Text, Tokutek and TokuDB

18 Comments

April 9, 2015

Which analytic technology problems are important to solve for whom?

I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.

In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks: Read more

Categories: Business intelligence, Data warehousing, Databricks, Spark and BDAS, Hadoop, Netezza, NoSQL, Predictive modeling and advanced analytics, Tableau Software

1 Comment

March 23, 2015

A new logical data layer?

I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.

Here are some thoughts as to why, and also as to challenges that need to be overcome.

There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:

When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.

Categories: Business intelligence, Business Objects, Data models and architecture, Data warehousing, EAI, EII, ETL, ELT, ETLT, Emulation, transparency, portability, Hadoop, Memory-centric data management, MOLAP, Streaming and complex event processing (CEP), WibiData

3 Comments

February 22, 2015

Data models

7-10 years ago, I repeatedly argued the viewpoints:

Relational DBMS were the right choice in most cases.
Multiple kinds of relational DBMS were needed, optimized for different kinds of use case.
There were a variety of specialized use cases in which non-relational data models were best.

Since then, however:

Hadoop has flourished.
NoSQL has flourished.
Graph DBMS have matured somewhat.
Much of the action has shifted to machine-generated data, of which there are many kinds.

So it’s probably best to revisit all that in a somewhat organized way.

Categories: Cassandra, Cloud computing, Data models and architecture, Data warehouse appliances, Data warehousing, Database diversity, Hadoop, In-memory DBMS, Log analysis, Mid-range, MongoDB, NoSQL, OLTP, RDF and graphs, Splunk, Structured documents

8 Comments

November 30, 2014

Thoughts and notes, Thanksgiving weekend 2014

I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:

1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:

Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,

The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.

What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.

2. Three years ago I posted about agile (predictive) analytics. One of the points was:

… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.

Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.

3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more

Categories: About this blog, Citus Data, Cloudera, Data models and architecture, Data warehousing, Databricks, Spark and BDAS, EMC, Hadoop, IBM and DB2, KXEN, MapReduce, Market share and customer counts, MemSQL, Microsoft and SQL*Server, MongoDB, MySQL, NewSQL, NoSQL, OLTP, Oracle, PostgreSQL, Predictive modeling and advanced analytics, SAP AG, Specific users, Sybase, Tokutek and TokuDB, WibiData

12 Comments

November 15, 2014

Technical differentiation

I commonly write about real or apparent technical differentiation, in a broad variety of domains. But actually, computers only do a couple of kinds of things:

Accept instructions.
Execute them.

And hence almost all IT product differentiation fits into two buckets:

Easier instruction-giving, whether that’s in the form of a user interface, a language, or an API.
Better execution, where “better” usually boils down to “faster”, “more reliable” or “more reliably fast”.

As examples of this reductionism, please consider:

Application development is of course a matter of giving instructions to a computer.
Database management systems accept and execute data manipulation instructions.
Data integration tools accept and execute data integration instructions.
System management software accepts and executes system management instructions.
Business intelligence tools accept and execute instructions for data retrieval, navigation, aggregation and display.

Similar stories are true about application software, or about anything that has an API (Application Programming Interface) or SDK (Software Development Kit).

Yes, all my examples are in software. That’s what I focus on. If I wanted to be more balanced in including hardware or data centers, I might phrase the discussion a little differently — but the core points would still remain true.

What I’ve said so far should make more sense if we combine it with the observation that differentiation is usually restricted to particular domains. Read more

Categories: Business intelligence, Data warehousing, Hadoop, Teradata

4 Comments

October 22, 2014

Is analytic data management finally headed for the cloud?

It seems reasonable to wonder whether analytic data management is headed for the cloud. In no particular order:

Amazon Redshift appears to be prospering.
So are some SaaS (Software as a Service) business intelligence vendors.
Amazon Elastic MapReduce is still around.
Snowflake Computing launched with a cloud strategy.
Cazena, with vague intentions for cloud data warehousing, destealthed.*
Cloudera made various cloud-related announcements.
Data is increasingly machine-generated, and machine-generated data commonly originates off-premises.
The general argument for cloud-or-at-least-colocation has compelling aspects.
Analytic workloads can be “bursty”, and so could benefit from true cloud elasticity.

Categories: Amazon and its cloud, Cloud computing, Data warehousing, Netezza

3 Comments

October 22, 2014

Snowflake Computing

I talked with the Snowflake Computing guys Friday. For starters:

Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
Snowflake claims excellent SQL coverage for a 1.0 product.
Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.

Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*

*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.

In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:

Ingest formats are JSON, XML or AVRO for now.
I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.

I don’t know enough details to judge whether I’d call that an example of schema-on-need.

A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather: Read more

Categories: Amazon and its cloud, Cloud computing, Data mart outsourcing, Data models and architecture, Data warehousing, Market share and customer counts, Parallelization, Pricing, Software as a Service (SaaS), Structured documents

5 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in