June 4, 2011

Dirty data, stored dirt cheap

A major driver of Hadoop adoption is the “big bit bucket” use case. Users take a whole lot of data, often machine-generated data in logs of different kinds, and dump it into one place, managed by Hadoop, at open-source pricing. Hadoop hardware doesn’t need to be that costly either. And once you get that data into Hadoop, there are a whole lot of things you can do with it.

Of course, there are various outfits who’d like to sell you not-so-cheap bit buckets. Contending technologies include Hadoop appliances (which I don’t believe in), Splunk (which in many use cases I do), and MarkLogic (ditto, but often the cases are different from Splunk’s). Cloudera and IBM, among other vendors, would also like to sell you some proprietary software to go with your standard Apache Hadoop code.

So the question arises — why would you want to spend serious money to look after your low-value data? The answer, of course, is that maybe your log data isn’t so low-value. Read more

Categories: Hadoop, Investment research and trading, Log analysis, Splunk

9 Comments

June 4, 2011

Hardware for Hadoop

After suggesting that there’s little point to Hadoop appliances, it occurred to me to look into what kinds of hardware actually are used with Hadoop. So far as I can tell:

Hadoop nodes today tend to run on fairly standard boxes.
Hadoop nodes in the past have tended to run on boxes that were light with respect to RAM.
The number of spindles per core on Hadoop node boxes is going up even as disks get bigger.

Categories: Cloudera, Hadoop, Pricing, Storage, Yahoo

11 Comments

June 2, 2011

Why you would want an appliance — and when you wouldn’t

Data warehouse appliances are booming. But Hadoop appliances are a non-starter.

Data warehouse and other data management appliances are on the upswing. Oracle is pushing Exadata. Teradata* is going strong, and also recently bought Aster Data. IBM bought Netezza. Greenplum and Vertica were bought by EMC and HP respectively. All those moves are favorable for appliances.

*As far as I’m concerned, all Teradata hardware-included systems are appliances.

In essence, there are two kinds of reasons to prefer appliances over software-only offerings: Read more

Categories: Data warehouse appliances, Hadoop, Open source

10 Comments

May 30, 2011

Another category of derived data

Six months ago, I argued the importance of derived analytic data, saying

… there’s no escaping the importance of derived/augmented/enhanced/cooked/adjusted data for analytic data processing. The five areas I have in mind are, loosely speaking:

Aggregates, when they are maintained, generally for reasons of performance or response time.

Calculated scores, commonly based on data mining/predictive analytics.

Text analytics.

The kinds of ETL (Extract/Transform/Load) Hadoop and other forms of MapReduce are commonly used for.

Adjusted data, especially in scientific contexts.

Probably there are yet more examples that I am at the moment overlooking.

Well, I did overlook at least one category. 🙂

A surprisingly important kind of derived data is metadata, especially for large, poly-structured data sets. For example, CERN has vastly quantities of experiment sensor data, stored as files; just the metadata alone fills over 10 terabytes in an Oracle database. MarkLogic is big on storing derived metadata, both on the publishing/media and intelligence sides of the business.

Categories: Data models and architecture, Derived data, Hadoop, MarkLogic

2 Comments

May 26, 2011

Slashdot venting thread about Oracle/Sun hardware

Slashdot has what amounts to a venting thread about Oracle/Sun hardware. The one consistent favorable theme is that Sun hardware is good stuff if you want to run Oracle. Otherwise, comments repeatedly say:

Product discounts are down, effectively creating a price increase.
Service prices are way up for some customers, because cheaper service options have been eliminated.
Service quality is unsatisfactory.
Oracle is difficult to do business with.

So far, I haven’t seen any comments to the effect “I don’t know what you guys are talking about; we’re perfectly happy with Sun”, but surely those will come too.

Categories: Oracle, Pricing

4 Comments

May 24, 2011

Quick thoughts on Oracle-on-Amazon

Amazon has a page up for what it calls Amazon RDS for Oracle Database. You can rent Amazon instances suitable for running Oracle, and bring your own license (BYOL), or you can rent a “License Included” instance that includes Oracle Standard Edition One (a cheap version of Oracle that is limited to two sockets).

My quick thoughts start:

Mainly, this isn’t for production usage. But exceptions might arise when:
- An application, from creation to abandonment, is only expected to have a short lifespan, in support of a specific project.
- There is an extreme internal-politics bias to operating versus capital expenses, or something like that, forcing a user department to cloud production deployment even when it doesn’t make much rational sense.
- An application is small enough, or the situation is sufficiently desperate, that any inefficiencies are outweighed by convenience.
There is non-production appeal. In particular:
- Spinning up a quick cloud instance can make a lot of sense for a developer.
- The same goes if you want to sell an Oracle-based application and need to offer demo/test capabilities.
- The same might go for off-site replication/disaster recovery.

Of course, those are all standard observations every time something that’s basically on-premises software is offered in the cloud. They’re only reinforced by the fact that the only Oracle software Amazon can actually license you is a particularly low-end edition.

And Oracle is indeed on-premises software. In particular, Oracle is hard enough to manage when it’s on your premises, with a known hardware configuration; who would want to try to manage a production instance of Oracle in the cloud?

Categories: Amazon and its cloud, Cloud computing, Oracle

7 Comments

May 23, 2011

Traditional databases will eventually wind up in RAM

In January, 2010, I posited that it might be helpful to view data as being divided into three categories:

Human/Tabular data –i.e., human-generated data that fits well into relational tables or arrays.
Human/Nontabular data — i.e., all other data generated by humans.
Machine-Generated data.

I won’t now stand by every nuance in that post, which may differ slightly from those in my more recent posts about machine-generated data and poly-structured databases. But one general idea is hard to dispute:

Traditional database data — records of human transactional activity, referred to as “Human/Tabular data above” — will not grow as fast as Moore’s Law makes computer chips cheaper.

And that point has a straightforward corollary, namely:

It will become ever more affordable to put traditional database data entirely into RAM. Read more

Categories: Analytic technologies, Cache, In-memory DBMS, memcached, Memory-centric data management, OLTP, Oracle, Oracle TimesTen, SAP AG, solidDB, Storage, Theory and architecture, VoltDB and H-Store

28 Comments

May 21, 2011

Object-oriented database management systems (OODBMS)

There seems to be a fair amount of confusion about object-oriented database management systems (OODBMS). Let’s start with a working definition:

An object-oriented database management system (OODBMS, but sometimes just called “object database”) is a DBMS that stores data in a logical model that is closely aligned with an application program’s object model. Of course, an OODBMS will have a physical data model optimized for the kinds of logical data model it expects.

If you’re guessing from that definition that there can be difficulties drawing boundaries between the application, the application programming language, the data manipulation language, and/or the DBMS — you’re right. Those difficulties have been a big factor in relegating OODBMS to being a relatively niche technology to date.

Examples of what I would call OODBMS include: Read more

Categories: Cache, In-memory DBMS, Intersystems and Cache', Memory-centric data management, Objectivity and Infinite Graph, OLTP, Software as a Service (SaaS), Starcounter

21 Comments

May 18, 2011

Starcounter high-speed memory-centric object-oriented DBMS, coming soon

Since posting recently about Starcounter, I’ve had the chance to actually talk with the company (twice). Hence I know more than before. 🙂 Starcounter:

Has been around as a company since 2006.
Has developed memory-centric object-oriented DBMS technology that has been OEMed by a few application software companies (especially in bricks-and-mortar retailing and in online advertising).
Is planning to actually launch an OODBMS product sometime this summer.
Has 14 employees (most or all of whom are in Sweden, which is also where I think Starcounter’s current customers are centered).
Is planning to shift emphasis soon to the US market.

Starcounter’s value propositions are programming ease (no object/relational impedance mismatch) and performance. Starcounter believes its DBMS has 100X the performance of conventional DBMS at short-request transaction processing, and 10X the performance of other memory-centric and/or object-oriented DBMS (e.g. Oracle TimesTen, or Versant). That said, Starcounter has not yet tested VoltDB. Starcounter does not claim performance much beyond that of disk-based DBMS on analytic tasks such as aggregations.

The key technical aspect to Starcounter is integration between the DBMS and the virtual machine, so that the same copy of the data is accessed by both the DBMS and the application program, without any movement or transformation being needed. (Starcounter isn’t aware of any other object-oriented DBMS that work this way.) Transient and persistent data are handled in the same way, seamlessly.

Other Starcounter technical highlights include: Read more

Categories: Data models and architecture, In-memory DBMS, Memory-centric data management, Object, OLTP, Starcounter, Theory and architecture

3 Comments

May 15, 2011

What to do about “unstructured data”

We hear much these days about unstructured or semi-structured (as opposed to) structured data. Those are misnomers, however, for at least two reasons. First, it’s not really the data that people think is un-, semi-, or fully structured; it’s databases.* Relational databases are highly structured, but the data within them is unstructured — just lists of numbers or character strings, whose only significance derives from the structure that the database imposes.

*Here I’m using the term “database” literally, rather than as a concise synonym for “database management system”. But see below.

Second, a more accurate distinction is not whether a database has one structure or none — it’s whether a database has one structure or many. The easiest way to see this is for databases that have clearly-defined schemas. A relational database has one schema (even if it is just the union of various unrelated sub-schemas); an XML database, however, can have as many schemas as it contains documents.

One small terminological problem is easily handled, namely that people don’t talk about true databases very often, at least when they’re discussing generalities; rather, they talk about data and DBMS.* So let’s talk of DBMS being “structured” singly or multiply or whatever, just as the databases they’re designed to manage are.

*And they refer to the DBMS as “databases,” because they don’t have much other use for the word.

All that said — I think that single vs. multiple database structures isn’t a bright-line binary distinction; rather, it’s a spectrum. For example: Read more

Categories: Cassandra, Couchbase, Data models and architecture, HBase, IBM and DB2, MarkLogic, MongoDB, NoSQL, Splunk, Theory and architecture

19 Comments

← Previous Page — Next Page →

Search our blogs and white papers

Monash Research blogs

DBMS 2 covers database management, analytics, and related technologies.
Text Technologies covers text mining, search, and social software.
Strategic Messaging analyzes marketing and messaging strategy.
The Monash Report examines technology and public policy issues.
Software Memories recounts the history of the software industry.

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.

Links
- Monash Research
- White Papers
Admin
- Log in