I’m going to be out-of-sorts this week, due to a colonoscopy. (Between the prep, the procedure, and the recovery, that’s a multi-day disablement.) In the interim, here’s a collection of links, quick comments and the like.
1. Are you an engineer considering a start-up? This post is for you. It’s based on my long experience in and around such scenarios, and includes a section on “Deadly yet common mistakes”.
2. There seems to be a lot of confusion regarding the business model at my clients Databricks. Indeed, my own understanding of Databricks’ on-premises business has changed recently. There are no changes in my beliefs that:
- Databricks does not directly license or support on-premises Spark users. Rather …
- … it helps partner companies to do so, where:
- Examples of partner companies include usual-suspect Hadoop distribution vendors, and DataStax.
- “Help” commonly includes higher-level support.
However, I now get the impression that revenue from such relationships is a bigger deal to Databricks than I previously thought.
Databricks, by the way, has grown to >50 people.
3. DJ Patil and Ruslan Belkin apparently had a great session on lessons learned, covering a lot of ground. Many of the points are worth reading, but one in particular echoed something I’m hearing lots of places — “Data is super messy, and data cleanup will always be literally 80% of the work.” Actually, I’d replace the “always” by something like “very often”, and even that mainly for newish warehouses, data marts or datasets. But directionally the comment makes a whole lot of sense.
|Categories: Data integration and middleware, Databricks, Spark and BDAS, DataStax, Hadoop, Health care, Investment research and trading, Text||Leave a Comment|
Indexes are central to database management.
- My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …
- … and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.
- Recently, I’ve had numerous conversations in which indexing strategies played a central role.
Perhaps it’s time for a round-up post on indexing.
1. First, let’s review some basics. Classically:
- An index is a DBMS data structure that you probe to discover where to find the data you really want.
- Indexes make data retrieval much more selective and hence faster.
- While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.
- Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)
2. Further: Read more
|Categories: Data warehousing, Database compression, GIS and geospatial, Google, MapReduce, McObject, MemSQL, MySQL, ScaleDB, solidDB, Sybase, Text, Tokutek and TokuDB||18 Comments|
I chatted with the MariaDB folks on Tuesday. Let me start by noting:
- MariaDB, the product, is a MySQL fork.
- MariaDB, product and company alike, are essentially a reaction to Oracle’s acquisition of MySQL. A lot of the key players are previously from MySQL.
- MariaDB, the company, is the former SkySQL …
- … which acquired or is the surviving entity of a merger with The Monty Program, which originated MariaDB. According to Wikipedia, something called the MariaDB Foundation is also in the mix.
- I get the impression SkySQL mainly provided services around MySQL, especially remote DBA.
- It appears that a lot of MariaDB’s technical differentiation going forward is planned to be in a companion product called MaxScale, which was released into Version 1.0 general availability earlier this year.
The numbers around MariaDB are a little vague. I was given the figure that there were ~500 customers total, but I couldn’t figure out what they were customers for. Remote DBA services? MariaDB support subscriptions? Something else? I presume there are some customers in each category, but I don’t know the mix. Other notes on MariaDB the company are:
- ~80 people in ~15 countries.
- 20-25 engineers, which hopefully doesn’t count a few field support people.
- “Tiny” headquarters in Helsinki.
- Business leadership growing in the US and especially the SF area.
MariaDB, the company, also has an OEM business. Part of their pitch is licensing for connectors — specifically LGPL — that hopefully gets around some of the legal headaches for MySQL engine suppliers.
MaxScale is a proxy, which starts out by intercepting and parsing MariaDB queries. Read more
|Categories: Database compression, Hadoop, IBM and DB2, Market share and customer counts, Mid-range, MySQL, Open source, Tokutek and TokuDB, Transparent sharding||1 Comment|
I hear much discussion of shortfalls in analytic technology, especially from companies that want to fill in the gaps. But how much do these gaps actually matter? In many cases, that depends on what the analytic technology is being used for. So let’s think about some different kinds of analytic task, and where they each might most stress today’s available technology.
In separating out the task areas, I’ll focus first on the spectrum “To what extent is this supposed to produce novel insights?” and second on the dimension “To what extent is this supposed to be integrated into a production/operational system?” Issues of latency, algorithmic novelty, etc. can follow after those. In particular, let’s consider the tasks: Read more
|Categories: Business intelligence, Data warehousing, Databricks, Spark and BDAS, Hadoop, Netezza, NoSQL, Predictive modeling and advanced analytics, Tableau Software||1 Comment|
My favorite educational video growing up, by far, was a 1960 film embedded below. I love it because it pranks its viewers, starting right in the opening scene. (Start at the 0:50 mark to see what I mean.)
If you’re ever in the position of helping a kid or young adult understand physics, this video could be a great help. Frankly, it could help in political discussions as well …
I’m skeptical of data federation. I’m skeptical of all-things-to-all-people claims about logical data layers, and in particular of Gartner’s years-premature “Logical Data Warehouse” buzzphrase. Still, a reasonable number of my clients are stealthily trying to do some kind of data layer middleware, as are other vendors more openly, and I don’t think they’re all crazy.
Here are some thoughts as to why, and also as to challenges that need to be overcome.
There are many things a logical data layer might be trying to facilitate — writing, querying, batch data integration, real-time data integration and more. That said:
- When you’re writing data, you want it to be banged into a sufficiently-durable-to-acknowledge condition fast. If acknowledgements are slow, performance nightmares can ensue. So writing is the last place you want an extra layer, perhaps unless you’re content with the durability provided by an in-memory data grid.
- Queries are important. Also, they formally are present in other tasks, such as data transformation and movement. That’s why data manipulation packages (originally Pig, now Hive and fuller SQL) are so central to Hadoop.
1. Continuing from last week’s HBase post, the Cloudera folks were fairly proud of HBase’s features for performance and scalability. Indeed, they suggested that use cases which were a good technical match for HBase were those that required fast random reads and writes with high concurrency and strict consistency. Some of the HBase architecture for query performance seems to be:
- Everything is stored in sorted files. (I didn’t probe as to what exactly the files were sorted on.)
- Files have indexes and optional Bloom filters.
- Files are marked with min/max field values and time stamp ranges, which helps with data skipping.
Notwithstanding that a couple of those features sound like they might help with analytic queries, the base expectation is that you’ll periodically massage your HBase data into a more analytically-oriented form. For example — I was talking with Cloudera after all — you could put it into Parquet.
2. The discussion of which kinds of data are originally put into HBase was a bit confusing.
- HBase is commonly used to receive machine-generated data. Everybody knows that.
- Cloudera drew a distinction between:
- Straightforward time series, which should probably just go into HDFS (Hadoop Distributed File System) rather than HBase.
- Data that is bucketed by entity, which likely should go into HBase. Examples of entities are specific users or devices.
- Cloudera also reminded me that OpenTSDB, a popular time series data store, runs over HBase.
OpenTSDB, by the way, likes to store detailed data and aggregates side-by-side, which resembles a pattern I discussed in my recent BI for NoSQL post.
3. HBase supports caching, tiered storage, and so on. Cloudera is pretty sure that it is publicly known (I presume from blog posts or conference talks) that: Read more
|Categories: Cloudera, eBay, Facebook, Hadoop, HBase, Market share and customer counts, NoSQL, Open source, Petabyte-scale data management, Specific users, Yahoo||4 Comments|
Over the past couple years, there have been various quick comments and vague press releases about “BI for NoSQL”. I’ve had trouble, however, imagining what it could amount to that was particularly interesting, with my confusion boiling down to “Just what are you aggregating over what?” Recently I raised the subject with a few leading NoSQL companies. The result is that my confusion was expanded. Here’s the small amount that I have actually figured out.
As I noted in a recent post about data models, many databases — in particular SQL and NoSQL ones — can be viewed as collections of <name, value> pairs.
- In a relational database, a record is a collection of <name, value> pairs with a particular and predictable — i.e. derived from the table definition — sequence of names. Further, a record usually has an identifying key (commonly one of the first values).
- Something similar can be said about structured-document stores — i.e. JSON or XML — except that the sequence of names may not be consistent from one document to the next. Further, there’s commonly a hierarchical relationship among the names.
- For these purposes, a “wide-column” NoSQL store like Cassandra or HBase can be viewed much as a structured-document store, albeit with different performance optimizations and characteristics and a different flavor of DML (Data Manipulation Language).
Consequently, a NoSQL database can often be viewed as a table or a collection of tables, except that:
- The NoSQL database is likely to have more null values.
- The NoSQL database, in a naive translation toward relational, may have repeated values. So a less naive translation might require extra tables.
That’s all straightforward to deal with if you’re willing to write scripts to extract the NoSQL data and transform or aggregate it as needed. But things get tricky when you try to insist on some kind of point-and-click. And by the way, that last comment pertains to BI and ETL (Extract/Transform/Load) alike. Indeed, multiple people I talked with on this subject conflated BI and ETL, and they were probably right to do so.
|Categories: Business intelligence, Cassandra, EAI, EII, ETL, ELT, ETLT, HBase, MongoDB, NoSQL, Structured documents||5 Comments|
I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:
- The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
- Cloudera still uses a figure of 20% of its customers being HBase-centric.
- HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
- With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.
- Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
- As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
- Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)
|Categories: Cloudera, Clustering, Data models and architecture, Database diversity, Hadoop, HBase, Hortonworks, HP and Neoview, Intel, Market share and customer counts, NoSQL, Open source||4 Comments|
I found yesterday’s news quite unpleasant.
- A guy I knew and had a brief rivalry with in high school died of colon cancer, a disease that I’m at high risk for myself.
- GigaOm, in my opinion the best tech publication — at least for my interests — shut down.
- The sex discrimination trial around Kleiner Perkins is undermining some people I thought well of.
So I want to unclutter my mind a bit. Here goes.
1. There are a couple of stories involving Sam Simon and me that are too juvenile to tell on myself, even now. But I’ll say that I ran for senior class president, in a high school where the main way to campaign was via a single large poster, against a guy with enough cartoon-drawing talent to be one of the creators of the Simpsons. Oops.
2. If one suffers from ulcerative colitis as my mother did, one is at high risk of getting colon cancer, as she also did. Mine isn’t as bad as hers was, due to better tolerance for medication controlling the disease. Still, I’ve already had a double-digit number of colonoscopies in my life. They’re not fun. I need another one soon; in fact, I canceled one due to the blizzards.
Pro-tip — never, ever have a colonoscopy without some kind of anesthesia or sedation. Besides the unpleasantness, the lack of meds increases the risk that the colonoscopy will tear you open and make things worse. I learned that the hard way in New York in the early 1980s.