Analysis of issues in parallel computing, especially parallelized database management. Related subjects include:
I talked with a couple of Cloudera folks about HBase last week. Let me frame things by saying:
- The closest thing to an HBase company, ala MongoDB/MongoDB or DataStax/Cassandra, is Cloudera.
- Cloudera still uses a figure of 20% of its customers being HBase-centric.
- HBaseCon and so on notwithstanding, that figure isn’t really reflected in Cloudera’s marketing efforts. Cloudera’s marketing commitment to HBase has never risen to nearly the level of MongoDB’s or DataStax’s push behind their respective core products.
- With Cloudera’s move to “zero/one/many” pricing, Cloudera salespeople have little incentive to push HBase hard to accounts other than HBase-first buyers.
- Cloudera no longer dominates HBase development, if it ever did.
- Cloudera is the single biggest contributor to HBase, by its count, but doesn’t make a majority of the contributions on its own.
- Cloudera sees Hortonworks as having become a strong HBase contributor.
- Intel is also a strong contributor, as are end user organizations such as Chinese telcos. Not coincidentally, Intel was a major Hadoop provider in China before the Intel/Cloudera deal.
- As far as Cloudera is concerned, HBase is just one data storage technology of several, focused on high-volume, high-concurrency, low-latency short-request processing. Cloudera thinks this is OK because of HBase’s strong integration with the rest of the Hadoop stack.
- Others who may be inclined to disagree are in several cases doing projects on top of HBase to extend its reach. (In particular, please see the discussion below about Apache Phoenix and Trafodion, both of which want to offer relational-like functionality.)
|Categories: Cloudera, Clustering, Data models and architecture, Database diversity, Hadoop, HBase, Hortonworks, HP and Neoview, Intel, Market share and customer counts, NoSQL, Open source||4 Comments|
I’m on record as believing that:
- Hadoop needs a memory-centric storage grid.
- Tachyon is a strong candidate to fill the role.
- It’s an open secret that there will be a Tachyon company. However, …
- … no details have been publicized. Indeed, the open secret itself is still officially secret.
- Tachyon technology, which just hit 0.6 a couple of days ago, still lacks many features I regard as essential.
- As a practical matter, most Tachyon interest to date has been associated with Spark. This makes perfect sense given Tachyon’s origin and initial technical focus.
- Tachyon was in 50 or more sites last year. Most of these sites were probably just experimenting with it. However …
- … there are production Tachyon clusters with >100 nodes.
As a reminder of Tachyon basics: Read more
|Categories: Clustering, Databricks, Spark and BDAS, Hadoop, Memory-centric data management||3 Comments|
I chatted last night with Ion Stoica, CEO of my client Databricks, for an update both on his company and Spark. Databricks’ actual business is Databricks Cloud, about which I can say:
- Databricks Cloud is:
- Currently running on Amazon only.
- Not dependent on Hadoop.
- Databricks Cloud, despite having a 1.0 version number, is not actually in general availability.
- Even so, there are a non-trivial number of paying customers for Databricks Cloud. (Ion gave me an approximate number, but is keeping it NDA until Spark Summit East.)
- Databricks Cloud gets at data from S3 (most commonly), Redshift, Elastic MapReduce, and perhaps other sources I’m forgetting.
- Databricks Cloud was initially focused on ad-hoc use. A few days ago the capability was added to schedule jobs and so on.
- Unsurprisingly, therefore, Databricks Cloud has been used to date mainly for data exploration/visualization and ETL (Extract/Transform/Load). Visualizations tend to be scripted/programmatic, but there’s also an ODBC driver used for Tableau access and so on.
- Databricks Cloud customers are concentrated (but not unanimously so) in the usual-suspect internet-centric business sectors.
- The low end of the amount of data Databricks Cloud customers are working with is 100s of gigabytes. This isn’t surprising.
- The high end of the amount of data Databricks Cloud customers are working with is petabytes. That did surprise me, and in retrospect I should have pressed for details.
I do not expect all of the above to remain true as Databricks Cloud matures.
Ion also said that Databricks is over 50 people, and has moved its office from Berkeley to San Francisco. He also offered some Spark numbers, such as: Read more
|Categories: Amazon and its cloud, Cloud computing, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Parallelization, Petabyte-scale data management, Predictive modeling and advanced analytics, Software as a Service (SaaS)||6 Comments|
I hoped to write a reasonable overview of current- to medium-term future IT innovation. Yeah, right. But if we abandon any hope that this post could be comprehensive, I can at least say:
1. Back in 2011, I ranted against the term Big Data, but expressed more fondness for the V words — Volume, Velocity, Variety and Variability. That said, when it comes to data management and movement, solutions to the V problems have generally been sketched out.
- Volume has been solved. There are Hadoop installations with 100s of petabytes of data, analytic RDBMS with 10s of petabytes, general-purpose Exadata sites with petabytes, and 10s/100s of petabytes of analytic Accumulo at the NSA. Further examples abound.
- Velocity is being solved. My recent post on Hadoop-based streaming suggests how. In other use cases, velocity is addressed via memory-centric RDBMS.
- Variety and Variability have been solved. MongoDB, Cassandra and perhaps others are strong NoSQL choices. Schema-on-need is in earlier days, but may help too.
2. Even so, there’s much room for innovation around data movement and management. I’d start with:
- Product maturity is a huge issue for all the above, and will remain one for years.
- Hadoop and Spark show that application execution engines:
- Have a lot of innovation ahead of them.
- Are tightly entwined with data management, and with data movement as well.
- Hadoop is due for another refactoring, focused on both in-memory and persistent storage.
- There are many issues in storage that can affect data technologies as well, including but not limited to:
- Solid-state (flash or post-flash) vs. spinning disk.
- Networked vs. direct-attached.
- Virtualized vs. identifiable-physical.
- Graph analytics and data management are still confused.
I believe in all of the following trends:
- Hadoop is a Big Deal, and here to stay.
- Spark, for most practical purposes, is becoming a big part of Hadoop.
- Most servers will be operated away from user premises, whether via SaaS (Software as a Service), co-location, or “true” cloud computing.
Trickier is the meme that Hadoop is “the new OS”. My thoughts on that start:
- People would like this to be true, although in most cases only as one of several cluster computing platforms.
- Hadoop, when viewed as an operating system, is extremely primitive.
- Even so, the greatest awkwardness I’m seeing when different software shares a Hadoop cluster isn’t actually in scheduling, but rather in data interchange.
There is also a minor issue that if you distribute your Hadoop work among extra nodes you might have to pay a bit more to your Hadoop distro support vendor. Fortunately, the software industry routinely solves more difficult pricing problems than that.
|Categories: Cloud computing, Databricks, Spark and BDAS, Hadoop, MapReduce, MemSQL, Software as a Service (SaaS)||15 Comments|
I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
I talked with the Snowflake Computing guys Friday. For starters:
- Snowflake is offering an analytic DBMS on a SaaS (Software as a Service) basis.
- The Snowflake DBMS is built from scratch (as opposed, to for example, being based on PostgreSQL or Hadoop).
- The Snowflake DBMS is columnar and append-only, as has become common for analytic RDBMS.
- Snowflake claims excellent SQL coverage for a 1.0 product.
- Snowflake, the company, has:
- 50 people.
- A similar number of current or past users.
- 5 referenceable customers.
- 2 techie founders out of Oracle, plus Marcin Zukowski.
- Bob Muglia as CEO.
Much of the Snowflake story can be summarized as cloud/elastic/simple/cheap.*
*Excuse me — inexpensive. Companies rarely like their products to be labeled as “cheap”.
In addition to its purely relational functionality, Snowflake accepts poly-structured data. Notes on that start:
- Ingest formats are JSON, XML or AVRO for now.
- I gather that the system automagically decides which fields/attributes are sufficiently repeated to be broken out as separate columns; also, there’s a column for the documents themselves.
I don’t know enough details to judge whether I’d call that an example of schema-on-need.
A key element of Snowflake’s poly-structured data story seems to be lateral views. I’m not too clear on that concept, but I gather: Read more
|Categories: Amazon and its cloud, Cloud computing, Data mart outsourcing, Data models and architecture, Data warehousing, Market share and customer counts, Parallelization, Pricing, Software as a Service (SaaS), Structured documents||2 Comments|
1. I wish I had some good, practical ideas about how to make a political difference around privacy and surveillance. Nothing else we discuss here is remotely as important. I presumably can contribute an opinion piece to, more or less, the technology publication(s) of my choice; that can have a small bit of impact. But I’d love to do better than that. Ideas, anybody?
2. A few thoughts on cloud, colocation, etc.:
- The economies of scale of colocation-or-cloud over operating your own data center are compelling. Most of the reasons you outsource hardware manufacture to Asia also apply to outsourcing data center operation within the United States. (The one exception I can think of is supply chain.)
- The arguments for cloud specifically over colocation are less persuasive. Colo providers can even match cloud deployments in rapid provisioning and elastic pricing, if they so choose.
- Surely not coincidentally, I am told that Rackspace is deemphasizing cloud, reemphasizing colocation, and making a big deal out of Open Compute. In connection with that, Rackspace has pulled back from its leadership role in OpenStack.
- I’m hearing much more mention of Amazon Redshift than I used to. It seems to have a lot of traction as a simple and low-cost option.
- I’m hearing less about Elastic MapReduce than I used to, although I imagine usage is still large and growing.
- In general, I get the impression that progress is being made in overcoming the inherent difficulties in cloud (and even colo) parallel analytic processing. But it all still seems pretty vague, except for the specific claims being made for traction of Redshift, EMR, and so on.
- Teradata recently told me that in colocation pricing, it is common for floor space to be everything, with power not separately metered. But I don’t think that trend is a big deal, as it is not necessarily permanent.
- Cloud hype is of course still with us.
- Other than the above, I stand by my previous thoughts on appliances, clusters and clouds.
3. As for the analytic DBMS industry: Read more
I spent a day with Teradata in Rancho Bernardo last week. Most of what we discussed is confidential, but I think the non-confidential parts and my general impressions add up to enough for a post.
First, let’s catch up with some personnel gossip. So far as I can tell:
- Scott Gnau runs most of Teradata’s development, product management, and product marketing, the big exception being that …
- … Darryl McDonald run the apps part (Aprimo and so on), and no longer is head of marketing.
- Oliver Ratzesberger runs Teradata’s software development.
- Jeff Carter has returned to his roots and runs the hardware part, in place of Carson Schmidt.
- Aster founders Mayank Bawa and Tasso Argyros have left Teradata (perhaps some earn-out period ended).
- Carson is temporarily running Aster development (in place of Mayank), and has some sort of evangelism role waiting after that.
- With the acquisition of Hadapt, Teradata gets some attention from Dan Abadi. Also, they’re retaining Justin Borgman.
The biggest change in my general impressions about Teradata is that they’re having smart thoughts about the cloud. At least, Oliver is. All details are confidential, and I wouldn’t necessarily expect them to become clear even in October (which once again is the month for Teradata’s user conference). My main concern about all that is whether Teradata’s engineering team can successfully execute on Oliver’s directives. I’m optimistic, but I don’t have a lot of detail to support my good feelings.
In some quick-and-dirty positioning and sales qualification notes, which crystallize what we already knew before:
- The Teradata 1xxx series is focused on cost-per-bit.
- The Teradata 2xxx series is focused on cost-per-query. It is commonly Teradata’s “lead” product, at least for new customers.
- The Teradata 6xxx series is supposed to be able to do “everything”.
- The Teradata Aster “Discovery Analytics” platform is sold mainly to customers who have a specific high-value problem to solve. (Randy Lea gave me a nice round dollar number, but I won’t share it.) I like that approach, as it obviates much of the concern about “Wait — is this strategic for us long-term, given that we also have both Teradata database and Hadoop clusters?”
Also: Read more
|Categories: Aster Data, Data warehouse appliances, Data warehousing, Hadapt, Hadoop, MapReduce, Solid-state memory, Teradata||2 Comments|
I have a small blacklist of companies I won’t talk with because of their particularly unethical past behavior. Actian is one such; they evidently made stuff up about me that Josh Berkus gullibly posted for them, and I don’t want to have conversations that could be dishonestly used against me.
That said, Peter Boncz isn’t exactly an Actian employee. Rather, he’s the professor who supervised Marcin Zukowski’s PhD thesis that became Vectorwise, and I chatted with Peter by Skype while he was at home in Amsterdam. I believe his assurances that no Actian personnel sat in on the call.
In other news, Peter is currently working on and optimistic about HyPer. But we literally spent less than a minute talking about that
Before I get to the substance, there’s been a lot of renaming at Actian. To quote Andrew Brust,
… the ParAccel, Pervasive and Vectorwise technologies are being unified under the Actian Analytics Platform brand. Specifically, the ParAccel technology … is being re-branded Actian Matrix; Pervasive’s technologies are rechristened Actian DataFlow and Actian DataConnect; and Vectorwise becomes Actian Vector.
Actian … is now “one company, with one voice and one platform” according to its John Santaferraro
The bolded part of the latter quote is untrue — at least in the ordinary sense of the word “one” — but the rest can presumably be taken as company gospel.
All this is by way of preamble to saying that Peter reached out to me about Actian’s new Vector Hadoop Edition when he blogged about it last June, and we finally talked this week. Highlights include: Read more
|Categories: Actian and Ingres, Clustering, Database compression, Hadoop, ParAccel, Pervasive Software, SQL/Hadoop integration, VectorWise, Workload management||4 Comments|