Discussion of relational database management systems that are offered through some version of open source licensing. Related subjects include:
For starters, let me say:
- SequoiaDB, the company, is my client.
- SequoiaDB, the product, is the main product of SequoiaDB, the company.
- SequoiaDB, the company, has another product line SequoiaCM, which subsumes SequoiaDB in content management use cases.
- SequoiaDB, the product, is fundamentally a JSON data store. But it has a relational front end …
- … and is usually sold for RDBMS-like use cases …
- … except when it is sold as part of SequoiaCM, which adds in a large object/block store and a content-management-oriented library.
- SequoiaDB’s products are open source.
- SequoiaDB’s largest installation seems to be 2 PB across 100 nodes; that includes block storage.
- Figures for DBMS-only database sizes aren’t as clear, but the sweet spot of the cluster-size range for such use cases seems to be 6-30 nodes.
- SequoiaDB, the company, was founded in Toronto, by former IBM DB2 folks.
- Even so, it’s fairly accurate to view SequoiaDB as a Chinese company. Specifically:
- SequoiaDB’s founders were Chinese nationals.
- Most of them went back to China.
- Other employees to date have been entirely Chinese.
- Sales to date have been entirely in China, but SequoiaDB has international aspirations
- SequoiaDB has >100 employees, a large majority of which are split fairly evenly between “engineering” and “implementation and technical support”.
- SequoiaDB’s marketing (as opposed to sales) department is astonishingly tiny.
- SequoiaDB cites >100 subscription customers, including 10 in the global Fortune 500, a large fraction of which are in the banking sector. (Other sectors mentioned repeatedly are government and telecom.)
Unfortunately, SequoiaDB has not captured a lot of detailed information about unpaid open source production usage.
|Categories: Application areas, Business intelligence, Data models and architecture, Data warehousing, Databricks, Spark and BDAS, Market share and customer counts, NoSQL, OLTP, Open source, PostgreSQL, SequoiaDB, Structured documents||4 Comments|
Crate.io and CrateDB basics include:
- Crate.io makes CrateDB.
- CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
- CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
- Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
- Crate.io says it has 22 employees and 5 paying customers.
- Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.
In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.
CrateDB’s not-just-relational story starts:
- A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
- … where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
- … except when they’re just BLOBs (Binary Large OBjects).
- There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
- There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.
|Categories: Columnar database management, Data models and architecture, Databricks, Spark and BDAS, GIS and geospatial, MemSQL, NoSQL, Open source, Structured documents||3 Comments|
“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:
- A query layer with multiple ways to query and analyze data.
- A separate data storage layer in which you have a choice of data storage engines …
- … each of which has the same logical (JSON-based) data structure.
When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.
To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further: Read more
|Categories: Database diversity, Emulation, transparency, portability, MongoDB, MySQL, NoSQL, Open source, RDF and graphs, Structured documents, Text||3 Comments|
I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud – are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.
data Artisans and Flink basics start:
- Flink is an Apache project sponsored by the Berlin-based company data Artisans.
- Flink has been viewed in a few different ways, all of which are similar to how Spark is seen. In particular, per co-founder Kostas Tzoumas:
- Flink’s original goal was “Hadoop done right”.
- Now Flink is focused on streaming analytics, as an alternative to Spark Streaming, Samza, et al.
- Kostas seems to see Flink as a batch-plus-streaming engine that’s streaming-first.
Like many open source projects, Flink seems to have been partly inspired by a Google paper.
To this point, data Artisans and Flink have less maturity and traction than Databricks and Spark. For example: Read more
|Categories: Cloudera, Databricks, Spark and BDAS, EAI, EII, ETL, ELT, ETLT, Hadoop, Hortonworks, Intel, Market share and customer counts, Open source, Streaming and complex event processing (CEP)||3 Comments|
Databricks CEO Ali Ghodsi checked in because he disagreed with part of my recent post about Databricks. Ali’s take on Databricks’ position in the Spark world includes:
- What I called Databricks’ “secondary business” of “licensing stuff to Spark distributors” was really about second/third tier support. Fair enough. But distributors of stacks including Spark, for whatever combination of on-premise and cloud as the case may be, may in many cases be viewed as competitors to Databricks cloud-only service. So why should Databricks help them?
- Databricks’ investment in Spark Summit and similar evangelism is larger than I realized.
- Ali suggests that the fraction of Databricks’ engineering devoted to open source Spark is greater than I understood during my recent visit.
Ali also walked me through customer use cases and adoption in wonderful detail. In general:
- A large majority of Databricks customers have machine learning use cases.
- Predicting and preventing user/customer churn is a huge issue across multiple market sectors.
The story on those sectors, per Ali, is: Read more
Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully aware either. But when I set out to write this post, I found that it just kept growing longer.
1. The most basic form of lock-in is:
- You do application development for a target set of platform technologies.
- Your applications can’t run without those platforms underneath.
- Hence, you’re locked into those platforms.
2. Enterprise vendor standardization is closely associated with lock-in. The core idea is that you have a mandate or strong bias toward having different apps run over the same platforms, because:
- That simplifies your environment, requiring less integration and interoperability.
- That simplifies your staffing; the same skill sets apply to multiple needs and projects.
- That simplifies your vendor support relationships; there’s “one throat to choke”.
- That simplifies your price negotiation.
3. That last point is double-edged; you have more power over suppliers to whom you give more business, but they also have more power over you. The upshot is often an ELA (Enterprise License Agreement), which commonly works:
- For a fixed period of time, the enterprise may use as much of a given product set as they want, with costs fixed in advance.
- A few years later, the price is renegotiated, based on then-current levels of usage.
|Categories: Amazon and its cloud, Buying processes, Cassandra, Exadata, Facebook, IBM and DB2, Microsoft and SQL*Server, MongoDB, Neo Technology and Neo4j, Open source, Oracle, SAP AG||11 Comments|
- I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth.
- The timing was awkward for most companies I wanted to see. No blame accrues to those who didn’t make themselves available.
- I came back with a nasty cough. Follow-up phone calls aren’t an option until next week.
- I’m impatient to start writing. Hence tonight’s posts. But it’s difficult for a man and his cough to be productive at the same time.
A running list of recent posts is:
- As a companion to this post, I’m publishing a very long one on vendor lock-in.
- Spark and Databricks are both prospering, and of course enhancing their technology as well.
- Ditto DataStax.
- Flink is interesting as the streaming technology it’s now positioned to be, rather than the overall Spark alternative it used to be positioned as but which the world didn’t need.
Subjects I’d like to add to that list include:
- MemSQL, Zoomdata, and Neo Technology (also prospering).
- Cloudera (multiple topics, as usual).
- Analytic SQL engines (“traditional” analytic RDBMS aren’t doing well).
- Microsoft’s reinvention (it feels real).
- Metadata (it’s ever more of a thing).
- Machine learning (it’s going to be a big portion of my research going forward).
- Transitions to the cloud — this subject affects almost everything else.
- Kafka has gotten considerable attention and adoption in streaming.
- Kafka is open source, out of LinkedIn.
- Folks who built it there, led by Jay Kreps, now have a company called Confluent.
- Confluent seems to be pursuing a fairly standard open source business model around Kafka.
- Confluent seems to be in the low to mid teens in paying customers.
- Confluent believes 1000s of Kafka clusters are in production.
- Confluent reports 40 employees and $31 million raised.
At its core Kafka is very simple:
- Kafka accepts streams of data in substantially any format, and then streams the data back out, potentially in a highly parallel way.
- Any producer or consumer of data can connect to Kafka, via what can reasonably be called a publish/subscribe model.
- Kafka handles various issues of scaling, load balancing, fault tolerance and so on.
So it seems fair to say:
- Kafka offers the benefits of hub vs. point-to-point connectivity.
- Kafka acts like a kind of switch, in the telecom sense. (However, this is probably not a very useful metaphor in practice.)
|Categories: Data integration and middleware, Humor, Kafka and Confluent, Market share and customer counts, Microsoft and SQL*Server, Open source, Specific users, Streaming and complex event processing (CEP)||10 Comments|
- I’ve suggested in the past that multi-data-center capabilities are important for “data sovereignty”/geo-compliance.
- The need for geo-compliance just got a lot stronger, with the abolition of the European Union’s Safe Harbour rule for the US. If you collect data in multiple countries, you should be at least thinking about geo-compliance.
- Cassandra is an established leader in multi-data-center operation.
But when I made that connection and checked in accordingly with my client Patrick McFadin at DataStax, I discovered that I’d been a little confused about how multi-data-center Cassandra works. The basic idea holds water, but the details are not quite what I was envisioning.
The story starts:
- Cassandra groups nodes into logical “data centers” (i.e. token rings).
- As a best practice, each physical data center can contain one or more logical data center, but not vice-versa.
- There are two levels of replication — within a single logical data center, and between logical data centers.
- Replication within a single data center is planned in the usual way, with the principal data center holding a database likely to have a replication factor of 3.
- However, copies of the database held elsewhere may have different replication factors …
- … and can indeed have different replication factors for different parts of the database.
In particular, a remote replication factor for Cassandra can = 0. When that happens, then you have data sitting in one geographical location that is absent from another geographical location; i.e., you can be in compliance with laws forbidding the export of certain data. To be clear (and this contradicts what I previously believed and hence also implied in this blog):
- General multi-data-center operation is not what gives you geo-compliance, because the default case is that the whole database is replicated to each data center.
- Instead, you get that effect by tweaking your specific replication settings.
|Categories: Cassandra, Clustering, DataStax, HBase, NoSQL, Open source, Specific users, Surveillance and privacy||3 Comments|