The United States and consequently much of the world are in political uproar. Much of that is about very general and vital issues such as war, peace or the treatment of women. But quite a lot of it is to some extent tech-industry-specific. The purpose of this post is outline how and why that is.
- There’s a worldwide backlash against “elites” — and tech industry folks are perceived as members of those elites.
- That perception contains a lot of truth, and not just in terms of culture/education/geography. Indeed, it may even be a bit understated, because trends commonly blamed on “trade” or “globalization” often have their roots in technological advances.
- There’s a worldwide trend towards authoritarianism. Surveillance/ privacy and censorship issues are strongly relevant to that trend.
- Social media companies are up to their neck in political considerations.
Because they involve grave threats to liberty, I see surveillance/privacy as the biggest technology-specific policy issues in the United States. (In other countries, technology-driven censorship might loom larger yet.) My views on privacy and surveillance have long been:
- Fixing the legal frameworks around information use is a difficult and necessary job. The tech community should be helping more than it is.
- Until those legal frameworks are indeed cleaned up, the only responsible alternative is to foot-drag on data collection, on data retention, and on the provision of data to governmental agencies.
Given the recent election of a US president with strong authoritarian tendencies, that foot-dragging is much more important than it was before.
Other important areas of technology/policy overlap include: Read more
The United States presidency was recently assumed by an Orwellian lunatic.* Sadly, this is not an exaggeration. The dangers — both of authoritarianism and of general mis-governance — are massive. Everybody needs in some way to respond.
*”Orwellian lunatic” is by no means an oxymoron. Indeed, many of the most successful tyrants in modern history have been delusional; notable examples include Hitler, Stalin, Mao and, more recently, Erdogan. (By way of contrast, I view most other Soviet/Russian leaders and most jumped-up-colonel coup leaders as having been basically sane.)
There are many candidates for what to focus on, including:
- Technology-specific issues — e.g. privacy/surveillance, network neutrality, etc.
- Issues in which technology plays a large role — e.g. economic changes that affect many people’s employment possibilities.
- Subjects that may not be tech-specific, but are certainly of great importance. The list of candidates here is almost endless, such as health care, denigration of women, maltreatment of immigrants, or the possible breakdown of the whole international order.
But please don’t just go on with your life and leave the politics to others. Those “others” you’d like to rely on haven’t been doing a very good job.
What I’ve chosen to do personally includes: Read more
Crate.io and CrateDB basics include:
- Crate.io makes CrateDB.
- CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs.
- CrateDB’s creators were perhaps a little slow to realize that the “R” part was needed, but are playing catch-up in that regard.
- Crate.io is an outfit founded by Austrian guys, headquartered in Berlin, that is turning into a San Francisco company.
- Crate.io says it has 22 employees and 5 paying customers.
- Crate.io cites bigger numbers than that for confirmed production users, clearly active clusters, and overall product downloads.
In essence, CrateDB is an open source and less mature alternative to MemSQL. The opportunity for MemSQL and CrateDB alike exists in part because analytic RDBMS vendors didn’t close it off.
CrateDB’s not-just-relational story starts:
- A column can contain ordinary values (of usual-suspect datatypes) or “objects”, …
- … where “objects” presumably are the kind of nested/hierarchical structures that are common in the NoSQL/internet-backend world, …
- … except when they’re just BLOBs (Binary Large OBjects).
- There’s a way to manually define “strict schemas” on the structured objects, and a syntax for navigating their structure in WHERE clauses.
- There’s also a way to automagically infer “dynamic schemas”, but it’s simplistic enough to be more suitable for development/prototyping than for serious production.
|Categories: Columnar database management, Data models and architecture, Databricks, Spark and BDAS, GIS and geospatial, MemSQL, NoSQL, Open source, Structured documents||3 Comments|
After a July visit to DataStax, I wrote
The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things.
- Handle the database design part of application development. In NoSQL environments, this part of the job is indeed largely refactored away. More precisely, it is integrated into the general app developer/architect role.
- Manage production databases. This part of the DBA job is, if anything, a bigger deal in the NoSQL world than in more mature and automated relational environments. It’s likely to be called part of “devops” rather than “DBA”, but by whatever name it’s very much a thing.
That turns out to understate the core point, which is that DBAs still matter in non-RDBMS environments. Specifically, it’s too narrow in two ways.
- First, it’s generally too narrow as to what DBAs do; people with DBA-like skills are also involved in other areas such as “data governance”, “information lifecycle management”, storage, or what I like to call data mustering.
- Second — and more narrowly — the first bullet point of the quote is actually incorrect. In fact, the database design part of application development can be done by a specialized person up front in the NoSQL world, just as it commonly is for RDBMS apps.
My wake-up call for that latter bit was a recent MongoDB 3.4 briefing. MongoDB certainly has various efforts in administrative tools, which I won’t recapitulate here. But to my surprise, MongoDB also found a role for something resembling relational database design. The idea is simple: A database administrator defines a view against a MongoDB database, where views: Read more
|Categories: Databricks, Spark and BDAS, Hadoop, MongoDB, NoSQL, Streaming and complex event processing (CEP)||Leave a Comment|
“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the train as well, but they’ve taken a clear and interesting stance:
- A query layer with multiple ways to query and analyze data.
- A separate data storage layer in which you have a choice of data storage engines …
- … each of which has the same logical (JSON-based) data structure.
When I pointed out that it would make sense to call this “multimodel query” — because the storage isn’t “multimodel” at all — they quickly agreed.
To be clear: While there are multiple ways to read data in MongoDB, there’s still only one way to write it. Letting that sink in helps clear up confusion as to what about MongoDB is or isn’t “multimodel”. To spell that out a bit further: Read more
|Categories: Database diversity, Emulation, transparency, portability, MongoDB, MySQL, NoSQL, Open source, RDF and graphs, Structured documents, Text||2 Comments|
“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call for rapid human decision-making. Here are some notes on that conundrum.
1. I recently posted that “real-time” is getting real. But there are multiple technology challenges involved, including:
- General streaming. Some of my posts on that subject are linked at the bottom of my August post on Flink.
- Low-latency ingest of data into structures from which it can be immediately analyzed. That helps drive the (re)integration of operational data stores, analytic data stores, and other analytic support — e.g. via Spark.
- Business intelligence that can be used quickly enough. This is a major ongoing challenge. My clients at Zoomdata may be thinking about this area more clearly than most, but even they are still in the early stages of providing what users need.
- Advanced analytics that can be done quickly enough. Answers there may come through developments in anomaly management, but that area is still in its super-early days.
- Alerting, which has been under-addressed for decades. Perhaps the anomaly management vendors will finally solve it.
|Categories: Business intelligence, Databricks, Spark and BDAS, In-memory DBMS, Investment research and trading, Log analysis, Streaming and complex event processing (CEP), Text, Web analytics, Zoomdata||3 Comments|
Then felt I like some watcher of the skies
When a new planet swims into his ken
— John Keats, “On First Looking Into Chapman’s Homer”
1. In June I wrote about why anomaly management is hard. Well, not only is it hard to do; it’s hard to talk about as well. One reason, I think, is that it’s hard to define what an anomaly is. And that’s a structural problem, not just a semantic one — if something is well enough understood to be easily described, then how much of an anomaly is it after all?
Artificial intelligence is famously hard to define for similar reasons.
“Anomaly management” and similar terms are not yet in the software marketing mainstream, and may never be. But naming aside, the actual subject matter is important.
2. Anomaly analysis is clearly at the heart of several sectors, including:
- IT operations
- Factory and other physical-plant operations
Each of those areas features one or both of the frameworks:
- Surprises are likely to be bad.
- Coincidences are likely to be suspicious.
So if you want to identify, understand, avert and/or remediate bad stuff, data anomalies are the first place to look.
3. The “insights” promised by many analytics vendors — especially those who sell to marketing departments — are also often heralded by anomalies. Already in the 1970s, Walmart observed that red clothing sold particularly well in Omaha, while orange flew off the shelves in Syracuse. And so, in large college towns, they stocked their stores to the gills with clothing in the colors of the local football team. They also noticed that fancy dresses for little girls sold especially well in Hispanic communities … specifically for girls at the age of First Communion.
|Categories: Business intelligence, Log analysis, Predictive modeling and advanced analytics, Web analytics||Leave a Comment|
1. The cloud is super-hot. Duh. And so, like any hot buzzword, “cloud” means different things to different marketers. Four of the biggest things that have been called “cloud” are:
- The Amazon cloud, Microsoft Azure, and their competitors, aka public cloud.
- Software as a service, aka SaaS.
- Co-location in off-premises data centers, aka colo.
- On-premises clusters (truly on-prem or colo as the case may be) designed to run a broad variety of applications, aka private cloud.
Further, there’s always the idea of hybrid cloud, in which a vendor peddles private cloud systems (usually appliances) running similar technology stacks to what they run in their proprietary public clouds. A number of vendors have backed away from such stories, but a few are still pushing it, including Oracle and Microsoft.
This is a good example of Monash’s Laws of Commercial Semantics.
2. Due to economies of scale, only a few companies should operate their own data centers, aka true on-prem(ises). The rest should use some combination of colo, SaaS, and public cloud.
This fact now seems to be widely understood.
I’ve been an analyst for 35 years, and debates about “real-time” technology have run through my whole career. Some of those debates are by now pretty much settled. In particular:
- Yes, interactive computer response is crucial.
- Into the 1980s, many apps were batch-only. Demand for such apps dried up.
- Business intelligence should occur at interactive speeds, which is a major reason that there’s a market for high-performance analytic RDBMS.
- Theoretical arguments about “true” real-time vs. near-real-time are often pointless.
- What matters in most cases is human users’ perceptions of speed.
- Most of the exceptions to that rule occur when machines race other machines, for example in automated bidding (high frequency trading or otherwise) or in network security.
A big issue that does remain open is: How fresh does data need to be? My preferred summary answer is: As fresh as is needed to support the best decision-making. I think that formulation starts with several advantages:
- It respects the obvious point that different use cases require different levels of data freshness.
- It cautions against people who think they need fresh information but aren’t in a position to use it. (Such users have driven much bogus “real-time” demand in the past.)
- It covers cases of both human and automated decision-making.
Straightforward applications of this principle include: Read more
I used to spend most of my time — blogging and consulting alike — on data warehouse appliances and analytic DBMS. Now I’m barely involved with them. The most obvious reason is that there have been drastic changes in industry structure:
- Many of the independent vendors were swooped up by acquisition.
- None of those acquisitions was a big success.
- Microsoft did little with DATAllegro.
- Netezza struggled with R&D after being bought by IBM. An IBMer recently told me that their main analytic RDBMS engine was BLU.
- I hear about Vertica more as a technology to be replaced than as a significant ongoing market player.
- Pivotal open-sourced Greenplum. I have detected few people who care.
- Ditto for Actian’s offerings.
- Teradata claimed a few large Aster accounts, but I never hear of Aster as something to compete or partner with.
- Smaller vendors fizzled too. Hadapt and Kickfire went to Teradata as more-or-less acquihires. InfiniDB folded. Etc.
- Impala and other Hadoop-based alternatives are technology options.
- Oracle, Microsoft, IBM and to some extent SAP/Sybase are still pedaling along … but I rarely talk with companies that big.
Simply reciting all that, however, begs the question of whether one should still care about analytic RDBMS at all.
My answer, in a nutshell, is:
Analytic RDBMS — whether on premises in software, in the form of data warehouse appliances, or in the cloud – are still great for hard-core business intelligence, where “hard-core” can refer to ad-hoc query complexity, reporting/dashboard concurrency, or both. But they aren’t good for much else.