I’m taking a few weeks defocused from work, as a kind of grandpaternity leave. That said, the venue for my Dances of Infant Calming is a small-but-nice apartment in San Francisco, so a certain amount of thinking about tech industries is inevitable. I even found time last Tuesday to meet or speak with my clients at WibiData, MemSQL, Cloudera, Citus Data, and MongoDB. And thus:
1. I’ve been sloppy in my terminology around “geo-distribution”, in that I don’t always make it easy to distinguish between:
- Storing different parts of a database in different geographies, often for reasons of data privacy regulatory compliance.
- Replicating an entire database into different geographies, often for reasons of latency and/or availability/ disaster recovery,
The latter case can be subdivided further depending on whether multiple copies of the data can accept first writes (aka active-active, multi-master, or multi-active), or whether there’s a clear single master for each part of the database.
What made me think of this was a phone call with MongoDB in which I learned that the limit on number of replicas had been raised from 12 to 50, to support the full-replication/latency-reduction use case.
2. Three years ago I posted about agile (predictive) analytics. One of the points was:
… if you change your offers, prices, ad placement, ad text, ad appearance, call center scripts, or anything else, you immediately gain new information that isn’t well-reflected in your previous models.
Subsequently I’ve been hearing more about predictive experimentation such as bandit testing. WibiData, whose views are influenced by a couple of Very Famous Department Store clients (one of which is Macy’s), thinks experimentation is quite important. And it could be argued that experimentation is one of the simplest and most direct ways to increase the value of your data.
3. I’d further say that a number of developments, trends or possibilities I’m seeing are or could be connected. These include agile and experimental predictive analytics in general, as noted in the previous point, along with: Read more
The name of this blog comes from an August, 2005 column. 8 1/2 years later, that analysis holds up pretty well. Indeed, I’d keep the first two precepts exactly as I proposed back then:
- Task-appropriate data managers. Much of this blog is about task-appropriate data stores, so I won’t say more about them in this post.
- Drastic limitations on relational schema complexity. I think I’ve been vindicated on that one by, for example:
- NoSQL and dynamic schemas.
- Schema-on-read, and its smarter younger brother schema-on-need.
- Limitations on the performance and/or allowed functionality of joins in scale-out short-request RDBMS, and the relative lack of complaints about same.
- Funky database design from major Software as a Service (SaaS) vendors such as Workday and Salesforce.com.
- A whole lot of logs.
I’d also keep the general sense of the third precept, namely appropriately-capable data integration, but for that one the specifics do need some serious rework.
For starters, let me say: Read more
|Categories: About this blog, Business intelligence, Database diversity, EAI, EII, ETL, ELT, ETLT, Investment research and trading, NoSQL, Schema on need||2 Comments|
I’m not having a productive week, part of the reason being a hard drive crash that took out early drafts of what were to be last weekend’s blog posts. Now I’m operating from a laptop, rather than my preferred dual-monitor set-up. So please pardon me if I’m concise even by comparison to my usual standards.
- My recent posts based on surveillance news have been partly superseded by – well, by more news. Some of that news, along with some good discussion, may be found in the comment threads.
- The same goes for my recent Hadoop posts.
- The replay for my recent webinar on real-time analytics is now available. My part ran <25 minutes.
- One of my numerous clients using or considering a “real-time analytics” positioning is Sqrrl, the company behind the NoSQL DBMS Accumulo. Last month, Derrick Harris reported on a remarkable Accumulo success story – multiple US intelligence instances managing 10s of petabytes each, and supporting a variety of analytic (I think mainly query/visualization) approaches.
- Several sources have told me that MemSQL’s Zynga sale is (in part) for Membase replacement. This is noteworthy because Zynga was the original pay-for-some-of-the-development Membase customer.
- More generally, the buzz out of Couchbase is distressing. Ex-employees berate the place; job-seekers check around and then decide not to go there; rivals tell me of resumes coming out in droves. Yes, there’s always some of that, even at obviously prospering companies, but this feels like more than the inevitable low-level buzz one hears anywhere.
- I think the predictive modeling state of the art has become:
- Cluster in some way.
- Model separately on each cluster.
- And if you still want to do something that looks like a regression – linear or otherwise – then you might want to use a tool that lets you shovel training data in WITHOUT a whole lot of preparation* and receive a model back out. Even if you don’t accept that as your final model, it can at least be a great guide to feature selection (in the statistical sense of the phrase) and the like.
- Champion/challenger model testing is also a good idea, at least if you’re in some kind of personalization/recommendation space, and have enough traffic to test like that.**
- Most companies have significant turnover after being acquired, perhaps after a “golden handcuff” period. Vertica is no longer an exception.
- Speaking of my clients at HP Vertica – they’ve done a questionable job of communicating that they’re willing to price their product quite reasonably. (But at least they allowed me to write about $2K/terabyte for hardware/software combined.)
- I’m hearing a little more Amazon Redshift buzz than I expected to. Just a little.
- StreamBase was bought by TIBCO. The rumor says $40 million.
*Basic and unavoidable ETL (Extract/Transform/Load) of course excepted.
**I could call that ABC (Always Be Comparing) or ABT (Always Be Testing), but they each sound like – well, like The Glove and the Lions.
Email delivery of posts has been screwed up; multiple people tell me they haven’t gotten their email for months. (In the future, please tell me of such difficulties!) So it’s time for a change, and I’m asking for your advice as to what you’d suggest for our mailing list.
Yes, I’m asking via a blog post, even thought the core problem is that people who want to see my posts via e-mail aren’t getting them. Please work with me on this anyway.
My two basic questions are:
- What should be the frequency of delivery? To date, it’s been nightly (at least in theory).
- What delivery technology should be used? To date, it’s been FeedBlitz.
1. The nightly scheduling has been an artifact of an RSS-to-email link that no longer seems stable. So I’m thinking of just manually pasting each post into a list email, in which case:
- Posts could be sent without delay.
- Every post would be delivered by separate mail. (As opposed to having only one post per night be mailed, while others just get linked to.)
It’s a bit more work for me, but probably nothing dire. Does lower latency sound good to everybody?
2. The main technical options seem to be: Read more
Analyzing companies of any size is hard. Analyzing large ones, however, is harder yet.
- I get (much) less substance in an hour on the phone with a megacorp than I do when I talk with a smaller company.
- What large companies say is less reliable than what I hear from smaller ones.
- Large companies have policies, procedures, bureaucracy and attitudes that get in the way of communicating in the first place.
Such limitations should be borne in mind in connection with anything I write about, for example, Oracle, Microsoft, IBM, or SAP.
There are many reasons for large companies to communicate less usefully with analysts than smaller ones do. Some of the biggest are:
- For reasons of internal information flow, the people I talk with just know less than their counterparts at smaller companies. Similarly, what they do “know” is more often wrong, since different parts of the same company may not hold identical views.
- That’s when we talk about real issues at all, which can get crowded out by large companies’ voluminous efforts in complex positioning, messaging, and product names.
- Huge companies have huge bureaucracies, and they hurt.
- A small company C-level executive can make smart decisions about what to say or not say. A large company minion doesn’t have the same freedom.
- Just the process of getting access to even a mid-level spokesminion at a large company is harder than reaching a senior person at a smaller outfit.
- Large firms are clearest when communicating with their existing customers and those organizations’ key influencers. They’re less effective or clear when opening themselves up to competitive comparisons.
- If a company wants to behave unethically in its analyst dealings, there are economies of scale to doing so.
|Categories: About this blog, IBM and DB2, Microsoft and SQL*Server, Oracle, SAP AG, Sybase||6 Comments|
Please disregard any intentions I expressed of traveling in October, in particular a trip to visit 20 or so California clients. I’m under doctor’s orders not to fly for several weeks, and also don’t feel like driving (or walking) any significant distances. Any meetings I have in the very near future will either be telephonic, or else within a few minute’s drive of my home office in Acton, MA.
The story behind this is:
- Istanbul sidewalks have a lot of knee/shin-height metal poles to separate streets and driveways from sidewalks.
- I stumbled over one of the shin-height ones.
- I have some pretty dramatic bruising, and associated pain.
- Bruises + plane flights = risk of blood clot.
Fortunately, that’s all it is — no fracture, and the sprain per se is mild. But about 4 doctors and nurses have told me this is really unusual bruising. Nobody has offered a precise opinion as to how soon it will clear up, but I gather the good case is 2-4 weeks and the bad case is twice that.
I should have plenty of opportunity to blog.
What are the central challenges in internet system design? We probably all have similar lists, comprising issues such as scale, scale-out, throughput, availability, security, programming ease, UI, or general cost-effectiveness. Screw those up, and you don’t have an internet business.
Much new technology addresses those challenges, with considerable success. But the success is usually one silo at a time — a short-request application here, an analytic database there. When it comes to integration, unsolved problems abound.
The top integration and integration-like challenges for me, from a practical standpoint, are:
- Integrating silos — a decades-old problem still with us in a big way.
- Dynamic schemas with joins.
- Low-latency business intelligence.
- Human real-time personalization.
Other concerns that get mentioned include:
- Geographical distribution due to privacy laws, which for some users is a major requirement for compliance.
- Logical data warehouse, a term that doesn’t actually mean anything real.
- In-memory data grids, which some day may no longer always be hand-coupled to the application and data stacks they accelerate.
Let’s skip those latter issues for now, focusing instead on the first four.
Enterprise software terminology is too often mired in confusion. I hope to lessen that by publishing a series of web pages that define and describe various industry terms, with one or several paragraphs per subject, and plenty of internal and external links.
Absent a better name, I’ll refer to this as an “analytic glossary”. I want users of the analytic glossary to learn or confirm:
- What terms mean.
- Which products exemplify which product categories.
- Which additional subjects to look into.
I will do or closely direct the core writing myself. I may hire outside help for ancillary tasks, such as adding links, or for various kinds of wordsmithing.
All this presupposes a site redesign, which hasn’t begun. But I’ve started to draft the content. As I do, I’ll post it here. And I very much want you to comment.
If you think I got something wrong or left anything out — even if it’s just a nuance — please speak up! Later, when the glossary pages are live, I’ll link them back to the original blog post discussions. Thus, your comments will be part of the permanent glossary record. Read more
After several hours of DBMS 2 being down, I put out a “We’re broken” note from another blog. Naturally, the next fix I tried seems to have worked. My joy in that far outweighs my embarrassment. This kind of thing just happens once in a while when one has business-critical software that isn’t good at having a test-to-production staging cycle.
In case anybody ever runs into the same problems, the short form of the story is:
1. DBMS2.com came down due to a corrupted automatic upgrade of WordPress.
2. The fix was to do an automatic install of WordPress to a dummy domain, then copy over the files in the domain’s root and wp-includes directories.
3. The one file that needed to be copied back from the old installation was wp-configure. (Once it occurred to me to start reading from index.php, it took me about 1 minute to figure that out …)
Our blogs have been moved to a new hosting company, and everything should be working. Ditto our business site.
If you notice any counterexamples, please be so kind as to ping me.