Hadapt is moving forward
I’ve talked with my clients at Hadapt a couple of times recently. News highlights include:
- The Hadapt 1.0 product is going “Early Access” today.
- General availability of Hadapt 1.0 is targeted for an officially unspecified time frame, but it’s soon.
- Hadapt raised a nice round of venture capital.
- Hadapt added Sharmila Mulligan to the board.
- Dave Kellogg is in the picture too, albeit not as involved as Sharmila.
- Hadapt has moved the company to Cambridge, which is preferable to Yale environs for obvious reasons. (First location = space they’re borrowing from their investors at Bessemer.)
- Headcount is in the low teens, with a target of doubling fast.
The Hadapt product story hasn’t changed significantly from what it was before. Specific points I can add include: Read more
Categories: Hadapt, Hadoop, MapReduce, PostgreSQL, SQL/Hadoop integration, Theory and architecture, Workload management | 6 Comments |
Lessons from T-Mobile’s epic fail
When my electric power came back on but my Verizon FiOS internet connection didn’t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile’s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that.
T-Mobile’s instructions tell you that you need to know the factory defaults for network name and password. That makes sense. They don’t also tell you that you need to know your SIM card number (included), IMEI number (included), or authorization number (not included).
That’s right — you need a number that T-Mobile doesn’t tell you you need. But the story gets a lot worse from there, because it’s almost impossible to get the number from them. I eventually talked with approximately 8 T-Mobile call center associates over the course of the evening before getting successfully connected.
Categories: Specific users, Text | Leave a Comment |
MarkLogic’s Hadoop connector
It’s time to circle back to a subject I skipped when I otherwise wrote about MarkLogic 5: MarkLogic’s new Hadoop connector.
Most of what’s confusing about the MarkLogic Hadoop Connector lies in two pairs of options it presents you:
- Hadoop can talk XQuery to MarkLogic. But alternatively, Hadoop can use a long-established simple(r) Java API for streaming documents into or out of a MarkLogic database.
- Hadoop can make requests to MarkLogic in MarkLogic’s normal mode of operation, namely to address any node in the MarkLogic cluster, which then serves as a “head” node for the duration of that particular request. But alternatively, Hadoop can use a long-standing MarkLogic option to circumvent the whole DBMS cluster and only talk to one specific MarkLogic node.
Otherwise, the whole thing is just what you would think:
- Hadoop can read from and write to MarkLogic, in parallel at both ends.
- If Hadoop is just writing to MarkLogic, there’s a good chance the process is properly called “ETL.”
- If Hadoop is reading a lot from MarkLogic, there’s a good chance the process is properly called “batch analytics.”
MarkLogic said that it wrote this Hadoop connector itself.
Categories: Clustering, EAI, EII, ETL, ELT, ETLT, Hadoop, MapReduce, MarkLogic, Parallelization, Workload management | 2 Comments |
The cool aspects of Odiago WibiData
Christophe Bisciglia and Aaron Kimball have a new company.
- It’s called Odiago, and is one of my gratifyingly more numerous tiny clients.
- Odiago’s product line is called WibiData, after the justly popular We Be Sushi restaurants.
- We’ve agreed on a split exclusive de-stealthing launch. You can read about the company/founder/investor stuff on TechCrunch. But this is the place for — well, for the tech crunch.
WibiData is designed for management of, investigative analytics on, and operational analytics on consumer internet data, the main examples of which are web site traffic and personalization and their analogues for games and/or mobile devices. The core WibiData technology, built on HBase and Hadoop,* is a data management and analytic execution layer. That’s where the secret sauce resides. Also included are:
- REST APIs for interactive access.
- Import/export tools, including JDBC access.
- Management tools.
- Analytic libraries — data mining, predictive analytics, machine learning, and so on.
The whole thing is in beta, with about three (paying) beta customers.
*And Avro and so on.
The core ideas of WibiData include:
- ALL data pertaining to a single user (or mobile device) is kept in a single, possibly very long, HBase row.
- There are two primary operators in WibiData, Produce and Gather.
- Produce operates on single rows. It can operate on one row at HBase speed (milliseconds) if you need to inform an interactive user response. Or it can operate on the whole database in batch via Hadoop MapReduce.
- It is reasonable to think of Produce as mainly doing two things. One is the aforementioned serving of data out of WibiData into interactive applications. The other is scoring, classifying, recommending, etc. on individual users (i.e. rows), in line with an analytic model.
- Gather typically operates on all your rows at once, and emits suitable input for a MapReduce Reduce step. It is reasonable to think of Gather as being a key cog in the training of analytic models.
- HBase schema management is done at the WibiData system level, not directly in applications. There’s a WibiData HBase data dictionary, powered by a set of system tables, that specifies cell data types/record types and, in effect, primitive schemas.
Categories: Data models and architecture, Hadoop, HBase, NoSQL, Predictive modeling and advanced analytics, Web analytics, WibiData | 14 Comments |
MarkLogic 5, and why you might care
MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:
- More-of-the-same in line with MarkLogic’s core positioning.
- A new bi-directional Hadoop connector.
- A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.
Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.
*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.
Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.
Categories: Hadoop, Market share and customer counts, MarkLogic, Scientific research, Solid-state memory, Structured documents, Text | 1 Comment |
Where Datameer is positioned
I’ve chatted with Datameer a couple of times recently, mainly with CEO Stefan Groschupf, most recently after XLDB last Tuesday. Nothing I learned greatly contradicts what I wrote about Datameer 1 1/2 years ago. In a nutshell, Datameer is designed to let you do simple stuff on large amounts of data, where “large amounts of data” typically means data in Hadoop, and “simple stuff” includes basic versions of a spreadsheet, of BI, and of EtL (Extract/Transform/Load, without much in the way of T).
Stefan reports that these capabilities are appealing to a significant fraction of enterprise or other commercial Hadoop users, especially the EtL and the BI. I don’t doubt him.
Categories: Business intelligence, Datameer, EAI, EII, ETL, ELT, ETLT, Hadoop | 4 Comments |
NoSQL notes
Last week I visited with James Phillips of Couchbase, Max Schireson and Eliot Horowitz of 10gen, and Todd Lipcon, Eric Sammer, and Omer Trajman of Cloudera. I guess it’s time for a round-up NoSQL post. 🙂
Views of the NoSQL market horse race are reasonably consistent, with perhaps some elements of “Where you stand depends upon where you sit.”
- As James tells it, NoSQL is simply a three-horse race between Couchbase, MongoDB, and Cassandra.
- Max would include HBase on the list.
- Further, Max pointed out that metrics such as job listings suggest MongoDB has the most development activity, and Couchbase/Membase/CouchDB perhaps have less.
- The Cloudera guys remarked on some serious HBase adopters.*
- Everybody I spoke with agreed that Riak had little current market presence, although some Basho guys could surely be found who’d disagree.
Categories: Basho and Riak, Cassandra, Cloudera, Clustering, Couchbase, HBase, Market share and customer counts, MongoDB, NoSQL, Open source, Oracle, Parallelization | 12 Comments |
Transparent relational OLTP scale-out
There’s a perception that, if you want (relatively) worry-free database scale-out, you need a non-relational/NoSQL strategy. That perception is false. In the analytic case it’s completely ridiculous, as has been demonstrated by Teradata, Vertica, Netezza, and various other MPP (Massively Parallel Processing) analytic DBMS vendors. And now it’s false for short-request/OLTP (OnLine Transaction Processing) use cases as well.
My favorite relational OLTP scale-out choice these days is the SchoonerSQL/dbShards partnership. Schooner Information Technology (SchoonerSQL) and Code Futures (dbShards) are young, small companies, but I’m not too concerned about that, because the APIs they want you to write to are just MySQL’s. The main scenarios in which I can see them failing are ones in which they are competitively leapfrogged, either by other small competitors – e.g. ScaleBase, Akiban, TokuDB, or ScaleDB — or by Oracle/MySQL itself. While that could suck for my clients Schooner and Code Futures, it would still provide users relying on MySQL scale-out with one or more good product alternatives.
Relying on non-MySQL NewSQL startups, by way of contrast, would leave me somewhat more concerned. (However, if their code is open sourced. you have at least some vendor-failure protection.) And big-vendor scale-out offerings, such as Oracle RAC or DB2 pureScale, may be more complex to deploy and administer than the MySQL and NewSQL alternatives.
Categories: Clustering, dbShards and CodeFutures, IBM and DB2, MySQL, NewSQL, NoSQL, OLTP, Open source, Oracle, Parallelization, Schooner Information Technology, Transparent sharding | 2 Comments |
Schooner pivots further
Schooner Information Technology started out as a complete-system MySQL appliance vendor. Then Schooner went software-only, but continued to brag about great performance in configurations with solid-state drives. Now Schooner has pivoted further, and is emphasizing high availability, clustered performance, and other hardware-agnostic OLTP (OnLine Transaction Processing) features. Fortunately, Schooner has some interesting stuff in those areas to talk about.
The short form of the SchoonerSQL (as Schooner’s product is now called) story goes roughly like this:
- SchoonerSQL replicates data — synchronously if the replication target is local, asynchronously if it is remote.
- Local synchronous replication provides high availability; remote asynchronous replication provides disaster recovery.
- SchoonerSQL’s local synchronous replication also provides read scale-out.
- Schooner has a partnership with Code Futures/dbShards to provide write scale-out via transparent sharding.
- SchoonerSQL has some secret sauce in replication performance. This has the effect of significantly increasing write performance (assuming you were going to replicate anyway), because otherwise you might have to slow down the master server’s write performance so that the slaves can keep up with it.
- Schooner believes it still has some single-server performance advantages as well.
Categories: Clustering, dbShards and CodeFutures, MySQL, OLTP, Oracle, Parallelization, Schooner Information Technology | 3 Comments |
More notes on Oracle NoSQL
A reporter asked me for some thoughts on Oracle’s new NoSQL product. For the most part, I stand by my previous comments on Oracle NoSQL. Still, NoSQL in general deserves a place in Oracle shops, so it makes sense for Oracle to try to coopt it.
Oracle’s core DBMS is not well suited to track interactions (e.g. web clicks), even in cases where it’s the choice for transactions; it’s unnecessarily heavyweight. What’s worse, using the same database to store actions and interactions can lead to serious reliability problems. If a better architecture is to dump the clicks into some NoSQL store, massage the information, and eventually put some derived data into a relational DBMS, then Oracle will naturally try to own each step of the data pipeline.
Dynamic schemas are another area of Oracle weakness, leading in some cases to outright Oracle replacements. However, pure key-value stores go too far to the opposite extreme; you should at least be able to index and retrieve data one field at a time. Based on what I’ve seen of Oracle’s marketing literature, that feature will be missing from the first release of Oracle’s NoSQL.* Until it’s in there, and until it works well, I don’t see why anybody should use Oracle’s NoSQL product.
*Frankly, that choice makes no sense to me on any level. Yet it’s the way Oracle seems to have elected to go — or, if it isn’t, then there’s somebody writing Oracle marketing collateral who’s clearly in the wrong line of work.
Categories: NoSQL, Oracle, Web analytics | 2 Comments |