Data types
Analysis of data management technology optimized for specific datatypes, such as text, geospatial, object, RDF, or XML. Related subjects include:
- Any subcategory
- Database diversity
Sumo Logic and UIs for text-oriented data
I talked with the Sumo Logic folks for an hour Thursday. Highlights included:
- Sumo Logic does SaaS (Software as a Service) log management.
- Sumo Logic is text indexing/Lucene-based. Thus, it is reasonable to think of Sumo Logic as “Splunk-like”. (However, Sumo Logic seems to have a stricter security/trouble-shooting orientation than Splunk, which is trying to branch out.)
- Sumo Logic has hacked Lucene for faster indexing, and says 10-30 second latencies are typical.
- Sumo Logic’s main differentiation is automated classification of events.
- There’s some kind of streaming engine in the mix, to update counters and drive alerts.
- Sumo Logic has around 30 “customers,” free (mainly) or paying (around 5) as the case may be.
- A truly typical Sumo Logic customer has single to low double digits of gigabytes of log data per day. However, Sumo Logic seems highly confident in its ability to handle a terabyte per customer per day, give or take a factor of 2.
- When I asked about the implications of shipping that much data to a remote data center, Sumo Logic observed that log data compresses really well.
- Sumo Logic recently raised a bunch of venture capital.
- Sumo Logic’s founders are out of ArcSight, a log management company HP paid a bunch of money for.
- Sumo Logic coined a marketing term “LogReduce”, but it has nothing to do with “MapReduce”. Sumo Logic seems to find this amusing.
What interests me about Sumo Logic is that automated classification story. I thought I heard Sumo Logic say: Read more
| Categories: Log analysis, Market share and customer counts, Predictive modeling and advanced analytics, Software as a Service (SaaS), Text | 2 Comments |
Splunk update
Splunk is announcing the Splunk 4.3 point release. Before discussing it, let’s recall a few things about Splunk, starting with:
- Splunk is first and foremost an analytic DBMS …
- … used to manage logs and similar multistructured data.
- Splunk’s DML (Data Manipulation Language) is based on text search, not on SQL.
- Splunk has extended its DML in natural ways (e.g., you can use it to do calculations and even some statistics).
- Splunk bundles some (very) basic, Splunk-specific business intelligence capabilities.
- The paradigmatic use of Splunk is to monitor IT operations in real time. However:
- There also are plenty of non-real-time uses for Splunk.
- Splunk is proudest of its growth in non-IT quasi-real-time uses, such as the marketing side of web operations.
As in any release, a lot of Splunk 4.3 is about “Oh, you didn’t have that before?” features and Bottleneck Whack-A-Mole performance speed-up. One performance enhancement is Bloom filters, which are a very hot topic these days. More important is a switch from Flash to HTML5, so as to accommodate mobile devices with less server-side rendering. Splunk reports that its users — especially the non-IT ones — really want to get Splunk information on the tablet devices. While this somewhat contradicts what I wrote a few days ago pooh-poohing mobile BI, let me hasten to point out:
- Splunk is used for a lot of (quasi) real-time monitoring.
- Splunk’s desktop user interfaces are, by BI standards, quite primitive.
That’s pretty much the ideal scenario for mobile BI: Timeliness matters and prettiness doesn’t.
| Categories: Business intelligence, Data models and architecture, Data warehousing, Log analysis, Specific users, Splunk, Structured documents, Web analytics | 3 Comments |
Lessons from T-Mobile’s epic fail
When my electric power came back on but my Verizon FiOS internet connection didn’t, it was time for a mobile hotspot/prepaid wireless internet service. T-Mobile’s 4G Mobile Hotspot/Prepaid Mobile Broadband offering seemed like a good choice. But the experience of setting it up was a nightmare, and a possible instructive nightmare at that.
T-Mobile’s instructions tell you that you need to know the factory defaults for network name and password. That makes sense. They don’t also tell you that you need to know your SIM card number (included), IMEI number (included), or authorization number (not included).
That’s right — you need a number that T-Mobile doesn’t tell you you need. But the story gets a lot worse from there, because it’s almost impossible to get the number from them. I eventually talked with approximately 8 T-Mobile call center associates over the course of the evening before getting successfully connected.
| Categories: Specific users, Text | Leave a Comment |
MarkLogic 5, and why you might care
MarkLogic is releasing MarkLogic 5. Key elements of the announcement are:
- More-of-the-same in line with MarkLogic’s core positioning.
- A new bi-directional Hadoop connector.
- A free MarkLogic Express edition, limited in license terms more than in actual features, as per Slide 27 of the deck MarkLogic graciously supplied for me to post.
Also, MarkLogic is early with a feature that most serious DBMS vendors will soon have – support for tiered storage, with writes going first to solid-state storage, then being flushed to disk via a caching-style algorithm.* And as befits a sometime search-engine-substitute, MarkLogic has finally licensed a large set of document filters, from an Australian company called Isys. Apparently, the special virtue of the Isys filters is that they’re good at extracting not only text, but metadata as well.
*If there’s a caching algorithm that doesn’t contain a major element of LRU (Least Recently Used), I don’t recall ever hearing about it.
MarkLogic seems to have settled on a positioning that, although distressingly buzzword-heavy, is at least partly based upon reality. The real part includes:
- MarkLogic is a serious, enterprise-class DBMS (see for example Slide 12 of the MarkLogic deck) …
- … which has been optimized from the getgo for poly-structured data.
- MarkLogic can and does scale out to handle large amounts of data.
- MarkLogic is a general-purpose DBMS, suitable for both short-request and analytic tasks.
- MarkLogic is particularly well suited for analyses with long chains of “progressive enhancement” (MarkLogic’s favorite term when talking about derived data).
- MarkLogic often plays the role of a content assembler and/or search engine, and the people who use MarkLogic in those ways are commonly doing things that can be described as research and analysis.
Based on that reality, MarkLogic talks a lot about Volume, Velocity, Variety, Big Data, unstructured data, semi-structured data, and big data analytics.
| Categories: Hadoop, MarkLogic, Market share and customer counts, Scientific research, Solid-state memory, Structured documents, Text | 1 Comment |
Text data management, Part 3: Analytic and progressively enhanced
This is Part 3 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve gone on for two long posts about text data management already, but even so I’ve glossed over a major point:
Using text data commonly involves a long series of data enhancement steps.
Even before you do what we’d normally think of as “analysis”, text markup can include steps such as:
- Figure out where the words break.
- Figure out where the clauses and sentences break.
- Figure out where the paragraphs, sections, and chapters break.
- (Where necessary) map the words to similar ones — spelling correction, stemming, etc.
- Figure out which words are grammatically which parts of speech.
- Figure out which pronouns and so on refer to which other words. (Technical term: Anaphora resolution.)
- Figure out what was being said, one clause at a time.
- Figure out the emotion — or “sentiment” — associated with it.
Those processes can add up to dozens of steps. And maybe, six months down the road, you’ll think of more steps yet.
| Categories: Data warehousing, Hadoop, NoSQL, Text | 3 Comments |
Text data management, Part 2: General and short-request
This is Part 2 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
I’ve recently given widely varied advice about managing text (and similar files — images and so on), ranging from
Sure, just keep going with your old strategy of keeping .PDFs in the file system and pointing to them from the relational database. That’s an easy performance optimization vs. having the RDBMS manage them as BLOBs.
to
I suspect MongoDB isn’t heavyweight enough for your document management needs, let alone just dumping everything into Hadoop. Why don’t you take a look at MarkLogic?
Here are some reasons why.
There are three basic kinds of text management use case:
- Text as payload.
- Text as search parameter.
- Text as analytic input.
| Categories: MarkLogic, NoSQL, Text | 5 Comments |
Text data management, Part 1: Confusion
This is Part 1 of a three post series. The posts cover:
- Confusion about text data management.
- Choices for text data management (general and short-request).
- Choices for text data management (analytic).
There’s much confusion about the management of text data, among technology users, vendors, and investors alike. Reasons seems to include:
- The terminology around text data is inaccurate.
- Data volume estimates for text are misleading.
- Multiple different technologies are in the mix, including:
- Enterprise text search.
- Text analytics — text mining, sentiment analysis, etc.
- Document stores — e.g. document-oriented NoSQL, or MarkLogic.
- Log management and parsing — e.g. Splunk.
- Text archiving — e.g., various specialty email archiving products I couldn’t even name.
- Public web search — Google et al.
- Text search vendors have disappointed, especially technically.
- Text analytics vendors have disappointed, especially financially.
- Other analytic technology vendors ignore what the text analytic vendors actually have accomplished, and reinvent inferior wheels rather than OEM the state of the art.
Above all: The use cases for text data vary greatly, just as the use cases for simply-structured databases do.
There are probably fewer people now than there were six years ago who need to be told that text and relational database management are very different things. Other misconceptions, however, appear to be on the rise. Specific points that are commonly overlooked include: Read more
| Categories: Analytic technologies, Archiving and information preservation, Google, Log analysis, MarkLogic, NoSQL, Oracle, Splunk, Text | 2 Comments |
Defining NoSQL
A reporter tweeted: ”Is there a simple plain English definition for NoSQL?” After reminding him of my cynical yet accurate Third Law of Commercial Semantics, I gave it a serious try, and came up with the following. More precisely, I tweeted the bolded parts of what’s below; the rest is commentary added for this post.
NoSQL is most easily defined by what it excludes: SQL, joins, strong analytic alternatives to those, and some forms of database integrity. If you leave all four out, and you have a strong scale-out story, you’re in the NoSQL mainstream. Read more
| Categories: Cassandra, MarkLogic, MySQL, Object, Open source, Petabyte-scale data management, Schooner Information Technology, dbShards and CodeFutures | 6 Comments |
Oracle NoSQL is unlikely to be a big deal
Alex Williams noticed that there will be a NoSQL session at Oracle OpenWorld next week, and is wondering whether this will be a big deal. I think it won’t be.
There really are three major points to NoSQL.
- Dynamic schemas. This is the only one of the three that truly depends on NoSQL.
- Scale-out short-request processing. If you want to scale out efficiently at high request volumes, you’re best off not using all the flexibility SQL/relational DBMS offer. (In particular, you don’t want to do cross-node joins). Not coincidentally, a number of the best scale-out offerings were built to be NoSQL.
- Open source. Doing a relational DBMS is a big project. It seems easier to build NoSQL ones.
Oracle can address the latter two points as aggressively as it wishes via MySQL. It so happens I would generally recommend MySQL enhanced by dbShards, Schooner, and/or dbShards/Schooner, rather than Oracle-only MySQL … but that’s a detail. In some form or other, Oracle’s MySQL is a huge player in the scale-out, open source, short-request database management market.
So that leaves us with dynamic schemas. Oracle has at least four different sets of technology in that area:
- As Workday noticed years ago, MySQL can be used as a functional, basic key-value store.
- Oracle also has XML-based Berkeley DB/SleepyCat kicking around.*
- The XML extensions to Oracle’s core DBMS could be alleged to have a dynamic schema/NoSQL flavor. (Blech.)
- A dynamic schema argument could also be made for object-oriented DBMS technology. While Oracle doesn’t to my knowledge exactly sell that, it does have the Tangosol Coherence line of technology, with a potentially similar programming model.
If Oracle is now refreshing and rebranding one or more of these as “NoSQL”, there’s no reason to view that as a big deal at all.
*That’s Mike Olson’s former company, if you’re keeping score at home.
| Categories: MySQL, NoSQL, OLTP, Object, Open source, Oracle, Parallelization, Schooner Information Technology, Structured documents | 11 Comments |
The database architecture of salesforce.com, force.com, and database.com
salesforce.com, force.com, and database.com use exactly the same database infrastructure and architecture. That’s the good news. The bad news is that salesforce.com is somewhat obscure about technical details, for reasons such as:
- A long-ago marketing decision to not give infrastructure details, so as to convey a “Don’t worry; we’ll take care of everything” message.
- Even so, a long-ago and perhaps now-regretted marketing decision to disclose and even exaggerate salesforce.com’s reliance on Oracle, as part of an early-days attempt to prove salesforce was using enterprise-class technology.
- A desire to hide the recipe for salesforce.com’s secret sauce.
- Force of habit — I’m not sure salesforce even knows how to tell its technical story with any clarity.
Actually, salesforce.com has moved some kinds of data out of Oracle that previously used to be stored there. Besides Oracle, salesforce uses at least a file system and a RAM-based data store about which I have no details. Even so, much of salesforce.com’s data is stored in Oracle — a single instance of Oracle, which it believes may be the largest instance of Oracle in the world.
