Complex event processing (CEP)
Discussion of complex event processing (CEP), aka event processing or stream processing – i.e., of technology that executes queries before data is ever stored on disk. Related subjects include:
When we scheduled a call to talk about Sentry, Cloudera’s Charles Zedlewski and I found time to discuss other stuff as well. One interesting part of our discussion was around the processing “frameworks” Cloudera sees as most important.
- The four biggies are:
- MapReduce. Duh.
- SQL, specifically Impala. This is as opposed to the uneasy Hive/MapReduce layering.
- “Math” , which seems to mainly be through partnerships with SAS and Revolution Analytics. I don’t know a lot about how these work, but I presume they bypass MapReduce, in which case I could imagine them greatly outperforming Mahout.
- Stream processing (Storm) is next in line.
- Graph — e.g. Giraph — rises to at least the proof-of-concept level. Again, the hope would be that this well outperforms graph-on-MapReduce.
- Charles is also seeing at least POC interest in Spark.
- But MPI (Message Passing Interface) on Hadoop isn’t going anywhere fast, except to the extent it’s baked into SAS or other “math” frameworks. Generic MPI use cases evidently turn out to be a bad fit for Hadoop, due to factors such as:
- Low data volumes.
- Latencies in various parts of the system
HBase was artificially omitted from this “frameworks” discussion because Cloudera sees it as a little bit more of a “storage” system than a processing one.
Another good subject was offloading work to Hadoop, in a couple different senses of “offload”: Read more
|Categories: BDAS, Spark, and Shark, Cloudera, Complex event processing (CEP), Endeca, Hadoop, HP and Neoview, MapReduce, Predictive modeling and advanced analytics, RDF and graphs, Revolution Analytics, SAS Institute, Teradata||22 Comments|
It’s hard to make data easy to analyze. While everybody seems to realize this — a few marketeers perhaps aside — some remarks might be useful even so.
Many different technologies purport to make data easy, or easier, to an analyze; so many, in fact, that cataloguing them all is forbiddingly hard. Major claims, and some technologies that make them, include:
- “We get data into a form in which it can be analyzed.” This is the story behind, among others:
- Most of the data integration and ETL (Extract/Transform/Load) industries, software vendors and consulting firms alike.
- Many things that purport to be “analytic applications” or data warehouse “quick starts”.
- “Data reduction” use cases in event processing.*
- Text analytics tools.
- “Forget all that transformation foofarah — just load (or write) data into our thing and start analyzing it immediately.” This at various times has been much of the story behind:
- Relational DBMS, according to their inventor E. F. Codd.
- MOLAP (Multidimensional OnLine Analytic Processing), also according to RDBMS inventor E. F. Codd.
- Any kind of analytic DBMS, or general purpose DBMS used for data warehousing.
- Newer kinds of analytic DBMS that are faster than older kinds.
- The “data mart spin-out” feature of certain analytic DBMS.
- In-memory analytic data stores.
- NoSQL DBMS that have a few analytic features.
- TokuDB, similarly.
- Electronic spreadsheets, from VisiCalc to Datameer.
- “Our tools help you with specific kinds of analyses or analytic displays.” This is the story underlying, among others:
- The business intelligence industry.
- The predictive analytics industry.
- Algorithmic trading use cases in complex event processing.*
- Some analytic applications.
*Complex event/stream processing terminology is always problematic.
My thoughts on all this start: Read more
I recently proposed a 2×2 matrix of BI use cases:
- Is there an operational business process involved?
- Is there a focus on root cause analysis?
Let me now introduce another 2×2 matrix of analytic scenarios:
- Is there a compelling need for super-fresh data?
- Who’s consuming the results — humans or machines?
My point is that there are at least three different cool things people might think about when they want their analytics to be very fast:
- Fast investigative analytics — e.g., business intelligence with great query response.
- Computations on very fresh data, presented to humans — e.g. “heartbeat” graphics monitoring a network.
- Computations on very fresh data, presented back to a machine — e.g., a recommendation engine that includes makes good use of data about a user’s last few seconds of actions.
There’s also one slightly boring one that however drives a lot of important applications: Read more
|Categories: Business intelligence, Complex event processing (CEP), Games and virtual worlds, Log analysis, Predictive modeling and advanced analytics, Splunk, WibiData||3 Comments|
These are three closely-related draft entries for the DBMS2 analytic glossary. Please comment with any ideas you have for their improvement!
1. We coined the term memory-centric data management to comprise several kinds of technology that manage data in RAM (Random Access Memory), including:
- In-memory DBMS (DataBase Management Systems).
- Hybrid memory-centric DBMS.
- Other kinds of in-memory data stores, such as:
- Caching layers.
- In-memory data stores that are tightly tied to specific analytic tools, for example the in-memory data management part of QlikView.
- Complex event/stream processing.
- Many examples of memory-centric data management (April, 2012)
2. An in-memory DBMS is a DBMS designed under the assumption that substantially all database operations will be performed in RAM (Random Access Memory). Thus, in-memory DBMS form a subcategory of memory-centric data management systems.
Ways in which in-memory DBMS are commonly different from those that query and update persistent storage include: Read more
|Categories: Analytic glossary, Cache, Complex event processing (CEP), In-memory DBMS, Memory-centric data management||5 Comments|
Ron Pressler of Parallel Universe/SpaceBase pinged me about a data grid product he was open sourcing, called Galaxy. The idea is that a distributed RAM grid will allocate data, not randomly or via consistent hashing, but rather via a locality-sensitive approach. Notes include:
- The original technology was developed to track moving objects on behalf of the Israeli Air Force.
- The commercial product is focused on MMO (Massively MultiPlayer Online) games (or virtual worlds).
- The underpinnings are being open sourced.
- Ron suggests that, among other use cases, Galaxy might work well for graphs.
- Ron argues that one benefit is that when lots of things cluster together — e.g. characters in a game — there’s a natural way to split them elastically (shrink the radius for proximity).
- The design philosophy seems to be to adapt as many ideas as possible from the way CPUs manage (multiple levels of) RAM cache.
|Categories: Cache, Clustering, Complex event processing (CEP), Games and virtual worlds, GIS and geospatial, Open source, RDF and graphs, Scientific research||2 Comments|
I’m frequently asked to generalize in some way about in-memory or memory-centric data management. I can start:
- The desire for human real-time interactive response naturally leads to keeping data in RAM.
- Many databases will be ever cheaper to put into RAM over time, thanks to Moore’s Law. (Most) traditional databases will eventually wind up in RAM.
- However, there will be exceptions, mainly on the machine-generated side. Where data creation and RAM data storage are getting cheaper at similar rates … well, the overall cost of RAM storage may not significantly decline.
Getting more specific than that is hard, however, because:
- The possibilities for in-memory data storage are as numerous and varied as those for disk.
- The individual technologies and products for in-memory storage are much less mature than those for disk.
- Solid-state options such as flash just confuse things further.
Consider, for example, some of the in-memory data management ideas kicking around. Read more
I find myself in need of a word or phrase that means bring data together from various sources so that it’s ready to be used, where the use can be analysis or operations. The first words I thought of were “aggregation” and “collection,” but they both have other meanings in IT. Even “data marshalling” has a specific meaning different from what I want. So instead, I’ll go with data mustering.
I mean for the term “data mustering” to encompass at least three scenarios:
- Integrated (relational) data warehouse.
- Big bit bucket.
- Big bit stream.
Let me explain what I mean by each. Read more
|Categories: Complex event processing (CEP), Data warehousing, Investment research and trading, Sybase, Teradata||11 Comments|
My clients at StreamBase are coming out with a new product line called LiveView, and I agreed they could launch it via this blog. Key points about StreamBase LiveView Version 1.0 include:
- LiveView is a business intelligence and alerting suite built on/in the rest of StreamBase’s technology, meant to operate on streaming data.
- LiveView is positioned by StreamBase as having a true push event-driven architecture rather than pull/poll.
- StreamBase LiveView is designed to query in-memory data and then have the results change in real time as the data set changes.
- The LiveView user interface is a rapidly changing work in progress.
- LiveView has other Version 1 limitations as well
- LiveView is targeted squarely at StreamBase’s financial trading core market until some of the Version 1 limitations are lifted.
The basic StreamBase LiveView pipeline goes something like: Read more
|Categories: Business intelligence, Complex event processing (CEP), Data warehousing, Memory-centric data management, StreamBase||2 Comments|
While I was cryptic in my general CEP/streaming catchup, I’ll say a bit more regarding StreamBase in particular. At the highest level, non-technically:
- StreamBase once planned to conquer the world.
- However, StreamBase really only sold effectively in the financial trading and intelligence markets.
- StreamBase retrenched, focusing almost exclusively on the financial trading market.
- With StreamBase LiveView, StreamBase is expanding from embedded operational analytics to do (also operational) business intelligence as well.
- StreamBase is hopeful that, perhaps starting with Version 2 or so, LiveView will be successful outside the financial trading market.
|Categories: Complex event processing (CEP), Investment research and trading, Parallelization, StreamBase||2 Comments|
When I agreed to launch the StreamBase LiveView product via DBMS 2, I planned to catch up on the whole CEP/streaming area first. Due to the power and internet outages last week, that didn’t entirely happen. So I’ll do a bit of that now, albeit more cryptically than I hoped and intended.
- The upshot of my what to call CEP thread in August was that “streaming” and “event processing” are not the same concept, but it so happens that they have the most traction where they intersect. That said, I both observe and endorse an apparent shift from “event” to “stream” as the core of the terminology, in a reversal of my opinion of several years ago.
- IBM continues to throw a lot of resources at its System S/ InfoSphere Streams product, but I haven’t heard yet of much marketplace success. That said, I believe IBM is still pretty serious about Streams, as one would expect from an effort whose code name so cheekily references System R. In particular, Streams shows up prominently on IBM’s top-level analytic architecture slide.
- Sybase recently released its ESP (Event Stream Processor) 5.0, which it says is the full merger of the Aleri and Coral8 predecessors. You can still get Sybase ESP without buying into the full Sybase RAP stack, and Sybase has no plans to change that.
- Sybase has discontinued all the business intelligence types of products Aleri and Coral8 were developing. Rather, Sybase is OEMing Panopticon, which it reports has been well received. Other than the discontinuation of the BI efforts, there seem to be few Aleri or Coral8 features missing from the merged Sybase ESP product.
- Truviso continues to be out of the picture.
- I have more to say about StreamBase separately.
- I have more to say about Sybase and IBM, which I’ll get to when I can.
- I have nothing new on Progress Apama. I also know little about any of the open source efforts.
Meanwhile, if you want to see technically nitty-gritty posts about the CEP/streaming area, you may want to look at my CEP/streaming coverage circa 2007-9, based on conversations with (among others) Mike Stonebraker, John Bates, and Mark Tsimelzon.
|Categories: Business intelligence, Complex event processing (CEP), IBM and DB2, StreamBase, Sybase, Truviso||3 Comments|