I frequently badger my clients to tell their story in the form of a company blog, where they can say what needs saying without being restricted by the rules of other formats. KXEN actually listened, and put up a pair of CTO posts that make the company story a lot clearer.
Excerpts from the first post include (with minor edits for formatting, including added emphasis):
Back in 1995, Vladimir Vapnik … changed the machine learning game with his new ‘Statistical Learning Theory’: he provided the machine learning guys with a mathematical framework that allowed them finally to understand, at the core, why some techniques were working and some others were not. All of a sudden, a new realm of algorithms could be written that would use mathematical equations instead of engineering data science tricks (don’t get me wrong here: I am an engineer at heart and I know the value of “tricks,” but tricks cannot overcome the drawbacks of a bad mathematical framework). Here was a foundation for automated data mining techniques that would perform as well as the best data scientists deploying these tricks. Luck is not enough though; it was because we knew a lot about statistics and machine learning that we were able to decipher the nuggets of gold in Vladimir’s theory.
The market needed a system that is able to perform classification and regression (we later added clustering/segmentation, times series analysis, association rules and social network analysis), that has the following characteristics:
- Non-parametric: little user intervention and tuning should be required — it should work well out of the box.
- Independent of the data and target distribution:
- Target: the classification system should be able to handle rates of positive values even as low as 0.1% (such as in fraud, for example), or be able to forecast a continuous value with only 1% of non zero values.
- Data: it should automate mixing and matching and comparing influence for ordinal, nominal, continuous,and textual variables without any user intervention.
- Scalable in number of rows: the training time should be linear with the number of rows, and the quality of the models should increase with the number of rows.
- Scalable in number of columns: the training time should be close to linear with respect to the number of columns, and the quality of the models should increase with the number of columns. It is well known that most algorithms present a problem of over-fitting in high dimensions; it is quite ironic that companies spend billions of dollars in collecting data but often cannot take advantage of all this data because most first-generation analytical workbenches collapse trying to handle the high dimensionality inherent in all this data.
- Descriptive: a good predictive analytics package must be able to present its findings in a way that a business user can understand. We have always believed that there is a continuum between predictive and descriptive analytics: predictive models should be descriptive enough and descriptive models should be usable in a predictive manner to make decisions.
- Deployable: the scoring equations should be simple enough to be deployed in any operational environment: SQL for databases, Java code for the web (or even for smartphones), etc.
Vapnik’s theory provided us with a mathematical framework for capabilities 1, 2 and 4 above; what remained was 3, 5 and 6, which we solved with a well known pattern in machine learning: by using linear systems in a properly encoded space (the trick is to have the good encoded space).
The second post seems to make some strong model-quality benchmark claims, but there also seems to be an in-house-vs.-publicly-checked mismatch going on.