March 23, 2009

SAS in its own cloud

The Register has a fairly detailed article about SAS expanding its cloud/SaaS offerings.  I disagree with one part, namely:

SAS may not have a choice but to build its own cloud. Given the sensitive nature of the data its customers analyze, moving that data out to a public cloud such as the Amazon EC2 and S3 combo is just not going to happen.

And even if rugged security could make customers comfortable with that idea, moving large data sets into clouds (as Sun Microsystems discovered with the Sun Grid) is problematic. Even if you can parallelize the uploads of large data sets, it takes time.

But if you run the applications locally in the SAS cloud, then doing further analysis on that data is no big deal. It’s all on the same SAN anyway, locked down locally just as you would do in your own data center.

I fail to see why SAS’s campus would be better than leading hosting companies’ data centers for either of data privacy/security or data upload speed.  Rather, I think major reasons for SAS building its own data center for cloud computing probably focus on:

Comments

15 Responses to “SAS in its own cloud”

  1. Mark Callaghan on March 23rd, 2009 9:38 am

    Curt,
    Why would the ‘smp-oriented’ nature of SAS be a problem on Amazon? EC2 provides servers that appear to be SMP servers from a client’s perspective — http://aws.amazon.com/ec2/#instance.

  2. Curt Monash on March 23rd, 2009 11:59 am

    Mark,

    I don’t know exactly what SAS is or isn’t best optimized for. And http://www.sas.com/partners/directory/sun/ZencosSPDS-X4500.pdf would seem, if anything, to speak against what I was suggesting.

    But anyway — whether SAS needs to control its own hardware and whether SAS wants to control its own hardware aren’t exactly the same question. 😉

    CAM

  3. Guy Bayes on March 23rd, 2009 1:59 pm

    What SAS really needs to perform well is I/O to support the constant sorting it does. Lots and lots of I/O. The quants I have supported in general don’t make heavy demands of CPU or memory, what they really need is solid state disk.

    You have to code sas specifically to take advantage of more then one cpu I think.

    My knowledge is limited though, especially when it comes to the SAS application suite.

    Just speculating, but I wonder if part of this SaaS play might be some kind of proprietary map/reduce framework for SAS? Would be kind of a natural fit in some ways.

  4. Curt Monash on March 23rd, 2009 3:46 pm

    SAS has to be careful, or versions of R integrated into cheap MPP data warehouse DBMS — with or without MapReduce — will be in increasingly big threat.

  5. Jeff Wright on March 24th, 2009 8:04 am

    @Guy, SAS has had built-in multi-threading for 5 years now.

    One of the applications that SAS has offered in SaaS mode for years is SAS Drug Development. I believe that 21 CFR Part 11 regulation for these pharma users is part of the self-hosting decision.

    I can think of one or two other technical factors that may be considerations, but SAS also just has a company culture of going their own way.

  6. Guy Bayes on March 24th, 2009 5:37 pm

    Yes multithreading is built-in in the sense that you don’t have to pay extra for it. However to take advantage of it, I believe you have to rewrite the code to use the special multithreaded calls. It’s not an abstracted, configurable parameter at the global level, like it would be in a relational database.

    http://www2.sas.com/proceedings/forum2007/036-2007.pdf

  7. Jeff Wright on March 24th, 2009 8:09 pm

    You don’t have to code for it. Multi-threading is built-in in the sense that it just happens for you. Starting with SAS9, certain SAS procedures have been modified to take advantage of multi-threading (unless options are used to suppress this). The current list of multi-threaded procedures may be found at:

    http://support.sas.com/rnd/scalability/procs/index.html

    However, it is true that this happens at the level of an individual processing step, not an entire SAS program.

  8. Hans Gilde on March 25th, 2009 9:23 am

    I think that R has farther to go than many people realize in order to go to really catch up to SAS.

    For example, you can hook SAS up to a terabyte of data and run some analysis. It might be slow and disk intensive, but it’ll work. Try a huge data set with R and it’ll fail; you’ll either need some custom programming or it’ll just be impossible, depending on the analysis.

    For that and several other reasons, SAS seems safe for now. But I agree that doesn’t mean they should consider themselves safe in the long run.

  9. Curt Monash on March 25th, 2009 12:49 pm

    Hans,

    That depends the specific implementation of R, doesn’t it? 🙂

    CAM

  10. Hans Gilde on March 25th, 2009 4:53 pm

    Well, not exactly. The built in SAS analytics have always been coded to keep as little as possible in memory. Input data is read one line at a time and temporary data is stored by default in a disk file. This is why SAS is by default so disk intensive, although there are many possible configurations.

    R (all versions to my knowledge) only deals with data in memory. If you want to handle more data than you have memory, you will have to chunk it through a piece at a time. There are libraries to help, but still it can be hard to get a handle on. And if your analytics library wants all of the data at once – well, you’re out of luck and it needs to be rewritten to handle chunks.

    As of fairly recently, S-PLUS has a feature called Big Data that replaces the memory cache with a disk cache. So programs operate thinking they’re dealing with memory, but really it reads from the disk. There is work to implement this in R, but I don’t know of a widely used and stable version yet. So for now, the state is this: most R libraries don’t handle chunking data, so most libraries can only handle as much data as you have memory. R users have been known to upgrade to 64bit because of this. In the future, there are many possible solutions including allowing R working “memory” to be cached on disk, or rewriting libraries to handle data differently.

  11. Richard Boire on March 25th, 2009 5:32 pm

    What about cost implications here. SAS charges an arm and a leg for their software. My question to you folks is whether or not you think their charges might start to decrease with SAS being available like other software within this cloud computing environment.

  12. Hans Gilde on March 25th, 2009 5:49 pm

    I wanted to point out the difference between SAS-as-a-service and SAS deployed in a computing cloud.

    SAS-as-a-service makes a lot of sense. With SAS, you load in data and get out reports, summaries,etc. It’s not like a database where you’re constantly reading the data you load in. Internet latency is a killer for DBMS access, but not for getting reports from SAS. So it makes perfect sense for SAS do SaaS. This will come with a completely different pricing structure.

    Licensing SAS for the Amazon cloud – well that is just a SAS license. I don’t see why it would be cheaper just because it’s in a computer that runs in a “cloud”.

  13. Curt Monash on March 25th, 2009 6:19 pm

    Hans,

    My point was more that R on a sufficiently parallel system might be able to process more data …

    Best,

    CAM

  14. Hans Gilde on March 25th, 2009 7:17 pm

    Well, the answer really depends on the usage. The real answer is a little technical.

    But many statisticians love SAS because so often they don’t have to worry at all about things like memory use, where as in R you always do.

  15. Guy Bayes on March 26th, 2009 12:55 pm

    @Hans it’s not uncommon to have SAS filesystem “datamarts” holding persistent data to support the quants. The overhead between SAS and the RDBM is still high, caching data locally and persistently makes sense in many situations.

    @Curt Even a very distributed environment, like Hadoop, makes it very clear you should not expect to fit everything into memory.

    The thing that makes the SAS map/reduce concept kind of reasonable to me, is that SAS is natively much more in a map/reduce file processing kind of mode then in a distributed RDBM mode. I could also imagine using something like Hadoop Streams to take existing sas and run it distributing, provided you were clever with your hashing (and could find a way around the ridiculous license cost).

    Trying to get SAS to run natively inside an RDBMS on the other hand is an order of magnitude different problem.

    SAS is very far from SQL.

    So far SAS/Teradata seems to be mostly trying to rever engineer SAS procedures as C compiled objects inside Teredata, one at a time.

Leave a Reply




Feed: DBMS (database management system), DW (data warehousing), BI (business intelligence), and analytics technology Subscribe to the Monash Research feed via RSS or email:

Login

Search our blogs and white papers

Monash Research blogs

User consulting

Building a short list? Refining your strategic plan? We can help.

Vendor advisory

We tell vendors what's happening -- and, more important, what they should do about it.

Monash Research highlights

Learn about white papers, webcasts, and blog highlights, by RSS or email.