Comments on: SAS in its own cloud

By: Guy Bayes

Guy Bayes — Thu, 26 Mar 2009 16:55:52 +0000

@Hans it’s not uncommon to have SAS filesystem “datamarts” holding persistent data to support the quants. The overhead between SAS and the RDBM is still high, caching data locally and persistently makes sense in many situations.

@Curt Even a very distributed environment, like Hadoop, makes it very clear you should not expect to fit everything into memory.

The thing that makes the SAS map/reduce concept kind of reasonable to me, is that SAS is natively much more in a map/reduce file processing kind of mode then in a distributed RDBM mode. I could also imagine using something like Hadoop Streams to take existing sas and run it distributing, provided you were clever with your hashing (and could find a way around the ridiculous license cost).

Trying to get SAS to run natively inside an RDBMS on the other hand is an order of magnitude different problem.

SAS is very far from SQL.

So far SAS/Teradata seems to be mostly trying to rever engineer SAS procedures as C compiled objects inside Teredata, one at a time.

By: Hans Gilde

Hans Gilde — Wed, 25 Mar 2009 23:17:26 +0000

Well, the answer really depends on the usage. The real answer is a little technical.

But many statisticians love SAS because so often they don’t have to worry at all about things like memory use, where as in R you always do.

By: Curt Monash

Curt Monash — Wed, 25 Mar 2009 22:19:28 +0000

Hans,

My point was more that R on a sufficiently parallel system might be able to process more data …

Best,

CAM

By: Hans Gilde

Hans Gilde — Wed, 25 Mar 2009 21:49:57 +0000

I wanted to point out the difference between SAS-as-a-service and SAS deployed in a computing cloud.

SAS-as-a-service makes a lot of sense. With SAS, you load in data and get out reports, summaries,etc. It’s not like a database where you’re constantly reading the data you load in. Internet latency is a killer for DBMS access, but not for getting reports from SAS. So it makes perfect sense for SAS do SaaS. This will come with a completely different pricing structure.

Licensing SAS for the Amazon cloud – well that is just a SAS license. I don’t see why it would be cheaper just because it’s in a computer that runs in a “cloud”.

By: Richard Boire

Richard Boire — Wed, 25 Mar 2009 21:32:29 +0000

What about cost implications here. SAS charges an arm and a leg for their software. My question to you folks is whether or not you think their charges might start to decrease with SAS being available like other software within this cloud computing environment.

By: Hans Gilde

Hans Gilde — Wed, 25 Mar 2009 20:53:42 +0000

Well, not exactly. The built in SAS analytics have always been coded to keep as little as possible in memory. Input data is read one line at a time and temporary data is stored by default in a disk file. This is why SAS is by default so disk intensive, although there are many possible configurations.

R (all versions to my knowledge) only deals with data in memory. If you want to handle more data than you have memory, you will have to chunk it through a piece at a time. There are libraries to help, but still it can be hard to get a handle on. And if your analytics library wants all of the data at once – well, you’re out of luck and it needs to be rewritten to handle chunks.

As of fairly recently, S-PLUS has a feature called Big Data that replaces the memory cache with a disk cache. So programs operate thinking they’re dealing with memory, but really it reads from the disk. There is work to implement this in R, but I don’t know of a widely used and stable version yet. So for now, the state is this: most R libraries don’t handle chunking data, so most libraries can only handle as much data as you have memory. R users have been known to upgrade to 64bit because of this. In the future, there are many possible solutions including allowing R working “memory” to be cached on disk, or rewriting libraries to handle data differently.

By: Curt Monash

Curt Monash — Wed, 25 Mar 2009 16:49:24 +0000

Hans,

That depends the specific implementation of R, doesn’t it? 🙂

CAM

By: Hans Gilde

Hans Gilde — Wed, 25 Mar 2009 13:23:14 +0000

I think that R has farther to go than many people realize in order to go to really catch up to SAS.

For example, you can hook SAS up to a terabyte of data and run some analysis. It might be slow and disk intensive, but it’ll work. Try a huge data set with R and it’ll fail; you’ll either need some custom programming or it’ll just be impossible, depending on the analysis.

For that and several other reasons, SAS seems safe for now. But I agree that doesn’t mean they should consider themselves safe in the long run.

By: Jeff Wright

Jeff Wright — Wed, 25 Mar 2009 00:09:21 +0000

You don’t have to code for it. Multi-threading is built-in in the sense that it just happens for you. Starting with SAS9, certain SAS procedures have been modified to take advantage of multi-threading (unless options are used to suppress this). The current list of multi-threaded procedures may be found at:

http://support.sas.com/rnd/scalability/procs/index.html

However, it is true that this happens at the level of an individual processing step, not an entire SAS program.

By: Guy Bayes

Guy Bayes — Tue, 24 Mar 2009 21:37:50 +0000

Yes multithreading is built-in in the sense that you don’t have to pay extra for it. However to take advantage of it, I believe you have to rewrite the code to use the special multithreaded calls. It’s not an abstracted, configurable parameter at the global level, like it would be in a relational database.

http://www2.sas.com/proceedings/forum2007/036-2007.pdf