Cray’s strategy these days seems to be:
- Move forward with the classic supercomputer business.
- Diversify into related areas.
At the moment, the main diversifications are:
- Boxes that are like supercomputers, but at a lower price point.
- “(Big) data”.
The last of the three is what Cray subsidiary Yarcdata is all about.
“Yarc” = “Cray” spelled backwards.
To a first approximation, Yarcdata is a bunch of Cray guys, with an overlay out of Informatica/Siperian and other database-oriented software companies. Yarcdata’s first effort is to manage graph data, via an appliance product called uRika.* More precisely, uRika manages RDF triples, with SPARQL as the query language. More precisely yet, uRika manages quadruples, with the fourth field being for “subgraph ID”. Having multiple subgraphs sounds like it’s somewhere between having:
- Multiple tables in one database.
- Multiple databases managed by one DBMS.
A natural way to wind up with multiple subgraphs is to import data from different sources.
Yarcdata is still trying to figure out exactly which relationship analytics application areas it is pursuing. Yarcdata’s big multi-year design partner was a large intelligence agency, for an unspecified application that obviously has a lot to do with terrorism and national security. Also mentioned, as is appropriate for a Cray subsidiary, are application areas that feel more scientific or technical (life sciences, financial services). Not mentioned much so far — except perhaps by me — are telecom/influencer-detection and anti-fraud.
The last time Yarcdata gave me a customer count, it was 5, but that was some months ago.
As best I understand, uRika has two tiers of servers. One tier features commodity hardware, and runs a stack of data access software from or at least based on the Apache Jena project. The other tier has classic Cray hardware, running a proprietary data store. This data store is in-memory, except that like most in-memory analytic stores, it can be initialized from disk. Notes on the data store part include:
- It’s shared-everything, with one global address space for RAM. There’s no explicit data partitioning.
- Cray talks a lot about half a petabyte of RAM, to the point that I’m guessing that that’s what the classified first customer actually has. But of course you can get uRika in various different sizes.
- A key point is that Cray lets you have lots of threads going. Figures on that included 128 threads/processor and 8000 processors, for 1 million threads.
- Why so many threads? To help “tolerate” memory latency. If one thread is delayed, just switch to the next.
On the graph analytic functionality, there seems to be less in the way of uRika secret sauce at this time. SPARQL 1.0 and Jena get mentioned, but innovative extensions are discussed not so much in the present tense, but rather in future or hypothetical terms. Anyhow, I haven’t spent a lot of time looking at what SPARQL can or can’t do, but I gather that if you want to do a straightforward graph query, SPARQL can handle it. But for graph analytics such as centrality measures or whatever, you need tools or extensions.