At the highest level:
- Presto is, roughly speaking, Facebook’s replacement for Hive, at least for queries that are supposed to run at interactive speeds.
- Teradata is announcing support for Presto with a classic open source pricing model.
- Presto will also become, roughly speaking, Teradata’s replacement for Hive.
- Teradata’s Presto efforts are being conducted by the former Hadapt.
Now let’s make that all a little more precise.
Regarding Presto (and I got most of this from Teradata)::
- To a first approximation, Presto is just another way to write SQL queries against HDFS (Hadoop Distributed File System). However …
- … Presto queries other data stores too, such as various kinds of RDBMS, and federates query results.
- Facebook at various points in time created both Hive and now Presto.
- Facebook started the Presto project in 2012 and now has 10 engineers on it.
- Teradata has named 16 engineers – all from Hadapt – who will be contributing to Presto.
- Known serious users of Presto include Facebook, Netflix, Groupon and Airbnb. Airbnb likes Presto well enough to have 1/3 of its employees using it, via an Airbnb-developed tool called Airpal.
- Facebook is known to have a cluster cited at 300 petabytes and 4000 users where Presto is presumed to be a principal part of the workload.
Daniel Abadi said that Presto satisfies what he sees as some core architectural requirements for a modern parallel analytic RDBMS project:
- Data is pipelined between operators, with no gratuitous writing to disk the way you might have in something MapReduce-based. This is different from the sense of “pipelining” in which one query might keep an intermediate result set hanging around because another query is known to need those results as well.
- Presto processing is vectorized; functions don’t need to be re-invoked a tuple at a time. This is different from the sense of vectorization in which several tuples are processed at once, exploiting SIMD (Single Instruction Multiple Data). Dan thinks SIMD is useful mainly for column stores, and Presto tries to be storage-architecture-agnostic.
- Presto query operators and hence query plans are dynamically compiled, down to byte code.
- Although it is generally written in Java, Presto uses direct memory management rather than relying on what Java provides. Dan believes that, despite being written in Java, Presto performs as if it were written in C.
More precisely, this is a checklist for interactive-speed parallel SQL. There are some query jobs long enough that Dan thinks you need the fault-tolerance obtained from writing intermediate results to disk, ala’ HadoopDB (which was of course the MapReduce-based predecessor to Hadapt).
That said, Presto is a newish database technology effort, there’s lots of stuff missing from it, and there still will be lots of stuff missing from Presto years from now. Teradata has announced contribution plans to Presto for, give or take, the next year, in three phases:
- Phase 1 (released immediately, and hence in particular already done):
- An installer.
- More documentation, especially around installation.
- Command-line monitoring and management.
- Phase 2 (later in 2015)
- Integrations with YARN, Ambari and soon thereafter Cloudera Manager.
- Expanded SQL coverage.
- Phase 3 (some time in 2016)
- An ODBC driver, which is of course essential for business intelligence tool connectivity.
- Other connectors (e.g. more targets for query federation).
- Further SQL coverage.
Absent from any specific plans that were disclosed to me was anything about optimization or other performance hacks, and anything about workload management beyond what can be gotten from YARN. I also suspect that much SQL coverage will still be lacking after Phase 3.
Teradata’s basic business model for Presto is:
- Teradata is selling subscriptions, for which the principal benefit is support.
- Teradata reserves the right to make some of its Presto-enhancing code subscription-only, but has no immediate plans to do so.
- Teradata being Teradata, it would love to sell you Presto-related professional services. But you’re absolutely welcome to consume Presto on the basis of license-plus-routine-support-only.
And of course Presto is usurping Hive’s role wherever that makes sense in Teradata’s data connectivity story, e.g. Teradata QueryGrid.
Finally, since I was on the phone with Justin Borgman and Dan Abadi, discussing a project that involved 16 former Hadapt engineers, I asked about Hadapt’s status. That may be summarized as:
- There are currently no new Hadapt sales.
- Only a few large Hadapt customers are still being supported by Teradata.
- The former Hadapt folks would love Hadapt or Hadapt-like technology to be integrated with Presto, but no such plans have been finalized at this time.