Comments on: Kickfire’s FPGA-based technical strategy

By: Kickfire capacity and pricing | DBMS2 -- DataBase Management System Services

Sun, 18 Oct 2009 09:16:22 +0000

[…] strategy, e.g. from Daniel Abadi, Merv Adrian and, kicking things off — as it were — me. Weeks after a recent Kickfire product release, there’s finally a fairly accurate data sheet […]

By: Kickfire Disrupts DW Economics, Targets Mainstream ADBMS Opportunities « Market Strategies for IT Suppliers

Tue, 06 Oct 2009 20:24:11 +0000

[…] architecture; you can read about it in Daniel’s blog and in a great discussion thread on Curt Monash’s blog post. As a result of all this work, the system can keep intermediate data sets in the chip’s […]

By: Tony Bain

Tony Bain — Wed, 02 Sep 2009 02:46:08 +0000

I agree a Kickfire/XtremeData comparison doesn’t make too much sense given the different product focus. I will be more interested in a Kickfire/VectorWise comparison as these both seem to have similar target (but I guess we are 12 months away from that). One could assume that an FPGA approach will outperform but the price/performance gap will be interesting.

Regardless of FPGA/non-FPGA strategies, a cheap “plug and play” appliance for mid-range MySQL data marts still seems like a pretty good idea to me.

By: Geno Valente

Geno Valente — Fri, 28 Aug 2009 18:59:38 +0000

You are correct about the QDR size, but there are 4 of them (vs just one) – each with separate addresses, in addition to the two controllers to the Mobo DIMMS.

However, I’m sure under the hood the data flow and execution models are pretty different. We’ve tried to optimize for large data problems and MPP, to allow the system to be data model agnostic. (Query time on data with HASH = Query Time on same data round-robin). I’ll let KF comment, but I’m positive it is not the same “goal”.

The In-Socket concept allows us to mix/match as we need to. If we felt we needed more FPGA and/or more memory for something in the future, we’d just move the HP-DL785 for example. For now, we don’t need too.

If/when we publish our benchmark numbers publicly, it would not be for the 300GB size, but the largest sizes (3TB, 10TB, and 30TB sizes). We haven’t because our customers aren’t asking for it. They want a rack(s) on site, with their data, running their queries. This is what we are doing with 100% effort.

By: Eric Wendel

Eric Wendel — Fri, 28 Aug 2009 18:23:45 +0000

Hi Geno,

I don’t have any association with Kickfire whatsoever so I can’t address your patented vs. pending question, but Kickfire’s Joseph Chamdani (for one) has a very long and impressive resume in the area, many patents issued, and worked for Sun in related areas as well. It’s entirely possible that there are previously issued patents that Kickfire has licensed and can therefore legitimately use “Patented” even if none of the current applications have been issued as patents.

Anyway, I think we should assume (especially here in a public forum) that the Kickfire team and their lawyers are smart enough to use the term “patented” appropriately.

Now, the QDR memory on your device is (a) an SRAM cache, and (b) only 32Mbytes. The need for a custom memory interface I refer to applies to the system DRAM, which needs to be GBytes in size and support massive concurrency of “in flight” memory transfers (in addition to bandwidth) for Data Warehousing apps.

If the Xtreme Data approach of using the server mobo memory-subsystem works as well as Kickfire’s, then (by virtue of being a much cheaper way to go) it should deliver at even better cost/performance benefits in TPC-H…right?

Eric

By: Geno Valente

Geno Valente — Fri, 28 Aug 2009 03:11:55 +0000

@ Eric
Two things: Everything about dbX is dramatically different than KickFire. I’ll start with – we scale to data sets that are 1000x bigger than KickFire, and finish with the In-Socket Accelerator has QDR memory on it, uses the largest of FPGAs which have massive internal memory bandwidth, and we use the local DDR DIMMS on the motherboard via two separate controllers – it gives us all the power we need to offer 1GB/Sec/Node regardless of data partitioning. I’ll recommend this ChalkTalk for anyone that wants to know more: http://www.xtremedata.com/parallelism.php

Second, I’d love to learn more about KickFire. I’ve read the website but I wanted more. So I just did a search at the http://www.uspto.gov (US patent and trademark office) and could only find a patent application, not an approval. I’m guessing I just missed it or their search engine isn’t finding it. (Noting that calling something “patented” that is only “patent pending” is a highly illegal, so perhaps I just can’t find it). Can you post or email us the patent number? That would probably clear some things up about what you are doing and help everyone reading understand the information? A tall request I know, but you have basically made this public via the filing for a patent in the first place.

By: Justin Swanhart

Justin Swanhart — Thu, 27 Aug 2009 22:41:48 +0000

Eric,

Consider further that Kickfire doesn’t use sequential compression algorithms but instead uses dictionary compression which doesn’t require decompression during query processing.

By: Eric Wendel

Eric Wendel — Thu, 27 Aug 2009 17:37:36 +0000

TS,

These may help illustrate the value of the FPGA in the Kickfire architecture:

1) “Where’s the Beef? Why FPGAs Are So Fast” http://research.microsoft.com/apps/pubs/default.aspx?id=70636

So…orders of magnitude speedups are typical for FPGA across a wide range of apps…but note the importance of the customized memory subsystem and interface:

“Our results show that custom memory interfaces are the most effective way at enabling much greater performance on the FPGA, and that memory interfaces traditional software use become a bottleneck when the FPGA uses the same interface.”

This is why Kickfire uses a separate memory subsystem. It’s also why in-socket FPGA accelerator approaches like XtremeData DBX fall far short.

Now consider, just for the job of compression/de-compression ALONE (crucial in the column-store context), a single FPGA is faster than dozens of multi-GHz cores, and uses a tiny fraction of the power.

For a specific example, take a look at this:

2) “Streaming implementation of a sequential decompression algorithm on an FPGA”

http://portal.acm.org/citation.cfm?id=1508195#abstract

And of course compression becomes exponentially more important in the Flash SSD context.

Hope this helps.

Eric Wendel

By: Raj

Raj — Tue, 25 Aug 2009 22:02:55 +0000

Hi Curt,

Thank you for your post on our technology. I have a couple of additional details to provide:

1) I think it would be interesting for your readers to understand the key design philosophy behind our SQL chip. In a nutshell, with our chip, we have done for memory bandwidth what column store has done for I/O bandwidth. Why is this important? The SQL chip is based on a dataflow architecture which employs direct transistor-based processing engines to natively execute high-level relational operations and database algorithms. This approach delivers an order of magnitude more query processing capability than today’s state-of-the-art microprocessors. This results in the need for an order of magnitude greater memory bandwidth than today’s microprocessors have available. There are several techniques that are employed in our SQL chip and systems stack to get this increased memory bandwidth. As we discussed, the top two are a) the ability to keep the intermediate data sets/tuple sets, resulting from the processing of complex queries, live on the chip without spilling to memory and b) using “deep-indexing” which helps avoid memory-intensive column scans which we can discuss later at your convenience. The net result is that you get the processing power and memory bandwidth equivalent of 20 or so Nehalem-based CPUs in a single chip.

2)A little more detail on how we manage the MySQL interface. After receiving the parse tree from MySQL, our optimizer generates a plan that uses either or both the SQL chip (our FPGA) and our software SQL execution engine (executed on the x86 in our base server). The majority of queries run natively in hardware or with just a small component in software. A smaller set will run just in our software execution engine. Neither of these paths use the MySQL optimizer or execution engine thereby ensuring consistently high performance.

Thanks,

Raj

By: Justin Swanhart

Justin Swanhart — Tue, 25 Aug 2009 17:35:53 +0000

TS – Get back to me when your 1TB machine beats our appliance in Price/Performance on the TPCH 300G benchmark.