Why Panel is 10x slower with Dask than with Pandas?

I ended up using DuckDB. Some benchmarks with my 24Gb parquet file (71Gb in memory. All of that will not fit into memory at once. Tested on my laptop):

  • Dask: 5:00min initially, 1:30min with optimizations
  • Polars: 40 seconds out of the box. 11 seconds with some fine tuning.
  • DuckDB: 0.5-2 seconds out of the box (used same optimized table structure as with Polars)

I liked the Polars API perhaps a bit more than writing SQL for DuckDB, but decided to go with DuckDB as it was so fast out of the box.

3 Likes