Can a large parquet file created with pySpark be used with Panel?

Hello All,

I have an use case to build a dashboard, which will ultimately create directed graphs based on a parquet file created with pySpark (using a single node).
In terms of large file, I mean that the parquet file can be up to 30 GB and could contain around 120 million data points.

I can play with the file quite fast using Jupyter Lab, so in terms of RAM memory, should be fine.

I am curious if I am on a wrong track trying to use panel, or if is doable.

Tips & tricks are appreciated, thank you!

On a second thought, I don’t think is possible, with a pySpark dataframe.
As far as I know, Panel supports Pandas dataframes and converting pySpark dataframe to Pandas is very expensive.

Hi @sorin

I think it should be doable. But it would depend on 1. How many users you have 2. What “servers” you have access to 3. The kind of dashboard 4. how you implement it.

My hypothesis would be that it could be beneficial to use Dask.

I’m working on a “How to test guide” which includes performance and load testing. This might be relevant panel/test.md at de3b7c0ad97c14904d43bc842c3caceea45badd9 · holoviz/panel (github.com).

Panel just runs on top of Python. So in principle you can develop on top of any python framework, data format or server you can interact with. Including Spark if needed.

Hi @Marc!

To quickly answer your questions:

  • The use case will serve only one user locally, on his/her laptop
  • The laptop is suppose to be strong enough (32-64 GB RAM)
  • The current idea is to create a directed graph, which can be filtered using date, and/or certain attributes from the data
1 Like