Can a large parquet file created with pySpark be used with Panel?

sorin · January 2, 2023, 3:52pm

Hello All,

I have an use case to build a dashboard, which will ultimately create directed graphs based on a parquet file created with pySpark (using a single node).
In terms of large file, I mean that the parquet file can be up to 30 GB and could contain around 120 million data points.

I can play with the file quite fast using Jupyter Lab, so in terms of RAM memory, should be fine.

I am curious if I am on a wrong track trying to use panel, or if is doable.

Tips & tricks are appreciated, thank you!

sorin · January 2, 2023, 4:15pm

On a second thought, I don’t think is possible, with a pySpark dataframe.
As far as I know, Panel supports Pandas dataframes and converting pySpark dataframe to Pandas is very expensive.

Marc · January 2, 2023, 4:15pm

Hi @sorin

I think it should be doable. But it would depend on 1. How many users you have 2. What “servers” you have access to 3. The kind of dashboard 4. how you implement it.

My hypothesis would be that it could be beneficial to use Dask.

I’m working on a “How to test guide” which includes performance and load testing. This might be relevant panel/test.md at de3b7c0ad97c14904d43bc842c3caceea45badd9 · holoviz/panel (github.com).

Marc · January 2, 2023, 4:17pm

Panel just runs on top of Python. So in principle you can develop on top of any python framework, data format or server you can interact with. Including Spark if needed.

sorin · January 2, 2023, 4:20pm

Hi @Marc!

To quickly answer your questions:

The use case will serve only one user locally, on his/her laptop
The laptop is suppose to be strong enough (32-64 GB RAM)
The current idea is to create a directed graph, which can be filtered using date, and/or certain attributes from the data