Hachyderm @hachyderm

Recent searches

Search options

Only available when logged in.

**Itamar Turner-Trauring** @itamarst · Jan 12, 2023

Jan 12, 2023

Processing large amounts of data with #Pandas in #Python can be difficult; it’s quite easy to run out of memory and either slow down or crash. The #Polars dataframe library is a potential solution: if you use the right APIs, it can significantly reduce memory usage.

https://pythonspeed.com/articles/polars-memory-pandas/

Python⇒Speed · Jan 11, 2023Why Polars uses less memory than PandasPolars is an alternative to Pandas than can often run faster—and use less memory!

**Itamar Turner-Trauring** @itamarst · Jan 12, 2023

Jan 12, 2023

Itamar Turner-Trauring @itamarst

@vicki as part of that I discover that loading a particular (compressed) 20MB Parquet file is taking between 280MB and 600MB RAM, depending on library used to load it. In the Pandas case this will apparently vary a lot depending whether user installed pyarrow or fastparquet.

Starting to think that predicting Pandas memory usage is just too hard to even try, and only thing to do is measure (and/or reduce).

**Vicki Boykis** @vicki@jawns.club · Jan 12, 2023

Jan 12, 2023

Vicki Boykis @vicki@jawns.club

@itamarst something I noticed while trying out Polars out of the box on a JSON file (2GB) is that it OOMed the Jupyter kernel immediately. So I guess I need to now understand Polars limitations better as well

Itamar Turner-Trauring @itamarst@hachyderm.io

@vicki Looks like there's no lazy API for JSON in Polars So I'd probably use something like https://pypi.org/project/ijson/ + Python's csv writer to convert the JSON to CSV with fixed memory, and then use the lazy API on the CSV.

PyPIijsonIterative JSON parser with standard Python iterator interfaces

Jan 12, 2023, 05:59 PM··Web

0boosts·2favorites