hachyderm.io is one of the many independent Mastodon servers you can use to participate in the fediverse.
Hachyderm is a safe space, LGBTQIA+ and BLM, primarily comprised of tech industry professionals world wide. Note that many non-user account types have restrictions - please see our About page.

Administered by:

Server stats:

9.3K
active users

Processing large amounts of data with in can be difficult; it’s quite easy to run out of memory and either slow down or crash. The dataframe library is a potential solution: if you use the right APIs, it can significantly reduce memory usage.

pythonspeed.com/articles/polar

Python⇒Speed · Why Polars uses less memory than PandasPolars is an alternative to Pandas than can often run faster—and use less memory!

@vicki as part of that I discover that loading a particular (compressed) 20MB Parquet file is taking between 280MB and 600MB RAM, depending on library used to load it. In the Pandas case this will apparently vary a lot depending whether user installed pyarrow or fastparquet.

Starting to think that predicting Pandas memory usage is just too hard to even try, and only thing to do is measure (and/or reduce).

@itamarst something I noticed while trying out Polars out of the box on a JSON file (2GB) is that it OOMed the Jupyter kernel immediately. So I guess I need to now understand Polars limitations better as well

Itamar Turner-Trauring

@vicki Looks like there's no lazy API for JSON in Polars 😢 So I'd probably use something like pypi.org/project/ijson/ + Python's csv writer to convert the JSON to CSV with fixed memory, and then use the lazy API on the CSV.

PyPIijsonIterative JSON parser with standard Python iterator interfaces

@astrojuanlu @itamarst @vicki

And sometimes your data just don't fit in memory, so don't try it. Use polars + lazy + streaming.