@vicki as part of that I discover that loading a particular (compressed) 20MB Parquet file is taking between 280MB and 600MB RAM, depending on library used to load it. In the Pandas case this will apparently vary a lot depending whether user installed pyarrow or fastparquet.
Starting to think that predicting Pandas memory usage is just too hard to even try, and only thing to do is measure (and/or reduce).
@itamarst something I noticed while trying out Polars out of the box on a JSON file (2GB) is that it OOMed the Jupyter kernel immediately. So I guess I need to now understand Polars limitations better as well
@vicki Looks like there's no lazy API for JSON in Polars So I'd probably use something like https://pypi.org/project/ijson/ + Python's csv writer to convert the JSON to CSV with fixed memory, and then use the lazy API on the CSV.
@vicki That's annoying though insofar as CSV has less type data than even JSON. So maybe something based on combination of ijson and https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
And sometimes your data just don't fit in memory, so don't try it. Use polars + lazy + streaming.