When pyspark saves parquets to a folder on a parti...
# general
k
When pyspark saves parquets to a folder on a partition, it creates folders of the partition=some_value. When I use daft to read_parquet the parent folder (through a glob or otherwise), would it be possible to get back the columns of the table which were used as partitions?
j
Yes this is called a hive-style read, where we can read the file path and understand that we need to parse the
key=value
pairs into columns It’s currently a TODO item. I’m guessing one workaround for now would be to at least give you the filenames (the PR for that is almost ready to merge!) Could you create an issue for us for parsing Hive partition paths? We’ll get right on it
k
Okay!
Thanks!!