Phil Chen
08/27/2024, 6:11 PMjay
08/27/2024, 7:09 PMray.init(_*include_dashboard=True*_) first before running Daft! This will spin up Ray first (with a dashboard) and then Daft should actually also automatically pick it up when you do run itColin Ho
08/27/2024, 7:12 PMwhere clause in your query, i.e. WHERE partition_key >= lower_bound AND partition_key < upper_boundPhil Chen
08/27/2024, 7:29 PMPhil Chen
08/27/2024, 7:44 PMColin Ho
08/27/2024, 8:44 PMprint(df.write_parquet(..)) , which will show you the number of files and the file paths that Daft wrote to. These files are guaranteed to contain the results of your query.jay
08/27/2024, 8:48 PMdf.write_parquet(…) produces a dataframe that has the written filepaths) and save that somewhere as a manifest of the complete dataset written
• You may also consider using something like DeltaLake or Iceberg to get transactional guarantees around writing datajay
08/27/2024, 9:28 PMjay
08/27/2024, 9:28 PMPhil Chen
08/27/2024, 9:31 PMPhil Chen
08/28/2024, 1:32 PMColin Ho
08/28/2024, 6:02 PMpartition_cols argument in write_parquet? This will subdivide a partition further based on the given columns.
2. Are you performing any additional operations between read_sql and write_parquet? This could affect the number of partitions.
3. In the list of files returned from write_parquet, do you notice any files with the same uuid prefix, but different suffix? For example:
"path"
"c2f241b6-a54e-46bd-8453-fb7fb9aff4eb-1.parquet"
"c2f241b6-a54e-46bd-8453-fb7fb9aff4eb-0.parquet"
"c2f241b6-a54e-46bd-8453-fb7fb9aff4eb-2.parquet"
Notice that these files have the same uuid prefix, but are suffixed differently with '-1' or '-0' or '2'. This can happen if the partition is too large, and Daft will create multiple files per partition (You can actually configure the parquet_target_filesize, see: https://getdaft.io/projects/docs/en/stable/api_docs/doc_gen/configuration_functions/daft.set_execution_config.html). Apologies, I only just realised that we have this behavior 😅.Phil Chen
08/28/2024, 10:22 PMColin Ho
08/28/2024, 10:48 PMSo, if I understand currently, the final number of files may not be the same as the num_partitions value passed to read_sql?Yes, this is correct, the number of files can be greater. By default, Daft's target parquet file size is 512mb. write_parquet will split partitions accordingly to match this target size.
Phil Chen
08/29/2024, 3:05 PM