Phil Chen
09/09/2024, 7:10 PMPhil Chen
09/09/2024, 7:16 PMPhil Chen
09/09/2024, 7:31 PMPhil Chen
09/09/2024, 7:52 PMPhil Chen
09/09/2024, 8:04 PMColin Ho
09/09/2024, 8:27 PMColin Ho
09/09/2024, 8:32 PMIn addition, id is the index and its type is int. Why the generated query where clause range limit is in float?In order to have partitioned reads, Daft will first calculate percentiles from the partition column via
PERCENTILE_DISC
. These percentiles can be floats. It will then use these percentiles as bounds for each read. Here's a more in depth explanation: https://www.getdaft.io/projects/docs/en/stable/user_guide/integrations/sql.html#parallel-distributed-readsColin Ho
09/09/2024, 8:44 PMI am wonder if any one of those objects could be cached (each ray worker only need to create one engine and one connection for the same connection_url).This will be quite tricky to implement as Daft currently uses Ray tasks, which are stateless. We also can't reuse the same connection across tasks as they are not serializable.
Colin Ho
09/09/2024, 8:44 PMDoes the read_sql close the connection created by the factory?Yes, the connection will be closed once the read is complete
Colin Ho
09/09/2024, 8:47 PMColin Ho
09/09/2024, 9:15 PMfrom tenacity import retry, stop_after_attempt, wait_fixed
@retry(stop=stop_after_attempt(3), wait=wait_fixed(5))
def get_connection():
engine = create_engine(connection_url)
return engine.connect()
You can use https://tenacity.readthedocs.io/en/latest/ to implement the retry logicPhil Chen
09/09/2024, 10:35 PMPhil Chen
09/10/2024, 4:02 PMPhil Chen
09/10/2024, 5:14 PMColin Ho
09/10/2024, 5:31 PMPhil Chen
09/10/2024, 8:04 PMPhil Chen
09/10/2024, 10:02 PMPhil Chen
09/10/2024, 10:04 PMPhil Chen
09/11/2024, 1:39 PMPhil Chen
09/11/2024, 2:15 PMColin Ho
09/11/2024, 4:57 PM