Community for the Daft project and all things distributed data

Distributed Data Community

If I have a variable that keeps a dictionary that is read and updated in multiple different UDFs then what would be the best place to store this dictionary?

Could you give a better idea of what your overall workflow is?

In general having UDFs communicate via an external variable is quite tricky because it doesn’t work for things like running in a distributed environment!

<@U042126MG49> I'm trying to run the near-deduplication as well so I have an index for the minhashlsh which will be updated with the minhash for all rows in a udf, and then i intend to use the index in that external variable to run a query on all rows in another udf to get the keys of the similar rows. After that, i intend to create another external variable to build a graph that will represent all the similarity edges to assign groups based on the connected components.

dedupe.py

I see! Take a look at the attached Python file for some examples on how you could do minhash-based dedup.