If I have a variable that keeps a dictionary that ...
# general
k
If I have a variable that keeps a dictionary that is read and updated in multiple different UDFs then what would be the best place to store this dictionary?
j
Could you give a better idea of what your overall workflow is? In general having UDFs communicate via an external variable is quite tricky because it doesn’t work for things like running in a distributed environment!
k
@jay I'm trying to run the near-deduplication as well so I have an index for the minhashlsh which will be updated with the minhash for all rows in a udf, and then i intend to use the index in that external variable to run a query on all rows in another udf to get the keys of the similar rows. After that, i intend to create another external variable to build a graph that will represent all the similarity edges to assign groups based on the connected components.
j
I see! Take a look at the attached Python file for some examples on how you could do minhash-based dedup.
k
Super helpful!!! Thanks!!