@jay I'm trying to run the near-deduplication as well so I have an index for the minhashlsh which will be updated with the minhash for all rows in a udf, and then i intend to use the index in that external variable to run a query on all rows in another udf to get the keys of the similar rows. After that, i intend to create another external variable to build a graph that will represent all the similarity edges to assign groups based on the connected components.