Boruch Chalk
07/11/2024, 2:16 PMjay
07/11/2024, 4:07 PM.limit(10_000)
2. then do a .repartition(10)
which should randomly shuffle your data into partitions of roughly 1000 rows eachBoruch Chalk
07/14/2024, 6:11 AMinto_partition
right?jay
07/15/2024, 6:28 PM.into_partition
will try to avoid shuffling your data, and performs “in-place” splitting and coalescing instead!Sagi
07/31/2024, 10:54 AM.repartition
the resulting df is sometimes far from being randomly shuffled
df_a = daft.from_pydict({"char": ["a"]*10})
df_b = daft.from_pydict({"char": ["b"]*10})
df = df_a.concat(df_b)
df.repartition(2).show()
Can you shed some light on how .repartition
is implemented?jay
07/31/2024, 8:02 PMSagi
08/01/2024, 7:07 AMjay
08/01/2024, 5:29 PMSagi
08/02/2024, 10:59 AMjay
08/03/2024, 12:20 AMjay
08/03/2024, 12:20 AMSagi
08/04/2024, 7:24 AMclass MyDataset(torch.utils.data.IterableDataset):
df: daft.DataFrame
...
def __iter__(self):
df = self.df.repartition(None)
for partition in df.iter_partitions():
samples = partition.to_pylist()
np.random.shuffle(samples)
yield from samples
Sagi
08/04/2024, 7:40 AM