I'm trying to move a Dask workload over to Daft an...
# daft-dev
a
I'm trying to move a Dask workload over to Daft and running into an issue reading data from
s3
The following Dask code works and returns the first 5 rows of the dataframe:
Copy code
ddf = dd.read_parquet(
"<s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet>",
storage_options={"anon": True},
)
ddf.head()
when I try to read the same files in with Daft, my Jupyter kernel dies. Also when I run the Daft code before running the Dask code.
Copy code
io_config = IOConfig(s3=S3Config(anonymous=True))

df = daft.read_parquet(
    "<s3://dask-data/nyc-taxi/nyc-2015.parquet/part.*.parquet>",
    io_config=io_config,
)
df.show()
Is there a syntax difference here that I should be aware of? Happy to create a GH issue if this merits further investigating! I'm running Daft 0.2.21
It must be something to do with the path definition or files themselves as the following does work fine:
Copy code
io_config = IOConfig(s3=S3Config(anonymous=True))

df = daft.read_parquet(
    "<s3://coiled-datasets/uber-lyft-tlc/>",
    io_config=io_config,
)
df.show()
j
Hmm there aren’t any error logs or anything to go off of?
a
let me take a look in the Jupyter logs and I’ll get back to you
👋 1
j
Also what machine are you running this on? We should also take a look at the explain plan. I’ll try to reproduce on my end as well
a
Running on M1 mac
j
Woof….
Copy code
ScanWithTask-LocalLimit [Stage:2]:   0%|                                                                                                                                                 | 0/1 [00:00<?, ?it/s]thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `8`,
 right: `0`', /Users/runner/.cargo/git/checkouts/arrow2-4f48cbcad4539e8b/c0764b0/src/io/parquet/read/deserialize/primitive/basic.rs:53:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Rayon: detected unexpected panic; aborting
[1]    51086 abort      ipython
/Users/jaychia/.pyenv/versions/3.10.8/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown             
  warnings.warn('resource_tracker: There appear to be %d '
With
RUST_BACKTRACE=1
set:
Copy code
ScanWithTask-LocalLimit [Stage:2]:   0%|                                                                                                                                                 | 0/1 [00:00<?, ?it/s]thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `8`,
 right: `0`', /Users/runner/.cargo/git/checkouts/arrow2-4f48cbcad4539e8b/c0764b0/src/io/parquet/read/deserialize/primitive/basic.rs:53:9
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
   4: arrow2::io::parquet::read::deserialize::utils::next
   5: <arrow2::io::parquet::read::deserialize::primitive::basic::Iter<T,I,P,F> as core::iter::traits::iterator::Iterator>::next
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
   7: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
   8: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
   9: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
Looks like an error that’s happening deep inside the arrow2 deserialization code for Parquet. Taking a further look now
👍 1