https://www.getdaft.io logo
Join Slack
Powered by
# general
  • g

    Garrett Weaver

    09/26/2025, 4:58 AM
    when running a UDF with native runner and
    use_process=True
    everything works fine locally (mac), but seeing
    /usr/bin/bash: line 1:    58 Bus error               (core dumped)
    when run on k8s (argo workflows). any thoughts?
    c
    • 2
    • 3
  • g

    Garrett Weaver

    09/26/2025, 5:11 PM
    I am using
    explode
    with average result being 1 row --> 12 rows and max 1 row --> 366 rows (~5m rows --> ~66m rows). Seeing decently high memory usage during the explode even with a repartition prior to explode. Is the only remedy more partitions and/or reduced number cpus to reduce parallelism?
    c
    k
    • 3
    • 10
  • n

    Nathan Cai

    09/29/2025, 1:00 PM
    Hi there, just a quick question, what value does Daft provide compared to me just using multi-threaded programming? In which use cases does Daft go above and beyond just multi-threading?
    c
    k
    • 3
    • 3
  • r

    Robert Howell

    09/30/2025, 4:39 PM
    👋 Looking to get started with Daft? We’ve curated some approachable Good First Issues which the team is happy to help you with! New Expressions These are fun because you'll get to introduce entirely new capabilities. • Add `Expression.var` • Add `Expression.pow` • Add `Expression.product` • Support `ddof` argument for `stddev` • Support simple case and searched cased with list of branches • UUID function <-- very easy and very 🆒 Enhancements These are interesting because you'll get to learn from the existing functionality and extend current capabilities. • Support `Series[start:end]` slicing • Support image hash functions for deduplication • Equality on structured types <-- this would be a great addition • Hash rows of dataframes • Custom CSV delimiters <-- also great addition!! • Custom CSV quote characters • Custom CSV date/time formatting Documentation We are ALWAYS interested in documentation PRs and improve docs.daft.ai — here's a mega-list which you could work with coding agents to curate something nice! • https://github.com/Eventual-Inc/Daft/issues/4125 Full List • https://github.com/Eventual-Inc/Daft/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22good%20first%20issue%22&amp;page=1
    ❤️ 6
    🙌 1
  • d

    Dan Reverri

    09/30/2025, 9:50 PM
    I’m trying to read a large csv file and want to iterate over arrow record batches but daft is reading the whole file before starting to yield batches. Are there any options to process the csv file in chunks?
    d
    c
    • 3
    • 19
  • c

    ChanChan Mao

    10/03/2025, 9:35 PM
    If you're new to Daft and our community, I highly recommend reading this overview of Daft blogpost written by @Euan Lim! Such technical concepts were explained very well and were easily digestible. Thanks for this writeup!
    🙌 6
    e
    • 2
    • 1
  • c

    Colin Ho

    10/15/2025, 4:14 PM
    Hey everyone! We just pushed some updates to the Daft docs on optimization and debugging — especially around memory and partitioning. • Managing memory: tips on how to tune Daft to lower memory usage and avoid OOMs. • Optimizing partitions / batches: details on how Daft parallelizes data across cores / machines, and how to control it. Do check them out if you face any difficulties like high memory usage or low cpu utilization within your workloads!
    daft party 4
    s
    • 2
    • 1
  • f

    fr1ll

    10/15/2025, 11:36 PM
    Hi there- an API question for you. I’m working on a pipeline to generate input data for Apple’s new
    embedding-atlas
    library. At the end, I need to run UMAP against the full set of embeddings. I currently resort to the clunky approach below. Am I missing a better way to add a column from a numpy array of same length as the dataframe?
    Copy code
    vecs = np.stack([r["embedding"] for r in df.select("embedding").to_pylist()], axis=0)
    
    # little helper wrapping umap
    xy, knn_indices, knn_distances = umap_with_neighbors(vecs).values()
    
    df = df.with_column("_row_index": (daft.sql_expr("row_number()") - 1))
    
    umap_cols = daft.from_pydict({
        "_row_index": list(range(xy.shape[0])),
        "umap_x": daft.Series.from_arrow(pa.array(xy[:,0])),
        "umap_y": daft.Series.from_arrow(pa.array(xy[:,1])),
    })
     
    df = df.join(umap_cols, on="_row_index")
    m
    • 2
    • 5
  • n

    Nathan Cai

    10/17/2025, 4:00 PM
    Hi there, I'm making a contribution for https://github.com/Eventual-Inc/Daft/issues/3786 I think I've done it, but I was wondering if someone can help me explain how testing works. I already see function signatures in
    tests/connect/test_io.py
    Copy code
    @pytest.mark.skip(reason="<https://github.com/Eventual-Inc/Daft/issues/3786>")
    def test_write_csv_with_delimiter(make_df, make_spark_df, spark_session, tmp_path):
        pass
    But the thing is, I don't know what these arguments mean, and I don't see that function being invoked anywhere else so I don't know where it's being provided those arguments. But the other thing is, its function signature is different from the ones above:
    Copy code
    def test_csv_basic_roundtrip(make_spark_df, assert_spark_equals, spark_session, tmp_path):
    And I think I might need the
    assert_spark_equals
    function in my scenario, I was wondering if someone is willing to guide me in the right direction.
    r
    m
    • 3
    • 16
  • b

    Ben Cornelis

    10/17/2025, 5:37 PM
    Hello! I have a question about flotilla runner memory usage. I'm running a daft query against a ray cluster running on k8s (azure). I'm attempting to read a large parquet table (~1.8T, 50k files) from azure blob store, performing a group by with a count aggregation, and then writing the resulting dataframe to another directory. The script and query looks like:
    Copy code
    df = daft.read_parquet(
        "<abfs://my_container@my_service_account.dfs.core.windows.net/in_table_path/>"
    )
    df = df.groupby("group_col").agg(col("count_col").count())
    df.write_parquet("<abfs://my_container@my_service_account.dfs.core.windows.net/out_table_path/>")
    (it's worth noting that I have daft pinned to 0.5.19 since I currently get an error using
    write_parquet
    to azure blob store on later versions. It is the same issue as here: https://github.com/Eventual-Inc/Daft/issues/5336) When I submit a job running this script to my ray cluster, it eventually fails with this error:
    Copy code
    ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
    Memory on the node (IP: 10.224.2.36, ID: 637c809d2d9dfed8dd5c109e7ff9b81e4c10a024322086029a8f6def) where the task (actor ID: 53ff539b6349c1bee7efc01802000000, name=flotilla-plan-runner:RemoteFlotillaRunner.__init__, pid=1631, memory used=24.82GB) was running was 30.46GB / 32.00GB (0.951874), which exceeds the memory usage threshold of 0.95.
    Is this unexpected that the flotilla runner on the head node is consuming so much memory?
    c
    • 2
    • 12
  • b

    Ben Cornelis

    10/17/2025, 9:34 PM
    I'm getting a connection reset error periodically reading parquet from ADLS:
    Copy code
    daft.exceptions.DaftCoreException: DaftError::External Cached error: Unable to open file <abfs://path/to/my.parquet>: Error { context: Full(Custom { kind: Io, error: reqwest::Error { kind: Decode, source: hyper::Error(Body, Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" }) } }, "error converting `reqwest` request into a byte stream") }
    With the job I mentioned above with 50k files it's happening often enough that the job always fails. Has anyone ever seen this?
    k
    • 2
    • 35
  • b

    Ben Cornelis

    10/20/2025, 8:13 PM
    Is it possible to pass an ADLS uri containing a storage account name instead of passing env / config variables? I'm getting this error `Azure Storage Account not set and is required. Set either
    AzureConfig.storage_account
    or the
    AZURE_STORAGE_ACCOUNT
    environment variable.` but I also see code attempting to parse it from the uri: https://github.com/Eventual-Inc/Daft/blob/main/src/daft-io/src/azure_blob.rs#L116. I'm not sure why that's not working. I'm using an abfs uri of the form
    <abfs://container@account_name.dfs.core.windows.net/path/>
    k
    • 2
    • 6
  • n

    Nathan Cai

    10/21/2025, 9:03 PM
    By the way, does anyone actually run
    make test
    on their own devices? It burns a whole in my RAM when I run it on my laptop
    d
    m
    s
    • 4
    • 3
  • v

    VOID 001

    10/23/2025, 7:40 AM
    Hi, is there any native datasource support for daft reading from hive?
    k
    • 2
    • 1
  • b

    Ben Cornelis

    10/24/2025, 7:40 PM
    I'm running a group by query on the 1.8TB dataset I mentioned above. The ray job is failing because some RaySwordfishActors are running out of memory, either during the repartition or write parquet step (it looks like they're interleaved, so it's hard to tell). The log I'm seeing is:
    (raylet) node_manager.cc:3193: 1 Workers (tasks / actors) killed due to memory pressure (OOM)
    I've tried various k8s pod sizing - currently I have 80 worker pods with 8cpu / 32gb. Does anyone know why so much memory would be used, or how to tune / think about queries like this?
    ✅ 1
    d
    • 2
    • 14
  • g

    Garrett Weaver

    10/24/2025, 8:17 PM
    Is
    parquet_inflation_factor
    still respected in flotilla? I have a parquet file that is ~2GB, but expands to ~16GB in memory. When trying to run a select with window function, it is trying to load all data into memory and struggling (OOMing). What is the best course of action here?
    ✅ 1
    m
    c
    • 3
    • 8
  • g

    Garrett Weaver

    10/24/2025, 9:35 PM
    another question, in latest version, I am seeing an arrow schema mismatch on write with some new metadata in the schema, works fine with pyarrow writer (
    native_parquet_writer=False
    )
    ✅ 1
    d
    • 2
    • 7
  • t

    takumi hayase

    10/28/2025, 12:39 PM
    I am looking for new position as full stack developer now.
  • b

    Ben Cornelis

    10/28/2025, 3:40 PM
    I'm seeing lower than expected cpu utilization 30-40% (I originally posted 5-20% but misread it) on my k8s pods running a query - any tips on debugging or optimizing this?
    j
    • 2
    • 7
  • v

    VOID 001

    10/29/2025, 1:33 PM
    I've found that daft now moving to a new state-ful UDF implement: daft.cls. For developers who is using daft.udf already. Does the community suggest we switch to daft.cls now? (production related workloads) And what are the current functionality daft.cls does not support? Also what's the cost of migrating from daft.udf to daft.cls? IMO user only need to change the decorator name PLUS the function call argument. Is there any other thing I need to concern? Thanks!
    c
    e
    • 3
    • 8
  • n

    Nathan Cai

    10/30/2025, 10:30 PM
    Hi there, I was planning on working on this issue of adding a custom delimiter to the
    write_csv
    option https://github.com/Eventual-Inc/Daft/issues/3786 but I realized that since I'm using one field of PyArrow's csv WriteOptions: https://arrow.apache.org/docs/python/generated/pyarrow.csv.WriteOptions.html#pyarrow.csv.WriteOptions Might as well implement support all the fields, as it shouldn't be too difficult. This is the original PyArrow class:
    Copy code
    @dataclass(kw_only=True)
    class WriteOptions(lib._Weakrefable):
        include_header: bool = field(default=True, kw_only=False)
        batch_size: int = 1024
        delimiter: str = ","
        quoting_style: Literal["needed", "all_valid", "none"] = "needed"
    
        def validate(self) -> None: ...
    I want to implement the corresponding class in Rust like this:
    Copy code
    #[derive(Debug, Clone)]
    pub struct CSVWriteOptions {
        pub include_header: bool,
        pub batch_size: usize,
        pub delimiter: char,
        pub quoting_style: QuotingStyle,
    }
    
    #[derive(Debug, Clone)]
    pub enum CSVQuotingStyle {
        Needed,
        AllValid,
        None,
    }
    However there's a couple of obstactles, the first one it seems that Rust doesn't support default fields, so I guess I create a special constructor for default I guess? But the main problem is, the
    quoting_style
    field, it accepts only specific kinds of strings, normally this is fine, but the problem is that when that Rust struct gets passed to Daft's Python CSV Writer class which uses those options with PyArrow, it accepts a string for
    quoting_style
    , not an enum, what is the cleanest approach to this? As far as I'm aware you can't have specific values for strings in Rust, you have to use an enum. Should I just create two structs, the one above, and a new one that's the same but accepts
    quoting_style
    as a String and the original
    CSVWriteOptions
    has a function that converts it a new struct that turns the
    quote_style
    into a string? How does Daft deal with Python's string literal and Rust enum interoperability? I feel like this is something you have encountered before and I want to know your standard of doing this.
    d
    • 2
    • 4
  • m

    maxidl

    11/03/2025, 7:17 PM
    I got a question regarding observed disk read speed. For the operation shown in the screenshot, the sword fish progress shows me 707MiB bytes read, which is very much on point. However, in htop i always see >1.5G/s DISK READ while this operation is going on, which does not really make sense to me. The input data (all 150M rows) is about 4.5GB compressed parquet (the column I am reading). So seeing read speeds of >1.5G/s for a minute does not really make sense. Does anyone know what leads to such high read speed being shown in htop?
    c
    • 2
    • 2
  • e

    Everett Kleven

    11/04/2025, 7:13 PM
    Big Deal for Daft: As mentioned in #C08424P0A8K, we cut the 0.6.9 release today. This was a quick update to get @Kevin Wang’s new experimental vLLM provider merged for the Ray Summit: > This new vLLM Prefix Caching provider is able to accomplish this by combining the power of the vLLM serving engine with Daft’s distributed execution Flotilla to do two things: > • Dynamic Prefix Bucketing - improving LLM cache usage by bucketing and routing by prompt prefix. > • Streaming-Based Continuous Batching - Pipeline data processing with LLM inference to fully utilize GPUs. > Combined, these two strategies yield significant performance improvements and cost savings that scale to massive workloads. We observe that on a cluster of 128 GPUs (Nvidia L4), we are able to complete an inference workload of 200k prompts totaling 128 million tokens up to 50.7% faster. If you are interested in learning more, I highly recommend you read Kevin's blog post. Its pretty awesome: https://www.daft.ai/blog/cutting-llm-batch-inference-time-in-half-dynamic-prefix-bucketing-at-scale
    ❤️ 6
    🚀 4
    🔥 6
  • o

    Oliver

    11/04/2025, 11:08 PM
    Hi, I have a question. I'm running all these DBT(data build tools) jobs using Apache Spark as the compute framework. Is there anyway I can run the compute using the Daft framework while still building on DBT.
    c
    • 2
    • 5
  • e

    Everett Kleven

    11/06/2025, 12:24 AM
    This was mentioned in #C08424P0A8K, but I wanted to quickly draw the community attention to a few important points from todays 0.6.10 release: 1. This was a rapid fire release to fix a bug in
    embed_text
    and
    embed_image
    introduced in 0.6.8. 2. Please Note as of this release,
    Daft _*no longer actively maintains support for python version 3.9.*_
    3. Additional highlights include documentation updates for daft's new
    daft.func
    and
    daft.cls
    UDFs which now both support async and batch capabilities. If you are still using
    daft.udf
    , I'd highly recommend taking a look at our comparison guide to learn how to migrate from legacy UDFs. 4. Finally, a major upgrade for the already powerful
    prompt
    function with multi image and file input support for the OpenAI Provider.
    daft party 3
    ❤️ 2
  • v

    VOID 001

    11/06/2025, 2:05 PM
    I have a question about set_runner_ray(), why do we need
    noop_if_initialized = False
    option? Since when you initialize the ray runner it will always return the same OnceLock<Runner> to the caller. why can't we just change the default value to True? Or even remove this option?
    k
    • 2
    • 3
  • g

    Garrett Weaver

    11/06/2025, 5:36 PM
    👋 is it an expected breaking change that
    is_in
    no longer works with sets and requires lists?
    👀 1
    c
    • 2
    • 3
  • g

    Giridhar Pathak

    11/07/2025, 4:33 PM
    hey!! its awesome that openai has been added as a provider for the llm_generate expression!! is there a way to configure it so that we can use a proxy url for openai?
    r
    j
    • 3
    • 9
  • n

    Nathan Cai

    11/13/2025, 5:38 PM
    Hi there, I've been trying to contribute to this issue, https://github.com/Eventual-Inc/Daft/issues/3786 But it turns out that when it comes to implementing it on the Rust portion, things get very tricky, Namely because the options passed to the logical plan builder,
    write_tabular
    , isn't known to the executor,
    connect
    , as that function takes no argument, and because of how complex the code is and how it feels like I'm doing surgery I was wondering if any of the contributors at Eventual could hop on a quick google meet call with me figuring it out?
  • a

    Adrian Lisko

    11/14/2025, 2:00 PM
    👋 Hi everyone! I am trying to figure out if it's possible to run ML inference pipeline on dataframe batches using all columns, eg read data -> run inference (vectorized predict(row) ) -> store predictions. I was hoping to use python batch udfs but those seem to work only with df series. Is there another way that I can use?
    c
    • 2
    • 3