POC: Early Exit 2 - sdf.collect(...).to_pandas() #729

gwaramadze · 2025-01-29T14:21:19Z

Introduces an ability to collect a small batch of data and inspect interactively in a Python interpreter or a tool like Jupyter Notebook.

Like POC 1 makes use of ExitManager, an internal class that facilitates stopping the Application once a certain number of messages has been processed or a timeout in seconds is reached.
Adds StreamingDataFrame.collect method:
- Accepts number and timeout parameters and configures ExitManager.
- Ensures that a single update operation is appended to the sdf. This operation will collect processed messages into a list. This step will not happen on consecutive sdf.collect calls.
- Triggers Application.run and Application.reset.
- The collect name is temporary. I don't like the clash with the windowed aggregation function named collect.
Adds to_pandas and to_polars methods, inspired by to_topic. These methods will return either Pandas or Polars dataframe, not our StreamingDataFrame.
Note: Application reset will not reset the consumer group, this means that consecutive runs will consume new messages (according to committed offset) instead of reprocessing the stream.
Note: There is no need for the user to do app.run()

Example:

from uuid import uuid4
from quixstreams import Application

app = Application(
    broker_address="localhost:19092",
    auto_offset_reset="earliest",
    consumer_group=str(uuid4()),
    use_changelog_topics=False,
)

topic = app.topic("1000000-numbers-100-keys")
sdf = app.dataframe(topic=topic)
sdf.collect(number=2)

# Inspect
sdf.to_pandas()
sdf.to_polars()

# Collect new data
sdf.collect(timeout=3)

# Inspect new data
sdf.to_pandas()
sdf.to_polars()

This reverts commit c68503e.

ExistManager is responsible for stopping the application once a certain number of messages arrives or a timeout is reached.

This reverts commit f54f601.

daniil-quix · 2025-01-30T12:26:28Z

I like the to_pandas() idea — it looks neat and returns a usable DataFrame (there is no need to initialize it upfront).
We could also pack the .collect() inside to_pandas() to simplify the API.

gwaramadze added 7 commits January 29, 2025 16:58

POC: inspect

61c843c

Revert "POC: inspect"

2b58282

This reverts commit c68503e.

Create ExitManager

4fcd9d3

ExistManager is responsible for stopping the application once a certain number of messages arrives or a timeout is reached.

Create LocalSink

d5eafca

Revert "Create LocalSink"

b642fde

This reverts commit f54f601.

Introduce sdf.collect(...).to_pandas() notation

7fd9c3e

Move Application run to collect

91aa9a3

gwaramadze force-pushed the feature/early-exit-poc-2 branch from 78a8fd7 to 91aa9a3 Compare January 29, 2025 15:58

daniil-quix closed this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Early Exit 2 - sdf.collect(...).to_pandas() #729

POC: Early Exit 2 - sdf.collect(...).to_pandas() #729

gwaramadze commented Jan 29, 2025 •

edited

Loading

daniil-quix commented Jan 30, 2025

POC: Early Exit 2 - sdf.collect(...).to_pandas() #729

POC: Early Exit 2 - sdf.collect(...).to_pandas() #729

Conversation

gwaramadze commented Jan 29, 2025 • edited Loading

daniil-quix commented Jan 30, 2025

gwaramadze commented Jan 29, 2025 •

edited

Loading