POC: Early Exit 2 - sdf.collect(...).to_pandas() #729
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduces an ability to collect a small batch of data and inspect interactively in a Python interpreter or a tool like Jupyter Notebook.
ExitManager
, an internal class that facilitates stopping the Application once a certain number of messages has been processed or a timeout in seconds is reached.StreamingDataFrame.collect
method:number
andtimeout
parameters and configuresExitManager
.update
operation is appended to the sdf. This operation will collect processed messages into a list. This step will not happen on consecutivesdf.collect
calls.Application.run
andApplication.reset
.collect
name is temporary. I don't like the clash with the windowed aggregation function namedcollect
.to_pandas
andto_polars
methods, inspired byto_topic
. These methods will return either Pandas or Polars dataframe, not our StreamingDataFrame.reset
will not reset the consumer group, this means that consecutive runs will consume new messages (according to committed offset) instead of reprocessing the stream.app.run()
Example: