Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do we do versioning? #7

Open
caro401 opened this issue Nov 9, 2023 · 14 comments
Open

How do we do versioning? #7

caro401 opened this issue Nov 9, 2023 · 14 comments
Labels
discussion open questions

Comments

@caro401
Copy link
Collaborator

caro401 commented Nov 9, 2023

Summary from https://github.com/DHARPA-Project/kiara-website/blob/main/concepts/versioning.md

  • how does versioning work for kiara core, how do I know if I can/should update

    • At the moment, while versioning is in alpha (0.x.y), assume that plugins at version 0.X will work (only) with kiara core version 0.X - ie if you are running kiara core 0.5.x, your plugins all also need to be 0.5.x
  • how does versioning work for plugins? how do i check what's changed between versions, how do I know if I can/should update

    • at the moment, no way to know unless you read the git diff/wrote the plugin?
  • is there also versioning of data types, operations, modules? Is this independent of plugins? How do I know what version I'm using, if I can/should update, what python package I need to update to to do that

    • currently no, but at some point it was mentioned this might be useful?
  • are versions of everything visible in your data lineage? currently no? should they be?

  • are/how are versions represented when you have a pipeline or workflow file?

  • what does all this mean for plugin authors and how they should version their things? Are there utilities in the template for bumping a version, writing a changelog etc?

At the moment, almost every piece of existing code/jupyter notebook/streamlit app I can find doesn't work. What can we do to make this less of a problem? Encouraging users to pin dependencies is one thing, but we also need to think about when things like names of operations need to be changed.

This seems to be a pain point in the team, a decent chunk of time is spent working out why materials from previous workshops or examples don't work anymore, and worrying about whether updating anything will break stuff

@caro401 caro401 added the discussion open questions label Nov 9, 2023
@makkus
Copy link
Collaborator

makkus commented Nov 10, 2023

There is no established way of doing things yet. For now, until we're out of alpha, I've been sem-versioning sort of, with updating the minor version for breaking changes, and keeping the plugin packages on the same 0.x.0 version as kiara core (even though that is not always necessary).

My main assumption is (was) that since our app would take care of creating the environment, we can pin all versions of kiara and officially supported plugins in each release, after we made sure they work together. If we want to support the use of kiara as a library as well, then we need to think about this again, and spend some time on it, and establish some sort of (best) practice.

Once we are out of alpha I also intend to maintain a changelog (automatically generated or not, not sure), but I didn't want to spend the time doing that as long as I'm the only main user of everything.

@makkus
Copy link
Collaborator

makkus commented Nov 10, 2023

is there also versioning of data types, operations, modules?
are versions of everything visible in your data lineage? currently no? should they be?

Each item in the lineage of a dataset comes with the details of the Python environment it was created in (packages, versions of packages, Python version, etc.). This can be queried via the kiara api (or if some information is not yet exposed, it can be added easily.

The rest is not fully established yet. For now, the module and data type versions mirror the version of the package they are contained in. We might want to add a secondary version, that indicates whether the data-type/module changed in an important way. For example, for modules we could up the version whenever the module API (basically input- or output schema) changed. As this indicates it might break pipelines the modules are part of.

Most of this is stubbed out, and can be added if necessary. But I'm not sure what the best way here is, and figured it's best to wait until we have some sort of experience (how often those changes happen, how severe the inpact each time is, ...).

@makkus
Copy link
Collaborator

makkus commented Nov 10, 2023

are/how are versions represented when you have a pipeline or workflow file?

Not sure what you mean. Pipelines themselves don't need versions, because for kiaras purposes they sort of 'dissolve' once used, and only the steps inside are important (each step has a module name and associated module/plugin version -- see above).

@makkus
Copy link
Collaborator

makkus commented Nov 10, 2023

what does all this mean for plugin authors and how they should version their things? Are there utilities in the template for bumping a version, writing a changelog etc?

No, this still needs to be established and implemented. I intend to add an automatic changelog creation to the project template, but other than that I have not really thought about this. I guess, external plugin authors can version their plugin as they please. If we decide to have datatype/module sub-versions, then we'd have to document how plugin creators should do that, it would probably have to happen manually.

@makkus
Copy link
Collaborator

makkus commented Nov 10, 2023

At the moment, almost every piece of existing code/jupyter notebook/streamlit app I can find doesn't work. What can we do to make this less of a problem? Encouraging users to pin dependencies is one thing, but we also need to think about when things like names of operations need to be changed.

This seems to be a pain point in the team, a decent chunk of time is spent working out why materials from previous workshops or examples don't work anymore, and worrying about whether updating anything will break stuff

I did not really plan for kiara to be used in so many heterogenous places without having full control over creating the environment in a scriped way. In case of our app (which was the only thing I was working towards), we'd have that full control.

Pinning dependencies is not trivial, esp. if you work with multiple plugin packages. Pin too strict, and the resolver might barf, pin too loose, and the thing will break in 2 months. Not sure what the solution is here. Offering kiara as a fully fledged library (with plugins) that will work relieably for everyone would incurr some meaningful development and maintanence cost, so we'd have to figure out who would take on that responsibility I guess.

Names of operations and modules is a different thing, and that's why I've been banging on about how important it is to get their interfaces right. They are not really supposed to change, at least change often. In the type of thing we are building we rely on modules and datatype interfaces to not change, because we can't really afford to implement a dependency resolution mechanism on top of pip or conda. This is as much a practical consideration/limitation as it is one of me not being smart enough to figure out how to implement something more flexible. If anyone has any ideas here, I'm all ears.

@CBurge95
Copy link
Collaborator

Probably more either simple and/or can-of-worms questions from me:

Once we are out of alpha I also intend to maintain a changelog

What do we perceive as being 'out of alpha'? I.e at what stage do the core functionalities need to be at for it to have made it to beta? And at this point will previous versions be effectively irrelevant, so we only need internal documentation for these early parts, rather than anything that is 'user-facing' (i.e. beyond this team)

Pipelines themselves don't need versions, because for kiaras purposes they sort of 'dissolve' once used, and only the steps inside are important

Do these still need to have pinned dependencies though? What if the steps inside have been renamed or removed? Should these metadata attached to the pipeline/workflow as a whole, so that we don't have to investigate the separate modules to work out which versions are needed?

Names of operations and modules is a different thing, and that's why I've been banging on about how important it is to get their interfaces right. They are not really supposed to change, at least change often.

For me at least, the issue is not so much that they change but that I don't know how/when they do, beyond trying to run something and discovering that this no longer works - the network analysis notebook for example no longer works (certainly not without pinned dependencies, though I personally am also still having some problems operating that, though that is more likely a me thing that anything else), but also because the modules used are no longer in the network analysis plugin and only exist in the dh_tagung plugin now. It would just be helpful to have a log of where these are/how they have changed/what is new with each plugin update. Again, just for my part, I would prefer that the name stayed the same and the internal workings changed in an update, rather than the module get re-named with any changes. This is easier to find/work out from the module metadata/documentation than trying to find the new operation (even with 'list operation'). Finally, I think the closer we can keep to existing function names, the better - I know we aren't copying other things, but just increases accessibility/useability for those moving across from different systems.

I don't think any of this has to be 'pretty' or end-user friendly, just findable and readable. Is there an easy way, for example, for us to see what is in the different previous versions of plugins, so we can decide which one might have the modules associate with a certain existing pipeline/notebook?

@makkus
Copy link
Collaborator

makkus commented Nov 16, 2023

What do we perceive as being 'out of alpha'? I.e at what stage do the core functionalities need to be at for it to have made it to beta? And at this point will previous versions be effectively irrelevant, so we only need internal documentation for these early parts, rather than anything that is 'user-facing' (i.e. beyond this team)

Originally I'd have said once we have an initial frontend (which in itself can be alpha quality code), and have established what parts of the backend are used by it (includes Python code as well as exactly which modules/pipelines), and those parts are stable in terms of their API. There will still be bugs, but client developers can assume the API they are using won't change anymore.

Generic usage like with jupyter will increase the surface of code and modules that need to be tested, maintained, documented and supported considerably.

Do these still need to have pinned dependencies though? What if the steps inside have been renamed or removed? Should these metadata attached to the pipeline/workflow as a whole, so that we don't have to investigate the separate modules to work out which versions are needed?

Its complicated, it's different when considering the metadata in relation to values that have been produced by a pipeline in the past, and future uses of the pipelines themselves. In general, I'd probably treat them the same as modules, and try our hardest not to change their external interfaces. Internals can be changed, since users would not be exposed to it.
But that is only half the story and a lot of stuff would depend on how frontends implement support for and use pipelines on a techincal level.

For me at least, the issue is not so much that they change but that I don't know how/when they do, beyond trying to run something and discovering that this no longer works - the network analysis notebook for example no longer works (certainly not without pinned dependencies, though I personally am also still having some problems operating that, though that is more likely a me thing that anything else), but also because the modules used are no longer in the network analysis plugin and only exist in the dh_tagung plugin now.

Yes, but we determined that from the start, and that is why we put the tagung stuff in their own plugins, right? Those will only keep working if you pin the exact dependency version you used at that point in time. If you do that, the modules that are in that version of the tagung and pinned downstream network analysis plugin will still exists. Minus the normal Python problems that come with not updating a dependency for a month... But I don't think there is anything we can do here aside from freezing every single dependency in the environment.

I don't think any of this has to be 'pretty' or end-user friendly, just findable and readable. Is there an easy way, for example, for us to see what is in the different previous versions of plugins, so we can decide which one might have the modules associate with a certain existing pipeline/notebook?

Up until now I've auto-generated plugin documentation for each plugin release, which includes all the modules that were included at that time ( e.g. https://dharpa.org/kiara_plugin.network_analysis/0.4.14/ -- you can change the version there). This will break from now on until the new auto-doc generated thing is implemented, but after that will work the same probably.

But the main problem I guess is really that we don't really have well-documented, tested & production-ready plugins we consider stable so far (apart from core_types and tabular sort of -- but even those need more polish, documentation, usage guides, tests). Once a plugin is published officially, I'd expect (or require) that all of the pipelines that are using it (or Python code for that matter) are not allowed to break, and backwards compatibility needs to be maintained as well as practically possible. Again, that is the reason why module interfaces can't change (in incompatible ways) after they are stable, and why we so far haven't assigned that label to basically anything.

If anyone wants to implement something that tracks modules while they are still in development, go for it, I'm happy to contribute anything lower level technical (probably API endpoints) required for such a system. The new auto-generation of plugin docs is probably a good starting point, since that will have all of the metadata of all included modules and data types in a machine readable way ( #2 ). A system like that will mainly be useful for the phase until a plugin is published officially the first time, but it could be doable.

@caro401
Copy link
Collaborator Author

caro401 commented Dec 7, 2023

@makkus please can you explain why SemVer is not an appropriate solution for versioning modules/plugins? You mentioned there were technical reasons, can you clarify please

@makkus
Copy link
Collaborator

makkus commented Dec 7, 2023

It would mean there would have to be a dependency resolution mechanism within kiara, and that's not the type of complexity I feel competent enough to implement. Or maybe there is a way to do without dependency resolution?

How would the version of the (plugin) Python package relate to a version of a module? What if a pipeline needs an older version of a module, which is contained in an older version of the Python package? What if the data-types used in one of the modules inputs/outputs changed in a significant, breaking way? Should we also version data types? Then we'd also need to version each input/output field with not just the type of data, but also version of type of data. I just don't have any idea how to do this technically, or from an API perspecive that pipeline and module creators would use.

How would you implement it, maybe I'm just missing the obvious, simple solution?

@makkus
Copy link
Collaborator

makkus commented Dec 7, 2023

Maybe let me add more context, this is as good a place as any...

So, as it stands now, kiara centers around the concept of pipelines that can be described in a declarative way, consisting of a series of steps (atomic modules or other pipelines), arranged in a specific way by connecting certain outputs of some steps to certain inputs of other steps. Declarative being the key word here, and that quality is fairly important in different ways. Jupyter notebooks are not declarative, for example, and that is why we can't 1:1 map certain aspects of doing data science between how things are done in Jupyter (and what people are used to and sort of intuitively assume is the only way, or at least default), and kiara.

This way of connecting steps results in a DAG (Directed Acyclic Graph). For all practial purposes those pipelines are 'static' data, but it's possible to sort of simulate the procedural way of doing traditional data science (input - decide which transformation - output - decide which next tansformation - output - ...) by gradually adding new steps (resulting in a new 'static' pipeline structure -- this is important conceptually: we don't 'change' the pipeline, we create a new one. Potentially, those changes can be tracked in a pipeline/workflow session history). It is a frontend architectural decision how to assemble said pipeline (and whether/how to want to track the history).

Result values that are created by running inputs through such a pipeline contain the information contained in the pipeline within their lineage tree (just upside down, also 'internal sub-pipelines' are resolved). That means, with kiara its possible to receive a result value, read the metadata, and re-run either the same chain of operations that lead to the result value (given we have all the inputs), or run the same pipeline, but with our own input values. This is one of the reasons why we have lineage, along with being able to communicate the history of data to whoever looks at it, and other more technical reasons.

One of my assumptions is that there will be plenty of pipelines in existence we don't know about or have control over. For example in my frontend-prototyping explorations I created a lot of small, short-lived ones to do utility stuff like rendering/filtering/other stuff that is not relevant for the actual data science part (those will never end up in a result values lineage because they are only used to display temporarily important information to a user for example). And of course, as I said, those pipelines will exist within each kiara value, encoded in its lineage. Those are probably the two most important places, but there are others.

Now, re: versioning. Each pipeline step represents an operation (a combination of a specific kiara module type and it's configuration). As is probably obvious, if any of the module types here change either the type, name, or 'meaning' (whatever that means in a specific context), any pipeline (remember: that's static, declarative configuration data, not code) that references that module will break. With two exceptions:

  • if we add a new input field to a module that also has a default, and the default value makes the module behave the same way the module behaved without that field before
  • if we add an additional output field, but leaving all previously existing output fields in place as before

Change anything apart from that, and you break every pipeline structure that referenced this module in the past.

We do have some protection from at least the lineage being totally useless: the value metadata also contains information about the exact Python packages that were in the environment when it was created, so in theory we could re-create that environment, and then re-run the included pipeline. But the immediate, intuitive value of having this information is severly dimished, in my opinion. Just imagine a 'prefect world' where the modules interface never change in a breaking way: you can share your results as much as you want, and you'd immediately share a useful workflow that others could use, or maybe build upon/extend. I could imagine a frontend that can dynamically create a graphical mini-app from any result value that someone shares with you, similar to the streamlit example I shared (which is really only missing the 'reading lineage and converting to pipeline' part).

Now, I'm not saying we must have exactly that, and its only one example of many I could think of. Either way, if I'm right and pipelines are central to using kiara, then I think it's highly desirable to keep breakage of existing ones at an absolute minimum.

As we've talked about, versioning modules separately from the Python package they are contained in could be a way to fix this, potentially. In one of the posts above I've indicated that there are some stubs/placeholders of having a secondary (apart from the Python package version) version for modules. This is not something I thought through 100%, but it could be used to indicate that the interface of a module changed in a non-breaking way. Which could be helpful for some frontend tasks.

In theory we could also use it to indicate breaking changes, using semver or whatever. But even if we do that, existing pipelines like I talked above will break, they will just do so with more information. Re-creating a working pipeline from a lineage will still not be possible, except we find a way to have multiple versions of a module in a plugin, and also think about how changes in data-type versions would affect this. It would also mean more work for pipeline creators, since they''d have to specify module and data type versions in the declarative pipeline config. And, most importantly, it would mean a fairly serious increase in complexity within the kiara codebase (which is already quite a bit more complex than I'd like it to be). And honestly, I'm not sure I'm a good enough programmer to implement this.

I can go into details why having to version data types would be even worse, if anyone is interested.

Another option would be to decree that we don't care about pipelines (and code like the Jupyter notebooks thing etc.) like this breaking from release-to-release, and pipelines must be updated like you need to update your Python code when you upgrade a dependency that breaks a usage pattern. I think this would make some really nice functionality we could otherwise have impossible (or at least very very impractical), and cause a lot more pain medium/long term than spending the time and really thinking about modules and their interfaces. But that's just my opinion, I'd be more than happy for anyone else to propose a way forward.

@caro401
Copy link
Collaborator Author

caro401 commented Dec 7, 2023

Ok, for now I don't care at all about versioning modules or anything, I just want to get a handle on if/how I can know that a version of a plugin is compatible with other plugins and with kiara core. Is semver appropriate for that or not?

Can we codify how we expect plugin authors to specify their dependencies (>=, ~=, ==, something else?) such that their plugin at a particular version can continue to be used to do actual research with, rather than breaking for unknown reasons? And from there, so that I can pick a bunch of plugins to install with my mini-apps and be sure that they will reliably work and work together, and that I can opt into breaking changes.

@caro401
Copy link
Collaborator Author

caro401 commented Dec 7, 2023

If pipeline code is meant to be declarative and reproducible, surely each version of each plugin and core should very precisely pin all of its dependencies/have everything in some kind of lockfile, so the pipeline is actually reproducible regardless of changes in the broader python ecosystem? Does a pipeline create its own runtime dependencies with the right plugin versions when it is run, or do you just hope the end user happened to install them? What is actually kept in the lineage? Every package installed in your conda env or just the one version of the plugin?

@makkus
Copy link
Collaborator

makkus commented Dec 8, 2023

Ok, for now I don't care at all about versioning modules or anything, I just want to get a handle on if/how I can know that a version of a plugin is compatible with other plugins and with kiara core. Is semver appropriate for that or not?

The question in our meeting was about versioning modules. If you are talking about Python packages and their versions (like our plugin packages), then, yes, I'd suggest semver with the caveat that we don't do 'proper' semver until we've reached version 1 (as a lot of other projects seem to do).

Can we codify how we expect plugin authors to specify their dependencies (>=, ~=, ==, something else?) such that their plugin at a particular version can continue to be used to do actual research with, rather than breaking for unknown reasons? And from there, so that I can pick a bunch of plugins to install with my mini-apps and be sure that they will reliably work and work together, and that I can opt into breaking changes.

I don't know a good solution here, so feel free to suggest something. Packages have their dependencies specified in pyproject.toml. It's not trivial to figure out what to do, and it depends on the package. As with every Python package, you need to specify the minimum version of a dependency that contains all the features you use. If there is a bug-fix release in a dependency you can update or not, it won't break anything hopefully. If the dependency has a breaking change, you have to decide on an individual basis whether you want to upgrade your minimum number, or specify a maximum (like the last version that still works). If your package is used by others, your users might run into problems because you'll force them into a specific version range, that might conflict with another version range that another dependency requests from them.
Long story short, it's the 'normal' package dependency story, that is particularlly bad in Python, and has nothing to do with kiara plugins.

Can we codify how we expect plugin authors to specify their dependencies (>=, ~=, ==, something else?) such that their plugin at a particular version can continue to be used to do actual research with, rather than breaking for unknown reasons?

Sure, go ahead, if you have an idea or opinion here we'll do that. I don't think it's realistic to come up with a solution that will give such a strong guarantee, but maybe there is a 'best' way to cover most situations? I'd be more than happy if we had a solution here.

And from there, so that I can pick a bunch of plugins to install with my mini-apps and be sure that they will reliably work and work together, and that I can opt into breaking changes.

I don't understand the exact circumstances on how that is supposed to work, apart from specifying the dependencies you need in pyproject.toml, but if you can sort of spec that out and maybe tell me how you would 'configure' that on your side, I can try to implement anything that would be necessary on the backend, or plugin template.

If pipeline code is meant to be declarative and reproducible, surely each version of each plugin and core should very precisely pin all of its dependencies/have everything in some kind of lockfile, so the pipeline is actually reproducible regardless of changes in the broader python ecosystem?

Well, no, because kiara collects this information at runtime:

kiara context info runtime environment explain python

That information is attached to every value that gets stored, and can also be queried and persisted in other situations. That doesn't mean we can't have a lockfile somewhere, but it means it's not required to manage this particular information outside of kiara and influence the strategy how to handle Python (or conda) packages, which as we've established is hard enough without additional constraints.

Does a pipeline create its own runtime dependencies with the right plugin versions when it is run, or do you just hope the end user happened to install them?

The pipeline is just the structure of how modules are connected together, it does not contain any dependencies. I guess we could consider adding a metadata field if there is a strong usecase for it, so far it wasn't needed.

If you run a pipeline and the environment misses some plugins, kiara will tell you it can't run it because it misses module_types 'x' and 'y'. I haven't really made that error message pretty, but in theory I can probably add some instructions like: pip install this and this specific plugin package. We could think about doing this in an automated way for the user, but we'd probably have to restart the Python runtime if we do. So, if this is a requirement from the frontend side, we can add it to the to-do list. So far, it hasn't been an issue in practice for me so I didn't think through options. Happy to work through them with you though.

What is actually kept in the lineage? Every package installed in your conda env or just the one version of the plugin?

The lineage only contains the basic structure (basically reverse pipeline information), as well as ids of all the values involved (inputs, intermediate results, the value that contains the lineage itself). Package information can then be retrieved by accessing the Python environment info that is attached to each of those values, I can't remember exactly how the information is stored and whether an additional lookup step is required or not, but if there is a need I can create a data-structure that contains that info both ways. In practice, since all of those values could contain different versions of Python packages (the values further down the line could eventually have older versions in some cases), it's probably best to pick the python packages from the value that contains the lineage itself, since in most cases it would have all the packages included that where required to create it. There might be some edge cases I haven't thought through yet, but this isn't something we had a need for yet. So again, whatever the requirements from your side, just let me know.

This all works as long as the interfaces of data-types and modules don't change. As soon as we allow (or have to consider) breaking changes here, it becomes a lot more complicated. This was flagged as a risk back when we decided to go the modular route (before your time).

@makkus
Copy link
Collaborator

makkus commented Dec 8, 2023

Oh, another complication I haven't talked about is Python version, because that also affects versions of dependencies that are available, for example currently we have a problem with duckdb not being available for 3.12. Or which architecture you are running on.
But I've sort of given up on the idea that we'll ever be able to figure all this out for our users since it's the same problem that everyone else is having, with or without kiara.

For frontends like yours its a bit easier, because we can freeze the exact environment at release (incl. Python version), test on all architectures/OSs we want to support. If it turns out we have some plugins we don't want to install initially but want to give users the option to install (I don't see a need to do this in a mini app like this), we could at least do it in a managed, tested way.
For a use-case like 'kiara-in-jupyter', it's harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion open questions
Projects
None yet
Development

No branches or pull requests

3 participants