-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking in #2
Comments
We've actually been working on this in Right now, our intention is to download and install Miniconda onto the user's filesystem (using a path from rappdirs), which can then be discovered and used by
The ultimate goal is to insulate the user from having to worry about managing a Python installation -- that is, R packages using We're also hoping will be able to work with a single, shared Python environment, with the option later for creating and using project-specific Python environments as necessary. This implies having client packages of Once concern I have with your approach is that it becomes difficult to switch to different versions of Python for users -- one is effectively stuck with the version that was built with One other worry: if 'basilisk' were to upgrade the bundled version of Python, I think this would also imply upgrading any pre-existing virtual environments (since those are often created using symlinks back to the "real" Python installation). This may or may not be the desired behavior for users. Perhaps this is okay if Python version bumps would only occur with R version bumps as well?
Doesn't this imply losing the main benefits of |
Thanks Kevin;
That sounds great; that would save me from having to manage that particular bit of functionality. If there are also advanced options to control the path, that would be perfect for us.
Yes, it would require a bit of discipline on our end - but we already manage coordinated versions of Bioconductor core packages across the project, so it is not much of a stretch to say that "BioC packages are strongly recommended to use this Python version". I guess each package could also install their own Python version if they wanted, but I figured we could save a bit of space and time by having a single installation for everyone rather than dealing with twenty different miniconda installations scattered around the place. At the time, I was also thinking that every R package would also use the same Python session (and thus should use the same Python version) but this doesn't matter as much now.
Yes, that's right. And additionally, the Bioconductor core team has the ability to bump all dependent packages to trigger recompilation in case of updates to core packages. This is not infrequent when dealing with changes to the S4 method table, and I've had to request this a few times when updating some C++ APIs that are linked to by other packages.
Yes. I think it depends on how you view this capability. From a package developer viewpoint, I have been thinking of using Python in a manner similar to the
That's a good point. I guess I should add some code checking (i) whether a Python instance is already loaded, (ii) use the virtual environment directly in the current process if not, but (iii) spin up a new process if the loaded instance is not compatible with the one that I want. |
Package authors may want to retain references to Python objects within the R session, though -- even if they only expose R objects to their users. As an example, you could imagine an R package that maintains a reference to e.g. an HTTP server implemented with Python, with the R package providing an interface to that server. In that case, the Python object would need to remain persistent. IMHO if you build any tooling around That said, having an API to execute some code in a |
That's an interesting perspective. I hadn't realized that the Python objects were persistent like that. If I may throw another opinion into the ring: for the packages I develop, I never assume any level of persistence between function calls. The general philosophy is to avoid side effects so as to make my life easier; the basilisk way of using Python code reflects this mentality. (I don't know whether global variables are particularly common in Python packages, but persistence could make R package debugging harder if those globals can be modified by reticulate operations in other packages.) The practical reason for using a separate R process was to accommodate different (and possibly mutually incompatible) Python package dependencies from different R packages within a single R analysis session. This would ensure that R packages calling some Python code would work regardless of what else was going on, e.g., what other packages were run before that, or what version of Python was loaded manually by the user. If you know of a better way of doing this, then that would be great - but it seems like it would be very difficult to swap out one set of Python modules for another "on the fly", especially given that the objects need to persist in the same Python session. And yes, it's true that basilisk won't be able to handle cases where persistence is necessary. I'll admit that's not a thing that I was considering for my packages. But I don't know of a solution that allows persistence and still guarantees that the R packages will work reliably. Tell you what - if you think this "separate process" idea is too niche, that's fine and understandable. I can just piggy-back off your miniconda deployment code (provided I can control the installation path). basilisk will probably have to exist to standardize the Python installation across BioC packages anyway, and at least within the project, I can imagine a few governance things we can do to encourage people to use compatible Python packages - with the separate processes being a fallback. |
I do think that client packages of reticulate will expect that a persistent Python session is available and can be accessed from the same R session. Handling mutually incompatible Python dependencies is definitely the main challenge in this approach, but I believe it can be at least mitigated if we document for R package authors that they should aim for a common target (e.g. Python 3.6). |
Okay, thanks. I guess I was hoping for a nice solution that would guarantee that Python-dependent functions in client packages can work regardless of what has already happened in the R session... oh well. I'll ask around the BioC developers to see what they feel about the persistence vs. isolation trade-off that basilisk makes compared to raw reticulate - see if there's any appetite for this... |
Thanks to @lawremi, the persistence problem has been solved by simply setting up a separate R process. This retains the advantage of isolation, as discussed above, while allowing the process to last as long as the client package wants it to. I suspect most people will just There is still the cost of serialization, but it should be possible to mitigate this by forking where possible. For Windows users... |
From my understanding of this discussion, and from admittedly superficial day-to-day experience with python, my impression is that R users treat pythonic code as scripts rather than library calls, and as such persistence and interoperability (between ad hoc collections of python modules) is an important goal. I'll tag @simon-anders (if he's on github these days...) |
My understanding is that basilisk has two goals: (1) define a common Python environment for all of Bioconductor, (2) enable multiple, potentially conflicting Python environments to exist within a single R session. I think these two goals could be achieved with separate packages. For (1), it seems that basilisk is just specifying (and providing) a single version of Python itself. Ideally it would specify (and potentially provide) a broader environment including the versions of popular Python extension modules. This would be the official Python environment of Bioconductor and serve to minimize conflicts. This is probably just a simple wrapper around miniConda. For (2), there should be an abstraction layer that enables packages to transparently interact with Python (via reticulate), whether Python is running in the current session (using the official Bioconductor environment), or whether Python is running in a remote session (using an alternative, conflicting environment). This seems like a reasonable compromise that enables full compatibility with any Python code while being efficient for the common, standard case. Ideally reticulate itself would provide this flexibility. |
That's right. At least, from the perspective of
I strongly agree with both of the points raised by @lawremi. I think having We'll have to think about how (2) could be handled on the |
Yes, if (2) could be handled on the reticulate side, that would make life easier. In fact, if (1) and (2) are provided by reticulate, there wouldn't be any need for a dedicated package like basilisk at all, and we could just have a Python integration style guide on the BioC website somewhere. Of course, if there is no appetite for (2), then I guess basilisk would need to exist, but that's okay; it wouldn't be much work to slap some tests and docs on it for BioC submission. I'll go and collect some feedback from developers of potential client packages before I make a go/no-go decision. |
I think it still makes sense for (1) to be solved by a package like basilisk; especially since it takes care of making sure Python is available in the binaries that would be distributed on Windows + macOS, and also freezes the version of Python for a particular version of the package. It is likely out-of-scope for |
Why would this be a Bioconductor package? Seems like the main consequence of that would be to enable Bioc developers to go off in their own incompatible direction... One thing I'd hope to get from this iteration is a unified solution that allows the Bioc (and CRAN?) builders to actually run python code in the same way that we do R code during build / check. We can't do that if there are plurality of solutions, or if solutions are intrinsically fragile or resource-intensive. I believe also that even the major python-based CRAN packages do not evaluate python code, which undermines a major strength of the system... Kind of a weird discussion for an issue on a package that's still under development... ;) |
I could see a case for solution (1) to be in Bioconductor, because it would give Bioconductor control over the Python environment to which all packages must adhere. Getting the entire R/CRAN ecosystem to agree on a single environment would be a lot more difficult. |
Based on the commentary above, I have added
Users and developers can influence these choices by directly or indirectly setting |
Branch vc_basilisk of https://github.com/vjcitn/BiocSklearn uses basilisk to manage python packages related to scikit-learn. BiocSklearn's current incarnation in Bioc 3.10, which simply uses reticulate::import as needed, without specifying python package versions, may illustrate some advantages to the basilisk approach, as today's build/check outcomes for linux and macos are inconsistent. I reported that a drawback appears to be the installation of a complete python runtime in the BiocSklearn installation inst folder (~330MB) ... but I may be failing to take advantage of certain symbolic link opportunities. |
Some investigation indicates the size is due to the installed packages, which get installed fresh for every virtual environment, so it's not a symlink issue. The cleanest solution is probably to have basilisk pull down some "core" packages (e.g., scipy, numpy, pandas) and allow the virtual environments to re-use them to avoid chewing up space with unnecessary copies across packages. This should be possible with |
@vjcitn The proposed solution is active as of 8b319f3 and works pretty well for me:
This shifts the content into basilisk, which should be fine if it reduces the redundancy that would otherwise occur if client packages all had their own copies of, e.g., numpy. (This is still permissible if it is necessary to resolve version conflicts, we've just switched the default for Now we just have to decide on a set of "core" packages... currently it's scipy, numpy, pandas and matplotlib. Adding more is possible but will inflate basilisk so it had better be good. |
If we're aiming to define a common base environment for Bioconductor, the more modules+versions we can specify, the greater the compatibility between packages. That doesn't mean we have to include the modules in the download of basilisk itself; it could lazily grow its installation. |
The idea of a 'common base environment' converges with efforts @kevinushey is pursuing, and is of course highly desirable from a user and build system perspective. |
Well, basilisk doesn't make any statements or guarantees about Python-Python compatibility across R packages, nor do I feel that would be a common use case. However, there would be benefits from avoiding the process-based fallbacks and reducing installation size of client R packages.
We'd have to design appropriate fallbacks for situations when the downstream processes requesting lazy installation don't have access to the basilisk installation directory. This is... possible, I guess, but it could get messy pretty quickly. We could move the miniconda installation out of basilisk's directory but that raises other questions about where this thing would live, e.g., when installed by a user vs a sysadmin on a server environment. There is also the problem of what happens when basilisk updates, as the lazily installed Python packages will all be lost when the installation directory is wiped. You would have to correspondingly bump all downstream packages to trigger lazy reinstallation. This is less of a problem with the current setup provided the miniconda installation itself doesn't change, as the symlinks in the client packages will just keep pointing to the same place. Seems a lot cleaner to just have basilisk pull down a fixed set to start with. I will note, though, that |
basilisk now has support for lazy installation of "core" packages. The deal is this:
The idea is to avoid redundant installations and thus decrease the disk footprint of the basilisk framework while also avoiding a large upfront cost in Python package installation. We can thus add any number of "core" packages to For any developer of a client package (well, just @vjcitn right now), there are several major changes:
Again, check out |
I wonder if there is a little hitch ... if you use devtools::document in a client package, inst/basilisk will be populated with a python environment (111MB in my case) in the source folders. you have to get rid of that before proceeding to install. maybe my development pattern (devtools::document() inside source to generate man files) is obsolete? |
Otherwise this new approach works for the code at https://github.com/vjcitn/BiocSklearn/tree/vc_basilisk |
I see this too. Don't know why it happens... guess it's due to some Edit: Well, it turns out that Edit II: Fixed by the simple expedient of checking whether the installation directory for an R package actually ends with the name of that package. |
The final piece is now in play. basilisk will create a virtual environment that is transparently used by any client package that does not require Python packages outside of
Another change is that we cache the miniconda installer script via BiocFileCache, to avoid having to redownload it every time we reinstall basilisk. It's actually surprisingly small for what it does, but it still has to do quite a lot, so we can cut out ~10 seconds from the install time by caching it. There is a remaining question of whether we want to match Edit: And now the circle is complete. Core lists for every OS are defined by scraping the Anaconda website and pulling out those tables for use as a constraint file in |
Looks like rstudio / reticulate have moved forward with their approach https://stat.ethz.ch/pipermail/r-package-devel/2020q1/004821.html . Any source for details @kevinushey ? |
Please see:
for a vignette discussing the scheme. |
Looks cool. Could we re-use whatever code you wrote to handle the miniconda installation? Getting the (Ana)conda installer working is giving me endless grief on the BioC Windows build machines. |
The function |
In addition, if you find |
Thanks @kevinushey. It looks pretty good; and I just learnt about And on that note, I just noticed the comment in |
@kevinushey:
@mtmorgan suggested that you might already have some ideas/implementations for the delivery of a consistent version of Python to reticulate users. If this is possible, it would certainly be helpful for our efforts to reliably wrap and deploy Python functionality inside Bioconductor packages.
Our current approach in basilisk basically uses a sledgehammer to kill a mouse. We drag in a miniconda installation of Python into
${R_HOME}/library/basilisk/inst
upon installation of basilisk, thus providing a self-contained Python installation owned by R. Each downstream BioC package that wants to use this Python instance sets up its own virtual environment upon its own installation, pulling down and installing whatever Python packages are necessary. Execution of Python code takes place in a separate R process via callr, thus allowing multiple R packages relying on incompatible versions of Python packages to be used in the same analysis session.Perhaps you might have something that's more elegant? Our main focus is on the operation of Python code within R packages, with developers being the intended audience.
The text was updated successfully, but these errors were encountered: