Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slim installation option or model inference export option #689

Open
twoertwein opened this issue Nov 11, 2024 · 5 comments · May be fixed by #691
Open

Slim installation option or model inference export option #689

twoertwein opened this issue Nov 11, 2024 · 5 comments · May be fixed by #691

Comments

@twoertwein
Copy link

rsmtools has many large dependencies. Many of them are not needed at inference time.

It would be nice to either:

  1. Use the pip compatible install options like pip install "rsmtools[full]" to install all dependencies only when needed
  2. Add a method to export an already trained model so that it can be run without requiring rsmtools (for example as a sklearn pipeline). Ideally it would allow specifying all the options that fast_predict supports.
@desilinguist
Copy link
Member

desilinguist commented Nov 11, 2024

I think (2) is a good idea. To start with, since RSMTool models are SKLL models, we should be able to easily set the pipeline attribute and then you would just need SKLL for the inference side. You'd still have more dependencies but SKLL has much fewer extra ones over scikit-learn, compared to RSMTool. Of course, this would require extra disk space as well.

@twoertwein
Copy link
Author

It might also be nice to remove pyarrow, openpyxl, xlrd, and xlwt from requirements.txt. These libraries are never directly called in rsmtools. If a user wants to let pandas read parquet files, they should install pandas's optional dependencies (fastparquet/pyarrow).

@desilinguist
Copy link
Member

I believe openpyxl and xlrd are needed for Excel support in Pandas or at least used to be? pyarrow was added because pandas is going to make it required starting with 3.0.

@twoertwein
Copy link
Author

I believe openpyxl and xlrd are needed for Excel support in Pandas or at least used to be?

Yes, these are optional dependencies of pandas needed to read/write excel files. Personally, I think users are responsible for installing them - not even pandas installs them by default.

pyarrow was added because pandas is going to make it required starting with 3.0.

I believe that was reverted :) (and fastparquet is much smaller and available on more architectures)

@desilinguist
Copy link
Member

I think the reason for pre-installing those libraries was because RSMTool was pitched as a fully-self-contained solution that works out of the box and because Excel spreadsheets were the main input files at ETS. But if that's no longer the case, then that's probably fine.

Good to know about pyarrow - we should definitely remove that then.

@twoertwein twoertwein linked a pull request Jan 2, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants