-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Nodes
and DataContainers
extension for supporting scikit-learn
#823
Comments
Nice very detailed @jjerphan By DataFrame support I'm guessing you mean all the functions like .max, .pivot, .apply, etc. ? (see the sidebar here) I also want to ask people who built the backend (like @smahmed776 ) if they think this will require major changes to the backend beyond DataContainer. Adding the ability to pass classes in Flojoy is a bit different than what we're currently doing. |
By DataFrame support I meant supporting passing |
Hi Julien, I wanted to add an example that should be a good target for this integration, using an industry application we've already been contacted about: semiconductor wafer quality assessment. What is the data: greyscale images of semiconductor wafers (resolution ~ 50x50) The failure types we are interested in are the following:
Given the complication of categorizing each image into any of these categories, it is a perfect test case for an ML application. For reference to train the model, please use the dataset found here, which is a cleaned version of the data found here. I've included a brief visualization of 100 wafers from this dataset in the video below, generated from a little gist here. Once you get a model trained to correctly identify the images in the example dataset, the functionality can then be ported to Flojoy, at which point I will have finished integrating batch processing into Flojoy. wafers.mp4 |
Hi Tristan, I have several question:
|
Hi Julien, With regards to the example case, I do think something like an MLP / CNN would be best for that. As far as I know, the functionality in Other than that, the proposed approach sounds good to me (with the addition above). The scope you've defined seems to be very nice for this first integration. I would say you can go ahead with this plan (if @Roulbac @dstrande @Ben-Epstein approve as well). I do think it would be valuable for users to be able to input their own pre-trained models. Many industry partners have already spent massive computational resources on various models, and if they can just easily insert them into Flojoy, I think it will make our product and its functionality more attractive to potential customers. |
👋 I'll break my thoughts into a few section The wafer quality example
Definitely agree, this is not a feasible use-case for sklearn in my opinion. And I don't think it aligns well with the typical use-cases of sklearn users. Sklearn models are often
I would focus on examples that map to these criteria. Scope and designIf I'm understanding your proposal correctly (building a node for each of the components listed under I would instead suggest considering a framework that has a node for
Each of those has required parameters such as
You could even extend that to have dynamic parameters that are based on the class chosen. For example
This will let you scale much more easily, both from a development perspective and from a UX perspective, as having a node per classifier in the UI might be hard to navigate. Out of scopeIf you want to make Pipelines out of scope, you should consider talking to your perspective audience and understanding their use-cases. For example, a very common practice is to have pipelines that employ Similarly, dataframes are pretty standard in ML over arrays/ordered pairs. They offer that necessary structure, so I'd again consider talking to your customers to get a better idea of their wants. Pre-trained modelsThis is incredibly valuable and should definitely be considered. There are 2 components to this
|
Including What I meant, is that providing CNN might be more adapted for classification or regression problems since those architectures make use of the hierarchical structure of n-d signals much more than MLPs. I think the use-cases you are provided with motivate the introduction of nodes from (or least workflow using) deep learning frameworks. Even-though this might be out of the scope of scikit-learn's support within Flojoy, some frameworks (like Keras) have really similar API and UX to scikit-learn's, and the work on integrating scikit-learn might help the one for theirs. If supporting those frameworks make sense, we might want to open discussions for that. What do you think? |
I agree it's a different topic, and one worth having. But I'd just toss in that you should strongly consider using huggingface over keras. It's a much simpler framework, and I imagine that a large percentage of use cases from customers will have pre-trained models already on the hub |
I confirm that supporting HuggingFace's pipeline would help users solve a variety of problems scikit-learn is not suited as a solution for. Depending on Flojoy's vision or targeted uses-cases (which I do not know entirely), scikit-learn might not be as relevant as other solutions. Would you like to provide your users with:
|
Thank you @jjerphan for initiating this conversation, here is my feedback on the matter. Firstly, I want to bring to everyone's attention the utility of model inference in the context of Flojoy versus model training. With a myriad of complexities around model training, Flojoy can really shine much more easily by catering to pre-trained models which users want to deploy with ease. Please bare in mind that this doesn't mean we should drop model training at all, but rather focus more energy on model inference while still catering for simpler model training scenarios. Why Prioritize Inference over Training:
Feedback on the Issue:
TL;DR There is a lot to gain in supporting pre-trained pipelines and simple model training/fine-tuning use-cases, and it would be much harder to make Flojoy a fully-fledged model training platform. This thought would be important to keep in mind while making design choices for the platform. The HF pipelines is an excellent example of what Flojoy could do very well. |
Thank you for this comprehensive comment, @Roulbac. I agree with everything that you have exposed. After identifying Flojoy's direction and relevant use-cases, I think that the support of scikit-learn (which QuantStack was contacted for) might not be as relevant (for now) as deploying models. I propose that we open another RFC dedicated to Model Deployment within Flojoy to pursue discussions. What do you think? |
+1 I agree ☝️ |
Context: scikit-learn's usage and specificities
While the current
Nodes
andDataContainers
this is sufficient for most library like SciPy and NumPy which can entirely be used with free function, other libraries — like scikit-learn — have other worflows relying on state-full instances of classes they defined.In the case of scikit-learn:
Estimators
(i.e. generallyRegressors
,Classifiers
andTransformers
) and a few methods on those instances (basicallyfit
,predict
,predict_proba
,score
,score_samples
).sklearn.Pipeline
, themselves being asklearn.MetaEstimator
.Estimators
accept and returns NumPy arrays and common Python objects (int
,float
,str
,dict
,list
,tuple
). As of 1.2, scikit-learn has an extended support forpandas.DataFrame
(pandas is not a dependence of scikit-learn).Estimators
arefit
, public fitted attributes (parts of the those instances' states) can be accessed to have access to relevant information.Estimators
, public fitted attributes' access is useful (it provides additional information) but is not strictly requiredEstimators
, public fitted attributes' access was the goal of have theEstimators
fit and thus is requiredEstimators
have specific public methods (e.g.cost_complexity_pruning_path
forsklearn.tree.DecisionTreeClassifier
). Those are defined either in final classes or common mixin or base classes.There are already some existing nodes that are using scikit-learn under
AI_ML
andGENERATORS
under the hood such as:Depending on the use-cases Flojoy wants to target, we might want to develop
Nodes
:This RFC mainly aims at defining this second option.
Proposed scope: focus only on the minimal required steps
The minimal required steps are the following:
X
,y
, two NumPy arraysX
andy
get split as:X_train
andy_train
: to fit an estimatorX_val
andy_val
: to evaluate an estimator performance during the model selectionX_test
andX_test
: to evaluate the final chosen estimator performancesklearn.model_selection.GridSearchCV
) generally take care of training and validation, soX
andy
get split as:X_train
andy_train
: to fit estimators and evaluate their performance during the model selection (they are further split in the process)X_test
andX_test
: to evaluate the final chosen model performanceMetaEstimator
if model selection is used)In scikit-learn, this scope non-exhaustively targets the following interfaces:
sklearn.model_selection.train_test_split
sklearn.datasets.make_blobs
sklearn.datasets.make_classification
sklearn.datasets.make_regression
sklearn.datasets.load_iris
sklearn.preprocessing.StandardScaler
sklearn.preprocessing.OneHotEncoder
sklearn.model_selection.GridSearchCV
sklearn.linear_model.LinearRegression
sklearn.tree.DecisionTreeRegressor
sklearn.linear_model.LogisticRegression
sklearn.tree.DecisionTreeClassifier
For now a first minimal support of scikit-learn, I propose considering the following as out of scope of for now:
sklearn.Pipeline
pandas.DataFrame
within scikit-learnProposed design
DataContainers
specifically for most of scikit-learn'sEstimator
andTransformer
.Nodes
' inputs.DataContainers
.Nodes
to load or create datasetssklearn.datasets.load_*
)sklearn.datasets.make_*
)pandas.read_csv
)Nodes
for the main methods:fit
predict
predict_proba
score
score_samples
DataContainers
OrderedPair
generallyProposed metric of success
Being able to produce examples similar to the ones of scikit-learn in Flojoy, such as:
References
The text was updated successfully, but these errors were encountered: