Depends on the use cases, the way you want to copy the scikit-learn model may be different. In my case, last week, I have to train and test estimators against different targets. I work in EMG or EEG stuff and I need to estimate finger joint sensor reading. Unlike random forest and neural networks, other estimators usually do not support multi-output systems 1. That is means, for every single finger joint in my hand, I have to create an estimator. I recalled that I need to create 10 different estimators for a single hand.
Ultimately I wrote a dumb logic:
clf = ModelClassifier(many confusing parameter you don not want to know)
clf_list = list()
for finger in finger_list:
clf_temp = clf.fit(data)
clf_list.append(clf_temp)
That logic won’t work because all element inside clf_list
is the same! Silly me. So I went with the second option in this post since I only need to create multiple unfitted estimators with the same parameter.
Let’s have a Multinominal Naive Bayes classifier and fit it with the digits dataset:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
X, y = datasets.load_digits(return_X_y=True)
X = X / X.max()
_, X, _, y = train_test_split(X, y, test_size=0.1, random_state=42)
X, X_test, y, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train1, X_train2, y_train1, y_train2 = train_test_split(X, y, test_size=0.25, random_state=42)
clf = MultinomialNB()
clf.fit(X_train1, y_train1)
print('Classifier score: %f' % clf.score(X_test, y_test))
#Classifier score: 0.716667
In lines number 8, 9, and 10, I tried to reduce the dataset to 10 percent of it and then separate the dataset into three pairs: (X_train1
, y_train1
), (X_train2
, y_test2
), and (X_test
, y_test
). The classifier score after being fitted with (X_train1
, y_train1
) is 0.716667
.
Copy/clone trained scikit-learn estimator
from copy import deepcopy
clf_deepcopy = deepcopy(clf)
print('Classifier score: %f' % clf_deepcopy.score(X_test, y_test))
# Classifier score: 0.716667
clf_deepcopy.partial_fit(X_train2, y_train2)
print('Classifier score: %f' % clf_deepcopy.score(X_test, y_test))
# Classifier score: 0.800000
print('Classifier score: %f' % clf.score(X_test, y_test))
# Classifier score: 0.716667
The clf_deepcopy
is a different instance from clf
, but copied its ‘weight’. Thus, when we run line number 5, the score of clf_deepcopy
is the same as clf
. Then in line number 8, clf_deepcopy
trained again with (X_train2
, y_train2
). After that, the score of clf_deepcopy
improved but clf
score stays the same. See reference 2.
Note: Partial fit is not available for all scikit-learn estimators. Since not all estimators can learn new data without learning the whole existing dataset.
Copy/clone clean scikit-learn estimator
from sklearn.base import clone
clf_clone = clone(clf)
print('Classifier score: %f' % clf_clone.score(X_test, y_test))
# NotFittedError: This MultinomialNB instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Line number 3 clones clf
to clf_clone
. While the two are different instances, clf_clone
is an unfitted clone of clf
with its parameter. When we run line number 5, it throws an NotFittedError
error since, well, clf_clone
is not fitted yet. We must train it with (X_train1
, y_train1
), (X_train2
, y_train2
), or both. See reference 3.
Do not do this!
clf_new = clf
print('Classifier score: %f' % clf_new.score(X_test, y_test))
# Classifier score: 0.716667
print('Classifier score: %f' % clf.score(X_test, y_test))
# Classifier score: 0.716667
clf_new.partial_fit(X_train2, y_train2)
print('Classifier score: %f' % clf_new.score(X_test, y_test))
# Classifier score: 0.800000
print('Classifier score: %f' % clf.score(X_test, y_test))
# Classifier score: 0.800000
If we run line number 1, clf_new
and clf
will refer to the same object in memory. They are one instance with two different names. Thus, the scores before and after partial fit are the same.