Python module to perform under sampling and over sampling with various techniques.
Hi,
I am getting error loading a trained imblearn.pipeline Pipeline saved by joblib. Getting this error message:
ModuleNotFoundError: No module named 'imblearn.over_sampling._smote.base'; 'imblearn.over_sampling._smote' is not a package
The trained pipeline was saved via joblib.dump(pipeline, 'filename.joblib')
. Any tips as to where the saving and loading process went wrong?
Hello everyone.
I have an usage question about EasyEnsembleClassifier. I have a dataset which has 450.000 data inputs with 13 columns(12 features, 1 target). My dataset is imbalanced (1:50) so I decided to use EasyEnsembleClassifier. I realized that all the subsets are exactly same for all the estimators.
I found this issue which is similar to my problem: scikit-learn-contrib/imbalanced-learn#116
In theory classifier method should create subsets for each estimators. These subsets should have all minority class samples and select same number of samples from majority class. In my case I should have roughly 18000 samples in each subset (I have roughly 9000 samples in minority class). However when I use "estimatorssamples" method it seems like output arrays for my estimators are exactly same and all of them have size of complete training set(80% of my dataset). So I decided to make a test:
'''
import numpy as np
from sklearn.datasets import make_classification
from imblearn.ensemble import EasyEnsembleClassifier
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.3, 0.7],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=10, random_state=1)
clf = EasyEnsembleClassifier(n_estimators=5, n_jobs=-1)
clf.fit(X, y)
arr = clf.estimatorssamples
arr
Output:
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]
'''
What am I doing wrong here? Obviously I am missing a point.
In [13]: for est in clf.estimators_:
...: print(est[0].sample_indices_)
[4 6 7 0 9 2]
[4 6 7 5 8 3]
[4 6 7 1 2 5]
[4 6 7 3 1 5]
[4 6 7 3 5 2]
estimator_samples_
is reporting. It might be a bug then
BaggingClassifier
from scikit-learn
The code you provided works fine with my generated dataset but when I use it on my real dataset this is what I get:
clf = EasyEnsembleClassifier(n_estimators=5, n_jobs=-1, sampling_strategy = 1.0)
clf.fit(X_train, y_train)
for est in clf.estimators_:
print(est[0].sample_indices_)
Output:
[279507 240017 23859 ... 94249 87790 120830]
[277730 75855 70104 ... 341432 318980 130029]
[166614 207 72374 ... 93568 76905 142951]
[304630 28272 143132 ... 159062 264981 41332]
[ 35943 358917 68200 ... 121931 209190 284075]
Is this a normal result? I would expect first three indices in each row to be the same. I mean; all of the samples that belong to the minority class are being used in all subsets. I am not saying this is wrong. I am just asking if this is normal?
RandomOverSampler
will get a dataset and will not be an issue to have non-numerical data inside
I'm using a multiclass dataset (cic-ids-2017), the target column is categorical (more than 4 classes), I used {pd.get_dummies} for One Hot Encoding. The dataset is very imbalanced, and when I tried to oversampling it using SMOTE method, doesn't work, I also tried to include them into a pipeline, but the pipeline cannot support get_dummies, I replaced it by OneHotEncoder, unfortunately, still not working :
X = dataset.drop(['Label'],1)
y = dataset.Label
steps = [('onehot', OneHotEncoder(), ('smt', SMOTE())]
pipeline = Pipeline(steps=steps)
X, y = pipeline.fit_resample(X, y)
Is there any proposition ?
This question got to do with SMOTEBoost implementation found here https://github.com/gkapatai/MaatPy but I believe the issue is relayed to imblearn
library.
I tried using the library to re-sample all classes in a multiclass problem. Caught by AttributeError: 'int' object has no attribute 'flatten'
error:
How to reproduce (in Colab nb):
Clone repo:
!git clone https://github.com/gkapatai/MaatPy.git
cd MaatPy/
from maatpy.classifiers import SMOTEBoost
Dummy data:
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=6, weights=[.1, .15, .75])
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=.2, random_state=123)
And then:
from maatpy.classifiers import SMOTEBoost
model = SMOTEBoost()
model.fit(xtrain, ytrain)
/usr/local/lib/python3.7/dist-packages/imblearn/over_sampling/_smote.py in _make_samples(self, X, y_dtype, y_type, nn_data, nn_num, n_samples, step_size)
106 random_state = check_random_state(self.random_state)
107 samples_indices = random_state.randint(
--> 108 low=0, high=len(nn_num.flatten()), size=n_samples)
109 steps = step_size * random_state.uniform(size=n_samples)
110 rows = np.floor_divide(samples_indices, nn_num.shape[1])
AttributeError: 'int' object has no attribute 'flatten'
You just need to operate proper reshaping. I once worked with a time series activity data in which I created chunks of N-size time-steps. The shape of my input was (1, 100, 4)
. So for the training sample, I have (n_samples, 1, 100, 4)
and was a five-class, multi-minority problem, that I want to oversample using SMOTE.
The way I go about it was to flatten the input, like so:
#..reshape (flatten) Train_X for SMOTE resanpling
nsamples, k, nx, ny = Train_X.shape
#Train_X = Train_X.reshape((nsamples,nx*ny))
#smote = SMOTE('not majority', random_state=42, k_neighbors=5)
#X_reample, Y_resample = smote.fit_sample(Train_X, Train_Y)
And then reshape the instance back to the original input shape, like so:
#..reshape input back to CNN xture
X_reample = X_reample.reshape(len(X_reample), k, nx, ny)