Computer Vision News - May 2021

16 Computer Vision Tool Resample the Dataset From the initial distribution, it is clear that a strong disproportion is present in the dataset. print ('Distribution of the Classes in the dataset') print (new_train['Finding Labels'].value_counts()/len(new_train)) labels_train = pd.get_dummies(new_train, columns=["Finding Labels"]) labels = labels_train[labels_train.columns[- 3 :]] y = np.array(labels) print ("Y original:", y.shape, np.sum(y, axis= 0 )) # Display barplot for class distributions sns.barplot(x=options,y=np.sum(y, axis= 0 )) plt.title('Class Distributions', fontsize= 14 ) plt.show() undersample= RandomUnderSampler() X_under, y_under = undersample.fit_resample(X.reshape( 433 , 150 * 150 * 3 ),y) undersample= TomekLinks() X_under, y_under = undersample.fit_resample(X.reshape( 433 , 150 * 150 * 3 ),y) oversample = RandomOverSampler(sampling_strategy='minority') X_over, y_over = oversample.fit_resample(X.reshape( 433 , 150 * 150 * 3 ),y) oversample = SMOTE() X_over, y_over = oversample.fit_resample(X.reshape( 433 , 150 * 150 * 3 ),y) The first thing to do in such cases is to try operating directly on the original data, by increasing the size of cases within the under-represented class. Whenever this is not possible, one may proceed with two techniques: 1) over-sampling (adding copies of instances from the under-represented class) or 2) under-sampling (deleting copies of higher frequency classes). Below we show how to perform random under-sampling and over-sampling and also two extra common techniques: the TomeKlinks algorithm, which tends to delete instances that are considered on the boundary between classes or noisy, and SMOTE, which creates synthetic samples from the minor class (rather than just creating copies). Corresponding bar plots are shown for each case. These can be produced using the seaborn library as we did above.

RkJQdWJsaXNoZXIy NTc3NzU=