Question d’entretien chez J.P. Morgan

dealing with unbalanced data for classification?

Réponse à la question d'entretien

Utilisateur anonyme

8 août 2020

For Data perspective, Oversampling and Undersampling are the techniques which could be used. If the major class has a lot of data ( say 10 million samples) then undersampling could be used. But generally that poses a risk of losing information. Therefore it is preferable to use oversampling algos like SMOTE which helps in increasing samples of minor class. From Algorithm perspective one should refrain using Random Forest and Neural Net techniques and should stick to techniques like SVM. If data is extremely unbalanced with class ratio of say 1:100, choose anomaly detection techniques like one class SVM.