0
0
0
share
0 Komentar
Menggunakan GridSearchCV untuk Mencari Parameter Optimal Pengklasifikasi Scikit-Learn
Terkadang hasil akurasi dari pembuatan model sangat kurang dari target. Bukan hanya masalah dataset dan preprocessing yang kurang baik, tapi pemilihan parameter untuk pengklasifikasi pun dapat menjadi salah satu penyebabnya. Di Scikit-Learn, kamu dapat menggunakan GridSearchCV untuk mencari parameter terbaik untuk pengklasifikasi yang ingin kamu gunakan. Prosesnya akan dilakukan secara brute force dan melaporkan mana parameter yang memiliki akurasi paling baik.
Uuntuk lebih lanjutnya mari kita ikuti tutorial berikut :D.
Persiapan
Spesifikasi komputer yang diperlukan untuk tutorial ini adalah:- RAM 4 GB atau lebih
- Intel Core i3 dengan QuadCore
- Swap 4 GB (bila menggunakan Linux)
- Python
- Scikit-Learn
- Scipy
- Numpy
- Dataset 20newsgroup
Menyiapkan contoh kode pengklasifikasi
Silahkan buat terlebih dahulu file bernama gridsearchcv-demo.py. Kemudian buat kode berikut di dalam file tersebut:Pertama kita siapkan terlebih dahulu data latih yang diambil dari dataset 20 newsgroups. Kemudian kita siapkan pipeline yang berisi pengklasifikasi default yang terdiri dari CountVectorizer, TfidfTransformer, dan SGDClassifier. Kemudian kita tentukan juga kombinasi parameter yang ingin kita uji pada pipeline untuk mendapatkan hasil terbaik. DI tutorial ini kita teliti parameter max_df, norm, dan alpha. Lalu kita buat instans GridSearchCV yang menerima parameter pengklasifikasi, parameter yang mau dicari, n_jobs sebanyak 4, cross validation sebanyak 10, dan output di konsol dengan tingkat kejelasan 4.from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.linear_model import SGDClassifier from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV
import json import datetime
menyiapkan dataset
dataset_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)
mengatur classifier
clf = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SGDClassifier()) ])
params = { 'vect__max_df': (0.75, 1.0), #'vect__max_features': (None, 5000, 10000, 50000), # 'vect__ngram_range': ((1, 1), (1, 2)), #'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), 'clf__alpha': (0.00001, 0.000001), # 'clf__penalty': ('l2', 'elasticnet'), #'clf__n_iter': (10, 50, 80), }
grid = GridSearchCV( clf, params, n_jobs=4, cv=10, verbose=4, )
clf = grid.fit(dataset_train.data, dataset_train.target)
print "\nBest estmator:" print print clf.best_estimator_
print "\nGrid score:" print for params, mean_score, scores in clf.grid_scores_: print "%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params) print
Setelah itu kita masukkan dataset kedalam GridSearchCV untuk diperiksa dan laporan pun akan diberikan setelah selesai melakukan pencarian parameter.
Mulai Menggunakan GridSearchCV
Sekarang mari kita jalankan skrip tersebut. Setelah menjalankannya dalam kurun waktu 6 menit, berikut adalah hasil output yang diberikan selama proses pencarian parameter untuk pengklasifikasi yang akan kita bangun dengan menggunakan GridSearchCV:Setelah melalui proses GridSearchCV, kita dapat memilih parameter yang terbaik. Pada hasil diatas kita dapat memilih parameter {'vect__max_df': 1.0, 'tfidf__norm': 'l2', 'clf__alpha': 1e-05} dengan skor 0.921 untuk dilewatkan kedalam pipeline yang telah kita buat. Kita dapat melewatkan parameter max_df=1.0 ke dalam CountVectorizer(), norm='l2' ke dalam TfidfTransformer, dan alpha=1e-05 ke dalam SGDClassifier. Dengan demikian akurasi pun dapat meningkat lebih signifikan ketimbang tidak menggunakan GridSearchCV.$ python gridsearchcv-demo.py
Fitting 10 folds for each of 8 candidates, totalling 80 fits [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.888596 - 13.4s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.890158 - 14.0s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.886643 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.901408 - 16.2s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.894876 - 14.6s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.877768 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.889184 - 14.8s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.888099 - 13.7s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.887011 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=0.75, score=0.888691 - 13.6s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.886842 - 16.0s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.891916 - 15.2s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.894552 - 13.3s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.893486 - 14.2s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.877768 - 14.4s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.886042 - 15.3s [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.887411 - 13.9s [Parallel(n_jobs=4)]: Done 17 tasks | elapsed: 1.2min [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.886323 - 14.3s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.882562 - 13.7s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l1, vect__max_df=1.0, score=0.888691 - 14.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.907895 - 15.4s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.913884 - 14.3s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.920035 - 13.9s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.930458 - 14.7s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.904340 - 14.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.939929 - 15.4s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.921099 - 15.1s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.909414 - 16.1s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.919858 - 14.6s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=0.75, score=0.922598 - 17.3s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.911404 - 15.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.925308 - 14.6s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.918278 - 11.8s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.923415 - 13.1s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.936396 - 13.0s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.908769 - 12.5s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.921099 - 12.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.922735 - 12.7s [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.920819 - 12.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-05, tfidf__norm=l2, vect__max_df=1.0, score=0.918967 - 12.2s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.912281 - 14.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.915641 - 13.2s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.914763 - 13.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.929577 - 13.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.931095 - 12.1s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.906997 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.921099 - 12.0s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.917407 - 12.2s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.923488 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=0.75, score=0.913624 - 11.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.907895 - 11.7s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.921793 - 12.1s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.907733 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.926056 - 12.5s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.929329 - 12.1s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.902569 - 11.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.909574 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.912966 - 12.3s [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.925267 - 11.7s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l1, vect__max_df=1.0, score=0.911843 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.890351 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.901582 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.903339 - 12.2s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.916373 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.921378 - 11.0s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.885740 - 11.2s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75 ............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.898936 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.899645 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.903025 - 11.4s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=0.75, score=0.910062 - 11.8s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.895614 - 11.9s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.898946 - 11.7s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.905097 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.919014 - 11.3s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.917845 - 11.6s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.888397 - 11.8s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0 .............. [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.901596 - 11.3s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.894316 - 10.5s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.903915 - 9.5s [CV] clf__alpha=1e-06, tfidf__norm=l2, vect__max_df=1.0, score=0.910062 - 9.6s [Parallel(n_jobs=4)]: Done 80 out of 80 | elapsed: 4.5min finished
Best estmator:
Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, st... penalty='l2', power_t=0.5, random_state=None, shuffle=True, verbose=0, warm_start=False))])
Grid score:
0.889 (+/-0.003) for {'vect__max_df': 0.75, 'tfidf__norm': 'l1', 'clf__alpha': 1e-05} 0.888 (+/-0.002) for {'vect__max_df': 1.0, 'tfidf__norm': 'l1', 'clf__alpha': 1e-05} 0.919 (+/-0.005) for {'vect__max_df': 0.75, 'tfidf__norm': 'l2', 'clf__alpha': 1e-05} 0.921 (+/-0.004) for {'vect__max_df': 1.0, 'tfidf__norm': 'l2', 'clf__alpha': 1e-05} 0.919 (+/-0.004) for {'vect__max_df': 0.75, 'tfidf__norm': 'l1', 'clf__alpha': 1e-06} 0.916 (+/-0.004) for {'vect__max_df': 1.0, 'tfidf__norm': 'l1', 'clf__alpha': 1e-06} 0.903 (+/-0.005) for {'vect__max_df': 0.75, 'tfidf__norm': 'l2', 'clf__alpha': 1e-06} 0.903 (+/-0.005) for {'vect__max_df': 1.0, 'tfidf__norm': 'l2', 'clf__alpha': 1e-06}
Bila kamu mempunyai sumber daya hardware yang lebih besar dan tangkas, kamu dapat mencabut semua tanda komentar pada bagian params. Sehingga kamu dapat melihat berbagai kombinasi yang lebih baik untuk mendapatkan akurasi yang lebih tinggi.
Referensi
- Scikit-Learn Official Documentation
- Python Official Documentation
0
0
0
share