Example for XGBoost

Mahitha Singirikonda
4 min readNov 6, 2020

XG Boost is very powerful Machine learning algorithm which can have higher rates of accuracy when specified by its wide range of parameters in supervised machine learning. XGBoost stands for eXtreme Gradient Boosting. XG Boost works on parallel tree boosting which predicts the target by combining results of multiple weak model. The XGBoost library implements the gradient boosting decision tree algorithm . Let us explore more using an example.

Here we are using heart disease uci data set from kaggle. Also we are trying to predict the likelihood of getting heart disease and which feature is more important for that. Here is the link for data.

First let us import pandas to read the data using the following line of code.

import pandas as pd

Now let us read the data to a dataframe

data = pd.read_csv(‘../input/heart-disease-uci/heart.csv’)data

This will read the data into dataframe called data and gives the following output.

Now let us see the info of the data to explore more about data using the following code

data.info()

This will give the following information about the data. We can see data types, columns, null values etc.

<class 'pandas.core.frame.DataFrame'> 
RangeIndex: 303 entries,
0 to 302 Data columns (total 14 columns):
# Column Non-Null Count Dtype
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13) memory usage: 33.3 KB

Let us see another form of information through following code

data.describe()

This will give the description of the dataframe in the following way.

Description gave count, mean, std, min, 25%, 50%, 75% and max of all columns. This further helps us to understand more about data. Now let us import the numpy, xgboost and sklearn.metrics libraries. We need mean squared error for regression.

import xgboost as xgb  from sklearn.metrics 
import mean_squared_error
import numpy as np

Next we need to assign X and Y values. Here we are also seperating the target variable and the rest of the variables using .iloc to subset the data.

X, Y = data.iloc[:,:-1],data.iloc[:,-1]

Now we need to convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and gives it acclaimed performance and efficiency gains.

data_dmatrix = xgb.DMatrix(data=X,label=Y)

Now, we will create the train and test set for cross-validation of the results using the train_test_split function from sklearn’s model_selection module with test_size size equal to 20% of the data. Also, to maintain reproducibility of the results, a random_state is also assigned .

from sklearn.model_selection import train_test_split 
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=123)

The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments.

xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)

Now let us fit the regressor to the training set using the .fit() method.

xg_reg.fit(X_train,Y_train)

the output after fitting shows many hyperparameters as follows.

XGBRegressor(alpha=10, base_score=0.5,                           booster='gbtree',colsample_bylevel=1,colsample_bynode=1, colsample_bytree=0.3, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.1, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=10, n_jobs=0, num_parallel_tree=1,objective='reg:linear', random_state=0, reg_alpha=10, reg_lambda=1,scale_pos_weight=1, subsample=1, tree_method='exact',validate_parameters=1, verbosity=None)

Let us predict the above model using the .predict() method.

preds = xg_reg.predict(X_test) 
preds

The out put is as follows:

array([0.62361383, 0.436409  , 0.4638252 , 0.4234442 , 0.5576368 ,        0.36139676, 0.6453048 , 0.48385164, 0.61654013, 0.53615165,        0.5271714 , 0.5342451 , 0.36847046, 0.36139676, 0.5126163 ,        0.42248273, 0.47246233, 0.52830946, 0.4810629 , 0.5204763 ,        0.590331  , 0.47579026, 0.6621046 , 0.39579174, 0.49821952,        0.53383183, 0.4865605 , 0.5409056 , 0.43799824, 0.53313845,        0.4234442 , 0.5955741 , 0.5409056 , 0.45628968, 0.5528887 ,        0.60049963, 0.61878484, 0.6078624 , 0.62996906, 0.59981364,        0.64041364, 0.55104494, 0.5254018 , 0.62684685, 0.55517054,        0.48204178, 0.36139676, 0.44480985, 0.60275817, 0.6277096 ,        0.44678292, 0.42796794, 0.6621046 , 0.5477844 , 0.49884164,        0.62160766, 0.55104494, 0.59651303, 0.5327918 , 0.55396104,        0.5179344 ], dtype=float32)

Let us calculate the rmse by using the mean_sqaured_error function from sklearn’s metrics module.

rmse = np.sqrt(mean_squared_error(Y_test, preds)) 
print("RMSE: %f" % (rmse))

The output will be as follows.

RMSE: 0.449886

We can see that RMSE prediction for heart disease came out to be our model can predict with this error. To analyse which feature is more important factor we need to classify and XGBclassifier() is used for that. Let us import the required libraries for this task.

from numpy import loadtxt 
from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot

Let us call the classifier using the following line

model = XGBClassifier()

Here we are trying to fit the model

model.fit(X, Y)

The above gives the following output with hyperparameters.

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,               colsample_bynode=1, colsample_bytree=1,gamma=0,gpu_id=-1,               importance_type='gain', interaction_constraints='',               learning_rate=0.300000012, max_delta_step=0, max_depth=6,               min_child_weight=1,missing=nan,monotone_constraints='()',           n_estimators=100,n_jobs=0,num_parallel_tree=1,random_state=0,               reg_alpha=0, reg_lambda=1,scale_pos_weight=1,subsample=1,               tree_method='exact',validate_parameters=1,verbosity=None)

Here we are trying to plot the importance model using this code

plot_importance(model)  
pyplot.show()

The output shows the order of importance

Conclusion

We have built XGBoost model for predicting the likelihood of getting heart disease and the model has 0.45 rmse. We have also found that the most important factor in finding out the heart disease is cholesterol.

Originally published at https://www.numpyninja.com on November 6, 2020.

--

--