# Example for XGBoost

XG Boost is very powerful Machine learning algorithm which can have higher rates of accuracy when specified by its wide range of parameters in supervised machine learning. XGBoost stands for eXtreme Gradient Boosting. XG Boost works on parallel tree boosting which predicts the target by combining results of multiple weak model**. **The XGBoost library implements the gradient boosting decision tree algorithm . Let us explore more using an example.

Here we are using heart disease uci data set from kaggle. Also we are trying to predict the likelihood of getting heart disease and which feature is more important for that. Here is the link for data.

First let us import pandas to read the data using the following line of code.

`import pandas as pd`

Now let us read the data to a dataframe

data = pd.read_csv(‘../input/heart-disease-uci/heart.csv’)data

This will read the data into dataframe called data and gives the following output.

Now let us see the info of the data to explore more about data using the following code

`data.info()`

This will give the following information about the data. We can see data types, columns, null values etc.

`<class 'pandas.core.frame.DataFrame'> `

RangeIndex: 303 entries,

0 to 302 Data columns (total 14 columns):

# Column Non-Null Count Dtype

0 age 303 non-null int64

1 sex 303 non-null int64

2 cp 303 non-null int64

3 trestbps 303 non-null int64

4 chol 303 non-null int64

5 fbs 303 non-null int64

6 restecg 303 non-null int64

7 thalach 303 non-null int64

8 exang 303 non-null int64

9 oldpeak 303 non-null float64

10 slope 303 non-null int64

11 ca 303 non-null int64

12 thal 303 non-null int64

13 target 303 non-null int64

dtypes: float64(1), int64(13) memory usage: 33.3 KB

Let us see another form of information through following code

`data.describe()`

This will give the description of the dataframe in the following way.

Description gave count, mean, std, min, 25%, 50%, 75% and max of all columns. This further helps us to understand more about data. Now let us import the numpy, xgboost and sklearn.metrics libraries. We need mean squared error for regression.

`import xgboost as xgb from sklearn.metrics `

import mean_squared_error

import numpy as np

Next we need to assign X and Y values. Here we are also seperating the target variable and the rest of the variables using .iloc to subset the data.

`X, Y = data.iloc[:,:-1],data.iloc[:,-1]`

Now we need to convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and gives it acclaimed performance and efficiency gains.

`data_dmatrix = xgb.DMatrix(data=X,label=Y)`

Now, we will create the train and test set for cross-validation of the results using the train_test_split function from sklearn’s model_selection module with test_size size equal to 20% of the data. Also, to maintain reproducibility of the results, a random_state is also assigned .

`from sklearn.model_selection import train_test_split `

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=123)

The next step is to instantiate an XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the hyper-parameters passed as arguments.

`xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,max_depth = 5, alpha = 10, n_estimators = 10)`

Now let us fit the regressor to the training set using the .fit() method.

`xg_reg.fit(X_train,Y_train)`

the output after fitting shows many hyperparameters as follows.

`XGBRegressor(alpha=10, base_score=0.5, booster='gbtree',colsample_bylevel=1,colsample_bynode=1, colsample_bytree=0.3, gamma=0, gpu_id=-1,importance_type='gain', interaction_constraints='',learning_rate=0.1, max_delta_step=0, max_depth=5,min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=10, n_jobs=0, num_parallel_tree=1,objective='reg:linear', random_state=0, reg_alpha=10, reg_lambda=1,scale_pos_weight=1, subsample=1, tree_method='exact',validate_parameters=1, verbosity=None)`

Let us predict the above model using the .predict() method.

`preds = xg_reg.predict(X_test) `

preds

The out put is as follows:

`array([0.62361383, 0.436409 , 0.4638252 , 0.4234442 , 0.5576368 , 0.36139676, 0.6453048 , 0.48385164, 0.61654013, 0.53615165, 0.5271714 , 0.5342451 , 0.36847046, 0.36139676, 0.5126163 , 0.42248273, 0.47246233, 0.52830946, 0.4810629 , 0.5204763 , 0.590331 , 0.47579026, 0.6621046 , 0.39579174, 0.49821952, 0.53383183, 0.4865605 , 0.5409056 , 0.43799824, 0.53313845, 0.4234442 , 0.5955741 , 0.5409056 , 0.45628968, 0.5528887 , 0.60049963, 0.61878484, 0.6078624 , 0.62996906, 0.59981364, 0.64041364, 0.55104494, 0.5254018 , 0.62684685, 0.55517054, 0.48204178, 0.36139676, 0.44480985, 0.60275817, 0.6277096 , 0.44678292, 0.42796794, 0.6621046 , 0.5477844 , 0.49884164, 0.62160766, 0.55104494, 0.59651303, 0.5327918 , 0.55396104, 0.5179344 ], dtype=float32)`

Let us calculate the rmse by using the mean_sqaured_error function from sklearn’s metrics module.

`rmse = np.sqrt(mean_squared_error(Y_test, preds)) `

print("RMSE: %f" % (rmse))

The output will be as follows.

`RMSE: 0.449886`

We can see that RMSE prediction for heart disease came out to be our model can predict with this error. To analyse which feature is more important factor we need to classify and XGBclassifier() is used for that. Let us import the required libraries for this task.

`from numpy import loadtxt `

from xgboost import XGBClassifier

from xgboost import plot_importance

from matplotlib import pyplot

Let us call the classifier using the following line

`model = XGBClassifier()`

Here we are trying to fit the model

`model.fit(X, Y)`

The above gives the following output with hyperparameters.

`XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,gamma=0,gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1,missing=nan,monotone_constraints='()', n_estimators=100,n_jobs=0,num_parallel_tree=1,random_state=0, reg_alpha=0, reg_lambda=1,scale_pos_weight=1,subsample=1, tree_method='exact',validate_parameters=1,verbosity=None)`

Here we are trying to plot the importance model using this code

`plot_importance(model) `

pyplot.show()

The output shows the order of importance

**Conclusion**

We have built XGBoost model for predicting the likelihood of getting heart disease and the model has 0.45 rmse. We have also found that the most important factor in finding out the heart disease is cholesterol.

*Originally published at **https://www.numpyninja.com** on November 6, 2020.*