Machine Learning — Introduction to Modeling #2

Ufuk Çolak
Nerd For Tech
Published in
7 min readMay 1, 2021

--

In the previous article, I had a general introduction to Machine Learning. I examined information about data, usage areas, and frequently used concepts. In this article, I will continue with the frequently used concepts. Then, I will examine the model validation methods.

Model Training

In the development of machine learning models, the trained model should perform well on new or unseen data. We divide our existing data into 2 as training and test data sets to simulate new/unseen data. In particular, the first part is the larger data subset of data used as the training set (such as taking into account 80% of the original data), the second is a smaller subset and used as the test set (the remaining 20% ​​data).

For example, we give data to the algorithm, and through this data, the algorithm learns a structure (such as house price prediction models). In other words, it learns the effects of factors and the aspects of these effects. We divide our data set into 2 to test the success of these algorithms that we have created by teaching. Let’s say we have a data set of 1000 observations, we allocate 800 to train and test the remaining 200 to see if they have trained the model.

Variable Selection

During the modeling studies, we may have 5,10, or even 100 independent variables, depending on the size of our data set. We will try to estimate the dependent variable Y with these independent variables. In modeling studies, we do not try to keep all variables in the model, the aim is to try to catch the most explanation with the least variable.

Model Selection

Two methods come to the fore.

  • Try to ensure that the best model is chosen among the models created with the combinations of variables that can occur.
  • Try to choose a model among the different models installed.

How to Choose the Model?

For regression problems, an explanation ratio and an RMSE (error measurement metric) derivative value are used.
For classification problems, a derivative value of the correct classification rate (metric for evaluating model success) is used.

Overfitting

The model learned patterns specific to the training data, while it does not predict well on new, unseen data.

In other words, we divide the data into two as test and train. It learns the algorithm training set very well. However, when we try to predict the model with the data set that it does not see, the prediction performance starts to decrease. This situation is called overfitting.

Deterministic Models vs Stochastic Models

In Deterministic models, models it is assumed that there is a definite relationship between variables. The relationship between the two variables is expressed with a line. In other words, the output of the model is fully determined by the parameter values.

Stochastic models are probabilistic models. There is a random error here.

As we can see in the graph above, in stochastic models, when we try to predict the relationship between X and Y, we cannot express it linearly. There is a margin of error.

Linear vs Nonlinear Models

The relationship between X and Y is linear if expressed with a straight line. If the relationships between variables are modeled using a curve, tree-based methods, or other methods instead of a line, this is called nonlinear methods.

Machine learning is simply the process of transition from mathematics to statistics. While mathematics includes precision, statistics includes probability. There is no certainty in statistics, there is always an error and guesswork.

Model Validation Methods

We build a model to find relationships between dependent and independent variables. For example, the dependent variable we want to predict is the price of the houses, and our independent variables, the size of the house, its location, the floor, etc. After fitting the model, we need to evaluate the results of the model. These studies are called model validation methods. Different methods are used in regression models and classification models.

Holdout Method

Let’s say you have an original data set. We divide the data set with 1000 observations as 80% — 20% as training and test set. We train with 800 observations and test with 200 observations. For example, we learn the coefficients in the home price prediction model with the training set, and then we test how good this forecast is with 200 observations.

In the Holdout method, if the number of observations is low, we may not be able to separate the data set as training and test set. For example, when we have 50 observations, we may not divide the data to train and test it.

K Fold Cross Validation Method

The data set is divided into k number of sections. In the first iteration, the first fold is used to test the model, and then the rest are used to train the model. In the second iteration, the 2nd fold is used as the testing set while the rest serve as the training set. This process is repeated until all folds have been used as the testing set.

When the errors obtained are averaged, this would be our validation (training) error. Then we test our model with the test set that we divided at the beginning of the study.

We always need to divide the data set we have into two as Test and Train sets. After this separation, we always need to do the K fold method over the training set. We need to calculate a correct training error on the training set, keep it aside, and test the model we have created on a test set.

In a nutshell, we have an original data set. We divide this data set by two as a test-train, 80% -20%. We validate using this 80% train data. We do this in 5 or 10 folds and build a model from here. With training error in the obtained model, we test on the test data.

Leave One Out Method

It is a particular version of the K-fold method. In K, we split the data set into 5–10 parts, exclude one fold at each iteration, and use the part we left out to build and test models with other folds. Here, the number of samples in the data set is equal to n as many as the number of clusters. That is, n clusters are assumed. Like K, they are all tested respectively.

For example, we have 1000 observations. Each time, a model fits with 999 observation units, and then one observation unit is tested. In the second iteration, another observation unit is excluded, and the model is fitted with all other observations, and then the excluded is tested. In this way, the whole data set is analyzed.

Although this method can be applied theoretically, it is difficult to use as the data set grows.

Bootstrap Method

Similar to other methods, it is based on approaches such as divide the data set by two somehow, let’s train the model with some, and test with the other part. Bootstrap works in a way to re-sample, in addition to what other methods do.

For example, we have the original data set. From this data set, bootstrap samples are created so that the data set is less than the number of observations. For example, there are 10 such as Bootstrap1, Bootstrap2, Bootstrap3. The model is fitted on these 10 data. The fitted models are tested with the test set approach, and the results are evaluated by taking the average of the trains and tests separately.

To sum up, it is used to derive data from a data set in a substitute manner. A model is fitted on each of the new data created. These models are tested and the results are evaluated accordingly.

As a result, the most common of the above methods is the K-Fold Cross Validation method. When we consider a new data set, we will first divide it into two as test and train. We will evaluate it using the K-Fold Cross Validation method on the train/training set here, and we will have created our final model and obtained our test error.

In the next article, see you on the methods of evaluating the prediction success of the models…

--

--