Machine Learning — Missing Data and Data Transformation #6

Ufuk Çolak
5 min readJun 6, 2021

--

Hello to everyone! In my previous article, we examined the data preprocessing of machine learning models. In this article, we will examine missing data analysis and data transformation methods.

Missing Data Analysis

What is Missing Data?

It refers to the lack of observations in the examined data set. Data may have missing values for many reasons, such as observations that were not recorded and data corruption. Real-world data also often have missing values. Since many machine learning algorithms do not support data with missing values, it is critical to process the missing data before the modeling step. Lines with missing data in the processing steps can be deleted or filled. Of course, some situations may arise due to these solutions.

The Damages of Direct Deletion of Missing Data

Before deleting the missing data, it is necessary to check whether the deficiency in the data set is a structural defect. If this situation has occurred structurally, deleting missing data may cause bias in the model.

For example, let’s look at information about our customers. We are trying to determine whether it emerged randomly or structurally. By structural, we examine whether we mean a result related to another variable or not. Let’s say our customer gets the credit card spend NA. At this point, let’s assume that the customer has information that he does not have in another column regarding Credit Card Ownership. Here, we cannot fill in the spending information of the customer without a credit card using the average method.

NA doesn’t always mean lack!

Starting from the previous example, let’s assume that the customer’s spend is again NA and has it in the Credit Card ownership column. This time, perhaps this observation may indicate that the customer is not spending or measuring. That’s why we shouldn’t think that NA should always be discarded.

Loss of information!

For example, let’s say we have a data set of 100 variables, and there is a missing observation in one of all observations. At this point, if we delete this column, we will have lost information.

Summary statistics, correlation tests, and visualization techniques can be used to test the randomness of missing data.

How to Fix Missing Data Problem?

Deletion Methods: As we mentioned above, if there is a 70% deficiency in the variable that we have identified as missing, we will delete this variable or the observation unit from the data so as not to cause loss of information. These will not benefit us.

  • Observation or variable deletion method (Deleting Columns)
  • Listwise method
  • Pairwise method

Methods of Assigning Value: It is a very simple and common method to fill in the observations that we have identified as missing. We can do it using various statistical methods.

  • Median, mean, quantile
  • Assignment to the most similar unit (hot deck)
  • Outsourced assignment

Predictive Methods: By modeling the observations that we have identified as missing.

  • Machine Learning
  • EM
  • Multiple Assignment Method

Data Transformation

We talked about how machine learning algorithms perform using numerical input variables. These variables perform better when scaled to a standard range. For example, let’s assume that there is information in the dataset that shows the age, gender, and monthly expenses of our customers. All of these variables have different distributions. While the age maybe 20,21,22, expenditure amounts maybe 1000,2000, or 3000. Machine learning models that will be fitted with these variables may not always perform well. At this point, necessary transformations need to be applied. Here, we can cite algorithms that use the weighted sum of the input, such as linear regression, and algorithms that use distance measures, such as k-nearest neighbors.

The two most popular techniques used to scale numerical data before modeling are normalization and standardization.

Normalization is the rescaling of each input variable to be in the range of 0 and 1. To apply this transformation, we need to know the minimum and maximum values ​​of the data.

Standardization scales each input variable individually by subtracting the mean (also called centering) and dividing by the standard deviation to shift the distribution so that the mean is 0 and the standard deviation is 1. In other words, standardizing a dataset involves rescaling the distribution of values so that the mean of the observed values is 0 and the standard deviation is 1. Subtracting the mean from the data is called centering, and dividing by the standard deviation is called scaling. For this reason, the method is sometimes called central scaling.

When the variable is standardized, the information structure within the variable itself does not deteriorate. But it is set to a certain standard.

Let’s assume that the variable is equal to 120 in its initial state, and it ranks 70th in order. When this variable is standardized, this value of 120 will likely be a value between 0–1. However, when this variable is sorted from smallest to largest, its order will not change. So it will still be in 70th place.

Therefore, when a variable is standardized, its value will change, it will be put into a certain format like 0–1, but its spread, distribution information, and current state will not change.

In this article, we focused on data preprocessing steps missing data analysis and data transformation methods. When we encounter missing data, we learned that we can remove it from the data or fill it with another value. We learned how to transform data containing different levels of information into the same structure, in other words, data transformation steps. I hope you enjoyed it, see you in the next :)

--

--