LivingTheCode.Life
  • Data Structures
  • Algorithms
  • Machine Learning
  • Categories
  • GitHub
GitHub
Twitter
Created By LifeOfCoding
Find me on:
Articles:
Vehicle Price Prediction
Fine-tune Mistral 7B with autotrain
Refine Image Generation with refinement models
SDLX ControlNet Image Shape Shifting
Dealing with missing values in Datasets
Fine-tune Llama2 on Guanaco Dataset with Autotrain
Fine-tune Llama2 with GPT Generated Dataset
Pandas Helper Functions
Pandas Grouping & Sorting Helpers
Intro to Pandas
Dealing with missing values in Datasets
Jimmy Rousseau
Author: Jimmy Rousseau | Published: 8/28/2023
Machine Learning

How missing values can effect learning

Most machine learning libraries will throw errors for missing values. Here I will outline some main approaches to dealing with them.

Let's start with the example from Kaggle learning:

Create a helper function to measure the quality of predictions for each approach.

Simple Option

The easiest solution is to just drop columns with missing values. This is only good when most values in this column are missing, otherwise we may be dropping some important information our algorithm can learn from.

MAE from Approach 1 (Drop columns with missing values): 183550.22137772635

Next best option with Imputation

Here we would fill in the missing values with some number. Most commonly the mean value. This usually leads to more accurate models than we would have dropping things all together.

MAE from Approach 2 (Imputation): 178166.46269899711

Extending Imputation

Sometimes imputation can lead to the new filled in values being above or below the actual values (perhaps, they just were not collected in the dataset), or rows with missing values being unique in some way than the others. In this case extending imputation by creating a new column tagging all the rows that had to be filled in, or not. This way, we give the algorithm some indication that that values are different or special in some way, with a new feature to learn from.

MAE from Approach 3 (An Extension to Imputation): 178927.503183954

Information for this article is from Kaggle Learning: Missing Values by Alexis Cook