Often in statistics and machine learning, we normalize variables such that the range of the values is between 0 and 1.
The most common reason to normalize variables is when we conduct some type of multivariate analysis (i.e. we want to understand the relationship between several predictor variables and a response variable) and we want each variable to contribute equally to the analysis.
When variables are measured at different scales, they often do not contribute equally to the analysis. For example, if the values of one variable range from 0 to 100,000 and the values of another variable range from 0 to 100, the variable with the larger range will be given a larger weight in the analysis.
By normalizing the variables, we can be sure that each variable contributes equally to the analysis.
To normalize the values to be between 0 and 1, we can use the following formula:
xnorm = (xi – xmin) / (xmax – xmin)
where:
- xnorm: The ith normalized value in the dataset
- xi: The ith value in the dataset
- xmax: The minimum value in the dataset
- xmin: The maximum value in the dataset
The following examples show how to normalize one or more variables in Python.
Example 1: Normalize a NumPy Array
The following code shows how to normalize all values in a NumPy array:
import numpy as np #create NumPy array data = np.array([[13, 16, 19, 22, 23, 38, 47, 56, 58, 63, 65, 70, 71]]) #normalize all values in array data_norm = (data - data.min())/ (data.max() - data.min()) #view normalized values data_norm array([[0. , 0.05172414, 0.10344828, 0.15517241, 0.17241379, 0.43103448, 0.5862069 , 0.74137931, 0.77586207, 0.86206897, 0.89655172, 0.98275862, 1. ]])
Each of the values in the normalized array are now between 0 and 1.
Example 2: Normalize All Variables in Pandas DataFrame
The following code shows how to normalize all variables in a pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'points': [25, 12, 15, 14, 19, 23, 25, 29], 'assists': [5, 7, 7, 9, 12, 9, 9, 4], 'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]}) #normalize values in every column df_norm = (df-df.min())/ (df.max() - df.min()) #view normalized DataFrame df_norm points assists rebounds 0 0.764706 0.125 0.857143 1 0.000000 0.375 0.428571 2 0.176471 0.375 0.714286 3 0.117647 0.625 0.142857 4 0.411765 1.000 0.142857 5 0.647059 0.625 0.000000 6 0.764706 0.625 0.571429 7 1.000000 0.000 1.000000
Each of the values in every column are now between 0 and1.
Example 3: Normalize Specific Variables in Pandas DataFrame
The following code shows how to normalize a specific variables in a pandas DataFrame:
import pandas as pd #create DataFrame df = pd.DataFrame({'points': [25, 12, 15, 14, 19, 23, 25, 29], 'assists': [5, 7, 7, 9, 12, 9, 9, 4], 'rebounds': [11, 8, 10, 6, 6, 5, 9, 12]}) define columns to normalize x = df.iloc[:,0:2] #normalize values in first two columns only df.iloc[:,0:2] = (x-x.min())/ (x.max() - x.min()) #view normalized DataFrame df points assists rebounds 0 0.764706 0.125 11 1 0.000000 0.375 8 2 0.176471 0.375 10 3 0.117647 0.625 6 4 0.411765 1.000 6 5 0.647059 0.625 5 6 0.764706 0.625 9 7 1.000000 0.000 12
Notice that just the values in the first two columns are normalized.
Additional Resources
The following tutorials provide additional information on normalizing data:
How to Normalize Data Between 0 and 1
How to Normalize Data Between 0 and 100
Standardization vs. Normalization: What’s the Difference?