My Digital Garden

One-Hot encoding

One-Hot encoding

One-hot encoding is a method used to represent categorical data as binary vectors. It's commonly used in machine learning and data preprocessing to convert non-numeric labels (like names or categories) into a numerical format that algorithms can understand.

You would use this in data analysis for these reasons:

  • Prevents ordinality: Unlike label encoding, one-hot encoding doesn't assume any order between categories.
  • Works with algorithms: Many machine learning models (like linear regression or neural networks) can't handle categorical data directly.

Suppose you have a categorical feature like:

Color: [Red, Green, Blue]

Each category is assigned a binary vector:

Red   → [1, 0, 0]  
Green → [0, 1, 0]  
Blue  → [0, 0, 1]

Only one bit is "hot" (i.e., 1) and the rest are 0 — hence the name "one-hot".

Code example

Here’s a simple example of one-hot encoding using scikit-learn's OneHotEncoder:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data
data = np.array([['Red'], ['Green'], ['Blue'], ['Green']])

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False)  # sparse=False returns a dense NumPy array

# Fit and transform the data
encoded_data = encoder.fit_transform(data)

# Show result
print(encoded_data)

# To see the feature names (optional)
print(encoder.get_feature_names_out())

which gives this output:

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]
['x0_Blue' 'x0_Green' 'x0_Red']

Each colour is now represented as a binary vector, ready for use in a machine learning model.

See also

internal and external references