One-Hot encoding
One-Hot encoding
One-hot encoding is a method used to represent categorical data as binary vectors. It's commonly used in machine learning and data preprocessing to convert non-numeric labels (like names or categories) into a numerical format that algorithms can understand.
You would use this in data analysis for these reasons:
- Prevents ordinality: Unlike label encoding, one-hot encoding doesn't assume any order between categories.
- Works with algorithms: Many machine learning models (like linear regression or neural networks) can't handle categorical data directly.
Suppose you have a categorical feature like:
Color: [Red, Green, Blue]
Each category is assigned a binary vector:
Red → [1, 0, 0]
Green → [0, 1, 0]
Blue → [0, 0, 1]
Only one bit is "hot" (i.e., 1) and the rest are 0 — hence the name "one-hot".
Code example
Here’s a simple example of one-hot encoding using scikit-learn's OneHotEncoder:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Sample categorical data
data = np.array([['Red'], ['Green'], ['Blue'], ['Green']])
# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse=False) # sparse=False returns a dense NumPy array
# Fit and transform the data
encoded_data = encoder.fit_transform(data)
# Show result
print(encoded_data)
# To see the feature names (optional)
print(encoder.get_feature_names_out())
which gives this output:
[[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]
[0. 1. 0.]]
['x0_Blue' 'x0_Green' 'x0_Red']
Each colour is now represented as a binary vector, ready for use in a machine learning model.
See also
internal and external references