Categorical variables are features in data that consist of categories or classes rather than numeric values. Some common examples of categorical variables include gender (male, female), credit card type (Visa, MasterCard, American Express), color (red, green, blue) etc. Machine learning algorithms can only understand and work with numerical values, so in order to use categorical variables in modeling, they need to be converted to numeric representations.
The most common approach for converting categorical variables to numeric format is known as one-hot encoding or dummy coding. In one-hot encoding, each unique category is represented as a binary variable that can take the value 0 or 1. For example, consider a categorical variable ‘Gender’ with possible values ‘Male’ and ‘Female’. We would encode this as:
Male = [1, 0]
Female = [0, 1]
In this representation, the feature vector will have two dimensions – one for ‘Male’ and one for ‘Female’. If an example is female, it will be encoded as [0, 1]. Similarly, a male example will be [1, 0].
This allows us to represent categorical information in a format that machine learning models can understand and work with. Some key things to note about one-hot encoding:
The number of dummy variables created will be one less than the number of unique categories. So for a variable with ‘n’ unique categories, we will generate ‘n-1’ dummy variables.
These dummy variables are usually added as separate columns to the original dataset. So the number of columns increases after one-hot encoding.
Exactly one of the dummy variables will be ‘1’ and rest ‘0’ for each example. This maintains the categorical information while mapping it to numeric format.
The dummy variable columns can then be treated as separate ordinal features by machine learning models.
One category needs to be omitted as the base level or reference category to avoid dummy variable trap. The effect of this reference category gets embedded in the model intercept.
Now, let’s look at an extended example to demonstrate the one-hot encoding process step-by-step:
Let’s consider a categorical variable ‘Color’ with 3 unique categories – Red, Green, Blue.
Original categorical data:
Example 1, Color: Red
Example 2, Color: Green
Example 3, Color: Blue
Steps:
Identify the unique categories – Red, Green, Blue
Create dummy variables/columns for each category
Column for Red
Column for Green
Column for Blue
Select a category as the base/reference level and exclude its dummy column
Let’s select Red as the reference level
Code other categories as 1 and reference level as 0 in dummy columns
Data after one-hot encoding:
Example 1, Red: 0, Green: 0, Blue: 0
Example 2, Red: 0, Green: 1, Blue: 0
Example 3, Red: 0, Green: 0, Blue: 1
We have now converted the categorical variable ‘Color’ to numeric dummy variables that machine learning models can understand and learn from as separate features.
This one-hot encoding process is applicable to any categorical variable with multiple classes. It allows representing categorical information in a numeric format required by ML algorithms, while retaining the categorical differences between classes. The dummy variables can then be readily used in modeling, feature selection, dimensionality reduction etc.
Some key advantages of one-hot encoding include:
It is a simple and effective approach to convert categorical text data to numeric form.
The categorical differences are maintained in the final numeric representation as dummy variables.
Dummy variables can be treated as nominal categorical variables in downstream modeling.
It scales well to problems with large number of categories by creating sparse feature vectors with mostly 0s.
Retains the option to easily convert back decoded categorical classes from model predictions.
It also has some disadvantages like increased dimensionality of the data after encoding and loss of any intrinsic ordering between categories. Techniques like targeted encoding and feature hashing can help alleviate these issues to some extent.
One-hot encoding is a fundamental preprocessing technique used widely to convert categorical textual features to numeric dummy variables – a requirement for application of most machine learning algorithms. It maintains categorical differences effectively while mapping to suitable numeric representations.