To evaluate the performance of the various regression models, I utilized multiple evaluation metrics and performed both internal and external validation of the models. For internal validation, I split the original dataset into a training and validation set to fine-tune the hyperparameters of each model. I used a 70%/30% split for the training and validation sets. For the training set, I fit each regression model (linear regression, lasso regression, ridge regression, elastic net regression, random forest regression, gradient boosting regression) and tuned the hyperparameters, such as the alpha and lambda values for regularization, number of trees and depth for ensemble methods, etc. using grid search cross-validation on the training set only.
This gave me optimized hyperparameters for each model that were specifically tailored to the training dataset. I then used these optimized models to make predictions on the held-out validation set to get an internal estimate of model performance during the model selection process. For model evaluation on the validation set, I calculated several different metrics including:
Mean Absolute Error (MAE) – to measure the average magnitude of errors in a set of predictions, without considering their direction. This metric identifies the average error independent of direction, penalizing all the individual differences equally.
Mean Squared Error (MSE) – the average squared difference between the estimated values and the actual value. MSE is a risk function, corresponding to the expected value of the squared error loss. It measures the average of the squares of the errors – the average squared difference between the estimated values and actual value. MSE penalizes larger errors, comparing them to smaller errors. This metric is highly sensitive to outliers.
Root Mean Squared Error (RMSE) – corresponds to the standard deviation of the residuals (prediction errors). RMSE serves to aggregate the magnitudes of the errors in predictions for various cases in a dataset. It indicates the sample standard deviation of the differences between predicted values and observed values. RMSE penalizes larger errors more, so it indicates the error across different cases.
R-squared (R2) – measures the closeness of the data points to the fitted regression line. It is a statistical measure that represents the proportion of the variance for a dependent variable that is explained by an independent variable or variables in a regression model. R2 ranges from 0 to 1, with higher values indicating less unexplained variance. R2 of 1 means the regression line perfectly fits the data.
By calculating multiple performance metrics on the validation set for each regression model, I was able to judge which model was performing the best overall on new, previously unseen data during the internal model selection process. The model with the lowest MAE, MSE, and RMSE and highest R2 was generally considered the best model internally.
In addition to internal validation, I also performed external validation by randomly removing 20% of the original dataset as an external test set, making sure no data from this set was used in any part of the model building process – neither for training nor validation. I then fit the final optimized models on the full training set and predicted on the external test set, again calculating evaluation metrics. This step allowed me to get an unbiased estimate of how each model would generalize to completely new data, simulating real-world application of the models.
Some key points about the external validation process:
The test set remained untouched during any part of model fitting, tuning, or validation
The final selected models from the internal validation step were refitted on the full training data
Performance was then evaluated on the external test set
This estimate of out-of-sample performance was a better indicator of true real-world generalization ability
By conducting both internal validation by splitting into training and validation sets, as well as external validation using a test set entirely separated from model building, I was able to more rigorously and objectively evaluate and compare the performance of different regression techniques. This process helped me identify not just the model that performed best on the data it was trained on, but more importantly, the model that was able to generalize best to new unseen examples, giving the most reliable predictive performance in real applications. The model with the best and most consistent performance across internal validation metrics, and external test set evaluation was selected as the optimal regression algorithm for the given problem and dataset.
This systematic process of evaluating regression techniques using multiple performance metrics on internal validation sets as well as truly external test data, allowed for fair model selection based on reliable estimates of true out-of-sample predictive ability. It helped guard against issues like overfitting to the test/validation data, and pick the technique that was robustly generalizable rather than just achieving high scores due to memorization on a specific data split. This multi-stage validation methodology produced the most confident assessment of how each regression model would perform in practice on new real examples.