The predictive models were evaluated using different classification and regression performance metrics depending on the type of dataset – whether it contained categorical/discrete class labels or continuous target variables. For classification problems with discrete class labels, the most commonly used metrics included accuracy, precision, recall, F1 score and AUC-ROC.

Accuracy is the proportion of true predictions (both true positives and true negatives) out of the total number of cases evaluated. It provides an overall view of how well the model predicts the class. It does not provide insights into errors and can be misleading if the classes are imbalanced.

Precision calculates the number of correct positive predictions made by the model out of all the positive predictions. It tells us what proportion of positive predictions were actually correct. A high precision relates to a low false positive rate, which is important for some applications.

Recall calculates the number of correct positive predictions made by the model out of all the actual positive cases in the dataset. It indicates what proportion of actual positive cases were predicted correctly as positive by the model. A model with high recall has a low false negative rate.

The F1 score is the harmonic mean of precision and recall, and provides an overall view of accuracy by considering both precision and recall. It reaches its best value at 1 and worst at 0.

AUC-ROC calculates the entire area under the Receiver Operating Characteristic curve, which plots the true positive rate against the false positive rate at various threshold settings. The higher the AUC, the better the model is at distinguishing between classes. An AUC of 0.5 represents a random classifier.

For regression problems with continuous target variables, the main metrics used were Mean Absolute Error (MAE), Mean Squared Error (MSE) and R-squared.

MAE is the mean of the absolute values of the errors – the differences between the actual and predicted values. It measures the average magnitude of the errors in a set of predictions, without considering their direction. Lower values mean better predictions.

MSE is the mean of the squared errors, and is most frequently used due to its intuitive interpretation as an average error energy. It amplifies larger errors compared to MAE. Lower values indicate better predictions.

R-squared calculates how close the data are to the fitted regression line and is a measure of how well future outcomes are likely to be predicted by the model. Its best value is 1, indicating a perfect fit of the regression to the actual data.

These metrics were calculated for the different predictive models on designated test datasets that were held out and not used during model building or hyperparameter tuning. This approach helped evaluate how well the models would generalize to new, previously unseen data samples.

For classification models, precision, recall, F1 and AUC-ROC were the primary metrics whereas for regression tasks MAE, MSE and R-squared formed the core evaluation criteria. Accuracy was also calculated for classification but other metrics provided a more robust assessment of model performance especially when dealing with imbalanced class distributions.

The metric values were tracked and compared across different predictive algorithms, model architectures, hyperparameters and preprocessing/feature engineering techniques to help identify the best performing combinations. Benchmark metric thresholds were also established based on domain expertise and prior literature to determine whether a given model’s predictive capabilities could be considered satisfactory or required further refinement.

Ensembling and stacking approaches that combined the outputs of different base models were also experimented with to achieve further boosts in predictive performance. The same evaluation metrics on holdout test sets helped compare the performance of ensembles versus single best models.

This rigorous and standardized process of model building, validation and evaluation on independent datasets helped ensure the predictive models achieved good real-world generalization capability and avoided issues like overfitting to the training data. The experimentally identified best models could then be deployed with confidence on new incoming real-world data samples.