Accuracy: Accuracy is one of the most common and straightforward evaluation metrics used in machine learning. It measures what percentage of predictions the model got completely right. It is calculated as the number of correct predictions made by the model divided by the total number of predictions made. Accuracy provides an overall sense of a model’s performance but has some limitations. A model could be highly accurate overall but poor at certain types of examples.

Precision: Precision measures the ability of a model to not label negative examples as positive. It is calculated as the number of true positives (TP) divided by the number of true positives plus the number of false positives (FP). A high precision means that when the model predicts an example as positive, it is truly positive. Precision is important when misclassifying a negative example as positive has serious consequences. For example, a medical test that incorrectly diagnoses a healthy person as sick.

Recall/Sensitivity: Recall measures the ability of a model to find all positive examples. It is calculated as the number of true positives (TP) divided by the number of true positives plus the number of false negatives (FN). A high recall means the model pulled most of the truly positive examples within the net. Recall is important when you want the model to find as many true positives as possible and not miss any. For example, identifying diseases from medical scans.

F1 Score: The F1 score is the harmonic mean of precision and recall. It combines both precision and recall into a single measure that balances them. F1 score reaches its best value at 1 and worst at 0. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst at 0. The relative contribution of precision and recall to the F1 score are equal. The F1 score is most commonly used evaluation metric when there is an imbalance between positive and negative classes.

Specificity: Specificity measures the ability of a model to correctly predict the absence of a condition (true negative rate). It is calculated as the number of true negatives (TN) divided by the number of true negatives plus the number of false positives (FP). Specificity is important in those cases where correctly identifying negatives is critical, such as disease screening. A high specificity means the model correctly identified most examples that did not have the condition as negative.

AUC ROC Curve: AUC ROC stands for Area Under Receiver Operating Characteristic curve. ROC is a probability curve and AUC represents degree or measure of separability of the model. It tells how well the model can distinguish between classes. ROC is a plot of the true positive rate against the false positive rate. AUC can range between 0 and 1, with a higher score representing better performance. Unlike accuracy, AUC is a balanced measure and is unaffected by class imbalance. AUC helps visualize and compare overall performance of models across different thresholds.

Cross Validation: To properly evaluate a machine learning model, it is important to validate it using techniques like k-fold cross validation. In k-fold cross validation, the dataset is divided into k smaller sets or folds. The model is trained k times, each time using k-1 folds for training and the remaining 1 fold for validating the model. This process is repeated k times so that each of the k folds is used exactly once for validation. The k results can then be averaged to get an overall validation accuracy. This method reduces variability and helps get an insight on how the model will generalize to an independent dataset.

A/B Testing: A/B testing involves comparing two versions of a model or system and evaluating them on key metrics against real users. For example, a production model could be A/B tested against a new proposed model to see if the new model actually performs better. A/B testing on real data exactly as it will be used is an excellent way to compare models and select the better one for deployment. Metrics like conversion rate, clicks, purchases etc. can help decide which model provides the optimal user experience.

Model Explainability: For high-stake applications, it is critical that the models are explainable and auditable. We should be able to explain why a model made a particular prediction for an example. Some techniques to evaluate explainability include interpreting individual predictions using methods like LIME, SHAP, integrated gradients etc. Global model explanations using techniques like SHAP plots can help understand feature importance and model behavior. Domain experts can manually analyze the explanations to ensure predictions are made for scientifically valid reasons and not some spurious correlations. Lack of robust explanations could mean the model fails to generalize.

Testing on Blind Data: To convincingly evaluate the real effectiveness of a model, it must be rigorously tested on completely new blind data that was not used during any part of model building. This includes data selection, feature engineering, model tuning, parameter optimization etc. Only then can we say with confidence how well the model would generalize to new real world data after deployment. Testing on truly blind data helps avoid issues like overfitting to the dev/test datasets. Key metrics should match or exceed performance on the initial dev/test data to claim generalizability.