# Metrics

## Binary classification

### Accuracy

Accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined.

### Logloss

This metric is used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is

-log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp))

### F1

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

### ROC AUC

A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm and can be calculated as (1 − specificity). It can also be thought of as a plot of the Power as a function of the Type I Error of the decision rule (when the performance is calculated from just a sample of the population, it can be thought of as estimators of these quantities). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from -infinity to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability on the x-axis.

When using normalized units, the area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').

### Average Precision

Compute average precision (AP) from prediction scores

AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

where P_n and R_n are the precision and recall at the nth threshold. It is restricted to the binary classification task only.

### Cohen's kappa

Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of kappa is:

where p_o is the relative observed agreement among raters (identical to accuracy), and p_e is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then kappa=1. If there is no agreement among the raters other than what would be expected by chance (as given by p_e), kappa=0. It is possible for the statistic to be negative, which implies that there is no effective agreement between the two raters or the agreement is worse than random.

### Gini coefficient

The Gini coefficient is the summary statistic of the Cumulative Accuracy Profile (CAP) chart. It is calculated as the quotient of the area which the CAP curve and diagonal enclose and the corresponding area in an ideal rating procedure.

GINI = 2 * ROC_AUC - 1.

## Multiclass classification

### Accuracy

Accuracy describes the fraction of number of correctly classified samples to the number of all samples.

### Log loss

Log Loss also known as Cross-Entropy loss tries to match the true probability p_i (which is a true label) and probability q_i given by predictive model. It is defined as:

-log P(p|q) = -sum_i p_i * log(q_i),

where sum is done over all classes.

### F1

In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter, which can be:

micro: Calculate metrics globally by counting the total true positives, false negatives and false positives.

macro: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

weighted: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

## Regression

### Median absolute error

This metric calculates median of absolute difference between true and predicted labels. It is more robust to outliers than MAE/MSE metrics.

### Mean absolute error (MAE)

In statistics, mean absolute error (MAE) is a measure of difference between two continuous variables. Assume X and Y are variables of paired observations that express the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. Consider a scatter plot of n points, where point i has coordinates (xi, yi)... Mean Absolute Error (MAE) is the average vertical distance between each point and the identity line. MAE is also the average horizontal distance between each point and the identity line.

The Mean Absolute Error is given by:

### Mean squared error (MSE)

In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between the estimated values and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss. The fact that MSE is almost always strictly positive (and not zero) is because of randomness or because the estimator does not account for information that could produce a more accurate estimate.

The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

The MSE is the second moment (about the origin) of the error, and thus incorporates both the variance of the estimator (how widely spread the estimates are from one data sample to another) and its bias (how far off the average estimated value is from the truth). For an unbiased estimator, the MSE is the variance of the estimator. Like the variance, MSE has the same units of measurement as the square of the quantity being estimated. In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error.

If Y_hat is a vector of n predictions generated from a sample of n data points on all variables, and Y is the vector of observed values of the variable being predicted, then the within-sample MSE of the predictor is computed as

### Mean squared log error (MSLE)

MSLE is usually used when you don't want to penalize huge differences in the predicted and true values when both predicted and true values are huge numbers.

### Coefficient of Determination (R2)

In statistics, the coefficient of determination, denoted R2 or r2 and pronounced "R squared", is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model

There are several definitions of R2 that are only sometimes equivalent. One class of such cases includes that of simple linear regression where r2 is used instead of R2. When an intercept is included, then r2 is simply the square of the sample correlation coefficient (i.e., r) between the observed outcomes and the observed predictor values. If additional regressors are included, R2 is the square of the coefficient of multiple correlation. In both such cases, the coefficient of determination ranges from 0 to 1.

## Timeseries

### Mean absolute scaled error (MASE)

In statistics, the mean absolute scaled error (MASE) is a measure of the accuracy of forecasts . It was proposed in 2005 by statistician Rob J. Hyndman and Professor of Decision Sciences Anne B. Koehler, who described it as a "generally applicable measurement of forecast accuracy without the problems seen in the other measurements." The mean absolute scaled error has favorable properties when compared to other methods for calculating forecast errors, such as root-mean-square-deviation, and is therefore recommended for determining comparative accuracy of forecasts.

For a non-seasonal time series, the mean absolute scaled error is estimated by

### Mean directional accuracy (MDA)

Mean Directional Accuracy (MDA), also known as Mean Direction Accuracy, is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction (upward or downward) to the actual realized direction. It is defined by the following formula

## Scoring Function Names for API

Listed below are the values you can use with the API to set the scoring function to be used by Auger to evaluate models.

### Regression and Time Series

- explained_variance
- neg_median_absolute_error
- neg_mean_absolute_error
- neg_mean_squared_error
- neg_mean_squared_log_error
- r2
- neg_rmsle
- neg_mase
- mda
- neg_rmse

### Classification

- accuracy
- f1
- f1_macro
- f1_micro
- f1_weighted
- neg_log_loss
- precision
- precision_macro
- precision_micro
- precision_weighted
- recall
- recall_macro
- recall_micro
- recall_weighted
- roc_auc
- gini
- cohen_kappa_score
- matthews_corrcoef