Content

About this project

Goodness of fit

Usually, stochastic models return multiple predictions as samples from presumed or modeled population. Each new input (set of features) may produce different population and we wish to have probabilistic model that learn them all in a training process. The probabilistic model may be an ensemble of individual deterministic models or a formula for combined distributions, from which samples are taken. The process of estimation accuracies of predicted probabilities called calibration in literature. In order to calibrate our model, we need to know true distribution of predicted target. It is hard to obtain such distributions for the datasets recorded for pysical objects and on that reason we have to make a synthetic data.

For example, if our dataset contains incomes of individuals working for paycheck depending on their demographic data, the model has to be stochastic due to the nature of the object. When we ask the model how much 50 year old males with professional education are making, the answer must be a distribution, not a scalar. Obviously, such distribution should be significanlty different for a different input such as 20 year old females without professional education and not significantly different when demographic parameters are not significantly different. This presumed gradual change in distributions allows to build accurate stochastic models even for all different inputs.

The dataset, that could be very good benchmark for such probabilistic models if correctly collected, is called adult. The word 'COULD' presumes the missed opportunity. The data collected in a very unprofessional way and can't be used. They provided income (which is target) as binary parameter either exceeding 50K or not and did not provide samples with true distribution, such as groups with same demographic parameters or at least close parameters.

This dataset is vivid evidence of one known statistical rule, when unprofessional people collect statistical data, they are useless.

Since I could not find experimental data for calibration of probabilistic models, I made them myself. One simple way is to use rather complex algebraic formula with noisy arguments: $$T = f({x^*}_1, {x^*}_2, {x^*}_3, ...),$$ $${x^*}_i = x_i + n_i,$$ where $x_i$ are observed features, ${x^*}_i$ are unobserved values, $n_i$ is a random value, which is used for computing target $T$ but not recorded.

The probabilistic model returns samples for provided features, which can be compared to Monte Carlo generated samples by collecting computed targets $T$ for different noise values. Obviously, the target distribution, for this formula, will depend on features.
For calibration of our trained model we have to conduct goodness of fit test, which tells how our prediction is close to true distribution. The size of the sample returned by probabilistic model is usually small, it can be 20 to 100 values, and true distribution is obtained by Monte Carlo and may have much bigger size. The next section explains how goodness of fit test is conducted.

Median tree

When sample sizes are significantly different, they can be compared by the nested medians. The sample is sorted and the first median is found. Then, the sample is divided into two subsets over this median and another two medians are added to the list from these subsets. This dividing and getting medians process continues to the depth allowed by the size of the shortest sample, the longest sample is divided same number of times for obtaining of the same size median list. The median lists are sorted and compared. The distance between sorted median lists tells how statistically close are two tested samples. Here is the example:

    srand((unsigned int)time(NULL));
    int nLong = 4048;
    auto xLong = std::make_unique<double[]>(nLong);
    for (int i = 0; i < nLong; ++i) {
        xLong[i] = (rand() % 100 + rand() % 100 + rand() % 100) / 3.0;
    }

    int nShort = 32;
    auto xShort = std::make_unique<double[]>(nShort);
    for (int i = 0; i < nShort; ++i) {
        xShort[i] = (rand() % 100 + rand() % 100) / 2.0;
    }

Two arrays xLong and xShort are generated with certain statistical similarity, and the median lists for both of them are shown below:

27.0 29.7
37.5 37.3
44.8 43.3
49.2 49.0
53.8 54.3
57.8 60.7
62.8 68.7

The proximity for these two lists is determined by the relative distance $S$ $$S = 2 * ||X - Y|| / (||X|| + ||Y||),$$ where vectors $X$ and $Y$ are two columns in above table. The large sample, obtained by Monte Carlo, is considered as population. We take 100 subsamples from it and compute 100 relative distances using median trees $S_i$. Then we compute same distance for the tested short sample $S_t$. By comparing $S_i$ and $S_t$ we can tell if $S_t$ could be taken from the same population. In our test we compare it to maximum $max(S_i)$. If $S_t$ less than $max(S_i)$ we consider test passed with 1% significance level.

The measure $S$ is called statistic. It can be proven that this test is related to Cramer von Mises. Below I provide elementary demo for clarity.

#include <iostream>
#include <memory>
#include <vector>
#include <algorithm>

static double SplitVector(std::vector<double>& x, std::vector<double>& y, std::vector<double>& z) {
    if (0 == x.size() % 2) {
        int n1 = x.size() / 2;
        double m = (x[n1] + x[n1 - 1]) / 2.0;
        for (int i = 0; i < n1; ++i) {
            y.push_back(x[i]);
        }
        for (int i = n1; i < x.size(); ++i) {
            z.push_back(x[i]);
        }
        return m;
    }
    else {
        int n1 = x.size() / 2;
        double m = x[n1];
        for (int i = 0; i < n1; ++i) {
            y.push_back(x[i]);
        }
        for (int i = n1 + 1; i < x.size(); ++i) {
            z.push_back(x[i]);
        }
        return m;
    }
}

static std::unique_ptr<double[]> ArrayToMedians7(std::unique_ptr<double[]>& z, int N) {
    std::vector<double> x;
    for (int i = 0; i < N; ++i) {
        x.push_back(z[i]);
    }
    std::sort(x.begin(), x.end());
    std::vector<double> m;

    std::vector<double> left;
    std::vector<double> right;
    double m1 = SplitVector(x, left, right);
    m.push_back(m1);

    std::vector<double> left1;
    std::vector<double> right1;
    double m2 = SplitVector(left, left1, right1);
    m.push_back(m2);

    std::vector<double> left2;
    std::vector<double> right2;
    double m3 = SplitVector(right, left2, right2);
    m.push_back(m3);

    std::vector<double> left3;
    std::vector<double> right3;
    double m4 = SplitVector(left1, left3, right3);
    m.push_back(m4);

    std::vector<double> left4;
    std::vector<double> right4;
    double m5 = SplitVector(right1, left4, right4);
    m.push_back(m5);

    std::vector<double> left5;
    std::vector<double> right5;
    double m6 = SplitVector(left2, left5, right5);
    m.push_back(m6);

    std::vector<double> left6;
    std::vector<double> right6;
    double m7 = SplitVector(right2, left6, right6);
    m.push_back(m7);

    std::sort(m.begin(), m.end());
    auto median = std::make_unique<double[]>(m.size());
    for (int i = 0; i < m.size(); ++i) {
        median[i] = m[i];
    }
    return median;
}

static double RelativeDistance(std::unique_ptr<double[]>& x, std::unique_ptr<double[]>& y, int N) {
    double normX = 0.0;
    double normY = 0.0;
    double normDiff = 0.0;
    for (int i = 0; i < N; ++i) {
        normDiff += (x[i] - y[i]) * (x[i] - y[i]);
        normX += x[i] * x[i];
        normY += y[i] * y[i];
    }
    normDiff = sqrt(normDiff);
    normX = sqrt(normX);
    normY = sqrt(normY);
    return 2.0 * normDiff / (normX + normY);
}

static bool PassedNestedMedianTest(std::unique_ptr<double[]>& xLong, std::unique_ptr<double[]>& xShort, int nLong, int nShort) {
    auto medianLong = ArrayToMedians7(xLong, nLong);
    auto sampleFromLong = std::make_unique<double[]>(nShort);

    double maxDist = -DBL_MIN;
    for (int k = 0; k < 100; ++k) {
        for (int i = 0; i < nShort; ++i) {
            sampleFromLong[i] = xLong[(int)(rand() % nLong)];
        }
        auto median = ArrayToMedians7(sampleFromLong, nShort);
        double dist = RelativeDistance(medianLong, median, 7);
        if (dist > maxDist) maxDist = dist;
    }
    auto medianShort = ArrayToMedians7(xShort, nShort);
    for (int i = 0; i < 7; ++i) {
        printf("%4.1f %4.1f\n", medianShort[i], medianLong[i]);
    }
    printf("\n");
    double testedDist = RelativeDistance(medianLong, medianShort, 7);
    
    if (testedDist < maxDist) return true;
    else return false;
}

int main()
{
    srand((unsigned int)time(NULL));
    int nLong = 4048;
    auto xLong = std::make_unique<double[]>(nLong);
    for (int i = 0; i < nLong; ++i) {
        xLong[i] = (rand() % 100 + rand() % 100 + rand() % 100) / 3.0;
    }

    int nShort = 32;
    auto xShort = std::make_unique<double[]>(nShort);
    for (int i = 0; i < nShort; ++i) {
        xShort[i] = (rand() % 100 + rand() % 100) / 2.0;
    }

    bool isOk = PassedNestedMedianTest(xLong, xShort, nLong, nShort);
    if (isOk) {
        printf("The test passed\n");
    }
    else {
        printf("The test failed\n");
    }
}