Skip to content

Cross Validation

dkregression.cross_validation.CrossValidation(max_evaluations=100, max_data_withhold=0.1)

This class handles all the functionality necessary for cross-validating models and evaluating the likelihood under the cross-validation paradigm. The cross-validation method is used to fit the kernel (hyper) parameters to the dataset. Just maximzing the likelihood of the training dataset will lead to severve overfitting of the model. Depending on the setup and the dataset size, cross-validation, especially leave-one--out cross-validation may become computationally intractable. This module gives the user full control over the computational budget and granularity of the cross-validation procedure. This is controlled through the two arguments: max_evaluations (the maximum allowed number of cross-validation runs) and max_data_withhold (the maximum fraction of the dataset that can be withheld). Given those two parameters and the dataset size n, there are three cases for the cross-validation behavior that the CrossValidation module exhibits.

Case 1: max_evaluations >= n

In this case a full leave-one-out cross-validation procedure can be carried out.

Case 2: n * max_data_withhold * max_evaluations >= n

In this case, there are too many datapoints in the dataset that a full leave-one-out cross-validation can be carried out, however, it is still possible that each datapoint is once part of the "test" set. Note, that max_valuations of cross-validation trials will be used in this case to keep the number of held-out samples per trial as small as possible.

Case 3: n * max_data_withhold * max_evaluations < n

In this case, there are too many datapoints in the dataset that each point can at least once be part of the "test" set. There will be max_evaluations cross-validation trials and each of them will have n * max_data_withhold held-out datapoints. While not every datapoint will be part of the "test" set, it is guaranteed, that no datapoint will be used for the "test" set more often than once.

Parameters:

Name Type Description Default
max_evaluations int

The maximum number of cross-validation trials.

100
max_data_withhold float

The maximum fraction of datapoints to be withheld for each cross-validation run. Needs to be between 0 and 1.

0.1

Examples:

import torch
from dkregression.cross_validation import CrossValidation

# cross-validation object with a maximum of 500 trials and 
# each trials can only withhold up to 5% of the entire dataset for testing
cv = CrossValidation(max_evaluations=500, max_data_withhold=0.05)
Source code in src/dkregression/cross_validation/cross_validation.py
def __init__(self,max_evaluations=100,max_data_withhold=0.1) -> None:
    """
    Args:
        max_evaluations (int, optional): The maximum number of cross-validation trials.
        max_data_withhold (float, optional): The maximum fraction of datapoints to be withheld for each cross-validation run. Needs to be between 0 and 1.

    Examples:
        ```py
        import torch
        from dkregression.cross_validation import CrossValidation

        # cross-validation object with a maximum of 500 trials and 
        # each trials can only withhold up to 5% of the entire dataset for testing
        cv = CrossValidation(max_evaluations=500, max_data_withhold=0.05)
        ```

    """
    self.max_evaluations = max_evaluations   #how many times the log_likelihood function should be called at max
    self.max_data_withhold = max_data_withhold   #what is the maximum percentage of data points to be removed from the training dataset and withheld for testing

split_data(X, Y)

Given the configuration with max_evaluations and max_data_withhold, this function splits the dataset, given by X and Y into the cross-validation trials. Based on max_evaluations, max_data_withhold, and the dataset size the cross-validation case as outlined above is used.

Parameters:

Name Type Description Default
X Tensor

The tensor of input points. Needs to be of shape (n,d_input), even if d_input==1.

required
Y Tensor

The tensor of output points. Needs to be of shape (n,d_output), even if d_input==1.

required

Returns:

Name Type Description
list

Each item in the list returned is a dictionary with the keys 'x', 'y', 'xq', and 'yq' where 'x' and 'y' is the "train" set and 'xq', and 'yq' is the "test" set.

Examples:

import torch
from dkregression.cross_validation import CrossValidation

X1, Y1 = torch.rand((40,2)), torch.rand((40,1))
X2, Y2 = torch.rand((100,2)), torch.rand((100,1))

# cross-validation object with a maximum of 50 trials and 
# each trials can only withhold up to 5% of the entire dataset for testing
cv = CrossValidation(max_evaluations=50, max_data_withhold=0.05)

print(len(cv.split_data(X1,Y1)))
print(len(cv.split_data(X2,Y2)))

This results in the following output:

>>>40
>>>50

Source code in src/dkregression/cross_validation/cross_validation.py
def split_data(self,X,Y):
    """Given the configuration with `max_evaluations` and `max_data_withhold`, this function splits the dataset, given by `X` and `Y` into the cross-validation trials. Based on `max_evaluations`, `max_data_withhold`, and the dataset size the cross-validation case as outlined above is used.

    Args:
        X (torch.Tensor): The tensor of input points. Needs to be of shape `(n,d_input)`, even if `d_input==1`.
        Y (torch.Tensor): The tensor of output points. Needs to be of shape `(n,d_output)`, even if `d_input==1`.

    Returns:
        list: Each item in the list returned is a dictionary with the keys `'x'`, `'y'`, `'xq'`, and `'yq'` where `'x'` and `'y'` is the "train" set and `'xq'`, and `'yq'` is the "test" set. 

    Examples:
        ```py
        import torch
        from dkregression.cross_validation import CrossValidation

        X1, Y1 = torch.rand((40,2)), torch.rand((40,1))
        X2, Y2 = torch.rand((100,2)), torch.rand((100,1))

        # cross-validation object with a maximum of 50 trials and 
        # each trials can only withhold up to 5% of the entire dataset for testing
        cv = CrossValidation(max_evaluations=50, max_data_withhold=0.05)

        print(len(cv.split_data(X1,Y1)))
        print(len(cv.split_data(X2,Y2)))
        ```

        This results in the following output:
        ```
        >>>40
        >>>50
        ```
    """
    # output is a list of dictionaries with the keys: 'x', 'y', 'xq', and 'yq'
    x_raw = X
    y_raw = Y
    n = x_raw.shape[0]
    total_max_data_withhold = int(n*self.max_data_withhold*self.max_evaluations)

    # case 1: max_evaluations >= n (leave-one-out cross validation possible)
    if self.max_evaluations >= n:
        dataset = []
        for i in range(n):
            xq = x_raw[[i]]
            yq = y_raw[[i]]

            other_indices = torch.ones((n,),dtype=torch.bool)
            other_indices[i] = False

            x = x_raw[[other_indices]]
            y = y_raw[[other_indices]]

            dataset.append({'x':x, 'y':y, 'xq':xq, 'yq':yq})

        return dataset


    # case 2: total_max_data_withhold >= n (full cross validation is possible)
    # In this case, we will try to max out the number of evaluations to keep the number of samples per trial as small as possible
    if total_max_data_withhold >= n:
        rand_idx = torch.randperm(n)
        split = torch.tensor_split(rand_idx,self.max_evaluations)
        dataset = []
        for s in split:
            xq = x_raw[[s]]
            yq = y_raw[[s]]

            other_indices = torch.ones((n,),dtype=torch.bool)
            other_indices[[s]] = False

            x = x_raw[[other_indices]]
            y = y_raw[[other_indices]]

            dataset.append({'x':x, 'y':y, 'xq':xq, 'yq':yq})

        return dataset


    # case 3: total_max_data_withhold < n (withheld samples are randomly selected)
    # In this case, we will always select max_data_withhold samples
    if total_max_data_withhold < n:
        rand_idx = torch.randperm(n)[:total_max_data_withhold]
        split = torch.tensor_split(rand_idx,self.max_evaluations)
        dataset = []
        for s in split:
            xq = x_raw[[s]]
            yq = y_raw[[s]]

            other_indices = torch.ones((n,),dtype=torch.bool)
            other_indices[[s]] = False

            x = x_raw[[other_indices]]
            y = y_raw[[other_indices]]

            dataset.append({'x':x, 'y':y, 'xq':xq, 'yq':yq})

        return dataset