| Title: | Resampling Algorithms for 'mlr3' Framework |
|---|---|
| Description: | A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a subset (such as geographic region, year, etc), then how do we know if subsets are similar enough so that we can get accurate predictions on one subset, after training on Other subsets? And how do we know if training on All subsets would improve prediction accuracy, relative to training on the Same subset? SOAK, Same/Other/All K-fold cross-validation, <doi:10.1002/sam.70055> can be used to answer these questions, by fixing a test subset, training models on Same/Other/All subsets, and then comparing test error rates (Same versus Other and Same versus All). Also provides code for estimating how many train samples are required to get accurate predictions on a test set. |
| Authors: | Toby Hocking [aut, cre] (ORCID: <https://orcid.org/0000-0002-3146-0865>), Daniel Agyapong [ctb] (ORCID: <https://orcid.org/0009-0004-0857-3150>), Michel Lang [ctb] (ORCID: <https://orcid.org/0000-0001-9754-0393>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Bernd Bischl [ctb] (ORCID: <https://orcid.org/0000-0001-6002-6980>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Jakob Richter [ctb] (ORCID: <https://orcid.org/0000-0003-4481-5554>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Patrick Schratz [ctb] (ORCID: <https://orcid.org/0000-0003-0748-6624>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Giuseppe Casalicchio [ctb] (ORCID: <https://orcid.org/0000-0001-5324-5966>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Stefan Coors [ctb] (ORCID: <https://orcid.org/0000-0002-7465-2146>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Quay Au [ctb] (ORCID: <https://orcid.org/0000-0002-5252-8902>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Martin Binder [ctb], Florian Pfisterer [ctb] (ORCID: <https://orcid.org/0000-0001-8867-762X>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Raphael Sonabend [ctb] (ORCID: <https://orcid.org/0000-0001-9225-4654>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Lennart Schneider [ctb] (ORCID: <https://orcid.org/0000-0003-4152-5308>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Marc Becker [ctb] (ORCID: <https://orcid.org/0000-0002-8115-0400>, Author of mlr3 when Resampling/ResamplingCV was copied/modified), Sebastian Fischer [ctb] (ORCID: <https://orcid.org/0000-0002-9609-3197>, Author of mlr3 when Resampling/ResamplingCV was copied/modified) |
| Maintainer: | Toby Hocking <[email protected]> |
| License: | LGPL-3 |
| Version: | 2026.5.19 |
| Built: | 2026-05-29 19:38:59 UTC |
| Source: | https://github.com/tdhock/mlr3resampling |
Classification data set with polygons (groups which should not be split in CV) and subsets (region3 or region4).
data("AZtrees")data("AZtrees")
A data frame with 5956 observations on the following 25 variables.
region3a character vector
region4a character vector
polygona numeric vector
ya character vector
ycoordlatitude
xcoordlongitude
SAMPLE_1a numeric vector
SAMPLE_2a numeric vector
SAMPLE_3a numeric vector
SAMPLE_4a numeric vector
SAMPLE_5a numeric vector
SAMPLE_6a numeric vector
SAMPLE_7a numeric vector
SAMPLE_8a numeric vector
SAMPLE_9a numeric vector
SAMPLE_10a numeric vector
SAMPLE_11a numeric vector
SAMPLE_12a numeric vector
SAMPLE_13a numeric vector
SAMPLE_14a numeric vector
SAMPLE_15a numeric vector
SAMPLE_16a numeric vector
SAMPLE_17a numeric vector
SAMPLE_18a numeric vector
SAMPLE_19a numeric vector
SAMPLE_20a numeric vector
SAMPLE_21a numeric vector
Paul Nelson Arellano, [email protected]
data.table::setDTthreads(1L) data(AZtrees) task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y") task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE) task.obj$col_roles$group <- "polygon" task.obj$col_roles$subset <- "region3" str(task.obj) same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() same_other_sizes_cv$instantiate(task.obj) same_other_sizes_cv$instance$iteration.dtdata.table::setDTthreads(1L) data(AZtrees) task.obj <- mlr3::TaskClassif$new("AZtrees3", AZtrees, target="y") task.obj$col_roles$feature <- grep("SAMPLE", names(AZtrees), value=TRUE) task.obj$col_roles$group <- "polygon" task.obj$col_roles$subset <- "region3" str(task.obj) same_other_sizes_cv <- mlr3resampling::ResamplingSameOtherSizesCV$new() same_other_sizes_cv$instantiate(task.obj) same_other_sizes_cv$instance$iteration.dt
AutoTunerTorch_epochs inherits from
mlr3tuning::AutoTuner, with an initialize method that
takes arguments to construct a torch module learner. It runs gradient
descent up to max_epochs and then re-runs using the best number
of epochs. Its edit_learner method sets number of epochs to 2
(for quick learning during proj_test),
and its save_learner method returns a history data table (one
row per epoch).
LearnerRegrCVGlmnetSave inherits from
LearnerRegrCVGlmnet; its save_learner method returns a
data table of regularized linear model weights (no edit_learner
method).
LearnerClassifCVGlmnetSave is similar.
Toby Dylan Hocking
## Simulate regression data. N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), noise=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^3, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- c("x","noise") task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.task.list # two regression tasks. ## Create a list of learners. reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("mlr3torch") && torch::torch_is_installed()){ gen_linear <- torch::nn_module( "my_linear", initialize = function(task) { self$weight = torch::nn_linear(task$n_features, 1) }, forward = function(x) { self$weight(x) } ) reg.learner.list$torch_linear <- mlr3resampling::AutoTunerTorch_epochs$new( "torch_linear", module_generator=gen_linear, max_epochs=3, batch_size=10, loss=mlr3torch::t_loss("mse"), measure_list=mlr3::msrs(c("regr.mse","regr.mae"))) } if(requireNamespace("mlr3learners")){ reg.learner.list$cv_glmnet <- mlr3resampling::LearnerRegrCVGlmnetSave$new() reg.learner.list$cv_glmnet$param_set$values$nfolds <- 3 } reg.learner.list # a list of learners. # 2-fold CV. kfold <- mlr3::ResamplingCV$new() kfold$param_set$values$folds <- 2 # Create project grid. pkg.proj.dir <- tempfile() pgrid <- mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, score_args=mlr3::msrs("regr.rmse"), kfold) ## Not run: test_out <- mlr3resampling::proj_test(pkg.proj.dir) test_out$learners_history.csv # from AutoTunerTorch_epochs, 2 epochs for testing. test_out$learners_weights.csv # from LearnerRegrCVGlmnetSave torch.job.i <- which(pgrid$learner_id=="torch_linear")[1] mlr3resampling::proj_compute(torch.job.i, pkg.proj.dir) mlr3resampling::proj_results_save(pkg.proj.dir) full_out <- mlr3resampling::proj_fread(pkg.proj.dir) full_out$learners_history.csv # from AutoTunerTorch_epochs, 3 epochs. ## End(Not run)## Simulate regression data. N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), noise=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^3, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- c("x","noise") task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.task.list # two regression tasks. ## Create a list of learners. reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("mlr3torch") && torch::torch_is_installed()){ gen_linear <- torch::nn_module( "my_linear", initialize = function(task) { self$weight = torch::nn_linear(task$n_features, 1) }, forward = function(x) { self$weight(x) } ) reg.learner.list$torch_linear <- mlr3resampling::AutoTunerTorch_epochs$new( "torch_linear", module_generator=gen_linear, max_epochs=3, batch_size=10, loss=mlr3torch::t_loss("mse"), measure_list=mlr3::msrs(c("regr.mse","regr.mae"))) } if(requireNamespace("mlr3learners")){ reg.learner.list$cv_glmnet <- mlr3resampling::LearnerRegrCVGlmnetSave$new() reg.learner.list$cv_glmnet$param_set$values$nfolds <- 3 } reg.learner.list # a list of learners. # 2-fold CV. kfold <- mlr3::ResamplingCV$new() kfold$param_set$values$folds <- 2 # Create project grid. pkg.proj.dir <- tempfile() pgrid <- mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, score_args=mlr3::msrs("regr.rmse"), kfold) ## Not run: test_out <- mlr3resampling::proj_test(pkg.proj.dir) test_out$learners_history.csv # from AutoTunerTorch_epochs, 2 epochs for testing. test_out$learners_weights.csv # from LearnerRegrCVGlmnetSave torch.job.i <- which(pgrid$learner_id=="torch_linear")[1] mlr3resampling::proj_compute(torch.job.i, pkg.proj.dir) mlr3resampling::proj_results_save(pkg.proj.dir) full_out <- mlr3resampling::proj_fread(pkg.proj.dir) full_out$learners_history.csv # from AutoTunerTorch_epochs, 3 epochs. ## End(Not run)
Runs train() and predict(), then a data table
with one row is saved to an RDS file in the grid_jobs
directory.
proj_compute(grid_job_i, proj_dir, verbose=FALSE, process_fun=Sys.getpid)proj_compute(grid_job_i, proj_dir, verbose=FALSE, process_fun=Sys.getpid)
grid_job_i |
integer from 1 to number of jobs (rows in |
proj_dir |
Project directory created by |
verbose |
Logical: print messages? |
process_fun |
function called with no arguments
(default |
If everything goes well, the user should not need to run this
function.
Instead, the user runs proj_submit as Step 2 out of the
typical 3 step pipeline (init grid, submit, read results).
proj_compute can sometimes be useful for testing or debugging the submit step,
since it runs one split at a time.
proj_compute returns a data table with one row of results.
Toby Dylan Hocking
N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, order_jobs = function(DT)1:2, # for CRAN. score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) mlr3resampling::proj_compute(1, pkg.proj.dir)N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, order_jobs = function(DT)1:2, # for CRAN. score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) mlr3resampling::proj_compute(1, pkg.proj.dir)
A project grid consists of all combinations of tasks, learners, resampling types, and resampling iterations, to be computed in parallel. This function creates a project directory with files to describe the grid.
proj_grid( proj_dir, tasks, learners, resamplings, order_jobs = NULL, score_args = NULL, save_learner = save_learner_default, save_pred = FALSE, train_seed = 1L, resampling_seed = 1L)proj_grid( proj_dir, tasks, learners, resamplings, order_jobs = NULL, score_args = NULL, save_learner = save_learner_default, save_pred = FALSE, train_seed = 1L, resampling_seed = 1L)
proj_dir |
Path to directory to create. |
tasks |
List of Tasks, or a single Task. |
learners |
List of Learners, or a single Learner. |
resamplings |
List of Resamplings, or a single Resampling. |
order_jobs |
Function which takes split table as input, and
returns integer vector of row numbers of the split table to write to
|
score_args |
Passed to |
save_learner |
Function to process Learner, after
training/prediction, but before saving result to disk.
For interpreting complex models, you should write a function that
returns only the parts of the model that you need (and discards the
other parts which would take up disk space for no reason).
Default is to call |
save_pred |
Function to process Prediction before saving to disk.
Default |
train_seed |
integer: random seed to set before training (default
|
resampling_seed |
integer: random seed to set before
instantiating each resampling (default |
This is Step 1 out of the
typical 3 step pipeline (init grid, submit, read results).
It creates a grid_jobs.csv table which has a column status;
each row is initialized to "not started" or "done",
depending on whether the corresponding result RDS file exists already.
Data table of splits to be processed (same as table saved to grid_jobs.rds).
Toby Dylan Hocking
N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) mlr3resampling::proj_compute(2, pkg.proj.dir)N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) mlr3resampling::proj_compute(2, pkg.proj.dir)
proj_results globs the RDS result files in the project
directory, and combines them into a result table via rbindlist().
proj_results_save saves that result table to results.rds
and one or more CSV files (redundant with RDS data, but more
convenient).
proj_fread reads the CSV files, adding columns from
proj_grid.csv to learners*.csv.
proj_results(proj_dir, verbose=FALSE) proj_results_save(proj_dir, verbose=FALSE) proj_fread(proj_dir)proj_results(proj_dir, verbose=FALSE) proj_results_save(proj_dir, verbose=FALSE) proj_fread(proj_dir)
proj_dir |
Project directory created via
|
verbose |
logical: cat progress? (default FALSE) |
This is Step 3 out of the typical 3 step pipeline (init grid, submit, read results).
Actually, if step 2 worked as intended, then
proj_results_save is called at the end of step 2,
which saves result files to disk that you can read directly:
results.csvcontains test measures for each split.
results.rdscontains additional list columns for learner
and pred (useful for model interpretation), and can be read via
readRDS()
learners.csvonly exists if learner column is a
data frame, in which case it contains the atomic columns, along
with meta-data describing each split.
learners_*.csvonly exists if learner column is
a named list of data frames: star in file name is expanded using
list names, with CSV data taken from atomic columns.
proj_results returns a data table with all columns, whereas
proj_results_save returns the same table with only atomic columns.
proj_fread returns a list
with names corresponding to CSV files in the test directory, and
values are the data tables that result from fread.
Toby Dylan Hocking
N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), noise=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- c("x","noise") task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, save_learner=function(L){ if(inherits(L, "LearnerRegrRpart")){ list(rpart=L$model$frame) } }, order_jobs = function(DT)which(DT$iteration==1)[1:2], # for CRAN. score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) computed <- mlr3resampling::proj_compute_all(pkg.proj.dir) result_list <- mlr3resampling::proj_fread(pkg.proj.dir) result_list$learners_rpart.csv # one row per node in decision tree. result_list$results.csv # test error in regr.* columns.N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), noise=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- c("x","noise") task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, save_learner=function(L){ if(inherits(L, "LearnerRegrRpart")){ list(rpart=L$model$frame) } }, order_jobs = function(DT)which(DT$iteration==1)[1:2], # for CRAN. score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) computed <- mlr3resampling::proj_compute_all(pkg.proj.dir) result_list <- mlr3resampling::proj_fread(pkg.proj.dir) result_list$learners_rpart.csv # one row per node in decision tree. result_list$results.csv # test error in regr.* columns.
proj_todo determines which jobs remain to be computed.
proj_compute_all computes all remaining jobs using
future_lapply if available, otherwise
lapply.
proj_compute_mpi computes all remaining jobs in parallel using
MPI (should be run in an R session activated by mpirun or
srun).
proj_submit is a non-blocking call to SLURM sbatch,
asking for a single job with several tasks that run proj_compute_mpi.
proj_todo(proj_dir) proj_compute_mpi(proj_dir, verbose=FALSE) proj_compute_all(proj_dir, verbose=FALSE, LAPPLY) proj_submit( proj_dir, tasks = 2, hours = 1, gigabytes = 1, verbose = FALSE)proj_todo(proj_dir) proj_compute_mpi(proj_dir, verbose=FALSE) proj_compute_all(proj_dir, verbose=FALSE, LAPPLY) proj_submit( proj_dir, tasks = 2, hours = 1, gigabytes = 1, verbose = FALSE)
proj_dir |
Project directory created via |
tasks |
Positive integer: |
hours |
Hours of walltime to ask the SLURM scheduler. |
gigabytes |
Gigabytes of memory to ask the SLURM scheduler. |
verbose |
Logical: print messages? |
LAPPLY |
Function like |
This is Step 2 out of the typical 3 step pipeline (init grid, submit, read results).
proj_submit returns the ID of the submitted SLURM job.
proj_compute_all and proj_compute_mpi return a data
table of results computed.
proj_todo returns a vector of job IDs not yet computed.
Toby Dylan Hocking
N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) ## Not run: if(requireNamespace("future.apply"))future::plan("multisession") mlr3resampling::proj_compute_all(pkg.proj.dir) if(requireNamespace("future.apply"))future::plan("sequential") ## End(Not run)N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), each=0.5*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, SOAK, score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) ## Not run: if(requireNamespace("future.apply"))future::plan("multisession") mlr3resampling::proj_compute_all(pkg.proj.dir) if(requireNamespace("future.apply"))future::plan("sequential") ## End(Not run)
Like testJob,
"test" means trying an example with a few small jobs
(default one train/test split per algorithm and data set)
before running the big calculation with all jobs.
Runs proj_grid to create a new project in the
test sub-directory, with a smaller
number of samples in each task, and with only one iteration
per Resampling. Runs proj_compute_all on this new
test project, and then reads any CSV result files.
proj_test( proj_dir, min_samples_per_stratum = 10, edit_learner=edit_learner_default, max_jobs=Inf, verbose=FALSE, LAPPLY=NULL)proj_test( proj_dir, min_samples_per_stratum = 10, edit_learner=edit_learner_default, max_jobs=Inf, verbose=FALSE, LAPPLY=NULL)
proj_dir |
Project directory created by |
min_samples_per_stratum |
Minimum number of samples to include in the smallest stratum. Other strata will be down-sampled proportionally. |
edit_learner |
Function which inputs a learner object, and
changes it to take less time for testing. Default calls
|
max_jobs |
Numeric, max number of jobs to test (default Inf). |
verbose |
Logical: print messages? |
LAPPLY |
Function like |
Same value as proj_fread on test project (list of data tables).
Toby Dylan Hocking
data.table::setDTthreads(1L) library(data.table) N <- 8000 set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), c(0.1,0.9)*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) kfold <- mlr3::ResamplingCV$new() kfold$param_set$values$folds <- 2 reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, kfold, save_learner=function(L){ if(inherits(L, "LearnerRegrRpart")){ list(rpart=L$model$frame) } }, score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) mlr3resampling::proj_test(pkg.proj.dir)data.table::setDTthreads(1L) library(data.table) N <- 8000 set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=factor(rep(c("Alice","Bob"), c(0.1,0.9)*N))) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^as.integer(person)) kfold <- mlr3::ResamplingCV$new() kfold$param_set$values$folds <- 2 reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] task.dt <- data.table(reg.dt)[ , y := f(x,person)+rnorm(N, sd=0.5) ][] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target="y") task.obj$col_roles$feature <- "x" task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( featureless=mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } pkg.proj.dir <- tempfile() mlr3resampling::proj_grid( pkg.proj.dir, reg.task.list, reg.learner.list, kfold, save_learner=function(L){ if(inherits(L, "LearnerRegrRpart")){ list(rpart=L$model$frame) } }, score_args=mlr3::msrs(c("regr.rmse", "regr.mae"))) mlr3resampling::proj_test(pkg.proj.dir)
Same/Other/All K-fold cross-validation (SOAK) results in K measures of test error/accuracy. This function computes P-values (two-sided T-test) between Same and All/Other.
pvalue(score_in, value.var = NULL, digits=3)pvalue(score_in, value.var = NULL, digits=3)
score_in |
Data table output from |
value.var |
Name of column to use as the evaluation metric in T-test. Default
NULL means to use the first column matching |
digits |
Number of decimal places to show for mean and standard deviation. |
List of class "pvalue" with named elements value.var,
stats, pvalues.
Toby Dylan Hocking
data.table::setDTthreads(1L) library(data.table) set.seed(2) N <- 150L x <- runif(N, -3, 3) sim.dt <- data.table( x = x, y = sin(x) + rnorm(N, sd = 0.5), Subset = rep(c("large", "large", "small"), length.out = N)) sim.task <- mlr3::TaskRegr$new("iid_easy", sim.dt, target = "y") sim.task$col_roles$feature <- "x" sim.task$col_roles$subset <- "Subset" soak <- mlr3resampling::ResamplingSameOtherSizesCV$new() soak$param_set$values$folds <- 5L learner_list <- list( mlr3::lrn("regr.featureless"), if(requireNamespace("rpart"))mlr3::lrn("regr.rpart")) bgrid <- mlr3::benchmark_grid(sim.task, learner_list, soak) result <- mlr3::benchmark(bgrid) score.dt <- mlr3resampling::score(result, mlr3::msr("regr.rmse")) if(requireNamespace("ggplot2")) plot(score.dt) p.list <- mlr3resampling::pvalue(score.dt) if(requireNamespace("ggplot2")) plot(p.list)data.table::setDTthreads(1L) library(data.table) set.seed(2) N <- 150L x <- runif(N, -3, 3) sim.dt <- data.table( x = x, y = sin(x) + rnorm(N, sd = 0.5), Subset = rep(c("large", "large", "small"), length.out = N)) sim.task <- mlr3::TaskRegr$new("iid_easy", sim.dt, target = "y") sim.task$col_roles$feature <- "x" sim.task$col_roles$subset <- "Subset" soak <- mlr3resampling::ResamplingSameOtherSizesCV$new() soak$param_set$values$folds <- 5L learner_list <- list( mlr3::lrn("regr.featureless"), if(requireNamespace("rpart"))mlr3::lrn("regr.rpart")) bgrid <- mlr3::benchmark_grid(sim.task, learner_list, soak) result <- mlr3::benchmark(bgrid) score.dt <- mlr3resampling::score(result, mlr3::msr("regr.rmse")) if(requireNamespace("ggplot2")) plot(score.dt) p.list <- mlr3resampling::pvalue(score.dt) if(requireNamespace("ggplot2")) plot(p.list)
For one test.subset, compute summary statistics and paired
two-sided t-test P-values that compare same versus
all/other, for two train-size panels:
full (n.train.groups == groups) and down-sampled
(n.train.groups == min(groups)).
pvalue_downsample( score_in, value.var = NULL, digits = 3 )pvalue_downsample( score_in, value.var = NULL, digits = 3 )
score_in |
Data table output from |
value.var |
Name of column to use as the evaluation metric in T-test. Default
NULL means to use the first column matching |
digits |
Non-negative integer number of digits to print in mean/SD text labels. |
List of class "pvalue_downsample" with named elements
subset_name, model_name, value.var,
stats, and pvalues. subset_name and
model_name are inferred from score_in.
Daniel Agyapong
data.table::setDTthreads(1L) library(data.table) set.seed(2) N <- 150L x <- runif(N, -3, 3) sim.dt <- data.table( x = x, y = sin(x) + rnorm(N, sd = 0.5), Subset = rep(c("large", "large", "small"), length.out = N)) sim.task <- mlr3::TaskRegr$new("iid_easy", sim.dt, target = "y") sim.task$col_roles$feature <- "x" sim.task$col_roles$subset <- "Subset" soak <- mlr3resampling::ResamplingSameOtherSizesCV$new() soak$param_set$values$folds <- 5L soak$param_set$values$sizes <- 0L learner_list <- list( mlr3::lrn("regr.featureless"), if(requireNamespace("rpart"))mlr3::lrn("regr.rpart")) bgrid <- mlr3::benchmark_grid(sim.task, learner_list, soak) result <- mlr3::benchmark(bgrid) score.dt <- mlr3resampling::score(result, mlr3::msr("regr.rmse")) if(requireNamespace("ggplot2")) plot(score.dt) down.list <- mlr3resampling::pvalue_downsample(score.dt[ task_id == "iid_easy" & test.subset == "large" & algorithm == "rpart"]) if(requireNamespace("ggplot2")) plot(down.list)data.table::setDTthreads(1L) library(data.table) set.seed(2) N <- 150L x <- runif(N, -3, 3) sim.dt <- data.table( x = x, y = sin(x) + rnorm(N, sd = 0.5), Subset = rep(c("large", "large", "small"), length.out = N)) sim.task <- mlr3::TaskRegr$new("iid_easy", sim.dt, target = "y") sim.task$col_roles$feature <- "x" sim.task$col_roles$subset <- "Subset" soak <- mlr3resampling::ResamplingSameOtherSizesCV$new() soak$param_set$values$folds <- 5L soak$param_set$values$sizes <- 0L learner_list <- list( mlr3::lrn("regr.featureless"), if(requireNamespace("rpart"))mlr3::lrn("regr.rpart")) bgrid <- mlr3::benchmark_grid(sim.task, learner_list, soak) result <- mlr3::benchmark(bgrid) score.dt <- mlr3resampling::score(result, mlr3::msr("regr.rmse")) if(requireNamespace("ggplot2")) plot(score.dt) down.list <- mlr3resampling::pvalue_downsample(score.dt[ task_id == "iid_easy" & test.subset == "large" & algorithm == "rpart"]) if(requireNamespace("ggplot2")) plot(down.list)
ResamplingSameOtherSizesCV
defines how a task is partitioned for
resampling, for example in
resample() or
benchmark().
Resampling objects can be instantiated on a
Task,
which can use two new roles: subset and fold.
After instantiation, sets can be accessed via
$train_set(i) and
$test_set(i), respectively.
This is an implementation of SOAK, Same/Other/All K-fold cross-validation. A supervised learning algorithm inputs a train set, and outputs a prediction function, which can be used on a test set. If each data point belongs to a subset (such as geographic region, year, etc), then how do we know if it is possible to train on one subset, and predict accurately on another subset? Cross-validation can be used to determine the extent to which this is possible, by first assigning fold IDs from 1 to K to all data (possibly using stratification, usually by subset and label). Then we loop over test sets (subset/fold combinations), train sets (same subset, other subsets, all subsets), and compute test/prediction accuracy for each combination. Comparing test/prediction accuracy between same and other, we can determine the extent to which it is possible (perfect if same/other have similar test accuracy for each subset; other is usually somewhat less accurate than same; other can be just as bad as featureless baseline when the subsets have different patterns).
ResamplingSameOtherSizesCV supports stratified sampling.
The stratification variables are assumed to be discrete,
and must be stored in the Task with column role
stratum.
If that role is not set, then all data are assigned to one stratum.
In case of multiple stratification variables,
each combination of the values of the stratification variables forms a
stratum.
When fold role is set, stratum role is not used for fold
assignment (folds are taken from fold role in that case).
When sizes param is at least 0, then downsampled train sets are
created, and if stratum role is set, we attempt to preserve
strata proportions in downsampled train sets.
ResamplingSameOtherSizesCV supports grouping of
observations that will not be split in cross-validation.
The grouping variable is assumed to be discrete,
and must be stored in the Task with column role
group.
If that role is not set, then each row is assigned to a different group.
When fold role is set, group role is not used for fold
assignment (folds are taken from fold role in that case).
When sizes param is at least 0, then downsampled train sets are
created, and if group role is set, we keep groups together
in downsampled train sets.
ResamplingSameOtherSizesCV supports fixing a test subset,
then training on same/other/all subsets.
The subset variable is assumed to be discrete,
and must be stored in the Task with column role subset.
The number of cross-validation folds K should be defined as the
fold parameter, default 3.
The number of random seeds for downsampling should be defined as the
seeds parameter, default 1.
The number of downsampling train set sizes should be defined as
the sizes parameter, which can also take two special values:
default -1 means no downsampling at all, and 0 means only downsampling
to the min of same/other sets (in units of groups).
The ratio for downsampling should be defined as the ratio
parameter, default 0.5. The min size of same and other sets is
repeatedly multiplied by this ratio, to obtain smaller sizes (in units
of groups).
The ignore_subset parameter should be either TRUE or
FALSE (default), whether to ignore the subset
role. TRUE only creates splits for same subset (even if task
defines subset role), and is useful for subtrain/validation
splits (hyper-parameter learning).
The subsets parameter should specify the train subsets of
interest: "S" (same),
"O" (other), "A" (all), "SO", "SA",
"SOA" (default).
If fold column role is set, then that column will be used for
non-random fold assignment (see Reproducibility vignette).
Otherwise, the group_stratum_algo parameter controls the algorithm used for
random fold assignment.
We want to assign folds such that groups are not split (hard
constraint), and strata proportions are preserved (objective to
optimize).
To see exemples of how this works,
read the vignette, Using subset with group and stratum.
Note that this feature works on tasks with both stratum and
group roles (unlike ResamplingCV).
Computing an optimal fold assignment is NP-hard (discrete, need
to try all possible assignments of groups to folds to find best
solution).
Instead of computing the optimal solution (slow), we implement heuristic
algorithms (fast), which yield approximate solutions.
The choice of which heuristic is controlled by the
group_stratum_algo parameter, which can take these values:
"RSS"Default, novel heuristic algorithm which attempts to
minimize the residual sum of squares, between actual and ideal
values of the strata/fold count matrix.
First groups are sorted by RSS (group counts - ideal counts per
stratum); ties are broken using number of samples in group,
mean(ideal*actual) counts (mean over strata), random group order.
Then each group is greedily assigned a fold in order; best fold
results in min RSS, ties broken using total counts above ideal.
For N data, K folds, G groups, and S strata, the algorithm is
O(N + K G S) time and O(N + S K) memory.
"Wasikowski"Same heuristic algorithm as in scikit-learn
StratifiedGroupKFold, adapted from
https://www.kaggle.com/code/jakubwasikowski/stratified-group-k-fold-cross-validation.
First groups are sorted by standard deviation over strata counts;
ties are broken by random group order.
Then each group is greedily assigned a fold in order; best fold results in the
smallest mean SD of the strata/fold count matrix (SD over folds is
computed for each stratum, then mean of SDs, minimized over folds),
ties broken using number of samples in fold.
For N data, K folds, G groups, and S strata, the algorithm is
O(N + K^2 G S) time and O(N + S K + S G) memory (could be problematic when G and
S are both large).
"WasikowskiLimitedMemory"Same logic and time complexity
but only O(N + S K) memory.
The train/test splits are defined by all possible combinations of
test subset, test fold, train subsets (same/other/all), downsampling
sizes, and random seeds.
After $instantiate() is called,
$instance is a list of data about the splits, with elements:
data table with one row per split.
data table with same number of rows as task data.
new()
Creates a new instance of this R6 class.
Resampling$new( id, param_set = ps(), duplicated_ids = FALSE, label = NA_character_, man = NA_character_ )
id(character(1))
Identifier for the new instance.
param_set(paradox::ParamSet)
Set of hyperparameters.
duplicated_ids(logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.
label(character(1))
Label for the new instance.
man(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object.
The referenced help package can be opened via method $help().
train_set()
Returns the row ids of the i-th training set.
Resampling$train_set(i)
i(integer(1))
Iteration.
(integer()) of row ids.
test_set()
Returns the row ids of the i-th test set.
Resampling$test_set(i)
i(integer(1))
Iteration.
(integer()) of row ids.
arXiv paper https://arxiv.org/abs/2410.08643 describing SOAK algorithm.
Articles https://github.com/tdhock/mlr3resampling/wiki/Articles
Package mlr3 for standard
Resampling, which does not support comparing
train on Same/Other/All subsets.
vignette(package="mlr3resampling") for more detailed examples.
same_other_sizes <- mlr3resampling::ResamplingSameOtherSizesCV$new() same_other_sizes$param_set$values$folds <- 5same_other_sizes <- mlr3resampling::ResamplingSameOtherSizesCV$new() same_other_sizes$param_set$values$folds <- 5
Computes a data table of scores.
score(bench.result, ...)score(bench.result, ...)
bench.result |
Output of |
... |
Additional arguments to pass to |
data table with scores.
Toby Dylan Hocking
N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=rep(1:2, each=0.5*N)) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^person) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] yname <- paste0("y_",pattern) reg.dt[, (yname) := f(x,person)+rnorm(N, sd=0.5)][] task.dt <- reg.dt[, c("x","person",yname), with=FALSE] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target=yname) task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } (bench.grid <- mlr3::benchmark_grid( reg.task.list, reg.learner.list, SOAK)) bench.result <- mlr3::benchmark(bench.grid) bench.score <- mlr3resampling::score(bench.result, mlr3::msr("regr.rmse")) plot(bench.score)N <- 80 library(data.table) set.seed(1) reg.dt <- data.table( x=runif(N, -2, 2), person=rep(1:2, each=0.5*N)) reg.pattern.list <- list( easy=function(x, person)x^2, impossible=function(x, person)(x^2)*(-1)^person) SOAK <- mlr3resampling::ResamplingSameOtherSizesCV$new() reg.task.list <- list() for(pattern in names(reg.pattern.list)){ f <- reg.pattern.list[[pattern]] yname <- paste0("y_",pattern) reg.dt[, (yname) := f(x,person)+rnorm(N, sd=0.5)][] task.dt <- reg.dt[, c("x","person",yname), with=FALSE] task.obj <- mlr3::TaskRegr$new( pattern, task.dt, target=yname) task.obj$col_roles$stratum <- "person" task.obj$col_roles$subset <- "person" reg.task.list[[pattern]] <- task.obj } reg.learner.list <- list( mlr3::LearnerRegrFeatureless$new()) if(requireNamespace("rpart")){ reg.learner.list$rpart <- mlr3::LearnerRegrRpart$new() } (bench.grid <- mlr3::benchmark_grid( reg.task.list, reg.learner.list, SOAK)) bench.result <- mlr3::benchmark(bench.grid) bench.score <- mlr3resampling::score(bench.result, mlr3::msr("regr.rmse")) plot(bench.score)