The goal of this vignette is to show how to use custom asymptotic references. As an example, we explore the differences in asymptotic time complexity between different implementations of binary segmentation.
The code below uses the following arguments:
N
is a numeric vector of data sizes,setup
is an R expression to create the data,library(data.table)
seg.result <- atime::atime(
N=2^seq(2, 20),
setup={
max.segs <- as.integer(N/2)
max.changes <- max.segs-1L
set.seed(1)
data.vec <- 1:N
},
"changepoint\n::cpt.mean"={
cpt.fit <- changepoint::cpt.mean(data.vec, method="BinSeg", Q=max.changes)
sort(c(N,cpt.fit@cpts.full[max.changes,]))
},
"binsegRcpp\nmultiset"={
binseg.fit <- binsegRcpp::binseg(
"mean_norm", data.vec, max.segs, container.str="multiset")
sort(binseg.fit$splits$end)
},
"fpop::\nmultiBinSeg"={
mbs.fit <- fpop::multiBinSeg(data.vec, max.changes)
sort(c(mbs.fit$t.est, N))
},
"wbs::sbs"={
wbs.fit <- wbs::sbs(data.vec)
split.dt <- data.table(wbs.fit$res)[order(-min.th, scale)]
sort(split.dt[, c(N, cpt)][1:max.segs])
},
"binsegRcpp\nlist"={
binseg.fit <- binsegRcpp::binseg(
"mean_norm", data.vec, max.segs, container.str="list")
sort(binseg.fit$splits$end)
})
plot(seg.result)
#> Warning in ggplot2::scale_y_log10("median line, min/max band"): log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
#> log-10 transformation introduced infinite values.
The plot method creates a log-log plot of median time and memory vs
data size, for each of the specified R expressions. The plot above shows
some speed differences between binary segmentation algorithms, but they
could be even easier to see for larger data sizes (exercise for the
reader: try modifying the N
and seconds.limit
arguments). You can also see that memory usage is much larger for
changepoint than for the other packages.
You can use references_best
to get a tall/long data
table that can be plotted to show both empirical time and memory
complexity:
seg.best <- atime::references_best(seg.result)
plot(seg.best)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced
#> infinite values.
The figure above shows asymptotic references which are best fit for each expression. We do the best fit by adjusting each reference to the largest N, and then ranking each reference by distance to the measurement of the second to largest N. For each panel/facet (method and unit), what appears to be the best fit asymptotic complexity? Which methods use quadratic time and/or memory?
The predict method estimates the data size N
which each
method can handle for a given unit value. The default is to estimate
N
for a time limit of 0.01 seconds, as shown in the plot
below:
(seg.pred <- predict(seg.best))
#> atime_prediction object
#> unit expr.name unit.value N
#> <char> <char> <num> <num>
#> 1: seconds changepoint\n::cpt.mean 0.01 629.2871
#> 2: seconds binsegRcpp\nmultiset 0.01 18438.3041
#> 3: seconds fpop::\nmultiBinSeg 0.01 20917.7532
#> 4: seconds wbs::sbs 0.01 20912.1263
#> 5: seconds binsegRcpp\nlist 0.01 2958.3351
plot(seg.pred)
#> Warning in ggplot2::scale_x_log10("N", breaks = meas[,
#> 10^seq(ceiling(min(log10(N))), : log-10 transformation introduced infinite
#> values.
If you have one or more expected time complexity classes that you
want to compare with your empirical measurements, you can use the
fun.list
argument. Note that each function in that list
should take as input the data size N
and output log base 10
of the reference function, as below:
my.refs <- list(# names should be LaTeX math mode expressions.
"N"=function(N)log10(N),
"N \\log N"=function(N)log10(N) + log10(log(N)),
"N^2"=function(N)2*log10(N),
"N^3"=function(N)3*log10(N))
my.best <- atime::references_best(seg.result, fun.list=my.refs)
dcast(my.best$references, expr.name + fun.name ~ unit, length)
#> Key: <expr.name, fun.name>
#> expr.name fun.name kilobytes seconds
#> <char> <char> <int> <int>
#> 1: binsegRcpp\nlist N 10 8
#> 2: binsegRcpp\nlist N log N 9 7
#> 3: binsegRcpp\nlist N^2 5 4
#> 4: binsegRcpp\nlist N^3 4 3
#> 5: binsegRcpp\nmultiset N 13 8
#> 6: binsegRcpp\nmultiset N log N 11 7
#> 7: binsegRcpp\nmultiset N^2 7 4
#> 8: binsegRcpp\nmultiset N^3 5 3
#> 9: changepoint\n::cpt.mean N 9 9
#> 10: changepoint\n::cpt.mean N log N 9 8
#> 11: changepoint\n::cpt.mean N^2 7 5
#> 12: changepoint\n::cpt.mean N^3 5 3
#> 13: fpop::\nmultiBinSeg N 10 8
#> 14: fpop::\nmultiBinSeg N log N 9 7
#> 15: fpop::\nmultiBinSeg N^2 5 4
#> 16: fpop::\nmultiBinSeg N^3 4 3
#> 17: wbs::sbs N 13 8
#> 18: wbs::sbs N log N 11 7
#> 19: wbs::sbs N^2 7 4
#> 20: wbs::sbs N^3 5 3
#> expr.name fun.name kilobytes seconds
The table above shows the counts of rows in the references table,
which can be used to draw asymptotic reference curves, for each
expr.name
, fun.name
, and unit (kilobytes and
seconds). To the best fit among these references, we use the code
below:
plot(my.best)
#> Warning in ggplot2::scale_y_log10(""): log-10 transformation introduced
#> infinite values.
The default plot above shows only the closest references which are
above and below the empirical/black measurements. To plot different
references and measurements, you can specify the
plot.references
and measurements
elements. For
example, in the code below, we subset the data so that only the seconds
unit is shown (not kilobytes), and only three expected references are
shown (log-linear, quadratic, cubic):
some.best <- my.best
some.best$plot.references <- my.best$ref[unit=="seconds" & fun.name %in% c("N log N","N^2","N^3")]
some.best$measurements <- my.best$meas[unit=="seconds"]
plot(some.best)
From the plot above, you should be able to see the asymptotic time complexity class of each algorithm.
seconds.limit
to see the differences more
clearly.