Package 'penaltyLearning' reference manual

Title:	Penalty Learning
Description:	Implementations of algorithms from Learning Sparse Penalties for Change-point Detection using Max Margin Interval Regression, by Hocking, Rigaill, Vert, Bach <http://proceedings.mlr.press/v28/hocking13.html> published in proceedings of ICML2013.
Authors:	Toby Dylan Hocking [aut, cre]
Maintainer:	Toby Dylan Hocking <[email protected]>
License:	GPL-3
Version:	2024.9.3
Built:	2025-03-31 03:10:22 UTC
Source:	https://github.com/tdhock/penaltylearning

change colors

Description

character vector of change-point label colors, to be used with ggplot2::scale_*_manual

Usage

"change.colors""change.colors"

change labels

Description

data.table of meta-data for label types.

Usage

"change.labels""change.labels"

changeLabel

Description

Describe an annotated region label for supervised change-point detection.

Usage

changeLabel(annotation, 
    min.changes, max.changes, 
    color)changeLabel(annotation, 
    min.changes, max.changes, 
    color)

Arguments

`annotation`	annotation
`min.changes`	min.changes
`max.changes`	max.changes
`color`	color

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

check features targets

Description

stop with an informative error if there is a problem with the feature or target matrix.

Usage

check_features_targets(feature.mat, 
    target.mat)check_features_targets(feature.mat, 
    target.mat)

Arguments

`feature.mat`	n x p numeric input feature matrix.
`target.mat`	n x 2 matrix of target interval limits.

Value

number of observations/rows.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

check target pred

Description

stop with an informative error if there are problems with the target matrix or predicted values.

Usage

check_target_pred(target.mat, 
    pred)check_target_pred(target.mat, 
    pred)

Arguments

`target.mat`	target.mat
`pred`	pred

Value

number of observations.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

coef IntervalRegression

Description

Get the learned coefficients of an IntervalRegression model.

Usage

## S3 method for class 'IntervalRegression'
coef(object, 
    ...)## S3 method for class 'IntervalRegression'
coef(object, 
    ...)

Arguments

`object`	object
`...`	...

Value

numeric matrix [features x regularizations] of learned weights (on the original feature scale), can be used for prediction via cbind(1,features) %*% weights.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

PeakSegFPOP demo data set

Description

PeakSegFPOP demo data set with 8 observations

Usage

data("demo8")data("demo8")

Format

A list of two objects: feature.mat is an 8 x 36 input feature matrix, and target.mat is a 8 x 2 output limit matrix.

featureMatrix

Description

Compute a feature matrix (segmentation problems x features).

Usage

featureMatrix(data.sequences, 
    problem.vars, data.var)featureMatrix(data.sequences, 
    problem.vars, data.var)

Arguments

`data.sequences`	data.frame of sorted sequences of data to segment.
`problem.vars`	character vector of columns of `data.sequences` to treat as segmentation problem IDs.
`data.var`	character vector of length 1 (column of `data.sequences` to treat as data to segment).

Value

Numeric feature matrix. Some entries may be missing or infinite; these columns should be removed before model training.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


test.df <- data.frame(
  id=rep(1:2, each=10),
  x=rnorm(20))
penaltyLearning::featureMatrix(test.df, "id", "x")
if(requireNamespace("neuroblastoma")){
  data(neuroblastoma, package="neuroblastoma", envir=environment())
  one <- subset(neuroblastoma$profiles, profile.id %in% c(1,2))
  f.mat <- penaltyLearning::featureMatrix(
    one, c("profile.id", "chromosome"), "logratio")
}

test.df <- data.frame(
  id=rep(1:2, each=10),
  x=rnorm(20))
penaltyLearning::featureMatrix(test.df, "id", "x")
if(requireNamespace("neuroblastoma")){
  data(neuroblastoma, package="neuroblastoma", envir=environment())
  one <- subset(neuroblastoma$profiles, profile.id %in% c(1,2))
  f.mat <- penaltyLearning::featureMatrix(
    one, c("profile.id", "chromosome"), "logratio")
}

Compute a feature vector of constant length which can be used as an input for supervised penalty learning. The output is a target interval of log(penalty) values that achieve minimum incorrect labels (see targetIntervals).

Usage

featureVector(data.vec)featureVector(data.vec)

Arguments

data.vec

numeric vector of ordered data.

Value

Numeric vector of features.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


x <- rnorm(10)
penaltyLearning::featureVector(x)
if(requireNamespace("neuroblastoma")){
  data(neuroblastoma, package="neuroblastoma", envir=environment())
  one <- subset(neuroblastoma$profiles, profile.id=="1" & chromosome=="1")
  (f.vec <- penaltyLearning::featureVector(one$logratio))
}

x <- rnorm(10)
penaltyLearning::featureVector(x)
if(requireNamespace("neuroblastoma")){
  data(neuroblastoma, package="neuroblastoma", envir=environment())
  one <- subset(neuroblastoma$profiles, profile.id=="1" & chromosome=="1")
  (f.vec <- penaltyLearning::featureVector(one$logratio))
}

geom tallrect

Description

ggplot2 geom with xmin and xmax aesthetics that covers the entire y range, useful for clickSelects background elements.

Usage

geom_tallrect(mapping = NULL, 
    data = NULL, stat = "identity", 
    position = "identity", 
    ..., na.rm = FALSE, 
    show.legend = NA, 
    inherit.aes = TRUE)geom_tallrect(mapping = NULL, 
    data = NULL, stat = "identity", 
    position = "identity", 
    ..., na.rm = FALSE, 
    show.legend = NA, 
    inherit.aes = TRUE)

Arguments

`mapping`	mapping
`data`	data
`stat`	stat
`position`	position
`...`	...
`na.rm`	na.rm
`show.legend`	show.legend
`inherit.aes`	inherit.aes

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

GeomTallRect

Description

ggproto object for geom_tallrect

Usage

"GeomTallRect""GeomTallRect"

IntervalRegressionCV

Description

Use cross-validation to fit an L1-regularized linear interval regression model by optimizing margin and/or regularization parameters. This function repeatedly calls IntervalRegressionRegularized, and by default assumes that margin=1. To optimize the margin, specify the margin.vec parameter manually, or use IntervalRegressionCVmargin (which takes more computation time but yields more accurate models). If the future package is available, two levels of future_lapply are used to parallelize on validation.fold and margin.

Usage

IntervalRegressionCV(feature.mat, 
    target.mat, n.folds = ifelse(nrow(feature.mat) < 
        10, 3L, 5L), 
    fold.vec = sample(rep(1:n.folds, 
        l = nrow(feature.mat))), 
    verbose = 0, min.observations = 10, 
    reg.type = "min", 
    incorrect.labels.db = NULL, 
    initial.regularization = 0.001, 
    margin.vec = 1, LAPPLY = NULL, 
    check.unlogged = TRUE, 
    ...)IntervalRegressionCV(feature.mat, 
    target.mat, n.folds = ifelse(nrow(feature.mat) < 
        10, 3L, 5L), 
    fold.vec = sample(rep(1:n.folds, 
        l = nrow(feature.mat))), 
    verbose = 0, min.observations = 10, 
    reg.type = "min", 
    incorrect.labels.db = NULL, 
    initial.regularization = 0.001, 
    margin.vec = 1, LAPPLY = NULL, 
    check.unlogged = TRUE, 
    ...)

Arguments

`feature.mat`	Numeric feature matrix, n observations x p features.
`target.mat`	Numeric target matrix, n observations x 2 limits. These should be real-valued (possibly negative). If your data are interval censored positive-valued survival times, you need to log them to obtain `target.mat`.
`n.folds`	Number of cross-validation folds.
`fold.vec`	Integer vector of fold id numbers.
`verbose`	numeric: 0 for silent, bigger numbers (1 or 2) for more output.
`min.observations`	stop with an error if there are fewer than this many observations.
`reg.type`	Either "1sd" or "min" which specifies how the regularization parameter is chosen during the internal cross-validation loop. min: first take the mean of the K-CV error functions, then minimize it (this is the default since it tends to yield the least test error). 1sd: take the most regularized model with the same margin which is within one standard deviation of that minimum (this model is typically a bit less accurate, but much less complex, so better if you want to interpret the coefficients).
`incorrect.labels.db`	either NULL or a data.table, which specifies the error function to compute for selecting the regularization parameter on the validation set. NULL means to minimize the squared hinge loss, which measures how far the predicted log(penalty) values are from the target intervals. If a data.table is specified, its first key should correspond to the rownames of `feature.mat`, and columns min.log.lambda, max.log.lambda, fp, fn, possible.fp, possible.fn; these will be used with `ROChange` to compute the AUC for each regularization parameter, and the maximimum will be selected (in the plot this is negative.auc, which is minimized). This data.table can be computed via labelError(modelSelection(`...`),...)$model.errors – see example(`ROChange`). In practice this makes the computation longer, and it should only result in more accurate models if there are many labels per data sequence.
`initial.regularization`	Passed to `IntervalRegressionRegularized`.
`margin.vec`	numeric vector of margin size hyper-parameters. The computation time is linear in the number of elements of `margin.vec` – more values takes more computation time, but yields slightly more accurate models (if there is enough data).
`LAPPLY`	Function to use for parallelization, by default `future_lapply` if it is available, otherwise lapply. For debugging with verbose>0 it is useful to specify LAPPLY=lapply in order to interactively see messages, before all parallel processes end.
`check.unlogged`	If TRUE, stop with an error if target matrix is non-negative and has any big difference in successive quantiles (this is an indicator that the user probably forgot to log their outputs).
`...`	passed to `IntervalRegressionRegularized`.

Value

List representing regularized linear model.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


if(interactive()){
  library(penaltyLearning)
  data("neuroblastomaProcessed", package="penaltyLearning", envir=environment())
  if(require(future)){
    plan(multiprocess)
  }
  set.seed(1)
  i.train <- 1:100
  fit <- with(neuroblastomaProcessed, IntervalRegressionCV(
    feature.mat[i.train,], target.mat[i.train,],
    verbose=0))
  ## When only features and target matrices are specified for
  ## training, the squared hinge loss is used as the metric to
  ## minimize on the validation set.
  plot(fit)
  ## Create an incorrect labels data.table (first key is same as
  ## rownames of feature.mat and target.mat).
  library(data.table)
  errors.per.model <- data.table(neuroblastomaProcessed$errors)
  errors.per.model[, pid.chr := paste0(profile.id, ".", chromosome)]
  setkey(errors.per.model, pid.chr)
  set.seed(1)
  fit <- with(neuroblastomaProcessed, IntervalRegressionCV(
    feature.mat[i.train,], target.mat[i.train,],
    ## The incorrect.labels.db argument is optional, but can be used if
    ## you want to use AUC as the CV model selection criterion.
    incorrect.labels.db=errors.per.model))
  plot(fit)
}

if(interactive()){
  library(penaltyLearning)
  data("neuroblastomaProcessed", package="penaltyLearning", envir=environment())
  if(require(future)){
    plan(multiprocess)
  }
  set.seed(1)
  i.train <- 1:100
  fit <- with(neuroblastomaProcessed, IntervalRegressionCV(
    feature.mat[i.train,], target.mat[i.train,],
    verbose=0))
  ## When only features and target matrices are specified for
  ## training, the squared hinge loss is used as the metric to
  ## minimize on the validation set.
  plot(fit)
  ## Create an incorrect labels data.table (first key is same as
  ## rownames of feature.mat and target.mat).
  library(data.table)
  errors.per.model <- data.table(neuroblastomaProcessed$errors)
  errors.per.model[, pid.chr := paste0(profile.id, ".", chromosome)]
  setkey(errors.per.model, pid.chr)
  set.seed(1)
  fit <- with(neuroblastomaProcessed, IntervalRegressionCV(
    feature.mat[i.train,], target.mat[i.train,],
    ## The incorrect.labels.db argument is optional, but can be used if
    ## you want to use AUC as the CV model selection criterion.
    incorrect.labels.db=errors.per.model))
  plot(fit)
}

IntervalRegressionCVmargin

Description

Use cross-validation to fit an L1-regularized linear interval regression model by optimizing both margin and regularization parameters. This function just calls IntervalRegressionCV with a margin.vec parameter that is computed based on the finite target interval limits. If default parameters are used, this function should be about 10 times slower than IntervalRegressionCV (since this function computes n.margin=10 models per regularization parameter whereas IntervalRegressionCV only computes one). On large (N > 1000 rows) data sets, this function should yield a model which is a little more accurate than IntervalRegressionCV (since the margin parameter is optimized).

Usage

IntervalRegressionCVmargin(feature.mat, 
    target.mat, log10.diff = 2, 
    n.margin = 10L, ...)IntervalRegressionCVmargin(feature.mat, 
    target.mat, log10.diff = 2, 
    n.margin = 10L, ...)

Arguments

`feature.mat`	Numeric feature matrix, n observations x p features.
`target.mat`	Numeric target matrix, n observations x 2 limits.
`log10.diff`	Numeric scalar: factors of 10 below the largest finite limit difference to use as a minimum margin value (difference on the log10 scale which is used to generate margin parameters). Bigger values mean a grid of margin parameters with a larger range. For example if the largest finite limit in `target.mat` is 26 and the smallest finite limit is -4 then the largest limit difference is 30, which will be used as the maximum margin parameter. If `log10.diff` is the default of 2 then that means the smallest margin parameter will be 0.3 (two factors of 10 smaller than 30).
`n.margin`	Integer scalar: number of margin parameters, by default 10.
`...`	Passed to `IntervalRegressionCV`.

Value

Model fit list from IntervalRegressionCV.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples

if(interactive()){
  library(penaltyLearning)
  data(
    "neuroblastomaProcessed",
    package="penaltyLearning",
    envir=environment())
  if(require(future)){
    plan(multiprocess)
  }
  set.seed(1)
  fit <- with(neuroblastomaProcessed, IntervalRegressionCVmargin(
    feature.mat, target.mat, verbose=1))
  plot(fit)
  print(fit$plot.heatmap)
}
if(interactive()){
  library(penaltyLearning)
  data(
    "neuroblastomaProcessed",
    package="penaltyLearning",
    envir=environment())
  if(require(future)){
    plan(multiprocess)
  }
  set.seed(1)
  fit <- with(neuroblastomaProcessed, IntervalRegressionCVmargin(
    feature.mat, target.mat, verbose=1))
  plot(fit)
  print(fit$plot.heatmap)
}

IntervalRegressionInternal

Description

Solve the squared hinge loss interval regression problem for one regularization parameter: w* = argmin_w L(w) + regularization * ||w||_1 where L(w) is the average squared hinge loss with respect to the targets, and ||w||_1 is the L1-norm of the weight vector (excluding the first element, which is the un-regularized intercept or bias term). This function performs no scaling of input features, and is meant for internal use only! To learn a regression model, try IntervalRegressionCV or IntervalRegressionUnregularized.

Usage

IntervalRegressionInternal(features, 
    targets, initial.param.vec, 
    regularization, threshold = 0.001, 
    max.iterations = 1000, 
    weight.vec = NULL, 
    Lipschitz = NULL, 
    verbose = 2, margin = 1, 
    biggest.crit = 100)IntervalRegressionInternal(features, 
    targets, initial.param.vec, 
    regularization, threshold = 0.001, 
    max.iterations = 1000, 
    weight.vec = NULL, 
    Lipschitz = NULL, 
    verbose = 2, margin = 1, 
    biggest.crit = 100)

Arguments

`features`	Scaled numeric feature matrix (problems x `features`). The first column/feature should be all ones and will not be regularized.
`targets`	Numeric target matrix (problems x 2).
`initial.param.vec`	initial guess for weight vector (`features`).
`regularization`	Degree of L1-regularization.
`threshold`	When the stopping criterion gets below this `threshold`, the algorithm stops and declares the solution as optimal.
`max.iterations`	If the algorithm has not found an optimal solution after this many iterations, increase `Lipschitz` constant and `max.iterations`.
`weight.vec`	A numeric vector of weights for each training example.
`Lipschitz`	A numeric scalar or NULL, which means to compute `Lipschitz` as the mean of the squared L2-norms of the rows of the feature matrix.
`verbose`	Cat messages: for restarts and at the end if >= 1, and for every iteration if >= 2.
`margin`	Margin size hyper-parameter, default 1.
`biggest.crit`	Restart FISTA with a bigger `Lipschitz` (smaller step size) if crit gets larger than this.

Value

Numeric vector of scaled weights w of the affine function f_w(X) = X %*% w for a scaled feature matrix X with the first row entirely ones.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

IntervalRegressionRegularized

Description

Repeatedly use IntervalRegressionInternal to solve interval regression problems for a path of regularization parameters. This function does not perform automatic selection of the regularization parameter; instead, it returns regression models for a range of regularization parameters, and it is up to you to select which one to use. For automatic regularization parameter selection, use IntervalRegressionCV.

Usage

IntervalRegressionRegularized(feature.mat, 
    target.mat, initial.regularization = 0.001, 
    factor.regularization = 1.2, 
    verbose = 0, margin = 1, 
    ...)IntervalRegressionRegularized(feature.mat, 
    target.mat, initial.regularization = 0.001, 
    factor.regularization = 1.2, 
    verbose = 0, margin = 1, 
    ...)

Arguments

`feature.mat`	Numeric feature matrix.
`target.mat`	Numeric target matrix.
`initial.regularization`	Initial regularization parameter.
`factor.regularization`	Increase regularization by this factor after finding an optimal solution. Or NULL to compute just one model (`initial.regularization`).
`verbose`	Print messages if >= 1.
`margin`	Non-negative `margin` size parameter, default 1.
`...`	Other parameters to pass to `IntervalRegressionInternal`.

Value

List representing fit model. You can do fit$predict(feature.matrix) to get a matrix of predicted log penalty values. The param.mat is the n.features * n.regularization numeric matrix of optimal coefficients (on the original scale).

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


if(interactive()){
  library(penaltyLearning)
  data("neuroblastomaProcessed", package="penaltyLearning", envir=environment())
  i.train <- 1:500
  fit <- with(neuroblastomaProcessed, IntervalRegressionRegularized(
    feature.mat[i.train,], target.mat[i.train,]))
  plot(fit)
}

if(interactive()){
  library(penaltyLearning)
  data("neuroblastomaProcessed", package="penaltyLearning", envir=environment())
  i.train <- 1:500
  fit <- with(neuroblastomaProcessed, IntervalRegressionRegularized(
    feature.mat[i.train,], target.mat[i.train,]))
  plot(fit)
}

IntervalRegressionUnregularized

Description

Use IntervalRegressionRegularized with initial.regularization=0 and factor.regularization=NULL, meaning fit one un-regularized interval regression model.

Usage

IntervalRegressionUnregularized(...)IntervalRegressionUnregularized(...)

Arguments

...

passed to IntervalRegressionRegularized.

Value

List representing fit model, see help(IntervalRegressionRegularized) for details.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Compute incorrect labels

Description

Compute incorrect labels for several change-point detection problems and models. Use this function after having computed changepoints, loss values, and model selection functions (see modelSelection). The next step after labelError is typically computing target intervals of log(penalty) values that predict changepoints with minimum incorrect labels for each problem (see targetIntervals).

Usage

labelError(models, labels, 
    changes, change.var = "chromStart", 
    label.vars = c("min", 
        "max"), model.vars = "n.segments", 
    problem.vars = character(0), 
    annotations = change.labels)labelError(models, labels, 
    changes, change.var = "chromStart", 
    label.vars = c("min", 
        "max"), model.vars = "n.segments", 
    problem.vars = character(0), 
    annotations = change.labels)

Arguments

`models`	data.frame with one row per (problem,model) combination, typically the output of modelSelection(...). There is a row for each changepoint model that could be selected for a particular segmentation problem. There should be columns `problem.vars` (for problem ID) and `model.vars` (for model complexity).
`labels`	data.frame with one row per (problem,region). Each label defines a region in a particular segmentation problem, and a range of predicted changepoints which are consistent in that region. There should be a column "annotation" with takes one of the corresponding values in the annotation column of `change.labels` (used to determine the range of predicted changepoints which are consistent). There should also be a columns `problem.vars` (for problem ID) and `label.vars` (for region start/end).
`changes`	data.frame with one row per (problem,model,change), for each predicted changepoint (in each model and segmentation problem). Should have columns `problem.vars` (for problem ID), `model.vars` (for model complexity), and `change.var` (for changepoint position).
`change.var`	character(length=1): column name of predicted change-point position in `labels`. The default "chromStart" is useful for genomic data with segment start/end positions stored in columns named chromStart/chromEnd. A predicted changepoint at position X is interpreted to mean a changepoint between X and X+1.
`label.vars`	character(length=2): column names of start and end positions of `labels`, in same units as change-point positions. The default is c("min", "max"). Labeled regions are (start,end] – open on the left and closed on the right, so for example a 0changes annotation between start=10 and end=20 means that any predicted changepoint at 11, ..., 20 is a false positive.
`model.vars`	character: column names used to identify model complexity. The default "n.segments" is for change-point `models` such as in the jointseg and changepoint packages.
`problem.vars`	character: column names used to identify data set / segmentation problem, should be present in all three data tables (`models`, `labels`, `changes`).
`annotations`	data.table with columns annotation, min.changes, max.changes, possible.fn, possible.fp which is joined to `labels` in order to determine how to compute false positives and false negatives for each annotation.

Value

list of two data.tables: label.errors has one row for every combination of models and labels, with status column that indicates whether or not that model commits an error in that particular label; model.errors has one row per model, with columns for computing target intervals and ROC curves (see targetIntervals and ROChange).

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


label <- function(annotation, min, max){
  data.frame(profile.id=4, chrom="chr14", min, max, annotation)
}
label.df <- rbind(
  label("1change", 70e6, 80e6),
  label("0changes", 20e6, 60e6))
model.df <- data.frame(chrom="chr14", n.segments=1:3)
change.df <- data.frame(chrom="chr14", rbind(
  data.frame(n.segments=2, changepoint=75e6),
  data.frame(n.segments=3, changepoint=c(75e6, 50e6))))
penaltyLearning::labelError(
  model.df, label.df, change.df,
  problem.vars="chrom", # for all three data sets.
  model.vars="n.segments", # for changes and selection.
  change.var="changepoint", # column of changes with breakpoint position.
  label.vars=c("min", "max")) # limit of labels in ann.

label <- function(annotation, min, max){
  data.frame(profile.id=4, chrom="chr14", min, max, annotation)
}
label.df <- rbind(
  label("1change", 70e6, 80e6),
  label("0changes", 20e6, 60e6))
model.df <- data.frame(chrom="chr14", n.segments=1:3)
change.df <- data.frame(chrom="chr14", rbind(
  data.frame(n.segments=2, changepoint=75e6),
  data.frame(n.segments=3, changepoint=c(75e6, 50e6))))
penaltyLearning::labelError(
  model.df, label.df, change.df,
  problem.vars="chrom", # for all three data sets.
  model.vars="n.segments", # for changes and selection.
  change.var="changepoint", # column of changes with breakpoint position.
  label.vars=c("min", "max")) # limit of labels in ann.

largestContinuousMinimumC

Description

Find the run of minimum cost with the largest size. This function use a linear time C implementation, and is meant for internal use. Use targetIntervals for real data.

Usage

largestContinuousMinimumC(cost, 
    size)largestContinuousMinimumC(cost, 
    size)

Arguments

`cost`	numeric vector of `cost` values.
`size`	numeric vector of interval `size` values.

Value

Integer vector length 2 (start and end of target interval relative to cost and size).

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


library(penaltyLearning)
data(neuroblastomaProcessed, envir=environment())
one.problem.error <-
  neuroblastomaProcessed$errors[profile.id=="4" & chromosome=="1"]
indices <- one.problem.error[, largestContinuousMinimumC(
  errors, max.log.lambda-min.log.lambda)]
one.problem.error[indices[["start"]]:indices[["end"]],]

library(penaltyLearning)
data(neuroblastomaProcessed, envir=environment())
one.problem.error <-
  neuroblastomaProcessed$errors[profile.id=="4" & chromosome=="1"]
indices <- one.problem.error[, largestContinuousMinimumC(
  errors, max.log.lambda-min.log.lambda)]
one.problem.error[indices[["start"]]:indices[["end"]],]

largestContinuousMinimumR

Description

Find the run of minimum cost with the largest size. This function uses a two pass R implementation, and is meant for internal use. Use targetIntervals for real data.

Usage

largestContinuousMinimumR(cost, 
    size)largestContinuousMinimumR(cost, 
    size)

Arguments

`cost`	numeric vector of `cost` values.
`size`	numeric vector of interval `size` values.

Value

Integer vector length 2 (start and end of target interval relative to cost and size).

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


library(penaltyLearning)
data(neuroblastomaProcessed, envir=environment())
one.problem.error <-
  neuroblastomaProcessed$errors[profile.id=="4" & chromosome=="1"]
indices <- one.problem.error[, largestContinuousMinimumR(
  errors, max.log.lambda-min.log.lambda)]
one.problem.error[indices[["start"]]:indices[["end"]],]

library(penaltyLearning)
data(neuroblastomaProcessed, envir=environment())
one.problem.error <-
  neuroblastomaProcessed$errors[profile.id=="4" & chromosome=="1"]
indices <- one.problem.error[, largestContinuousMinimumR(
  errors, max.log.lambda-min.log.lambda)]
one.problem.error[indices[["start"]]:indices[["end"]],]

Compute exact model selection function

Description

Given loss.vec L_i, model.complexity K_i, the model selection function i*(lambda) = argmin_i L_i + lambda*K_i, compute all of the solutions (i, min.lambda, max.lambda) with i being the solution for every lambda in (min.lambda, max.lambda). Use this function after having computed changepoints and loss values for each model, and before using labelError. This function uses the linear time algorithm implemented in C code (modelSelectionC).

Usage

modelSelection(models, 
    loss = "loss", complexity = "complexity")modelSelection(models, 
    loss = "loss", complexity = "complexity")

Arguments

`models`	data.frame with one row per model. There must be at least two columns models[[loss]] and models[[complexity]], but there can also be other meta-data columns.
`loss`	character: column name of `models` to interpret as `loss` L_i.
`complexity`	character: column name of `models` to interpret as `complexity` K_i.

Value

data.frame with a row for each model that can be selected for at least one lambda value, and the following columns. (min.lambda, max.lambda) and (min.log.lambda, max.log.lambda) are intervals of optimal penalty constants, on the original and log scale; the other columns (and rownames) are taken from models. This should be used as the models argument of labelError.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Exact model selection function

Description

Given loss.vec L_i, model.complexity K_i, the model selection function i*(lambda) = argmin_i L_i + lambda*K_i, compute all of the solutions (i, min.lambda, max.lambda) with i being the solution for every lambda in (min.lambda, max.lambda). This function uses the linear time algorithm implemented in C code. This function is mostly meant for internal use – it is instead recommended to use modelSelection.

Usage

modelSelectionC(loss.vec, 
    model.complexity, 
    model.id)modelSelectionC(loss.vec, 
    model.complexity, 
    model.id)

Arguments

`loss.vec`	numeric vector: loss L_i
`model.complexity`	numeric vector: model complexity K_i
`model.id`	vector: indices i

Value

data.frame with a row for each model that can be selected for at least one lambda value, and the following columns. (min.lambda, max.lambda) and (min.log.lambda, max.log.lambda) are intervals of optimal penalty constants, on the original and log scale; model.complexity are the K_i values; model.id are the model identifiers (also used for row names); and model.loss are the C_i values.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


loss.vec <- c(
  -9.9, -12.8, -19.2, -22.1, -24.5, -26.1, -28.5, -30.1, -32.2, 
  -33.7, -35.2, -36.8, -38.2, -39.5, -40.7, -41.8, -42.8, -43.9, 
  -44.9, -45.8)
seg.vec <- seq_along(loss.vec)
exact.df <- penaltyLearning::modelSelectionC(loss.vec, seg.vec, seg.vec)
## Solve the optimization using grid search.
L.grid <- with(exact.df,{
  seq(min(max.log.lambda)-1,
      max(min.log.lambda)+1,
      l=100)
})
lambda.grid <- exp(L.grid)
kstar.grid <- sapply(lambda.grid, function(lambda){
  crit <- with(exact.df, model.complexity * lambda + model.loss)
  picked <- which.min(crit)
  exact.df$model.id[picked]
})
grid.df <- data.frame(log.lambda=L.grid, segments=kstar.grid)
library(ggplot2)
## Compare the results.
ggplot()+
  ggtitle("grid search (red) agrees with exact path computation (black)")+
  geom_segment(aes(min.log.lambda, model.id,
                   xend=max.log.lambda, yend=model.id),
               data=exact.df)+
  geom_point(aes(log.lambda, segments),
             data=grid.df, color="red", pch=1)+
  ylab("optimal model complexity (segments)")+
  xlab("log(lambda)")

loss.vec <- c(
  -9.9, -12.8, -19.2, -22.1, -24.5, -26.1, -28.5, -30.1, -32.2, 
  -33.7, -35.2, -36.8, -38.2, -39.5, -40.7, -41.8, -42.8, -43.9, 
  -44.9, -45.8)
seg.vec <- seq_along(loss.vec)
exact.df <- penaltyLearning::modelSelectionC(loss.vec, seg.vec, seg.vec)
## Solve the optimization using grid search.
L.grid <- with(exact.df,{
  seq(min(max.log.lambda)-1,
      max(min.log.lambda)+1,
      l=100)
})
lambda.grid <- exp(L.grid)
kstar.grid <- sapply(lambda.grid, function(lambda){
  crit <- with(exact.df, model.complexity * lambda + model.loss)
  picked <- which.min(crit)
  exact.df$model.id[picked]
})
grid.df <- data.frame(log.lambda=L.grid, segments=kstar.grid)
library(ggplot2)
## Compare the results.
ggplot()+
  ggtitle("grid search (red) agrees with exact path computation (black)")+
  geom_segment(aes(min.log.lambda, model.id,
                   xend=max.log.lambda, yend=model.id),
               data=exact.df)+
  geom_point(aes(log.lambda, segments),
             data=grid.df, color="red", pch=1)+
  ylab("optimal model complexity (segments)")+
  xlab("log(lambda)")

Exact model selection function

Description

Given loss.vec L_i, model.complexity K_i, the model selection function i*(lambda) = argmin_i L_i + lambda*K_i, compute all of the solutions (i, min.lambda, max.lambda) with i being the solution for every lambda in (min.lambda, max.lambda). This function uses the quadratic time algorithm implemented in R code. This function is mostly meant for internal use and comparison – it is instead recommended to use modelSelection.

Usage

modelSelectionR(loss.vec, 
    model.complexity, 
    model.id)modelSelectionR(loss.vec, 
    model.complexity, 
    model.id)

Arguments

`loss.vec`	numeric vector: loss L_i
`model.complexity`	numeric vector: model complexity K_i
`model.id`	vector: indices i

Value

data.frame with a row for each model that can be selected for at least one lambda value, and the following columns. (min.lambda, max.lambda) and (min.log.lambda, max.log.lambda) are intervals of optimal penalty constants, on the original and log scale; model.complexity are the K_i values; model.id are the model identifiers (also used for row names); and model.loss are the C_i values.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


loss.vec <- c(
  -9.9, -12.8, -19.2, -22.1, -24.5, -26.1, -28.5, -30.1, -32.2, 
  -33.7, -35.2, -36.8, -38.2, -39.5, -40.7, -41.8, -42.8, -43.9, 
  -44.9, -45.8)
seg.vec <- seq_along(loss.vec)
penaltyLearning::modelSelectionR(loss.vec, seg.vec, seg.vec)

loss.vec <- c(
  -9.9, -12.8, -19.2, -22.1, -24.5, -26.1, -28.5, -30.1, -32.2, 
  -33.7, -35.2, -36.8, -38.2, -39.5, -40.7, -41.8, -42.8, -43.9, 
  -44.9, -45.8)
seg.vec <- seq_along(loss.vec)
penaltyLearning::modelSelectionR(loss.vec, seg.vec, seg.vec)

Processed neuroblastoma data set with features and targets

Description

Features are inputs and targets are outputs for penalty learning functions like penaltyLearning::IntervalRegressionCV. data(neuroblastoma, package="neuroblastoma") was processed by computing optimal Gaussian segmentation models from 1 to 20 segments (cghseg:::segmeanCO or Segmentor3IsBack::Segmentor), then label error was computed using neuroblastoma$annotations (penaltyLearning::labelError), then target intervals were computed (penaltyLearning::targetInterval). Features were also computed based on neuroblastoma$profiles.

Usage

data("neuroblastomaProcessed")data("neuroblastomaProcessed")

Format

List of two matrices: feature.mat is n.observations x n.features, and target.mat is n.observations x 2, where n.observations=3418 and n.features=117.

Interval regression problem that was not converging

Description

A small data set which was diverging using a previous implementation of IntervalRegressionCV.

Usage

data("notConverging")data("notConverging")

Format

A list with names: X.mat are numeric inputs, y.mat are numeric outputs, fold.vec is an integer vector of fold ID numbers.

Source

github.com/tdhock/neuroblastoma-data, data/H3K4me3_TDH_other/cv/equal_labels/testFolds/3/sampleSelectionGP_erf/5/order.csv

oneSkip

Description

A loss and model complexity function which never selects one of the models, using a linear penalty.

Usage

data("oneSkip")data("oneSkip")

Format

A list of two data.frames (input and output).

Source

example(exactModelSelection) in PeakSegDP package.

plot IntervalRegression

Description

Plot an IntervalRegression model.

Usage

## S3 method for class 'IntervalRegression'
plot(x, 
    ...)## S3 method for class 'IntervalRegression'
plot(x, 
    ...)

Arguments

`x`	x
`...`	...

Value

a ggplot.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

predict IntervalRegression

Description

Compute model predictions.

Usage

## S3 method for class 'IntervalRegression'
predict(object, 
    X, ...)## S3 method for class 'IntervalRegression'
predict(object, 
    X, ...)

Arguments

`object`	object
`X`	X
`...`	...

Value

numeric matrix of predicted log(penalty) values.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

print IntervalRegression

Description

print learned model parameters.

Usage

## S3 method for class 'IntervalRegression'
print(x, 
    ...)## S3 method for class 'IntervalRegression'
print(x, 
    ...)

Arguments

`x`	x
`...`	...

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

ROC curve for changepoints

Description

Compute a Receiver Operating Characteristic curve for a penalty function.

Usage

ROChange(models, predictions, 
    problem.vars = character())ROChange(models, predictions, 
    problem.vars = character())

Arguments

`models`	data.frame describing the number of incorrect labels as a function of log(lambda), with columns min.log.lambda, max.log.lambda, fp, fn, possible.fp, possible.fn, etc. This can be computed via labelError(modelSelection(...), ...)$model.errors – see examples.
`predictions`	data.frame with a column named pred.log.lambda, the predicted log(penalty) value for each segmentation problem.
`problem.vars`	character: column names used to identify data set / segmentation problem.

Value

named list of results:

`roc`	a data.table with one row for each point on the ROC curve
`thresholds`	two rows of roc which correspond to the predicted and minimal error thresholds
`auc.polygon`	a data.table with one row for each vertex of the polygon used to compute AUC
`auc`	numeric Area Under the ROC curve
`aum`	numeric Area Under Min(FP,FN)
`aum.grad`	data.table with one row for each prediction, and columns hi/lo bound for the aum generalized gradient.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


  library(penaltyLearning)
  library(data.table)

  data(neuroblastomaProcessed, envir=environment())
  ## Get incorrect labels data for one profile.
  pid <- 11
  pro.errors <- neuroblastomaProcessed$errors[
    profile.id==pid][order(chromosome, min.log.lambda)]
  dcast(pro.errors, n.segments ~ chromosome, value.var="errors")
  ## Get the feature that corresponds to the BIC penalty = log(n),
  ## meaning log(penalty) = log(log(n)).
  chr.vec <- paste(c(1:4, 11, 17))
  pid.names <- paste0(pid, ".", chr.vec)
  BIC.feature <- neuroblastomaProcessed$feature.mat[pid.names, "log2.n"]
  pred <- data.table(pred.log.lambda=BIC.feature, chromosome=chr.vec)
  ## edit one prediction so that it ends up having the same threshold
  ## as another one, to illustrate an aum sub-differential with
  ## un-equal lo/hi bounds.
  err.changes <- pro.errors[, {
    .SD[c(NA, diff(errors) != 0), .(min.log.lambda)]
  }, by=chromosome]
  (ch.vec <- err.changes[, structure(min.log.lambda, names=chromosome)])
  other <- "11"
  (diff.other <- ch.vec[[other]]-pred[other, pred.log.lambda, on=.(chromosome)])
  pred["1", pred.log.lambda := ch.vec[["1"]]-diff.other, on=.(chromosome)]
  pred["4", pred.log.lambda := 2, on=.(chromosome)]
  ch.vec[["1"]]-pred["1", pred.log.lambda, on=.(chromosome)]
  result <- ROChange(pro.errors, pred, "chromosome")
  library(ggplot2)
  ## Plot the ROC curves.
  ggplot()+
    geom_path(aes(FPR, TPR), data=result$roc)+
    geom_point(aes(FPR, TPR, color=threshold), data=result$thresholds, shape=1)

  ## Plot the number of incorrect labels as a function of threshold.
  ggplot()+
    geom_segment(aes(
      min.thresh, errors,
      xend=max.thresh, yend=errors),
      data=result$roc)+
    geom_point(aes((min.thresh+max.thresh)/2, errors, color=threshold),
               data=result$thresholds,
               shape=1)+
    xlab("log(penalty) constant added to BIC penalty")

  ## Plot area under Min(FP,FN).
  err.colors <- c(
    "fp"="red",
    "fn"="deepskyblue",
    "min.fp.fn"="black")
  err.sizes <- c(
    "fp"=3,
    "fn"=2,
    "min.fp.fn"=1)
  roc.tall <- melt(result$roc, measure.vars=names(err.colors))
  area.rects <- data.table(
    chromosome="total",
    result$roc[0<min.fp.fn])
  (gg.total <- ggplot()+
     geom_vline(
       xintercept=0,
       color="grey")+
     geom_rect(aes(
       xmin=min.thresh, xmax=max.thresh,
       ymin=0, ymax=min.fp.fn),
       data=area.rects,
       alpha=0.5)+
     geom_text(aes(
       min.thresh, min.fp.fn/2,
       label=sprintf(
         "Area Under Min(FP,FN)=%.3f ",
         result$aum)),
       data=area.rects[1],
       hjust=1,
       color="grey50")+
     geom_segment(aes(
       min.thresh, value,
       xend=max.thresh, yend=value,
       color=variable, size=variable),
       data=data.table(chromosome="total", roc.tall))+
     scale_size_manual(values=err.sizes)+
     scale_color_manual(values=err.colors)+
     theme_bw()+
     theme(panel.grid.minor=element_blank())+
     scale_x_continuous(
       "Prediction threshold")+
     scale_y_continuous(
       "Incorrectly predicted labels",
       breaks=0:10))

  ## Add individual error curves.
  tall.errors <- melt(
    pro.errors[pred, on=.(chromosome)],
    measure.vars=c("fp", "fn"))
  gg.total+
    geom_segment(aes(
      min.log.lambda-pred.log.lambda, value,
      xend=max.log.lambda-pred.log.lambda, yend=value,
      size=variable, color=variable),
      data=tall.errors)+
    facet_grid(chromosome ~ ., scales="free", space="free")+
    theme(panel.spacing=grid::unit(0, "lines"))+
    geom_blank(aes(
      0, errors),
      data=data.table(errors=c(1.5, -0.5)))

  print(result$aum.grad)
  if(interactive()){#this can be too long for CRAN.
    ## Plot how Area Under Min(FP,FN) changes with each predicted value.
    aum.dt <- pred[, {
      data.table(log.pen=seq(0, 4, by=0.5))[, {
        chr <- paste(chromosome)
        new.pred.dt <- data.table(pred)
        new.pred.dt[chr, pred.log.lambda := log.pen, on=.(chromosome)]
        with(
          ROChange(pro.errors, new.pred.dt, "chromosome"),
          data.table(aum))
      }, by=log.pen]
    }, by=chromosome]
    bounds.dt <- melt(
      result$aum.grad,
      measure.vars=c("lo", "hi"),
      variable.name="bound",
      value.name="slope")[pred, on=.(chromosome)]
    bounds.dt[, intercept := result$aum-slope*pred.log.lambda]
    ggplot()+
      geom_abline(aes(
        slope=slope, intercept=intercept),
        size=1,
        data=bounds.dt)+
      geom_text(aes(
        2, 2, label=sprintf("directional derivatives = [%d, %d]", lo, hi)),
        data=result$aum.grad)+
      scale_color_manual(
        values=c(
          predicted="red",
          new="black"))+
      geom_point(aes(
        log.pen, aum, color=type),
        data=data.table(type="new", aum.dt))+
      geom_point(aes(
        pred.log.lambda, result$aum, color=type),
        shape=1,
        data=data.table(type="predicted", pred))+
      theme_bw()+
      theme(panel.spacing=grid::unit(0, "lines"))+
      facet_wrap("chromosome", labeller=label_both)+
      coord_equal()+
      xlab("New log(penalty) value for chromosome")+
      ylab("Area Under Min(FP,FN)
using new log(penalty) for this chromosome
and predicted log(penalty) for others")
  }

library(penaltyLearning)
  library(data.table)

  data(neuroblastomaProcessed, envir=environment())
  ## Get incorrect labels data for one profile.
  pid <- 11
  pro.errors <- neuroblastomaProcessed$errors[
    profile.id==pid][order(chromosome, min.log.lambda)]
  dcast(pro.errors, n.segments ~ chromosome, value.var="errors")
  ## Get the feature that corresponds to the BIC penalty = log(n),
  ## meaning log(penalty) = log(log(n)).
  chr.vec <- paste(c(1:4, 11, 17))
  pid.names <- paste0(pid, ".", chr.vec)
  BIC.feature <- neuroblastomaProcessed$feature.mat[pid.names, "log2.n"]
  pred <- data.table(pred.log.lambda=BIC.feature, chromosome=chr.vec)
  ## edit one prediction so that it ends up having the same threshold
  ## as another one, to illustrate an aum sub-differential with
  ## un-equal lo/hi bounds.
  err.changes <- pro.errors[, {
    .SD[c(NA, diff(errors) != 0), .(min.log.lambda)]
  }, by=chromosome]
  (ch.vec <- err.changes[, structure(min.log.lambda, names=chromosome)])
  other <- "11"
  (diff.other <- ch.vec[[other]]-pred[other, pred.log.lambda, on=.(chromosome)])
  pred["1", pred.log.lambda := ch.vec[["1"]]-diff.other, on=.(chromosome)]
  pred["4", pred.log.lambda := 2, on=.(chromosome)]
  ch.vec[["1"]]-pred["1", pred.log.lambda, on=.(chromosome)]
  result <- ROChange(pro.errors, pred, "chromosome")
  library(ggplot2)
  ## Plot the ROC curves.
  ggplot()+
    geom_path(aes(FPR, TPR), data=result$roc)+
    geom_point(aes(FPR, TPR, color=threshold), data=result$thresholds, shape=1)

  ## Plot the number of incorrect labels as a function of threshold.
  ggplot()+
    geom_segment(aes(
      min.thresh, errors,
      xend=max.thresh, yend=errors),
      data=result$roc)+
    geom_point(aes((min.thresh+max.thresh)/2, errors, color=threshold),
               data=result$thresholds,
               shape=1)+
    xlab("log(penalty) constant added to BIC penalty")

  ## Plot area under Min(FP,FN).
  err.colors <- c(
    "fp"="red",
    "fn"="deepskyblue",
    "min.fp.fn"="black")
  err.sizes <- c(
    "fp"=3,
    "fn"=2,
    "min.fp.fn"=1)
  roc.tall <- melt(result$roc, measure.vars=names(err.colors))
  area.rects <- data.table(
    chromosome="total",
    result$roc[0<min.fp.fn])
  (gg.total <- ggplot()+
     geom_vline(
       xintercept=0,
       color="grey")+
     geom_rect(aes(
       xmin=min.thresh, xmax=max.thresh,
       ymin=0, ymax=min.fp.fn),
       data=area.rects,
       alpha=0.5)+
     geom_text(aes(
       min.thresh, min.fp.fn/2,
       label=sprintf(
         "Area Under Min(FP,FN)=%.3f ",
         result$aum)),
       data=area.rects[1],
       hjust=1,
       color="grey50")+
     geom_segment(aes(
       min.thresh, value,
       xend=max.thresh, yend=value,
       color=variable, size=variable),
       data=data.table(chromosome="total", roc.tall))+
     scale_size_manual(values=err.sizes)+
     scale_color_manual(values=err.colors)+
     theme_bw()+
     theme(panel.grid.minor=element_blank())+
     scale_x_continuous(
       "Prediction threshold")+
     scale_y_continuous(
       "Incorrectly predicted labels",
       breaks=0:10))

  ## Add individual error curves.
  tall.errors <- melt(
    pro.errors[pred, on=.(chromosome)],
    measure.vars=c("fp", "fn"))
  gg.total+
    geom_segment(aes(
      min.log.lambda-pred.log.lambda, value,
      xend=max.log.lambda-pred.log.lambda, yend=value,
      size=variable, color=variable),
      data=tall.errors)+
    facet_grid(chromosome ~ ., scales="free", space="free")+
    theme(panel.spacing=grid::unit(0, "lines"))+
    geom_blank(aes(
      0, errors),
      data=data.table(errors=c(1.5, -0.5)))

  print(result$aum.grad)
  if(interactive()){#this can be too long for CRAN.
    ## Plot how Area Under Min(FP,FN) changes with each predicted value.
    aum.dt <- pred[, {
      data.table(log.pen=seq(0, 4, by=0.5))[, {
        chr <- paste(chromosome)
        new.pred.dt <- data.table(pred)
        new.pred.dt[chr, pred.log.lambda := log.pen, on=.(chromosome)]
        with(
          ROChange(pro.errors, new.pred.dt, "chromosome"),
          data.table(aum))
      }, by=log.pen]
    }, by=chromosome]
    bounds.dt <- melt(
      result$aum.grad,
      measure.vars=c("lo", "hi"),
      variable.name="bound",
      value.name="slope")[pred, on=.(chromosome)]
    bounds.dt[, intercept := result$aum-slope*pred.log.lambda]
    ggplot()+
      geom_abline(aes(
        slope=slope, intercept=intercept),
        size=1,
        data=bounds.dt)+
      geom_text(aes(
        2, 2, label=sprintf("directional derivatives = [%d, %d]", lo, hi)),
        data=result$aum.grad)+
      scale_color_manual(
        values=c(
          predicted="red",
          new="black"))+
      geom_point(aes(
        log.pen, aum, color=type),
        data=data.table(type="new", aum.dt))+
      geom_point(aes(
        pred.log.lambda, result$aum, color=type),
        shape=1,
        data=data.table(type="predicted", pred))+
      theme_bw()+
      theme(panel.spacing=grid::unit(0, "lines"))+
      facet_wrap("chromosome", labeller=label_both)+
      coord_equal()+
      xlab("New log(penalty) value for chromosome")+
      ylab("Area Under Min(FP,FN)
using new log(penalty) for this chromosome
and predicted log(penalty) for others")
  }

squared hinge

Description

The squared hinge loss.

Usage

squared.hinge(x, e = 1)squared.hinge(x, e = 1)

Arguments

`x`	x
`e`	e

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

targetIntervalResidual

Description

Compute residual of predicted penalties with respect to target intervals. This function is useful for visualizing the errors in a plot of log(penalty) versus a feature.

Usage

targetIntervalResidual(target.mat, 
    pred)targetIntervalResidual(target.mat, 
    pred)

Arguments

`target.mat`	n x 2 numeric matrix: target intervals of log(penalty) values that yield minimal incorrect labels.
`pred`	numeric vector: predicted log(penalty) values.

Value

numeric vector of n residuals. Predictions that are too high (above target.mat[,2]) get positive residuals (too few changepoints), and predictions that are too low (below target.mat[,1]) get negative residuals.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


library(penaltyLearning)
library(data.table)
data(neuroblastomaProcessed, envir=environment())
## The BIC model selection criterion is lambda = log(n), where n is
## the number of data points to segment. This implies log(lambda) =
## log(log(n)), which is the log2.n feature.
row.name.vec <- grep(
  "^(4|520)[.]",
  rownames(neuroblastomaProcessed$feature.mat),
  value=TRUE)
feature.mat <- neuroblastomaProcessed$feature.mat[row.name.vec, ]
target.mat <- neuroblastomaProcessed$target.mat[row.name.vec, ]
pred.dt <- data.table(
  row.name=row.name.vec,
  target.mat,
  feature.mat[, "log2.n", drop=FALSE])
pred.dt[, pred.log.lambda := log2.n ]
pred.dt[, residual := targetIntervalResidual(
  cbind(min.L, max.L),
  pred.log.lambda)]
library(ggplot2)
limits.dt <- pred.dt[, data.table(
  log2.n,
  log.penalty=c(min.L, max.L),
  limit=rep(c("min", "max"), each=.N))][is.finite(log.penalty)]
ggplot()+
  geom_abline(slope=1, intercept=0)+
  geom_point(aes(
    log2.n,
    log.penalty,
    fill=limit),
    data=limits.dt,
    shape=21)+
  geom_segment(aes(
    log2.n, pred.log.lambda,
    xend=log2.n, yend=pred.log.lambda-residual),
    data=pred.dt,
    color="red")+
  scale_fill_manual(values=c(min="white", max="black"))

library(penaltyLearning)
library(data.table)
data(neuroblastomaProcessed, envir=environment())
## The BIC model selection criterion is lambda = log(n), where n is
## the number of data points to segment. This implies log(lambda) =
## log(log(n)), which is the log2.n feature.
row.name.vec <- grep(
  "^(4|520)[.]",
  rownames(neuroblastomaProcessed$feature.mat),
  value=TRUE)
feature.mat <- neuroblastomaProcessed$feature.mat[row.name.vec, ]
target.mat <- neuroblastomaProcessed$target.mat[row.name.vec, ]
pred.dt <- data.table(
  row.name=row.name.vec,
  target.mat,
  feature.mat[, "log2.n", drop=FALSE])
pred.dt[, pred.log.lambda := log2.n ]
pred.dt[, residual := targetIntervalResidual(
  cbind(min.L, max.L),
  pred.log.lambda)]
library(ggplot2)
limits.dt <- pred.dt[, data.table(
  log2.n,
  log.penalty=c(min.L, max.L),
  limit=rep(c("min", "max"), each=.N))][is.finite(log.penalty)]
ggplot()+
  geom_abline(slope=1, intercept=0)+
  geom_point(aes(
    log2.n,
    log.penalty,
    fill=limit),
    data=limits.dt,
    shape=21)+
  geom_segment(aes(
    log2.n, pred.log.lambda,
    xend=log2.n, yend=pred.log.lambda-residual),
    data=pred.dt,
    color="red")+
  scale_fill_manual(values=c(min="white", max="black"))

targetIntervalROC

Description

Compute a ROC curve using a target interval matrix. A prediction less than the lower limit is considered a false positive (penalty too small, too many changes), and a prediction greater than the upper limit is a false negative (penalty too large, too few changes). WARNING: this ROC curve is less detailed than the one you get from ROChange! Use ROChange if possible.

Usage

targetIntervalROC(target.mat, 
    pred)targetIntervalROC(target.mat, 
    pred)

Arguments

`target.mat`	n x 2 numeric matrix: target intervals of log(penalty) values that yield minimal incorrect labels.
`pred`	numeric vector: predicted log(penalty) values.

Value

list describing ROC curves, same as ROChange.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


library(penaltyLearning)
library(data.table)
data(neuroblastomaProcessed, envir=environment())

pid.vec <- c("1", "4")
chr <- 2
incorrect.labels <-
  neuroblastomaProcessed$errors[profile.id%in%pid.vec & chromosome==chr]
pid.chr <- paste0(pid.vec, ".", chr)
target.mat <- neuroblastomaProcessed$target.mat[pid.chr, , drop=FALSE]
pred.dt <- data.table(profile.id=pid.vec, pred.log.lambda=1.5)
roc.list <- list(
  labels=ROChange(incorrect.labels, pred.dt, "profile.id"),
  targets=targetIntervalROC(target.mat, pred.dt$pred.log.lambda))

err <- data.table(incorrect=names(roc.list))[, {
  roc.list[[incorrect]]$roc
}, by=incorrect]
library(ggplot2)
ggplot()+
  ggtitle("incorrect targets is an approximation of incorrect labels")+
  scale_size_manual(values=c(labels=2, targets=1))+
  geom_segment(aes(
    min.thresh, errors,
    color=incorrect,
    size=incorrect,
    xend=max.thresh, yend=errors),
               data=err)

library(penaltyLearning)
library(data.table)
data(neuroblastomaProcessed, envir=environment())

pid.vec <- c("1", "4")
chr <- 2
incorrect.labels <-
  neuroblastomaProcessed$errors[profile.id%in%pid.vec & chromosome==chr]
pid.chr <- paste0(pid.vec, ".", chr)
target.mat <- neuroblastomaProcessed$target.mat[pid.chr, , drop=FALSE]
pred.dt <- data.table(profile.id=pid.vec, pred.log.lambda=1.5)
roc.list <- list(
  labels=ROChange(incorrect.labels, pred.dt, "profile.id"),
  targets=targetIntervalROC(target.mat, pred.dt$pred.log.lambda))

err <- data.table(incorrect=names(roc.list))[, {
  roc.list[[incorrect]]$roc
}, by=incorrect]
library(ggplot2)
ggplot()+
  ggtitle("incorrect targets is an approximation of incorrect labels")+
  scale_size_manual(values=c(labels=2, targets=1))+
  geom_segment(aes(
    min.thresh, errors,
    color=incorrect,
    size=incorrect,
    xend=max.thresh, yend=errors),
               data=err)

Compute target intervals

Description

Compute target intervals of log(penalty) values that result in predicted changepoint models with minimum incorrect labels. Use this function after labelError, and before IntervalRegression*.

Usage

targetIntervals(models, 
    problem.vars)targetIntervals(models, 
    problem.vars)

Arguments

`models`	data.table with columns errors, min.log.lambda, max.log.lambda, typically labelError()$model.errors.
`problem.vars`	character: column names used to identify data set / segmentation problem.

Value

data.table with columns problem.vars, one row for each segmentation problem. The "min.log.lambda", and "max.log.lambda" columns give the largest interval of log(penalty) values which results in the minimum incorrect labels for that problem. This can be used to create the target.mat parameter of the IntervalRegression* functions.

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Examples


data.table::setDTthreads(1)

library(penaltyLearning)
data(neuroblastomaProcessed, envir=environment())
targets.dt <- targetIntervals(
  neuroblastomaProcessed$errors,
  problem.vars=c("profile.id", "chromosome"))

data.table::setDTthreads(1)

library(penaltyLearning)
data(neuroblastomaProcessed, envir=environment())
targets.dt <- targetIntervals(
  neuroblastomaProcessed$errors,
  problem.vars=c("profile.id", "chromosome"))

theme no space

Description

ggplot2 theme element for no space between panels.

Usage

theme_no_space(...)theme_no_space(...)

Arguments

...

...

Author(s)

Toby Dylan Hocking <[email protected]> [aut, cre]

Package 'penaltyLearning'

Help Index

change colors

Description

Usage

change labels

Description

Usage

changeLabel

Description

Usage

Arguments

Author(s)

check features targets

Description

Usage

Arguments

Value

Author(s)

check target pred

Description

Usage

Arguments

Value

Author(s)

coef IntervalRegression

Description

Usage

Arguments

Value

Author(s)

PeakSegFPOP demo data set

Description

Usage

Format

featureMatrix

Description

Usage

Arguments

Value

Author(s)

Examples

featureVector

Description

Usage

Arguments

Value

Author(s)

Examples

geom tallrect

Description

Usage

Arguments

Author(s)

GeomTallRect

Description

Usage

IntervalRegressionCV

Description

Usage

Arguments

Value

Author(s)

Examples

IntervalRegressionCVmargin

Description

Usage

Arguments

Value

Author(s)

Examples

IntervalRegressionInternal

Description

Usage

Arguments

Value

Author(s)

IntervalRegressionRegularized

Description

Usage