FLOPART is short for “Functional Labeled Optimal Partitioning” and the key words there which make it different from previous algorithms are

  • Functional: the cost functions are computed for pruning/speed, and to enforce a constrained/interpretable model which alternates between up/peak and down/background states.
  • Labeled: the labels constrain the positions where it is possible (or not) to have change-points.

In this vignette we present a comparison of FLOPART with other algorithms which do not have these two properties.

No label constraints

As a simple example, consider the following genomic data set:

library(data.table)
data("Mono27ac.simple", package="FLOPART")
Mono27ac.simple
#> $coverage
#>        chrom chromStart chromEnd count
#>       <char>      <int>    <int> <int>
#>    1:  chr11     145000   146765     0
#>    2:  chr11     146765   146807     1
#>    3:  chr11     146807   175254     0
#>    4:  chr11     175254   175296     1
#>    5:  chr11     175296   175738     0
#>   ---                                 
#> 2975:  chr11     326980   326981     5
#> 2976:  chr11     326981   326983     6
#> 2977:  chr11     326983   326985     5
#> 2978:  chr11     326985   326992     4
#> 2979:  chr11     326992   327000     3
#> 
#> $label
#>     chrom chromStart chromEnd annotation
#>    <char>      <num>    <num>     <char>
#> 1:  chr11     180000   200000    noPeaks
#> 2:  chr11     208000   220000    peakEnd
#> 3:  chr11     300000   308250  peakStart
#> 4:  chr11     308260   320000    peakEnd

The data tables above are visualized in the plot below,

ann.colors <- c(
  noPeaks="orange",
  peakStart="#efafaf",
  peakEnd="#ff4c4c")
if(require("ggplot2")){
  ggplot()+
    theme_bw()+
    scale_fill_manual("label", values=ann.colors)+
    geom_rect(aes(
      xmin=chromStart-0.5, xmax=chromEnd+0.5,
      ymin=-Inf, ymax=Inf,
      fill=annotation),
      alpha=0.5,
      color="grey",
      data=Mono27ac.simple[["label"]])+
    geom_step(aes(
      chromStart, count),
      data=Mono27ac.simple[["coverage"]],
      color="grey50")
}
#> Loading required package: ggplot2

The raw aligned read count data are shown in grey (larger values indicate more likely Mono27ac histone modification), and the labels are shown in colored rectangles (these indicate where the algorithm should detect changes, or not).

The code below computes the FLOPART model, for a properly chosen penalty:

label.pen <- 1400
fit <- with(Mono27ac.simple, FLOPART::FLOPART(coverage, label, label.pen))
lapply(fit, head)
#> $coverage_dt
#> Key: <chromStart, chromEnd>
#>    chromStart chromEnd count weight
#>         <int>    <int> <int>  <int>
#> 1:     145000   146765     0   1765
#> 2:     146765   146807     1     42
#> 3:     146807   175254     0  28447
#> 4:     175254   175296     1     42
#> 5:     175296   175738     0    442
#> 6:     175738   175780     1     42
#> 
#> $label_dt
#> Key: <chromEnd, chromStart>
#>    chromStart chromEnd annotation  type firstRow lastRow
#>         <int>    <int>     <char> <int>    <int>   <int>
#> 1:     180000   200000    noPeaks     0       13     117
#> 2:     208000   220000    peakEnd    -1      724    1321
#> 3:     300000   308250  peakStart     1     2722    2769
#> 4:     308260   320000    peakEnd    -1     2772    2870
#> 
#> $cost_mat
#>            [,1]       [,2]
#> [1,] 0.00000000 0.00000000
#> [2,] 0.11067717 0.11067717
#> [3,] 0.01052251 0.01052251
#> [4,] 0.01909784 0.01909784
#> [5,] 0.01886280 0.01886280
#> [6,] 0.02660139 0.02660139
#> 
#> $intervals_mat
#>      [,1] [,2]
#> [1,]    1    1
#> [2,]    2    1
#> [3,]    3    3
#> [4,]    3    5
#> [5,]    4    4
#> [6,]    4    5
#> 
#> $segments_dt
#>    chromStart chromEnd     status        mean
#>         <int>    <int>     <char>       <num>
#> 1:     145000   206725 background  0.06556501
#> 2:     206725   209216       peak 13.17743878
#> 3:     209216   236120 background  0.22431609
#> 4:     236120   237515       peak  7.38781362
#> 5:     237515   267598 background  0.19459495
#> 6:     267598   270853       peak  3.68970814
str(fit)
#> List of 5
#>  $ coverage_dt  :Classes 'data.table' and 'data.frame':  2986 obs. of  4 variables:
#>   ..$ chromStart: int [1:2986] 145000 146765 146807 175254 175296 175738 175780 175897 175913 175914 ...
#>   ..$ chromEnd  : int [1:2986] 146765 146807 175254 175296 175738 175780 175897 175913 175914 175940 ...
#>   ..$ count     : int [1:2986] 0 1 0 1 0 1 0 1 0 1 ...
#>   ..$ weight    : int [1:2986] 1765 42 28447 42 442 42 117 16 1 26 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr [1:2] "chromStart" "chromEnd"
#>  $ label_dt     :Classes 'data.table' and 'data.frame':  4 obs. of  6 variables:
#>   ..$ chromStart: int [1:4] 180000 208000 300000 308260
#>   ..$ chromEnd  : int [1:4] 200000 220000 308250 320000
#>   ..$ annotation: chr [1:4] "noPeaks" "peakEnd" "peakStart" "peakEnd"
#>   ..$ type      : int [1:4] 0 -1 1 -1
#>   ..$ firstRow  : int [1:4] 13 724 2722 2772
#>   ..$ lastRow   : int [1:4] 117 1321 2769 2870
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>   ..- attr(*, "sorted")= chr [1:2] "chromEnd" "chromStart"
#>  $ cost_mat     : num [1:2986, 1:2] 0 0.1107 0.0105 0.0191 0.0189 ...
#>  $ intervals_mat: int [1:2986, 1:2] 1 2 3 3 4 4 5 5 7 6 ...
#>  $ segments_dt  :Classes 'data.table' and 'data.frame':  10 obs. of  4 variables:
#>   ..$ chromStart: int [1:10] 145000 206725 209216 236120 237515 267598 270853 307841 308655 326129
#>   ..$ chromEnd  : int [1:10] 206725 209216 236120 237515 267598 270853 307841 308655 326129 327000
#>   ..$ status    : chr [1:10] "background" "peak" "background" "peak" ...
#>   ..$ mean      : num [1:10] 0.0656 13.1774 0.2243 7.3878 0.1946 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr>

Above we can see that fit is a list with several components:

  • coverage_dt is a data table of aligned read count coverage, same as input but with an additional weight column.
  • label_dt is a data table of labels, same as input but with additional columns type, firstRow, lastRow, which are used as input to the C++ code.
  • cost_mat is a N x 2 matrix of constrained optimal cost values, for each coverage data point (row) and state (column).
  • intervals_mat is a N x 2 matrix of counts of intervals used to represent the cost function, for each coverage data point (row) and state (column).
  • segments_dt is a data table which represents the segmentation/change-point model.

Below, for comparison, we compute models without label constraints:

penalty.vec <- c(
  "7"=1400,
  "6"=1450,
  "5"=1460)
unlabeled.segs.dt.list <- list()
for(penalty in penalty.vec){
  ufit <- FLOPART::FLOPART(Mono27ac.simple[["coverage"]], penalty=penalty)
  pen.segs <- ufit[["segments_dt"]]
  pen.peaks <- pen.segs[status=="peak"]
  n.peaks <- nrow(pen.peaks)
  unlabeled.segs.dt.list[[paste(penalty)]] <- data.table(
    penalty, n.peaks, pen.segs)
}
(unlabeled.segs.dt <- do.call(rbind, unlabeled.segs.dt.list))
#>     penalty n.peaks chromStart chromEnd     status         mean
#>       <num>   <int>      <int>    <int>     <char>        <num>
#>  1:    1400       7     145000   183846 background  0.007568347
#>  2:    1400       7     183846   183925       peak  1.063291139
#>  3:    1400       7     183925   206737 background  0.162414519
#>  4:    1400       7     206737   208583       peak 16.135969664
#>  5:    1400       7     208583   208584 background  3.883027523
#>  6:    1400       7     208584   209455       peak  3.883027523
#>  7:    1400       7     209455   236036 background  0.206312780
#>  8:    1400       7     236036   237515       peak  7.081135903
#>  9:    1400       7     237515   267598 background  0.194594954
#> 10:    1400       7     267598   270853       peak  3.689708141
#> 11:    1400       7     270853   307841 background  0.105088137
#> 12:    1400       7     307841   308655       peak  1.900491400
#> 13:    1400       7     308655   326129 background  0.100034337
#> 14:    1400       7     326129   327000       peak  2.469575201
#> 15:    1450       6     145000   183846 background  0.007568347
#> 16:    1450       6     183846   183925       peak  1.063291139
#> 17:    1450       6     183925   206737 background  0.162414519
#> 18:    1450       6     206737   208583       peak 16.135969664
#> 19:    1450       6     208583   208584 background  3.883027523
#> 20:    1450       6     208584   209455       peak  3.883027523
#> 21:    1450       6     209455   236036 background  0.206312780
#> 22:    1450       6     236036   237515       peak  7.081135903
#> 23:    1450       6     237515   267598 background  0.194594954
#> 24:    1450       6     267598   270853       peak  3.689708141
#> 25:    1450       6     270853   326129 background  0.129929807
#> 26:    1450       6     326129   327000       peak  2.469575201
#> 27:    1460       5     145000   206725 background  0.065565006
#> 28:    1460       5     206725   208583       peak 16.051130248
#> 29:    1460       5     208583   208584 background  3.883027523
#> 30:    1460       5     208584   209455       peak  3.883027523
#> 31:    1460       5     209455   236036 background  0.206312780
#> 32:    1460       5     236036   237515       peak  7.081135903
#> 33:    1460       5     237515   267598 background  0.194594954
#> 34:    1460       5     267598   270853       peak  3.689708141
#> 35:    1460       5     270853   326129 background  0.129929807
#> 36:    1460       5     326129   327000       peak  2.469575201
#>     penalty n.peaks chromStart chromEnd     status         mean

The table above represents the models without label constraints, for three different input penalties, which result in three different numbers of detected peaks. Below we visualize those models:

model.color <- "blue"
if(require("ggplot2")){
  ggplot()+
    ggtitle("Models without label constraints")+
    geom_step(aes(
      chromStart, count),
      data=Mono27ac.simple[["coverage"]],
      color="grey50")+
    geom_step(aes(
      chromStart, mean),
      data=unlabeled.segs.dt,
      color=model.color)+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(penalty + n.peaks ~ ., labeller=label_both)
}

The plot above shows the segmentation models without label constraints. Below we compute a data table to represent the segmentation, label errors, and predicted peaks of all models:

cons.segs <- fit[["segments_dt"]]
model.list <- c(unlabeled.segs.dt.list, list(FLOPART=data.table(
  penalty=label.pen, n.peaks=sum(cons.segs[["status"]]=="peak"), cons.segs)))
peaks.dt.list <- list()
err.dt.list <- list()
segs.dt.list <- list()
for(model in names(model.list)){
  model.segs <- model.list[[model]]
  model.peaks <- model.segs[status=="peak"]
  model.err <- if(requireNamespace("PeakError")){
    perr <- PeakError::PeakErrorChrom(model.peaks, Mono27ac.simple[["label"]])
    data.table(perr)[, .(chromStart, chromEnd, status)]
  }else{
    data.table(chromStart=integer(), chromEnd=integer(), status=character())
  }
  segs.dt.list[[model]] <- data.table(model, model.segs)
  err.dt.list[[model]] <- data.table(model, model.err)
  peaks.dt.list[[model]] <- data.table(model, model.peaks)
}
#> Loading required namespace: PeakError
(segs.dt <- do.call(rbind, segs.dt.list))
#>       model penalty n.peaks chromStart chromEnd     status         mean
#>      <char>   <num>   <int>      <int>    <int>     <char>        <num>
#>  1:    1400    1400       7     145000   183846 background  0.007568347
#>  2:    1400    1400       7     183846   183925       peak  1.063291139
#>  3:    1400    1400       7     183925   206737 background  0.162414519
#>  4:    1400    1400       7     206737   208583       peak 16.135969664
#>  5:    1400    1400       7     208583   208584 background  3.883027523
#>  6:    1400    1400       7     208584   209455       peak  3.883027523
#>  7:    1400    1400       7     209455   236036 background  0.206312780
#>  8:    1400    1400       7     236036   237515       peak  7.081135903
#>  9:    1400    1400       7     237515   267598 background  0.194594954
#> 10:    1400    1400       7     267598   270853       peak  3.689708141
#> 11:    1400    1400       7     270853   307841 background  0.105088137
#> 12:    1400    1400       7     307841   308655       peak  1.900491400
#> 13:    1400    1400       7     308655   326129 background  0.100034337
#> 14:    1400    1400       7     326129   327000       peak  2.469575201
#> 15:    1450    1450       6     145000   183846 background  0.007568347
#> 16:    1450    1450       6     183846   183925       peak  1.063291139
#> 17:    1450    1450       6     183925   206737 background  0.162414519
#> 18:    1450    1450       6     206737   208583       peak 16.135969664
#> 19:    1450    1450       6     208583   208584 background  3.883027523
#> 20:    1450    1450       6     208584   209455       peak  3.883027523
#> 21:    1450    1450       6     209455   236036 background  0.206312780
#> 22:    1450    1450       6     236036   237515       peak  7.081135903
#> 23:    1450    1450       6     237515   267598 background  0.194594954
#> 24:    1450    1450       6     267598   270853       peak  3.689708141
#> 25:    1450    1450       6     270853   326129 background  0.129929807
#> 26:    1450    1450       6     326129   327000       peak  2.469575201
#> 27:    1460    1460       5     145000   206725 background  0.065565006
#> 28:    1460    1460       5     206725   208583       peak 16.051130248
#> 29:    1460    1460       5     208583   208584 background  3.883027523
#> 30:    1460    1460       5     208584   209455       peak  3.883027523
#> 31:    1460    1460       5     209455   236036 background  0.206312780
#> 32:    1460    1460       5     236036   237515       peak  7.081135903
#> 33:    1460    1460       5     237515   267598 background  0.194594954
#> 34:    1460    1460       5     267598   270853       peak  3.689708141
#> 35:    1460    1460       5     270853   326129 background  0.129929807
#> 36:    1460    1460       5     326129   327000       peak  2.469575201
#> 37: FLOPART    1400       5     145000   206725 background  0.065565006
#> 38: FLOPART    1400       5     206725   209216       peak 13.177438780
#> 39: FLOPART    1400       5     209216   236120 background  0.224316087
#> 40: FLOPART    1400       5     236120   237515       peak  7.387813620
#> 41: FLOPART    1400       5     237515   267598 background  0.194594954
#> 42: FLOPART    1400       5     267598   270853       peak  3.689708141
#> 43: FLOPART    1400       5     270853   307841 background  0.105088137
#> 44: FLOPART    1400       5     307841   308655       peak  1.900491400
#> 45: FLOPART    1400       5     308655   326129 background  0.100034337
#> 46: FLOPART    1400       5     326129   327000       peak  2.469575201
#>       model penalty n.peaks chromStart chromEnd     status         mean
(err.dt <- do.call(rbind, err.dt.list))
#>       model chromStart chromEnd         status
#>      <char>      <int>    <int>         <char>
#>  1:    1400     180000   200000 false positive
#>  2:    1400     208000   220000 false positive
#>  3:    1400     300000   308250        correct
#>  4:    1400     308260   320000        correct
#>  5:    1450     180000   200000 false positive
#>  6:    1450     208000   220000 false positive
#>  7:    1450     300000   308250 false negative
#>  8:    1450     308260   320000 false negative
#>  9:    1460     180000   200000        correct
#> 10:    1460     208000   220000 false positive
#> 11:    1460     300000   308250 false negative
#> 12:    1460     308260   320000 false negative
#> 13: FLOPART     180000   200000        correct
#> 14: FLOPART     208000   220000        correct
#> 15: FLOPART     300000   308250        correct
#> 16: FLOPART     308260   320000        correct
(peaks.dt <- do.call(rbind, peaks.dt.list))
#>       model penalty n.peaks chromStart chromEnd status      mean
#>      <char>   <num>   <int>      <int>    <int> <char>     <num>
#>  1:    1400    1400       7     183846   183925   peak  1.063291
#>  2:    1400    1400       7     206737   208583   peak 16.135970
#>  3:    1400    1400       7     208584   209455   peak  3.883028
#>  4:    1400    1400       7     236036   237515   peak  7.081136
#>  5:    1400    1400       7     267598   270853   peak  3.689708
#>  6:    1400    1400       7     307841   308655   peak  1.900491
#>  7:    1400    1400       7     326129   327000   peak  2.469575
#>  8:    1450    1450       6     183846   183925   peak  1.063291
#>  9:    1450    1450       6     206737   208583   peak 16.135970
#> 10:    1450    1450       6     208584   209455   peak  3.883028
#> 11:    1450    1450       6     236036   237515   peak  7.081136
#> 12:    1450    1450       6     267598   270853   peak  3.689708
#> 13:    1450    1450       6     326129   327000   peak  2.469575
#> 14:    1460    1460       5     206725   208583   peak 16.051130
#> 15:    1460    1460       5     208584   209455   peak  3.883028
#> 16:    1460    1460       5     236036   237515   peak  7.081136
#> 17:    1460    1460       5     267598   270853   peak  3.689708
#> 18:    1460    1460       5     326129   327000   peak  2.469575
#> 19: FLOPART    1400       5     206725   209216   peak 13.177439
#> 20: FLOPART    1400       5     236120   237515   peak  7.387814
#> 21: FLOPART    1400       5     267598   270853   peak  3.689708
#> 22: FLOPART    1400       5     307841   308655   peak  1.900491
#> 23: FLOPART    1400       5     326129   327000   peak  2.469575
#>       model penalty n.peaks chromStart chromEnd status      mean

Finally we use the code below to visualize the models with and without label constraints.

peak.y <- -2
if(require("ggplot2")){
  ggplot()+
    ggtitle("Models with label constraints (FLOPART) and without (penalty values)")+
    scale_fill_manual("label", values=ann.colors)+
    geom_rect(aes(
      xmin=chromStart, xmax=chromEnd,
      ymin=-Inf, ymax=Inf,
      fill=annotation),
      alpha=0.5,
      color="grey",
      data=Mono27ac.simple[["label"]])+
    geom_step(aes(
      chromStart, count),
      data=Mono27ac.simple[["coverage"]],
      color="grey50")+
    geom_step(aes(
      chromStart, mean),
      data=segs.dt,
      color=model.color)+
    geom_segment(aes(
      chromStart, peak.y,
      xend=chromEnd, yend=peak.y),
      color=model.color,
      size=1,
      data=peaks.dt)+
    geom_point(aes(
      chromEnd, peak.y),
      color=model.color,
      shape=21,
      fill="white",
      data=peaks.dt)+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(model ~ ., labeller=label_both)+
    scale_linetype_manual(
      "error type",
      values=c(
        correct=0,
        "false negative"=3,
        "false positive"=1))+
    geom_rect(aes(
      xmin=chromStart, xmax=chromEnd,
      ymin=-Inf, ymax=Inf,
      linetype=status),
      fill=NA,
      color="black",
      size=1,
      data=err.dt)
}
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

In the plot above, the advantages of FLOPART are clear (no label errors), relative to the models without label constraints (always have some false positives or negatives with respect to the labels).