FLOPART is short for “Functional Labeled Optimal Partitioning” and the key words there which make it different from previous algorithms are
In this vignette we present a comparison of FLOPART with other algorithms which do not have these two properties.
As a simple example, consider the following genomic data set:
library(data.table)
data("Mono27ac.simple", package="FLOPART")
Mono27ac.simple
#> $coverage
#> chrom chromStart chromEnd count
#> <char> <int> <int> <int>
#> 1: chr11 145000 146765 0
#> 2: chr11 146765 146807 1
#> 3: chr11 146807 175254 0
#> 4: chr11 175254 175296 1
#> 5: chr11 175296 175738 0
#> ---
#> 2975: chr11 326980 326981 5
#> 2976: chr11 326981 326983 6
#> 2977: chr11 326983 326985 5
#> 2978: chr11 326985 326992 4
#> 2979: chr11 326992 327000 3
#>
#> $label
#> chrom chromStart chromEnd annotation
#> <char> <num> <num> <char>
#> 1: chr11 180000 200000 noPeaks
#> 2: chr11 208000 220000 peakEnd
#> 3: chr11 300000 308250 peakStart
#> 4: chr11 308260 320000 peakEnd
The data tables above are visualized in the plot below,
ann.colors <- c(
noPeaks="orange",
peakStart="#efafaf",
peakEnd="#ff4c4c")
if(require("ggplot2")){
ggplot()+
theme_bw()+
scale_fill_manual("label", values=ann.colors)+
geom_rect(aes(
xmin=chromStart-0.5, xmax=chromEnd+0.5,
ymin=-Inf, ymax=Inf,
fill=annotation),
alpha=0.5,
color="grey",
data=Mono27ac.simple[["label"]])+
geom_step(aes(
chromStart, count),
data=Mono27ac.simple[["coverage"]],
color="grey50")
}
#> Loading required package: ggplot2
The raw aligned read count data are shown in grey (larger values indicate more likely Mono27ac histone modification), and the labels are shown in colored rectangles (these indicate where the algorithm should detect changes, or not).
The code below computes the FLOPART model, for a properly chosen penalty:
label.pen <- 1400
fit <- with(Mono27ac.simple, FLOPART::FLOPART(coverage, label, label.pen))
lapply(fit, head)
#> $coverage_dt
#> Key: <chromStart, chromEnd>
#> chromStart chromEnd count weight
#> <int> <int> <int> <int>
#> 1: 145000 146765 0 1765
#> 2: 146765 146807 1 42
#> 3: 146807 175254 0 28447
#> 4: 175254 175296 1 42
#> 5: 175296 175738 0 442
#> 6: 175738 175780 1 42
#>
#> $label_dt
#> Key: <chromEnd, chromStart>
#> chromStart chromEnd annotation type firstRow lastRow
#> <int> <int> <char> <int> <int> <int>
#> 1: 180000 200000 noPeaks 0 13 117
#> 2: 208000 220000 peakEnd -1 724 1321
#> 3: 300000 308250 peakStart 1 2722 2769
#> 4: 308260 320000 peakEnd -1 2772 2870
#>
#> $cost_mat
#> [,1] [,2]
#> [1,] 0.00000000 0.00000000
#> [2,] 0.11067717 0.11067717
#> [3,] 0.01052251 0.01052251
#> [4,] 0.01909784 0.01909784
#> [5,] 0.01886280 0.01886280
#> [6,] 0.02660139 0.02660139
#>
#> $intervals_mat
#> [,1] [,2]
#> [1,] 1 1
#> [2,] 2 1
#> [3,] 3 3
#> [4,] 3 5
#> [5,] 4 4
#> [6,] 4 5
#>
#> $segments_dt
#> chromStart chromEnd status mean
#> <int> <int> <char> <num>
#> 1: 145000 206725 background 0.06556501
#> 2: 206725 209216 peak 13.17743878
#> 3: 209216 236120 background 0.22431609
#> 4: 236120 237515 peak 7.38781362
#> 5: 237515 267598 background 0.19459495
#> 6: 267598 270853 peak 3.68970814
str(fit)
#> List of 5
#> $ coverage_dt :Classes 'data.table' and 'data.frame': 2986 obs. of 4 variables:
#> ..$ chromStart: int [1:2986] 145000 146765 146807 175254 175296 175738 175780 175897 175913 175914 ...
#> ..$ chromEnd : int [1:2986] 146765 146807 175254 175296 175738 175780 175897 175913 175914 175940 ...
#> ..$ count : int [1:2986] 0 1 0 1 0 1 0 1 0 1 ...
#> ..$ weight : int [1:2986] 1765 42 28447 42 442 42 117 16 1 26 ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr [1:2] "chromStart" "chromEnd"
#> $ label_dt :Classes 'data.table' and 'data.frame': 4 obs. of 6 variables:
#> ..$ chromStart: int [1:4] 180000 208000 300000 308260
#> ..$ chromEnd : int [1:4] 200000 220000 308250 320000
#> ..$ annotation: chr [1:4] "noPeaks" "peakEnd" "peakStart" "peakEnd"
#> ..$ type : int [1:4] 0 -1 1 -1
#> ..$ firstRow : int [1:4] 13 724 2722 2772
#> ..$ lastRow : int [1:4] 117 1321 2769 2870
#> ..- attr(*, ".internal.selfref")=<externalptr>
#> ..- attr(*, "sorted")= chr [1:2] "chromEnd" "chromStart"
#> $ cost_mat : num [1:2986, 1:2] 0 0.1107 0.0105 0.0191 0.0189 ...
#> $ intervals_mat: int [1:2986, 1:2] 1 2 3 3 4 4 5 5 7 6 ...
#> $ segments_dt :Classes 'data.table' and 'data.frame': 10 obs. of 4 variables:
#> ..$ chromStart: int [1:10] 145000 206725 209216 236120 237515 267598 270853 307841 308655 326129
#> ..$ chromEnd : int [1:10] 206725 209216 236120 237515 267598 270853 307841 308655 326129 327000
#> ..$ status : chr [1:10] "background" "peak" "background" "peak" ...
#> ..$ mean : num [1:10] 0.0656 13.1774 0.2243 7.3878 0.1946 ...
#> ..- attr(*, ".internal.selfref")=<externalptr>
Above we can see that fit
is a list with several
components:
coverage_dt
is a data table of aligned read count
coverage, same as input but with an additional weight
column.label_dt
is a data table of labels, same as input but
with additional columns type
, firstRow
,
lastRow
, which are used as input to the C++ code.cost_mat
is a N x 2 matrix of constrained optimal cost
values, for each coverage data point (row) and state (column).intervals_mat
is a N x 2 matrix of counts of intervals
used to represent the cost function, for each coverage data point (row)
and state (column).segments_dt
is a data table which represents the
segmentation/change-point model.Below, for comparison, we compute models without label constraints:
penalty.vec <- c(
"7"=1400,
"6"=1450,
"5"=1460)
unlabeled.segs.dt.list <- list()
for(penalty in penalty.vec){
ufit <- FLOPART::FLOPART(Mono27ac.simple[["coverage"]], penalty=penalty)
pen.segs <- ufit[["segments_dt"]]
pen.peaks <- pen.segs[status=="peak"]
n.peaks <- nrow(pen.peaks)
unlabeled.segs.dt.list[[paste(penalty)]] <- data.table(
penalty, n.peaks, pen.segs)
}
(unlabeled.segs.dt <- do.call(rbind, unlabeled.segs.dt.list))
#> penalty n.peaks chromStart chromEnd status mean
#> <num> <int> <int> <int> <char> <num>
#> 1: 1400 7 145000 183846 background 0.007568347
#> 2: 1400 7 183846 183925 peak 1.063291139
#> 3: 1400 7 183925 206737 background 0.162414519
#> 4: 1400 7 206737 208583 peak 16.135969664
#> 5: 1400 7 208583 208584 background 3.883027523
#> 6: 1400 7 208584 209455 peak 3.883027523
#> 7: 1400 7 209455 236036 background 0.206312780
#> 8: 1400 7 236036 237515 peak 7.081135903
#> 9: 1400 7 237515 267598 background 0.194594954
#> 10: 1400 7 267598 270853 peak 3.689708141
#> 11: 1400 7 270853 307841 background 0.105088137
#> 12: 1400 7 307841 308655 peak 1.900491400
#> 13: 1400 7 308655 326129 background 0.100034337
#> 14: 1400 7 326129 327000 peak 2.469575201
#> 15: 1450 6 145000 183846 background 0.007568347
#> 16: 1450 6 183846 183925 peak 1.063291139
#> 17: 1450 6 183925 206737 background 0.162414519
#> 18: 1450 6 206737 208583 peak 16.135969664
#> 19: 1450 6 208583 208584 background 3.883027523
#> 20: 1450 6 208584 209455 peak 3.883027523
#> 21: 1450 6 209455 236036 background 0.206312780
#> 22: 1450 6 236036 237515 peak 7.081135903
#> 23: 1450 6 237515 267598 background 0.194594954
#> 24: 1450 6 267598 270853 peak 3.689708141
#> 25: 1450 6 270853 326129 background 0.129929807
#> 26: 1450 6 326129 327000 peak 2.469575201
#> 27: 1460 5 145000 206725 background 0.065565006
#> 28: 1460 5 206725 208583 peak 16.051130248
#> 29: 1460 5 208583 208584 background 3.883027523
#> 30: 1460 5 208584 209455 peak 3.883027523
#> 31: 1460 5 209455 236036 background 0.206312780
#> 32: 1460 5 236036 237515 peak 7.081135903
#> 33: 1460 5 237515 267598 background 0.194594954
#> 34: 1460 5 267598 270853 peak 3.689708141
#> 35: 1460 5 270853 326129 background 0.129929807
#> 36: 1460 5 326129 327000 peak 2.469575201
#> penalty n.peaks chromStart chromEnd status mean
The table above represents the models without label constraints, for three different input penalties, which result in three different numbers of detected peaks. Below we visualize those models:
model.color <- "blue"
if(require("ggplot2")){
ggplot()+
ggtitle("Models without label constraints")+
geom_step(aes(
chromStart, count),
data=Mono27ac.simple[["coverage"]],
color="grey50")+
geom_step(aes(
chromStart, mean),
data=unlabeled.segs.dt,
color=model.color)+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(penalty + n.peaks ~ ., labeller=label_both)
}
The plot above shows the segmentation models without label constraints. Below we compute a data table to represent the segmentation, label errors, and predicted peaks of all models:
cons.segs <- fit[["segments_dt"]]
model.list <- c(unlabeled.segs.dt.list, list(FLOPART=data.table(
penalty=label.pen, n.peaks=sum(cons.segs[["status"]]=="peak"), cons.segs)))
peaks.dt.list <- list()
err.dt.list <- list()
segs.dt.list <- list()
for(model in names(model.list)){
model.segs <- model.list[[model]]
model.peaks <- model.segs[status=="peak"]
model.err <- if(requireNamespace("PeakError")){
perr <- PeakError::PeakErrorChrom(model.peaks, Mono27ac.simple[["label"]])
data.table(perr)[, .(chromStart, chromEnd, status)]
}else{
data.table(chromStart=integer(), chromEnd=integer(), status=character())
}
segs.dt.list[[model]] <- data.table(model, model.segs)
err.dt.list[[model]] <- data.table(model, model.err)
peaks.dt.list[[model]] <- data.table(model, model.peaks)
}
#> Loading required namespace: PeakError
(segs.dt <- do.call(rbind, segs.dt.list))
#> model penalty n.peaks chromStart chromEnd status mean
#> <char> <num> <int> <int> <int> <char> <num>
#> 1: 1400 1400 7 145000 183846 background 0.007568347
#> 2: 1400 1400 7 183846 183925 peak 1.063291139
#> 3: 1400 1400 7 183925 206737 background 0.162414519
#> 4: 1400 1400 7 206737 208583 peak 16.135969664
#> 5: 1400 1400 7 208583 208584 background 3.883027523
#> 6: 1400 1400 7 208584 209455 peak 3.883027523
#> 7: 1400 1400 7 209455 236036 background 0.206312780
#> 8: 1400 1400 7 236036 237515 peak 7.081135903
#> 9: 1400 1400 7 237515 267598 background 0.194594954
#> 10: 1400 1400 7 267598 270853 peak 3.689708141
#> 11: 1400 1400 7 270853 307841 background 0.105088137
#> 12: 1400 1400 7 307841 308655 peak 1.900491400
#> 13: 1400 1400 7 308655 326129 background 0.100034337
#> 14: 1400 1400 7 326129 327000 peak 2.469575201
#> 15: 1450 1450 6 145000 183846 background 0.007568347
#> 16: 1450 1450 6 183846 183925 peak 1.063291139
#> 17: 1450 1450 6 183925 206737 background 0.162414519
#> 18: 1450 1450 6 206737 208583 peak 16.135969664
#> 19: 1450 1450 6 208583 208584 background 3.883027523
#> 20: 1450 1450 6 208584 209455 peak 3.883027523
#> 21: 1450 1450 6 209455 236036 background 0.206312780
#> 22: 1450 1450 6 236036 237515 peak 7.081135903
#> 23: 1450 1450 6 237515 267598 background 0.194594954
#> 24: 1450 1450 6 267598 270853 peak 3.689708141
#> 25: 1450 1450 6 270853 326129 background 0.129929807
#> 26: 1450 1450 6 326129 327000 peak 2.469575201
#> 27: 1460 1460 5 145000 206725 background 0.065565006
#> 28: 1460 1460 5 206725 208583 peak 16.051130248
#> 29: 1460 1460 5 208583 208584 background 3.883027523
#> 30: 1460 1460 5 208584 209455 peak 3.883027523
#> 31: 1460 1460 5 209455 236036 background 0.206312780
#> 32: 1460 1460 5 236036 237515 peak 7.081135903
#> 33: 1460 1460 5 237515 267598 background 0.194594954
#> 34: 1460 1460 5 267598 270853 peak 3.689708141
#> 35: 1460 1460 5 270853 326129 background 0.129929807
#> 36: 1460 1460 5 326129 327000 peak 2.469575201
#> 37: FLOPART 1400 5 145000 206725 background 0.065565006
#> 38: FLOPART 1400 5 206725 209216 peak 13.177438780
#> 39: FLOPART 1400 5 209216 236120 background 0.224316087
#> 40: FLOPART 1400 5 236120 237515 peak 7.387813620
#> 41: FLOPART 1400 5 237515 267598 background 0.194594954
#> 42: FLOPART 1400 5 267598 270853 peak 3.689708141
#> 43: FLOPART 1400 5 270853 307841 background 0.105088137
#> 44: FLOPART 1400 5 307841 308655 peak 1.900491400
#> 45: FLOPART 1400 5 308655 326129 background 0.100034337
#> 46: FLOPART 1400 5 326129 327000 peak 2.469575201
#> model penalty n.peaks chromStart chromEnd status mean
(err.dt <- do.call(rbind, err.dt.list))
#> model chromStart chromEnd status
#> <char> <int> <int> <char>
#> 1: 1400 180000 200000 false positive
#> 2: 1400 208000 220000 false positive
#> 3: 1400 300000 308250 correct
#> 4: 1400 308260 320000 correct
#> 5: 1450 180000 200000 false positive
#> 6: 1450 208000 220000 false positive
#> 7: 1450 300000 308250 false negative
#> 8: 1450 308260 320000 false negative
#> 9: 1460 180000 200000 correct
#> 10: 1460 208000 220000 false positive
#> 11: 1460 300000 308250 false negative
#> 12: 1460 308260 320000 false negative
#> 13: FLOPART 180000 200000 correct
#> 14: FLOPART 208000 220000 correct
#> 15: FLOPART 300000 308250 correct
#> 16: FLOPART 308260 320000 correct
(peaks.dt <- do.call(rbind, peaks.dt.list))
#> model penalty n.peaks chromStart chromEnd status mean
#> <char> <num> <int> <int> <int> <char> <num>
#> 1: 1400 1400 7 183846 183925 peak 1.063291
#> 2: 1400 1400 7 206737 208583 peak 16.135970
#> 3: 1400 1400 7 208584 209455 peak 3.883028
#> 4: 1400 1400 7 236036 237515 peak 7.081136
#> 5: 1400 1400 7 267598 270853 peak 3.689708
#> 6: 1400 1400 7 307841 308655 peak 1.900491
#> 7: 1400 1400 7 326129 327000 peak 2.469575
#> 8: 1450 1450 6 183846 183925 peak 1.063291
#> 9: 1450 1450 6 206737 208583 peak 16.135970
#> 10: 1450 1450 6 208584 209455 peak 3.883028
#> 11: 1450 1450 6 236036 237515 peak 7.081136
#> 12: 1450 1450 6 267598 270853 peak 3.689708
#> 13: 1450 1450 6 326129 327000 peak 2.469575
#> 14: 1460 1460 5 206725 208583 peak 16.051130
#> 15: 1460 1460 5 208584 209455 peak 3.883028
#> 16: 1460 1460 5 236036 237515 peak 7.081136
#> 17: 1460 1460 5 267598 270853 peak 3.689708
#> 18: 1460 1460 5 326129 327000 peak 2.469575
#> 19: FLOPART 1400 5 206725 209216 peak 13.177439
#> 20: FLOPART 1400 5 236120 237515 peak 7.387814
#> 21: FLOPART 1400 5 267598 270853 peak 3.689708
#> 22: FLOPART 1400 5 307841 308655 peak 1.900491
#> 23: FLOPART 1400 5 326129 327000 peak 2.469575
#> model penalty n.peaks chromStart chromEnd status mean
Finally we use the code below to visualize the models with and without label constraints.
peak.y <- -2
if(require("ggplot2")){
ggplot()+
ggtitle("Models with label constraints (FLOPART) and without (penalty values)")+
scale_fill_manual("label", values=ann.colors)+
geom_rect(aes(
xmin=chromStart, xmax=chromEnd,
ymin=-Inf, ymax=Inf,
fill=annotation),
alpha=0.5,
color="grey",
data=Mono27ac.simple[["label"]])+
geom_step(aes(
chromStart, count),
data=Mono27ac.simple[["coverage"]],
color="grey50")+
geom_step(aes(
chromStart, mean),
data=segs.dt,
color=model.color)+
geom_segment(aes(
chromStart, peak.y,
xend=chromEnd, yend=peak.y),
color=model.color,
size=1,
data=peaks.dt)+
geom_point(aes(
chromEnd, peak.y),
color=model.color,
shape=21,
fill="white",
data=peaks.dt)+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(model ~ ., labeller=label_both)+
scale_linetype_manual(
"error type",
values=c(
correct=0,
"false negative"=3,
"false positive"=1))+
geom_rect(aes(
xmin=chromStart, xmax=chromEnd,
ymin=-Inf, ymax=Inf,
linetype=status),
fill=NA,
color="black",
size=1,
data=err.dt)
}
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
In the plot above, the advantages of FLOPART are clear (no label errors), relative to the models without label constraints (always have some false positives or negatives with respect to the labels).