Title: | Disk-Based Constrained Change-Point Detection |
---|---|
Description: | Disk-based implementation of Functional Pruning Optimal Partitioning with up-down constraints <doi:10.18637/jss.v101.i10> for single-sample peak calling (independently for each sample and genomic problem), can handle huge data sets (10^7 or more). |
Authors: | Toby Dylan Hocking [aut, cre] |
Maintainer: | Toby Dylan Hocking <[email protected]> |
License: | GPL-3 |
Version: | 2024.10.1 |
Built: | 2024-11-01 03:24:28 UTC |
Source: | https://github.com/tdhock/peaksegdisk |
These data are raw aligned reads which have been mapped to the human genome, hg19. One is sample ID McGill0004, experiment H3K36me3, chr9, chunk H3K36me3_AM_immune/8. The other is sample ID McGill0002, experiment H3K4me3, chr2, chunk H3K4me3_PGP_immune/7. The chunk ID numbers refer to parts of the McGill benchmark data set, https://rcdata.nau.edu/genomic-ml/chip-seq-chunk-db/
data("ChIPreads")
data("ChIPreads")
A data frame with 40396 observations on the following 4 variables.
experiment
either H3K36me3 or H3K4me3
chrom
either chr9 or chr2
chromStart
0-based start position of read
chromEnd
1-based end position of read
count
number of times a read occured with the given chromStart/end in this sample/experiment
Peak detection algorithm are typically run on a sequence of non-negative integer count data, one data point for each genomic position. These data are useful for proving that peak detection methods are robust to different sequences: (1) spatially correlated, non-independent aligned read coverage; (2) un-correlated, independent representations such as first or last read.
Create a list of data tables describing PeakSegFPOP model and data.
## S3 method for class 'PeakSegFPOP_df' coef(object, ...)
## S3 method for class 'PeakSegFPOP_df' coef(object, ...)
object |
object |
... |
... |
list of data tables with named elements segments, loss, data, changes, peaks.
Toby Dylan Hocking <[email protected]> [aut, cre]
Compute changes and peaks to display/plot.
## S3 method for class 'PeakSegFPOP_dir' coef(object, ...)
## S3 method for class 'PeakSegFPOP_dir' coef(object, ...)
object |
object |
... |
... |
model list with additional named elements peaks and changes.
Toby Dylan Hocking <[email protected]> [aut, cre]
Named list of character vectors (column names of bed/bedGraph/tsv
files), used to read data files, which do not contain a header /
column names. Each name corresponds to a data/file type, and each
value is a character vector of column names expected in that
file. loss is for the coverage.bedGraph_penalty=VALUE_loss.tsv
file generated by PeakSegFPOP_file
; segments is for the
coverage.bedGraph_penalty=VALUE_segments.bed generated by
PeakSegFPOP_file
; coverage is for the coverage.bedGraph file which
is used as input to PeakSegFPOP_file
.
"col.name.list"
"col.name.list"
library(PeakSegDisk) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) write.table( four, tmp <- tempfile(), sep="\t", row.names=FALSE, col.names=FALSE) read.table( tmp, col.names=col.name.list$coverage) pstr <- "10.5" PeakSegFPOP_file(tmp, pstr) outf <- function(suffix){ paste0(tmp, "_penalty=", pstr, "_", suffix) } fread.first(outf("segments.bed"), col.name.list$segments) fread.first(outf("loss.tsv"), col.name.list$loss)
library(PeakSegDisk) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) write.table( four, tmp <- tempfile(), sep="\t", row.names=FALSE, col.names=FALSE) read.table( tmp, col.names=col.name.list$coverage) pstr <- "10.5" PeakSegFPOP_file(tmp, pstr) outf <- function(suffix){ paste0(tmp, "_penalty=", pstr, "_", suffix) } fread.first(outf("segments.bed"), col.name.list$segments) fread.first(outf("loss.tsv"), col.name.list$loss)
Read the first line of a text file. Useful for quickly checking if
the coverage.bedGraph_penalty=VALUE_segments.bed file is
consistent with the coverage.bedGraph_penalty=VALUE_loss.tsv
file. (used by the PeakSegFPOP_dir
caching mechanism)
fread.first(file.name, col.name.vec)
fread.first(file.name, col.name.vec)
file.name |
Name of file to read. |
col.name.vec |
Character vector of column names. |
Data table with one row.
Toby Dylan Hocking <[email protected]> [aut, cre]
library(PeakSegDisk) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) write.table( four, tmp <- tempfile(), sep="\t", row.names=FALSE, col.names=FALSE) pstr <- "10.5" PeakSegFPOP_file(tmp, pstr) outf <- function(suffix){ paste0(tmp, "_penalty=", pstr, "_", suffix) } segments.bed <- outf("segments.bed") first.seg.line <- fread.first(segments.bed, col.name.list$segments) last.seg.line <- fread.last(segments.bed, col.name.list$segments) loss.tsv <- outf("loss.tsv") loss.row <- fread.first(loss.tsv, col.name.list$loss) seg.bases <- first.seg.line$chromEnd - last.seg.line$chromStart loss.row$bases == seg.bases
library(PeakSegDisk) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) write.table( four, tmp <- tempfile(), sep="\t", row.names=FALSE, col.names=FALSE) pstr <- "10.5" PeakSegFPOP_file(tmp, pstr) outf <- function(suffix){ paste0(tmp, "_penalty=", pstr, "_", suffix) } segments.bed <- outf("segments.bed") first.seg.line <- fread.first(segments.bed, col.name.list$segments) last.seg.line <- fread.last(segments.bed, col.name.list$segments) loss.tsv <- outf("loss.tsv") loss.row <- fread.first(loss.tsv, col.name.list$loss) seg.bases <- first.seg.line$chromEnd - last.seg.line$chromStart loss.row$bases == seg.bases
Read the last line of a text file. Useful for quickly checking if
the coverage.bedGraph_penalty=VALUE_segments.bed file is
consistent with the coverage.bedGraph_penalty=VALUE_loss.tsv
file. (used by the PeakSegFPOP_dir
caching mechanism)
fread.last(file.name, col.name.vec)
fread.last(file.name, col.name.vec)
file.name |
Name of file to read. |
col.name.vec |
Character vector of column names. |
Data table with one row.
Toby Dylan Hocking <[email protected]> [aut, cre]
library(PeakSegDisk) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) write.table( four, tmp <- tempfile(), sep="\t", row.names=FALSE, col.names=FALSE) pstr <- "10.5" PeakSegFPOP_file(tmp, pstr) outf <- function(suffix){ paste0(tmp, "_penalty=", pstr, "_", suffix) } segments.bed <- outf("segments.bed") first.seg.line <- fread.first(segments.bed, col.name.list$segments) last.seg.line <- fread.last(segments.bed, col.name.list$segments) loss.tsv <- outf("loss.tsv") loss.row <- fread.first(loss.tsv, col.name.list$loss) seg.bases <- first.seg.line$chromEnd - last.seg.line$chromStart loss.row$bases == seg.bases
library(PeakSegDisk) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) write.table( four, tmp <- tempfile(), sep="\t", row.names=FALSE, col.names=FALSE) pstr <- "10.5" PeakSegFPOP_file(tmp, pstr) outf <- function(suffix){ paste0(tmp, "_penalty=", pstr, "_", suffix) } segments.bed <- outf("segments.bed") first.seg.line <- fread.first(segments.bed, col.name.list$segments) last.seg.line <- fread.last(segments.bed, col.name.list$segments) loss.tsv <- outf("loss.tsv") loss.row <- fread.first(loss.tsv, col.name.list$loss) seg.bases <- first.seg.line$chromEnd - last.seg.line$chromStart loss.row$bases == seg.bases
The data come from an H3K27ac ChIP-seq experiment which was aligned to the human reference genome (hg19), aligned read counts were used to produce the coverage data; looking at these data in a genome browser was used to produce the labels. ChIP-seq means Chromatin Immunoprecipitation followed by high-throughput DNA sequencing; it is an assay used to characterize genome-wide DNA-protein interactions. In this experiment the protein of interest is histone H3, with the specific modification K27ac (hence the name H3K27ac). Large counts (peaks) therefore indicate regions of the reference genome with high likelihood of interaction between DNA and that specific protein, in the specific Monocyte sample tested.
data("Mono27ac")
data("Mono27ac")
A list of 2 data.tables: coverage has 4 columns (chrom, chromStart, chromEnd, count=number of aligned reads at each position on chrom:chromStart-chromEnd); labels has 4 columns (chrom, chromStart, chromEnd, annotation=label at chrom:chromStart-chromEnd). chrom refers to the chromosome on whcih the data were gathered (chr11), chromStart is the 0-based position before the first base of the data/label, chromEnd is the 1-based position which is the last base of the data/label. Therefore, each chromEnd on each row should be equal to the chromStart of the next row.
UCI Machine Learning Repository, chipseq data set, problem directory H3K27ac-H3K4me3_TDHAM_BP/samples/Mono1_H3K27ac/S001YW_NCMLS/problems/chr11:60000-580000 Links: https://archive.ics.uci.edu/ml/datasets/chipseq for the UCI web page; https://github.com/tdhock/feature-learning-benchmark for a more detailed explanation.
Write data frame to disk then run PeakSegFPOP_dir
solver.
PeakSegFPOP_df(count.df, pen.num, base.dir = tempdir())
PeakSegFPOP_df(count.df, pen.num, base.dir = tempdir())
count.df |
data.frame with columns count, chromStart, chromEnd. These data
will be saved via |
pen.num |
Non-negative numeric scalar. |
base.dir |
base.dir/chrXX-start-end/coverage.bedGraph will be written, where chrXX is the chrom column, start is the first chromStart position, and end is the last chromEnd position. |
List of solver results, same as PeakSegFPOP_dir
.
Toby Dylan Hocking <[email protected]> [aut, cre]
## Simulate a sequence of Poisson count data. sim.seg <- function(seg.mean, size.mean=15){ seg.size <- rpois(1, size.mean) rpois(seg.size, seg.mean) } set.seed(1) seg.mean.vec <- c(1.5, 3.5, 0.5, 4.5, 2.5) z.list <- lapply(seg.mean.vec, sim.seg) z.rep.vec <- unlist(z.list) ## Plot the simulated data sequence. if(require(ggplot2)){ count.df <- data.frame( position=seq_along(z.rep.vec), count=z.rep.vec) gg.count <- ggplot()+ geom_point(aes( position, count), shape=1, data=count.df) gg.count } ## Plot the true changes. n.segs <- length(seg.mean.vec) seg.size.vec <- sapply(z.list, length) seg.end.vec <- cumsum(seg.size.vec) change.vec <- seg.end.vec[-n.segs]+0.5 change.df <- data.frame( changepoint=change.vec) gg.change <- gg.count+ geom_vline(aes( xintercept=changepoint), data=change.df) gg.change ## Plot the run-length encoding of the same data. z.rle.vec <- rle(z.rep.vec) chromEnd <- cumsum(z.rle.vec$lengths) coverage.df <- data.frame( chrom="chrUnknown", chromStart=c(0L, chromEnd[-length(chromEnd)]), chromEnd, count=z.rle.vec$values) gg.rle <- gg.change+ geom_segment(aes( chromStart+0.5, count, xend=chromEnd+0.5, yend=count), data=coverage.df) gg.rle ## Fit a peak model and plot the segment means. fit <- PeakSegDisk::PeakSegFPOP_df(coverage.df, 10.5) gg.rle+ geom_segment(aes( chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean), color="green", data=fit$segments) ## Default plot method shows data as geom_step. (gg <- plot(fit)) ## Plot data as points to verify the step representation. gg+ geom_point(aes( position, count), color="grey", shape=1, data=count.df)
## Simulate a sequence of Poisson count data. sim.seg <- function(seg.mean, size.mean=15){ seg.size <- rpois(1, size.mean) rpois(seg.size, seg.mean) } set.seed(1) seg.mean.vec <- c(1.5, 3.5, 0.5, 4.5, 2.5) z.list <- lapply(seg.mean.vec, sim.seg) z.rep.vec <- unlist(z.list) ## Plot the simulated data sequence. if(require(ggplot2)){ count.df <- data.frame( position=seq_along(z.rep.vec), count=z.rep.vec) gg.count <- ggplot()+ geom_point(aes( position, count), shape=1, data=count.df) gg.count } ## Plot the true changes. n.segs <- length(seg.mean.vec) seg.size.vec <- sapply(z.list, length) seg.end.vec <- cumsum(seg.size.vec) change.vec <- seg.end.vec[-n.segs]+0.5 change.df <- data.frame( changepoint=change.vec) gg.change <- gg.count+ geom_vline(aes( xintercept=changepoint), data=change.df) gg.change ## Plot the run-length encoding of the same data. z.rle.vec <- rle(z.rep.vec) chromEnd <- cumsum(z.rle.vec$lengths) coverage.df <- data.frame( chrom="chrUnknown", chromStart=c(0L, chromEnd[-length(chromEnd)]), chromEnd, count=z.rle.vec$values) gg.rle <- gg.change+ geom_segment(aes( chromStart+0.5, count, xend=chromEnd+0.5, yend=count), data=coverage.df) gg.rle ## Fit a peak model and plot the segment means. fit <- PeakSegDisk::PeakSegFPOP_df(coverage.df, 10.5) gg.rle+ geom_segment(aes( chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean), color="green", data=fit$segments) ## Default plot method shows data as geom_step. (gg <- plot(fit)) ## Plot data as points to verify the step representation. gg+ geom_point(aes( position, count), color="grey", shape=1, data=count.df)
Main function/interface for the PeakSegDisk package.
Run the low-level solver, PeakSegFPOP_file
,
on one genomic segmentation problem
directory, and read the result files into R. Actually, this
function will first check if the result files are already present
(and consistent), and if so, it will simply read them into R
(without running PeakSegFPOP_file
) – this is a caching mechanism
that can save a lot of time.
To run the algo on an integer vector, use PeakSegFPOP_vec
;
for a data.frame, use PeakSegFPOP_df
.
To compute the optimal model for a given number of peaks,
use sequentialSearch_dir
.
PeakSegFPOP_dir(problem.dir, penalty.param, db.file = NULL)
PeakSegFPOP_dir(problem.dir, penalty.param, db.file = NULL)
problem.dir |
Path to a directory like sampleID/problems/chrXX-start-end which
contains a coverage.bedGraph file with the aligned read counts for
one genomic segmentation problem. This must be a plain text file
with the following four columns: chrom (character chromosome
name), chromStart (integer start position), chromEnd (integer end
position), count (integer aligned read count on chrom from
chromStart+1 to chromEnd); see also
https://genome.ucsc.edu/goldenPath/help/bedgraph.html. Note that
the standard coverage.bedGraph file name is required; for full
flexibility the user can run the algo on an arbitrarily named file
via |
penalty.param |
non-negative numeric penalty parameter (larger values for fewer peaks), or character scalar which can be interpreted as such. 0 means max peaks, Inf means no peaks. |
db.file |
character scalar: file for writing temporary cost function database – there will be a lot of disk writing to this file. Default NULL means to write the same disk where the input bedGraph file is stored; another option is tempfile() which may result in speedups if the input bedGraph file is on a slow network disk and the temporary storage is a fast local disk. |
Finds the optimal change-points using the Poisson loss
and the PeakSeg constraint (changes in mean alternate between
non-decreasing and non-increasing). For data points, the
functional pruning algorithm is
memory. It is
time and disk space. It computes the
exact solution to the optimization problem in
vignette("Examples", package="PeakSegDisk")
.
Named list of two data.tables:
segments |
has one row for every segment in the optimal model, |
loss |
has one row and contains the following columns: |
same as input parameter
number of segments in optimal model
number of peaks in optimal model
number of positions described in bedGraph file
number of lines in bedGraph file
total Poisson loss
= =
mean.pen.cost*bases-penalty*peaks
mean penalized cost = (total.loss+penalty*peaks)/bases
number of adjacent segment means that have equal values in the optimal solution
mean number of intervals/candidate changepoints stored in optimal cost functions – useful for characterizing the computational complexity of the algorithm
maximum number of intervals
disk usage of *.db file
timing of PeakSegFPOP_file
Toby Dylan Hocking <[email protected]> [aut, cre]
data(Mono27ac, package="PeakSegDisk", envir=environment()) data.dir <- file.path( tempfile(), "H3K27ac-H3K4me3_TDHAM_BP", "samples", "Mono1_H3K27ac", "S001YW_NCMLS", "problems", "chr11-60000-580000") dir.create(data.dir, recursive=TRUE, showWarnings=FALSE) write.table( Mono27ac$coverage, file.path(data.dir, "coverage.bedGraph"), col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t") ## Compute one model with penalty=1952.6 (fit <- PeakSegDisk::PeakSegFPOP_dir(data.dir, 1952.6)) summary(fit)#same as fit$loss ## Visualize that model. ann.colors <- c( noPeaks="#f6f4bf", peakStart="#ffafaf", peakEnd="#ff4c4c", peaks="#a445ee") if(require(ggplot2)){ lab.min <- Mono27ac$labels[1, chromStart] lab.max <- Mono27ac$labels[.N, chromEnd] plist <- coef(fit) gg <- ggplot()+ theme_bw()+ geom_rect(aes( xmin=chromStart/1e3, xmax=chromEnd/1e3, ymin=-Inf, ymax=Inf, fill=annotation), color="grey", alpha=0.5, data=Mono27ac$labels)+ scale_fill_manual("label", values=ann.colors)+ geom_step(aes( chromStart/1e3, count), color="grey50", data=Mono27ac$coverage)+ geom_segment(aes( chromStart/1e3, mean, xend=chromEnd/1e3, yend=mean), color="green", size=1, data=plist$segments)+ geom_vline(aes( xintercept=chromEnd/1e3, linetype=constraint), color="green", data=plist$changes)+ scale_linetype_manual( values=c( inequality="dotted", equality="solid")) print(gg) print(gg+coord_cartesian(xlim=c(lab.min, lab.max)/1e3, ylim=c(0, 10))) ## Default plotting method only shows model. print(gg <- plot(fit)) ## Data can be added on top of model. print( gg+ geom_step(aes( chromStart, count), color="grey50", data=Mono27ac$coverage) ) }
data(Mono27ac, package="PeakSegDisk", envir=environment()) data.dir <- file.path( tempfile(), "H3K27ac-H3K4me3_TDHAM_BP", "samples", "Mono1_H3K27ac", "S001YW_NCMLS", "problems", "chr11-60000-580000") dir.create(data.dir, recursive=TRUE, showWarnings=FALSE) write.table( Mono27ac$coverage, file.path(data.dir, "coverage.bedGraph"), col.names=FALSE, row.names=FALSE, quote=FALSE, sep="\t") ## Compute one model with penalty=1952.6 (fit <- PeakSegDisk::PeakSegFPOP_dir(data.dir, 1952.6)) summary(fit)#same as fit$loss ## Visualize that model. ann.colors <- c( noPeaks="#f6f4bf", peakStart="#ffafaf", peakEnd="#ff4c4c", peaks="#a445ee") if(require(ggplot2)){ lab.min <- Mono27ac$labels[1, chromStart] lab.max <- Mono27ac$labels[.N, chromEnd] plist <- coef(fit) gg <- ggplot()+ theme_bw()+ geom_rect(aes( xmin=chromStart/1e3, xmax=chromEnd/1e3, ymin=-Inf, ymax=Inf, fill=annotation), color="grey", alpha=0.5, data=Mono27ac$labels)+ scale_fill_manual("label", values=ann.colors)+ geom_step(aes( chromStart/1e3, count), color="grey50", data=Mono27ac$coverage)+ geom_segment(aes( chromStart/1e3, mean, xend=chromEnd/1e3, yend=mean), color="green", size=1, data=plist$segments)+ geom_vline(aes( xintercept=chromEnd/1e3, linetype=constraint), color="green", data=plist$changes)+ scale_linetype_manual( values=c( inequality="dotted", equality="solid")) print(gg) print(gg+coord_cartesian(xlim=c(lab.min, lab.max)/1e3, ylim=c(0, 10))) ## Default plotting method only shows model. print(gg <- plot(fit)) ## Data can be added on top of model. print( gg+ geom_step(aes( chromStart, count), color="grey50", data=Mono27ac$coverage) ) }
Run the PeakSeg Functional Pruning Optimal Partitioning algorithm,
using a file on disk to store the O(N) function piece lists,
each of size O(log N).
This is a low-level function that just runs the algo
and produces the result files (without reading them into R),
so normal users are recommended to instead use PeakSegFPOP_dir
,
which calls this function then reads the result files into R.
PeakSegFPOP_file(bedGraph.file, pen.str, db.file = NULL)
PeakSegFPOP_file(bedGraph.file, pen.str, db.file = NULL)
bedGraph.file |
character scalar: tab-delimited tabular text file with four columns: chrom, chromStart, chromEnd, coverage. The algorithm creates a large temporary file in the same directory, so make sure that there is disk space available on that device. |
pen.str |
character scalar that can be converted to a numeric scalar via
as.numeric: non-negative penalty. More penalty means fewer
peaks. "0" and "Inf" are OK. Character is required rather than
numeric, so that the user can reliably find the results in the
output files, which are in the same directory as |
db.file |
character scalar: file for writing temporary cost function database – there will be a lot of disk writing to this file. Default NULL means to write the same disk where the input bedGraph file is stored; another option is tempfile() which may result in speedups if the input bedGraph file is on a slow network disk and the temporary storage is a fast local disk. |
A named list of input parameters, and the temporary cost function database file size in megabytes.
Toby Dylan Hocking <[email protected]> [aut, cre]
r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) dir.create(prob.dir <- tempfile()) coverage.bedGraph <- file.path(prob.dir, "coverage.bedGraph") write.table( four, coverage.bedGraph, sep="\t", row.names=FALSE, col.names=FALSE) pstr <- "10.5" result.list <- PeakSegDisk::PeakSegFPOP_file(coverage.bedGraph, pstr) dir(prob.dir) ## segments file can be read to see optimal segment means. outf <- function(suffix){ paste0(coverage.bedGraph, "_penalty=", pstr, suffix) } segments.bed <- outf("_segments.bed") seg.df <- read.table(segments.bed) names(seg.df) <- col.name.list$segments seg.df ## loss file can be read to see optimal Poisson loss, etc. loss.tsv <- outf("_loss.tsv") loss.df <- read.table(loss.tsv) names(loss.df) <- col.name.list$loss loss.df
r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } four <- rbind( r("chr1", 0, 10, 2), r("chr1", 10, 20, 10), r("chr1", 20, 30, 14), r("chr1", 30, 40, 13)) dir.create(prob.dir <- tempfile()) coverage.bedGraph <- file.path(prob.dir, "coverage.bedGraph") write.table( four, coverage.bedGraph, sep="\t", row.names=FALSE, col.names=FALSE) pstr <- "10.5" result.list <- PeakSegDisk::PeakSegFPOP_file(coverage.bedGraph, pstr) dir(prob.dir) ## segments file can be read to see optimal segment means. outf <- function(suffix){ paste0(coverage.bedGraph, "_penalty=", pstr, suffix) } segments.bed <- outf("_segments.bed") seg.df <- read.table(segments.bed) names(seg.df) <- col.name.list$segments seg.df ## loss file can be read to see optimal Poisson loss, etc. loss.tsv <- outf("_loss.tsv") loss.df <- read.table(loss.tsv) names(loss.df) <- col.name.list$loss loss.df
Convert integer data vector to run-length encoding,
then run PeakSegFPOP_df
.
PeakSegFPOP_vec(count.vec, pen.num)
PeakSegFPOP_vec(count.vec, pen.num)
count.vec |
integer vector, noisy non-negatve count data to segment. |
pen.num |
Non-negative numeric scalar. |
List of solver results, same as PeakSegFPOP_dir
.
Toby Dylan Hocking <[email protected]> [aut, cre]
## Simulate a sequence of Poisson data. sim.seg <- function(seg.mean, size.mean=15){ seg.size <- rpois(1, size.mean) rpois(seg.size, seg.mean) } set.seed(1) seg.mean.vec <- c(1.5, 3.5, 0.5, 4.5, 2.5) z.list <- lapply(seg.mean.vec, sim.seg) z.rep.vec <- unlist(z.list) ## Plot the simulated data. if(require(ggplot2)){ count.df <- data.frame( position=seq_along(z.rep.vec), count=z.rep.vec) gg.count <- ggplot()+ geom_point(aes( position, count), shape=1, data=count.df) print(gg.count) ## Plot the true changepoints. n.segs <- length(seg.mean.vec) seg.size.vec <- sapply(z.list, length) seg.end.vec <- cumsum(seg.size.vec) change.vec <- seg.end.vec[-n.segs]+0.5 change.df <- data.frame( changepoint=change.vec) gg.change <- gg.count+ geom_vline(aes( xintercept=changepoint), data=change.df) print(gg.change) ## Fit a peak model and plot it. fit <- PeakSegDisk::PeakSegFPOP_vec(z.rep.vec, 10.5) print( gg.change+ geom_segment(aes( chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean), color="green", data=fit$segments) ) ## A pathological data set. z.slow.vec <- 1:length(z.rep.vec) fit.slow <- PeakSegDisk::PeakSegFPOP_vec(z.slow.vec, 10.5) rbind(fit.slow$loss, fit$loss) }
## Simulate a sequence of Poisson data. sim.seg <- function(seg.mean, size.mean=15){ seg.size <- rpois(1, size.mean) rpois(seg.size, seg.mean) } set.seed(1) seg.mean.vec <- c(1.5, 3.5, 0.5, 4.5, 2.5) z.list <- lapply(seg.mean.vec, sim.seg) z.rep.vec <- unlist(z.list) ## Plot the simulated data. if(require(ggplot2)){ count.df <- data.frame( position=seq_along(z.rep.vec), count=z.rep.vec) gg.count <- ggplot()+ geom_point(aes( position, count), shape=1, data=count.df) print(gg.count) ## Plot the true changepoints. n.segs <- length(seg.mean.vec) seg.size.vec <- sapply(z.list, length) seg.end.vec <- cumsum(seg.size.vec) change.vec <- seg.end.vec[-n.segs]+0.5 change.df <- data.frame( changepoint=change.vec) gg.change <- gg.count+ geom_vline(aes( xintercept=changepoint), data=change.df) print(gg.change) ## Fit a peak model and plot it. fit <- PeakSegDisk::PeakSegFPOP_vec(z.rep.vec, 10.5) print( gg.change+ geom_segment(aes( chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean), color="green", data=fit$segments) ) ## A pathological data set. z.slow.vec <- 1:length(z.rep.vec) fit.slow <- PeakSegDisk::PeakSegFPOP_vec(z.slow.vec, 10.5) rbind(fit.slow$loss, fit$loss) }
Plot a PeakSeg model with attached data.
## S3 method for class 'PeakSegFPOP_df' plot(x, ...)
## S3 method for class 'PeakSegFPOP_df' plot(x, ...)
x |
x |
... |
... |
a ggplot.
Toby Dylan Hocking <[email protected]> [aut, cre]
Plot a PeakSeg model with attached data.
## S3 method for class 'PeakSegFPOP_dir' plot(x, ...)
## S3 method for class 'PeakSegFPOP_dir' plot(x, ...)
x |
x |
... |
... |
a ggplot.
Toby Dylan Hocking <[email protected]> [aut, cre]
Compute the most likely peak model with at most the number of
peaks given by peaks.int
. This function repeated calls
PeakSegFPOP_dir
with different penalty values, until either
(1) it finds the peaks.int
model, or (2) it concludes that there
is no peaks.int
model, in which case it returns the next simplest
model (with fewer peaks than peaks.int
).
The first pair of penalty values (0, Inf) is run in parallel
via the user-specified future plan,
if the future.apply package is available.
sequentialSearch_dir(problem.dir, peaks.int, verbose = 0)
sequentialSearch_dir(problem.dir, peaks.int, verbose = 0)
problem.dir |
problemID directory in which coverage.bedGraph has already been computed. If there is a labels.bed file then the number of incorrect labels will be computed in order to find the target interval of minimal error penalty values. |
peaks.int |
int: target number of peaks. |
verbose |
numeric verbosity: if >0 then |
Same result list from PeakSegFPOP_dir
, with an additional
component "others" describing the other models that were computed
before finding the optimal model with peaks.int
(or fewer)
peaks. Additional loss columns are as follows: under=number of
peaks in smaller model during binary search; over=number of peaks
in larger model during binary search; iteration=number of times
PeakSegFPOP has been run.
Toby Dylan Hocking <[email protected]> [aut, cre]
## Create simple 6 point data set discussed in supplementary ## materials. GFPOP/GPDPA computes up-down model with 2 peaks, but ## neither CDPA (PeakSegDP::cDPA) nor PDPA (jointseg) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } supp <- rbind( r("chr1", 0, 1, 3), r("chr1", 1, 2, 9), r("chr1", 2, 3, 18), r("chr1", 3, 4, 15), r("chr1", 4, 5, 20), r("chr1", 5, 6, 2) ) data.dir <- file.path(tempfile(), "chr1-0-6") dir.create(data.dir, recursive=TRUE) write.table( supp, file.path(data.dir, "coverage.bedGraph"), sep="\t", row.names=FALSE, col.names=FALSE) ## register a parallel future plan to compute the first two ## penalties in parallel during the sequential search. if(interactive() && requireNamespace("future"))future::plan("multisession") ## Compute optimal up-down model with 2 peaks via sequential search. fit <- PeakSegDisk::sequentialSearch_dir(data.dir, 2L) if(require(ggplot2)){ ggplot()+ theme_bw()+ geom_point(aes( chromEnd, coverage), data=supp)+ geom_segment(aes( chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean), data=fit$segments, color="green") }
## Create simple 6 point data set discussed in supplementary ## materials. GFPOP/GPDPA computes up-down model with 2 peaks, but ## neither CDPA (PeakSegDP::cDPA) nor PDPA (jointseg) r <- function(chrom, chromStart, chromEnd, coverage){ data.frame(chrom, chromStart, chromEnd, coverage) } supp <- rbind( r("chr1", 0, 1, 3), r("chr1", 1, 2, 9), r("chr1", 2, 3, 18), r("chr1", 3, 4, 15), r("chr1", 4, 5, 20), r("chr1", 5, 6, 2) ) data.dir <- file.path(tempfile(), "chr1-0-6") dir.create(data.dir, recursive=TRUE) write.table( supp, file.path(data.dir, "coverage.bedGraph"), sep="\t", row.names=FALSE, col.names=FALSE) ## register a parallel future plan to compute the first two ## penalties in parallel during the sequential search. if(interactive() && requireNamespace("future"))future::plan("multisession") ## Compute optimal up-down model with 2 peaks via sequential search. fit <- PeakSegDisk::sequentialSearch_dir(data.dir, 2L) if(require(ggplot2)){ ggplot()+ theme_bw()+ geom_point(aes( chromEnd, coverage), data=supp)+ geom_segment(aes( chromStart+0.5, mean, xend=chromEnd+0.5, yend=mean), data=fit$segments, color="green") }
Summary of PeakSegFPOP_dir
object
.
## S3 method for class 'PeakSegFPOP_dir' summary(object, ...)
## S3 method for class 'PeakSegFPOP_dir' summary(object, ...)
object |
object |
... |
... |
Data table with one row and columns describing model summary.
Toby Dylan Hocking <[email protected]> [aut, cre]
Convert wc output to integer number of lines.
wc2int(wc.output)
wc2int(wc.output)
wc.output |
Character scalar: output from wc. |
integer
Toby Dylan Hocking <[email protected]> [aut, cre]
Write a data.frame in R to a bedGraph file on disk. This must be a plain text file with the following four columns: chrom (character chromosome name), chromStart (integer start position), chromEnd (integer end position), count (integer aligned read count on chrom from chromStart+1 to chromEnd); see also https://genome.ucsc.edu/goldenPath/help/bedgraph.html
writeBedGraph(count.df, coverage.bedGraph)
writeBedGraph(count.df, coverage.bedGraph)
count.df |
data.frame with four columns: chrom, chromStart, chromEnd, count. |
coverage.bedGraph |
file path where data will be saved in plain text / bedGraph format. |
NULL (same as write.table).
Toby Dylan Hocking <[email protected]> [aut, cre]
library(PeakSegDisk) data(Mono27ac, envir=environment()) coverage.bedGraph <- file.path( tempfile(), "H3K27ac-H3K4me3_TDHAM_BP", "samples", "Mono1_H3K27ac", "S001YW_NCMLS", "problems", "chr11-60000-580000", "coverage.bedGraph") dir.create( dirname(coverage.bedGraph), recursive=TRUE, showWarnings=FALSE) writeBedGraph(Mono27ac$coverage, coverage.bedGraph) fread.first(coverage.bedGraph, col.name.list$coverage) fread.last(coverage.bedGraph, col.name.list$coverage)
library(PeakSegDisk) data(Mono27ac, envir=environment()) coverage.bedGraph <- file.path( tempfile(), "H3K27ac-H3K4me3_TDHAM_BP", "samples", "Mono1_H3K27ac", "S001YW_NCMLS", "problems", "chr11-60000-580000", "coverage.bedGraph") dir.create( dirname(coverage.bedGraph), recursive=TRUE, showWarnings=FALSE) writeBedGraph(Mono27ac$coverage, coverage.bedGraph) fread.first(coverage.bedGraph, col.name.list$coverage) fread.last(coverage.bedGraph, col.name.list$coverage)