This vignette contains a number of
examples which explain how to use capture_first_glob
to
read data from a set of regularly named files.
We begin with a simple example: iris data have 150 rows, as shown below.
library(data.table)
dir.create(iris.dir <- tempfile())
icsv <- function(sp)file.path(iris.dir, paste0(sp, ".csv"))
(iris.dt <- data.table(iris))
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 5.1 3.5 1.4 0.2 setosa
#> 2: 4.9 3.0 1.4 0.2 setosa
#> 3: 4.7 3.2 1.3 0.2 setosa
#> 4: 4.6 3.1 1.5 0.2 setosa
#> 5: 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 6.7 3.0 5.2 2.3 virginica
#> 147: 6.3 2.5 5.0 1.9 virginica
#> 148: 6.5 3.0 5.2 2.0 virginica
#> 149: 6.2 3.4 5.4 2.3 virginica
#> 150: 5.9 3.0 5.1 1.8 virginica
In the code below, we save one CSV file for each of the three Species.
iris.dt[, fwrite(.SD, icsv(Species)), by=Species]
#> Empty data.table (0 rows and 1 cols): Species
dir(iris.dir)
#> [1] "setosa.csv" "versicolor.csv" "virginica.csv"
The output above shows that there are three CSV files, one for each Species in the iris data. Below we read the first two rows of one file,
data.table::fread(file.path(iris.dir,"setosa.csv"), nrows=2)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <num> <num> <num> <num>
#> 1: 5.1 3.5 1.4 0.2
#> 2: 4.9 3.0 1.4 0.2
The output above shows that the CSV data file itself does not contain a Species column (the Species is instead encoded in the file name). Below we construct a glob, which is a string for matching files,
(iglob <- file.path(iris.dir,"*.csv"))
#> [1] "/tmp/RtmptbwqVT/file10b4623581c8/*.csv"
Sys.glob(iglob)
#> [1] "/tmp/RtmptbwqVT/file10b4623581c8/setosa.csv"
#> [2] "/tmp/RtmptbwqVT/file10b4623581c8/versicolor.csv"
#> [3] "/tmp/RtmptbwqVT/file10b4623581c8/virginica.csv"
The output above indicates that iglob
matches the three
data files. Below we read those files into R, using the following
syntax:
iglob
is a string/glob which
indicates the files to read,Species
matches that part of the
file name, and is captured to the resulting column of the same
name,"[.]csv"
indicates that suffix
must be matched (but since the argument is not named, it is not
captured, nor saved as a column in the output).nc::capture_first_glob(iglob, Species="[^/]+", "[.]csv")
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <char> <num> <num> <num> <num>
#> 1: setosa 5.1 3.5 1.4 0.2
#> 2: setosa 4.9 3.0 1.4 0.2
#> 3: setosa 4.7 3.2 1.3 0.2
#> 4: setosa 4.6 3.1 1.5 0.2
#> 5: setosa 5.0 3.6 1.4 0.2
#> ---
#> 146: virginica 6.7 3.0 5.2 2.3
#> 147: virginica 6.3 2.5 5.0 1.9
#> 148: virginica 6.5 3.0 5.2 2.0
#> 149: virginica 6.2 3.4 5.4 2.3
#> 150: virginica 5.9 3.0 5.1 1.8
The output above indicates that we have successfully read the iris
data back into R, including the Species
column which was
not present in the CSV data files.
Consider the example below, which is slightly more complex. The code below defines a glob for matching several data files.
db <- system.file("extdata/chip-seq-chunk-db", package="nc", mustWork=TRUE)
(glob <- paste0(db, "/*/*/counts/*gz"))
#> [1] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/chip-seq-chunk-db/*/*/counts/*gz"
(matched.files <- Sys.glob(glob))
#> [1] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/chip-seq-chunk-db/H3K36me3_AM_immune/9/counts/McGill0101.bedGraph.gz"
#> [2] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/chip-seq-chunk-db/H3K36me3_TDH_other/1/counts/McGill0019.bedGraph.gz"
#> [3] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/chip-seq-chunk-db/H3K4me3_TDH_immune/9/counts/McGill0024.bedGraph.gz"
#> [4] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/chip-seq-chunk-db/H3K4me3_XJ_immune/2/counts/McGill0024.bedGraph.gz"
The output above indicates there are four data files that are matched by the glob. Below we read the first one,
readLines(matched.files[1], n=5)
#> [1] "track type=bedGraph db=hg19 visibility=full graphType=points name=101K36monocyte description=\"McGill0101 H3K36me3 aligned read counts\""
#> [2] "chr10\t111456281\t111456338\t2"
#> [3] "chr10\t111456338\t111456381\t1"
#> [4] "chr10\t111456381\t111459312\t0"
#> [5] "chr10\t111459312\t111459316\t5"
We can see from the output above that this data file has a header of meta-data (not column names) on the first line, whereas the other lines contain tab-delimited data. We can read it with fread, as long as we provide a couple non-default arguments, as in the code below:
read.bedGraph <- function(f)data.table::fread(
f, skip=1, col.names = c("chrom","start", "end", "count"))
read.bedGraph(matched.files[1])
#> chrom start end count
#> <char> <int> <int> <int>
#> 1: chr10 111456281 111456338 2
#> 2: chr10 111456338 111456381 1
#> 3: chr10 111456381 111459312 0
#> 4: chr10 111459312 111459316 5
#> 5: chr10 111459316 111459409 10
#> ---
#> 7130: chr10 111721272 111721347 4
#> 7131: chr10 111721347 111721354 2
#> 7132: chr10 111721354 111722459 0
#> 7133: chr10 111722459 111722461 2
#> 7134: chr10 111722461 111722555 4
The output above indicates the data has been correctly read into R as
a table with four columns. To do that for each of the files, we use this
custom READ
function in the code below,
data.chunk.pattern <- list(
data="H.*?",
"/",
chunk="[0-9]+", as.integer)
(data.chunk.dt <- nc::capture_first_glob(glob, data.chunk.pattern, READ=read.bedGraph))
#> data chunk chrom start end count
#> <char> <int> <char> <int> <int> <int>
#> 1: H3K36me3_AM_immune 9 chr10 111456281 111456338 2
#> 2: H3K36me3_AM_immune 9 chr10 111456338 111456381 1
#> 3: H3K36me3_AM_immune 9 chr10 111456381 111459312 0
#> 4: H3K36me3_AM_immune 9 chr10 111459312 111459316 5
#> 5: H3K36me3_AM_immune 9 chr10 111459316 111459409 10
#> ---
#> 20297: H3K4me3_XJ_immune 2 chr22 20689768 20689770 0
#> 20298: H3K4me3_XJ_immune 2 chr22 20689770 20689870 1
#> 20299: H3K4me3_XJ_immune 2 chr22 20689870 20689995 0
#> 20300: H3K4me3_XJ_immune 2 chr22 20689995 20690080 1
#> 20301: H3K4me3_XJ_immune 2 chr22 20690080 20691400 0
The output above indicates the data files have been read into R as a table, with two additional columns (data and chunk), which correspond to the capture group names used in the regular expression pattern above.
We can absolutely use base R to read these files, but it takes a bit more code, as shown below.
base.df.list <- list()
for(file.csv in matched.files){
file.df <- read.bedGraph(file.csv)
counts.path <- dirname(file.csv)
chunk.path <- dirname(counts.path)
data.path <- dirname(chunk.path)
base.df.list[[file.csv]] <- data.frame(
data=basename(data.path),
chunk=basename(chunk.path),
file.df)
}
base.df <- do.call(rbind, base.df.list)
rownames(base.df) <- NULL
head(base.df)
#> data chunk chrom start end count
#> 1 H3K36me3_AM_immune 9 chr10 111456281 111456338 2
#> 2 H3K36me3_AM_immune 9 chr10 111456338 111456381 1
#> 3 H3K36me3_AM_immune 9 chr10 111456381 111459312 0
#> 4 H3K36me3_AM_immune 9 chr10 111459312 111459316 5
#> 5 H3K36me3_AM_immune 9 chr10 111459316 111459409 10
#> 6 H3K36me3_AM_immune 9 chr10 111459409 111459411 8
str(base.df)
#> 'data.frame': 20301 obs. of 6 variables:
#> $ data : chr "H3K36me3_AM_immune" "H3K36me3_AM_immune" "H3K36me3_AM_immune" "H3K36me3_AM_immune" ...
#> $ chunk: chr "9" "9" "9" "9" ...
#> $ chrom: chr "chr10" "chr10" "chr10" "chr10" ...
#> $ start: int 111456281 111456338 111456381 111459312 111459316 111459409 111459411 111459415 111463412 111463512 ...
#> $ end : int 111456338 111456381 111459312 111459316 111459409 111459411 111459415 111463412 111463512 111466726 ...
#> $ count: int 2 1 0 5 10 8 5 0 2 0 ...
The output above shows that we have read a data frame into R, and
that it is consistent with the data table returned by
nc::capture_first_glob
, which should be preferred for
simplicity when the files are regularly named. In contrast, this section
shows how arbitrary R code can be used, so this approach should be
preferred when the data in the file path can not be captured using
regular expressions.
In the code below, we write the same data to a set of CSV files with different names,
arrow.available <- requireNamespace("arrow") && arrow::arrow_with_dataset()
#> Loading required namespace: arrow
if(arrow.available){
path <- tempfile()
arrow::write_dataset(
dataset=data.chunk.dt,
path=path,
format="csv",
partitioning=c("data","chunk"),
max_rows_per_file=1000)
hive.glob <- file.path(path, "*", "*", "*.csv")
(hive.files <- Sys.glob(hive.glob))
}
#> [1] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-0.csv"
#> [2] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-1.csv"
#> [3] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-2.csv"
#> [4] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-3.csv"
#> [5] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-4.csv"
#> [6] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-5.csv"
#> [7] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-6.csv"
#> [8] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_AM_immune/chunk=9/part-7.csv"
#> [9] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-0.csv"
#> [10] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-1.csv"
#> [11] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-10.csv"
#> [12] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-11.csv"
#> [13] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-12.csv"
#> [14] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-2.csv"
#> [15] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-3.csv"
#> [16] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-4.csv"
#> [17] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-5.csv"
#> [18] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-6.csv"
#> [19] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-7.csv"
#> [20] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-8.csv"
#> [21] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K36me3_TDH_other/chunk=1/part-9.csv"
#> [22] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K4me3_TDH_immune/chunk=9/part-0.csv"
#> [23] "/tmp/RtmptbwqVT/file10b43512fbf6/data=H3K4me3_XJ_immune/chunk=2/part-0.csv"
In the output above, we can see that there are regularly named files with three variables encoded in the file path (data, chunk, part). The code below reads one of the files back into R:
if(arrow.available){
data.table::fread(hive.files[1])
}
#> chrom start end count
#> <char> <int> <int> <int>
#> 1: chr10 111456281 111456338 2
#> 2: chr10 111456338 111456381 1
#> 3: chr10 111456381 111459312 0
#> 4: chr10 111459312 111459316 5
#> 5: chr10 111459316 111459409 10
#> ---
#> 996: chr10 111619010 111619035 1
#> 997: chr10 111619035 111619092 2
#> 998: chr10 111619092 111619101 3
#> 999: chr10 111619101 111619128 2
#> 1000: chr10 111619128 111619129 3
The output above indicates that the file only has four columns (and is missing the variables which are encoded in the file path). In the code below, we read all those files back into R:
if(arrow.available){
hive.pattern <- list(
nc::field("data","=",".*?"),
"/",
nc::field("chunk","=",".*?", as.integer),
"/",
nc::field("part","-","[0-9]+", as.integer))
print(hive.dt <- nc::capture_first_glob(hive.glob, hive.pattern))
hive.dt[, .(rows=.N), keyby=.(data,chunk,part)]
}
#> data chunk part chrom start end count
#> <char> <int> <int> <char> <int> <int> <int>
#> 1: H3K36me3_AM_immune 9 0 chr10 111456281 111456338 2
#> 2: H3K36me3_AM_immune 9 0 chr10 111456338 111456381 1
#> 3: H3K36me3_AM_immune 9 0 chr10 111456381 111459312 0
#> 4: H3K36me3_AM_immune 9 0 chr10 111459312 111459316 5
#> 5: H3K36me3_AM_immune 9 0 chr10 111459316 111459409 10
#> ---
#> 20297: H3K4me3_XJ_immune 2 0 chr22 20689768 20689770 0
#> 20298: H3K4me3_XJ_immune 2 0 chr22 20689770 20689870 1
#> 20299: H3K4me3_XJ_immune 2 0 chr22 20689870 20689995 0
#> 20300: H3K4me3_XJ_immune 2 0 chr22 20689995 20690080 1
#> 20301: H3K4me3_XJ_immune 2 0 chr22 20690080 20691400 0
#> Key: <data, chunk, part>
#> data chunk part rows
#> <char> <int> <int> <int>
#> 1: H3K36me3_AM_immune 9 0 1000
#> 2: H3K36me3_AM_immune 9 1 1000
#> 3: H3K36me3_AM_immune 9 2 1000
#> 4: H3K36me3_AM_immune 9 3 1000
#> 5: H3K36me3_AM_immune 9 4 1000
#> 6: H3K36me3_AM_immune 9 5 1000
#> 7: H3K36me3_AM_immune 9 6 1000
#> 8: H3K36me3_AM_immune 9 7 134
#> 9: H3K36me3_TDH_other 1 0 1000
#> 10: H3K36me3_TDH_other 1 1 1000
#> 11: H3K36me3_TDH_other 1 2 1000
#> 12: H3K36me3_TDH_other 1 3 1000
#> 13: H3K36me3_TDH_other 1 4 1000
#> 14: H3K36me3_TDH_other 1 5 1000
#> 15: H3K36me3_TDH_other 1 6 1000
#> 16: H3K36me3_TDH_other 1 7 1000
#> 17: H3K36me3_TDH_other 1 8 1000
#> 18: H3K36me3_TDH_other 1 9 1000
#> 19: H3K36me3_TDH_other 1 10 1000
#> 20: H3K36me3_TDH_other 1 11 1000
#> 21: H3K36me3_TDH_other 1 12 109
#> 22: H3K4me3_TDH_immune 9 0 886
#> 23: H3K4me3_XJ_immune 2 0 172
#> data chunk part rows
The output above indicates that we have successfully read the data back into R.
In the code below, we read the same data files, with a more complex pattern that has two additional capture groups (name and id).
(count.dt <- nc::capture_first_glob(
glob,
data.chunk.pattern,
"/counts/",
name=list("McGill", id="[0-9]+", as.integer),
READ=read.bedGraph))
#> data chunk name id chrom start end count
#> <char> <int> <char> <int> <char> <int> <int> <int>
#> 1: H3K36me3_AM_immune 9 McGill0101 101 chr10 111456281 111456338 2
#> 2: H3K36me3_AM_immune 9 McGill0101 101 chr10 111456338 111456381 1
#> 3: H3K36me3_AM_immune 9 McGill0101 101 chr10 111456381 111459312 0
#> 4: H3K36me3_AM_immune 9 McGill0101 101 chr10 111459312 111459316 5
#> 5: H3K36me3_AM_immune 9 McGill0101 101 chr10 111459316 111459409 10
#> ---
#> 20297: H3K4me3_XJ_immune 2 McGill0024 24 chr22 20689768 20689770 0
#> 20298: H3K4me3_XJ_immune 2 McGill0024 24 chr22 20689770 20689870 1
#> 20299: H3K4me3_XJ_immune 2 McGill0024 24 chr22 20689870 20689995 0
#> 20300: H3K4me3_XJ_immune 2 McGill0024 24 chr22 20689995 20690080 1
#> 20301: H3K4me3_XJ_immune 2 McGill0024 24 chr22 20690080 20691400 0
count.dt[, .(count=.N), by=.(data, chunk, name, id, chrom)]
#> data chunk name id chrom count
#> <char> <int> <char> <int> <char> <int>
#> 1: H3K36me3_AM_immune 9 McGill0101 101 chr10 7134
#> 2: H3K36me3_TDH_other 1 McGill0019 19 chr21 12109
#> 3: H3K4me3_TDH_immune 9 McGill0024 24 chr1 886
#> 4: H3K4me3_XJ_immune 2 McGill0024 24 chr22 172
The output above indicates that we have successfully read the data into R, with two additional columns (name and id). These data can be visualized using the code below,
if(require(ggplot2)){
ggplot()+
facet_wrap(~data+chunk+name+chrom, labeller=label_both, scales="free")+
geom_step(aes(
start/1e3, count),
data=count.dt)
}
The plot above includes panel/facet titles which come from the variables which were stored in the file names.
The following example demonstrates how non-CSV data may be parsed,
using a custom READ
function. Consider the vignette data
files,
vignettes <- system.file("extdata/vignettes", package="nc", mustWork=TRUE)
(vglob <- paste0(vignettes, "/*.Rmd"))
#> [1] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/*.Rmd"
(vfiles <- Sys.glob(vglob))
#> [1] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v0-overview.Rmd"
#> [2] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v1-capture-first.Rmd"
#> [3] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v2-capture-all.Rmd"
#> [4] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v3-capture-melt.Rmd"
#> [5] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v4-comparisons.Rmd"
#> [6] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v5-helpers.Rmd"
#> [7] "/tmp/RtmpmIrAPO/Rinstffe26a07f05/nc/extdata/vignettes/v6-engines.Rmd"
The output above includes the glob and the files it matches. Below we define a function for parsing one of those files,
non.greedy.lines <- list(
list(".*\n"), "*?")
optional.name <- list(
list(" ", chunk_name="[^,}]+"), "?")
chunk.pattern <- list(
before=non.greedy.lines,
"```\\{r",
optional.name,
parameters=".*",
"\\}\n",
code=non.greedy.lines,
"```")
READ.vignette <- function(f)nc::capture_all_str(f, chunk.pattern)
str(READ.vignette(vfiles[1]))
#> Classes 'data.table' and 'data.frame': 7 obs. of 4 variables:
#> $ before : chr "<!--\n%\\VignetteEngine{knitr::knitr}\n%\\VignetteIndexEntry{vignette 0: Overview}\n-->\n\n# Overview of nc functionality\n\n" "\n\nHere is an index of topics which are explained in the different\nvignettes, along with an overview of funct"| __truncated__ "\n\nA variant is doing the same thing, but with input\nsubjects coming from a data table/frame with character columns.\n\n" "\n\n## Capture all matches in a single subject\n \n[Capture all](v2-capture-all.html) is for the situation whe"| __truncated__ ...
#> $ chunk_name: chr "setup" "" "" "" ...
#> $ parameters: chr ", include = FALSE" "" "" "" ...
#> $ code : chr "knitr::opts_chunk$set(\n collapse = TRUE,\n comment = \"#>\"\n)\n" "subject.vec <- c(\n \"chr10:213054000-213,055,000\",\n \"chrM:111000\",\n \"chr1:110-111 chr2:220-222\")\nnc"| __truncated__ "subject.dt <- data.table::data.table(\n JobID = c(\"13937810_25\", \"14022192_1\"),\n Elapsed = c(\"07:04:42\"| __truncated__ "nc::capture_all_str(\n subject.vec, chrom=\"chr.*?\", \":\", chromStart=\"[0-9,]+\", as.integer)\n" ...
#> - attr(*, ".internal.selfref")=<externalptr>
The output above shows a data table with 7 rows, one for each code chunk defined in the vignette data file. We read all of the vignette files using the code below.
chunk.dt <- nc::capture_first_glob(
vglob,
"/v",
vignette_number="[0-9]", as.integer,
"-",
vignette_name=".*?",
".Rmd",
READ=READ.vignette
)[
, chunk_number := seq_along(chunk_name), by=vignette_number
]
chunk.dt[, .(
vignette_number, vignette_name, chunk_number, chunk_name,
lines=nchar(code))]
#> vignette_number vignette_name chunk_number chunk_name lines
#> <int> <char> <int> <char> <int>
#> 1: 0 overview 1 setup 61
#> 2: 0 overview 2 192
#> 3: 0 overview 3 314
#> 4: 0 overview 4 91
#> 5: 0 overview 5 242
#> ---
#> 104: 6 engines 1 setup 61
#> 105: 6 engines 2 115
#> 106: 6 engines 3 67
#> 107: 6 engines 4 122
#> 108: 6 engines 5 320
The output above is a data table with one row for each chunk in each
data file. Some columns (vignette_number
and
vignette_name
) come from the file path, and others come
from the data file contents, including chunk number, name, and line
count. The files also contain code which has been parsed and can be
extracted via the code below, for example: