---
title: "Capture first match"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Capture first match}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Capture first match in each character vector element

Consider the following vector which contains genome position strings,

```{r}
pos.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "chr1:110-111 chr2:220-222") # two possible matches.
```

To capture the first genome position in each string, we use the following syntax.
The first argument is the subject character vector, 
and the other arguments are pasted together 
to make a capturing regular expression.
Each named argument generates a capture group; 
the R argument name is used for the column name of the result.

```{r}
(chr.dt <- nc::capture_first_vec(
  pos.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+"))
str(chr.dt)
```

We can add type conversion functions on the same line as each named argument:

```{r}
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
(int.dt <- nc::capture_first_vec(
  pos.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+", keep.digits))
str(int.dt)
```

Below we use list variables to create patterns which are re-usable,
and we use an un-named list to generate a non-capturing optional group:

```{r}
pos.pattern <- list("[0-9,]+", keep.digits)
range.pattern <- list(
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?")
nc::capture_first_vec(pos.vec, range.pattern)
```

In summary, `nc::capture_first_vec` takes a variable number of arguments:

* The first argument is the subject character vector.
* The other arguments specify the pattern, via character strings,
  functions, and/or lists.
* If a pattern (character/list) is named, a capture group is generated in the regex.
* Each function is used to convert the text extracted by the previous
  named pattern argument.
* Lists may be used to avoid repetition in the definition of the pattern and type conversion functions.
* Each list generates a group in the regex (named list => named capture group, un-named list => non-capturing group).
* All patterns are pasted together in the order that they appear in
  the argument list.
  
## View generated regex

To see the generated regular expression pattern string, call
`nc::var_args_list` with the variable number of arguments that
specify the pattern:

```{r}
nc::var_args_list(range.pattern)
```

The generated regex is the `pattern` element of the resulting list
above. The other element `fun.list` indicates the names and type
conversion functions to use with the capture groups.

## Error/NA if any subjects do not match

The default is to stop with an error if any subject does not match:

```{r, error=TRUE, purl=FALSE}
bad.vec <- c(bad="does not match", pos.vec)
nc::capture_first_vec(bad.vec, range.pattern)
```

Sometimes you want to instead report a row of NA when a subject does
not match. In that case, use `nomatch.error=FALSE`:

```{r}
nc::capture_first_vec(bad.vec, range.pattern, nomatch.error=FALSE)
```

## Other regex engines

By default nc uses the PCRE regex engine. Other choices include ICU
and RE2. Each engine has different features, which are discussed in
[my R journal paper](https://github.com/tdhock/namedCapture-article).

The engine is configurable via the `engine` argument or the
`nc.engine` option:

```{r, error=TRUE, purl=FALSE}
u.subject <- "a\U0001F60E#"
u.pattern <- list(emoji="\\p{EMOJI_Presentation}")
old.opt <- options(nc.engine="ICU")
nc::capture_first_vec(u.subject, u.pattern)
nc::capture_first_vec(u.subject, u.pattern, engine="PCRE") 
nc::capture_first_vec(u.subject, u.pattern, engine="RE2")
options(old.opt)
```

For more details see the "engines" vignette, 

```r
vignette("v6-engines", package="nc")
```

## Capture first match from one or more character columns in a table

We also provide `nc::capture_first_df` which extracts text
from several columns of a data.frame, using a different 
regular expression for each column.

* It requires a `data.frame` (or `data.table`) as the first argument.
* It takes a variable number of other arguments, all of which must be
  named. For each other argument we call `nc::capture_first_vec` on one
  column of the input table.
* Each argument name specifies a column of the input table which will
  be used as the subject in `nc::capture_first_vec`.
* Each argument value specifies a pattern to be used with
  `nc::capture_first_vec`, in list/character/function format as
  explained in the previous section.
* The return value is a `data.table` with the same number of rows as
  the input, but with an additional column for each named capture
  group. New columns have the same names as the capture groups, so
  make sure they are unique.
* This function inputs `data.frame` and outputs `data.table` so can be
  used in a [pipe](https://r4ds.had.co.nz/pipes.html).
* For memory efficiency when the input is a `data.table`, it is
  modified to avoid copying the entire table.
  
This function can greatly simplify the code required to create numeric
data columns from character data columns. For example consider the
following data which was output from the
[sacct](https://slurm.schedmd.com/sacct.html) program.

```{r}
(sacct.df <- data.frame(
  Elapsed = c(
    "07:04:42", "07:04:42", "07:04:49",
    "00:00:00", "00:00:00"),
  JobID=c(
    "13937810_25",
    "13937810_25.batch",
    "13937810_25.extern",
    "14022192_[1-3]",
    "14022204_[4]"),
  stringsAsFactors=FALSE))
```

Say we want to filter by the total Elapsed time (which is reported as
hours:minutes:seconds), and base job id (which is the number before
the underscore in the JobID column). We could start by converting
those character columns to integers via:

```{r}
int.pattern <- list("[0-9]+", as.integer)
range.pattern <- list(
  "\\[",
  task1=int.pattern,
  list(
    "-",#begin optional end of range.
    taskN=int.pattern
  ), "?", #end is optional.
  "\\]")
nc::capture_first_df(sacct.df, JobID=range.pattern, nomatch.error=FALSE)
```
The result shown above is another data frame with an additional column
for each capture group. Next, we define another pattern that
matches either one task ID or the previously defined range pattern:

```{r}
task.pattern <- list(
  "_",
  list(
    task=int.pattern,
    "|",#either one task(above) or range(below)
    range.pattern))
nc::capture_first_df(sacct.df, JobID=task.pattern)
```

Below we match the complete JobID column:

```{r}
job.pattern <- list(
  job=int.pattern,
  task.pattern,
  list(
    "[.]",
    type=".*"
  ), "?")
nc::capture_first_df(sacct.df, JobID=job.pattern)
```

Below we match the Elapsed column with a different regex:

```{r}
elapsed.pattern <- list(
  hours=int.pattern,
  ":",
  minutes=int.pattern,
  ":",
  seconds=int.pattern)
nc::capture_first_df(sacct.df, JobID=job.pattern, Elapsed=elapsed.pattern)
```

Overall the result is another data table with an additional column for
each capture group. Note in the code below that the input `sacct.df`
was not modified, but if the input is `data.table` then it is modified:

```{r}
nc::capture_first_df(sacct.df, JobID=job.pattern)
sacct.df
(sacct.DT <- data.table::as.data.table(sacct.df))
nc::capture_first_df(sacct.DT, JobID=job.pattern)
sacct.DT
```

## Complex example: fsa file names

In this section we explain how to use various features to parse the
fsa file names in
[PROVEDIt](https://lftdi.camden.rutgers.edu/provedit/files/). Here are
a few representative examples:

```{r}
fsa.vec <- c(#control samples:
  "A01-Ladder-PP16-001.10sec.fsa",
  "D07_Ladder-IP_004.5sec.fsa",
  "A12_RB121514ADG_001.10sec.fsa", 
  "A10_RB102191515LEA-IP_001.10sec.fsa",
  ##single-source samples:
  "A02-RD12-0002-35-0.5PP16-001.10sec.fsa", 
  "G01_RD14-0003-35d3S30-0.01563P-Q10.0_003.10sec.fsa",
  "A06_RD14-0003-24d3a-0.0625IP-Q0.8_001.10sec.fsa", 
  "A08-RD12-0002-01d-0.125PP16-001.10sec.fsa",
  "A10-RD12-0002-04d1-0.0625PP16-001.10sec.fsa", 
  "C02_RD14-0003-15d2b-0.25IP-Q0.5_003.5sec.fsa",
  ##mixture samples:
  "A02-RD12-0002-1_2-1;9-0.125PP16-001.10sec.fsa", 
  "H07_RD14-0003-35_36_37_38_39-1;4;4;4;1-M2I35-0.75IP-QLAND_004.5sec.fsa")
```

The goal is to build a regex that can convert this character vector to
a data table with different columns for the different variables. The
structure of the file names is explained in [the supplementary
materials
PDF](https://lftdi.camden.rutgers.edu/wp-content/uploads/2019/12/PROVEDIt-Database-Naming-Convention-Laboratory-Methodsv1.pdf). We
will build a complex regex in terms of simpler sub-patterns. First
let's just match the start of each file name, which has the row/column
of the 96-well plate in which the sample was tested:

```{r}
well.pattern <- list(
  "^",
  well.letter="[A-H]",
  well.number="[0-9]+", as.integer)
nc::capture_first_vec(fsa.vec, well.pattern)
```

Now, let's match the end of each file name:

```{r}
end.pattern <- list(
  "[-_]",
  capillary="[0-9]+", as.integer,
  "[.]",
  seconds="[0-9]+", as.integer,
  "sec[.]fsa$")
nc::capture_first_vec(fsa.vec, end.pattern)
```

Now, let's take a look at what's in between:

```{r}
between.dt <- nc::capture_first_vec(
  fsa.vec, well.pattern, between=".*?", end.pattern)
between.dt$between
```

Notice that PP16/P/IP are the kit types. There may optionally be a DNA
template mass before, and optionally a Q score after. Let's match those:

```{r}
mass.pattern <- list(
  template.nanograms="[0-9.]*", as.numeric)
kit.pattern <- nc::quantifier(
  kit=nc::alternatives("PP16", "IP", "P"),
  "?")
q.pattern <- nc::quantifier(
  "-Q",
  Q.chr=nc::alternatives("LAND", "[0-9.]+"),
  "?")
old.opt <- options(width=100)
(before.dt <- nc::capture_first_vec(
  between.dt$between,
  "^",
  before=".*?",
  mass.pattern, kit.pattern, q.pattern, "$"))
options(old.opt)
```

Now to match the before column, we need some alternatives. The first
four rows are controls, and the other rows are samples which are
indicated by a project ID (prefix RD) and then a sample ID. The
mixture samples at the bottom contain semicolon-delimited mixture
proportions.

```{r}
project.pattern <- list(project="RD[0-9]+-[0-9]+", "-")
single.pattern <- list(id="[0-9]+", as.integer)
mixture.pattern <- list(
  ids.chr="[0-9_]+",
  "-",
  parts.chr="[0-9;]+",
  "-")
single.or.mixture <- list(
  project.pattern,
  nc::alternatives(mixture.pattern, single.pattern))
control.pattern <- list(control="[^-_]+")
single.mixture.control <- list("[-_]", nc::alternatives(
  single.or.mixture, control.pattern))
(rest.dt <- nc::capture_first_vec(
  before.dt$before, single.mixture.control, rest=".*"))
```

The rest may contain some information related to the dilution and
treatment, which we can parse as follows:

```{r}
dilution.pattern <- nc::quantifier(
  "[dM]",
  dilution.number="[0-9]?", as.integer,
  "?")
treatment.pattern <- nc::quantifier(
  treatment.letter="[a-zA-Z]",
  nc::quantifier(treatment.number="[0-9]+", as.integer, "?"),
  "?")
dilution.treatment <- list(
  dilution.pattern,
  treatment.pattern,
  "[_-]?")
nc::capture_first_vec(rest.dt$rest, "^", dilution.treatment, "$")
```

Now we just have to put all of those patterns together to get a full
match:

```{r}
fsa.pattern <- list(
  well.pattern,
  single.mixture.control,
  dilution.treatment,
  mass.pattern, kit.pattern, q.pattern,  
  end.pattern)
(match.dt <- nc::capture_first_vec(fsa.vec, fsa.pattern))
```

And to verify we parsed everything correctly we print the subjects
next to each row:

```{r}
disp.dt <- data.table::data.table(match.dt)
names(disp.dt) <- paste(1:ncol(disp.dt))
old.opt <- options("datatable.print.colnames"="none")
split(disp.dt, fsa.vec)
options(old.opt)
```

Incidentally, if you print the regex as a string it looks like
something that would be essentially impossible to write/maintain by
hand (without `nc` helper functions):

```{r}
nc::var_args_list(fsa.pattern)$pattern
```

In conclusion, we have shown how `nc` can help build complex regex
patterns from simple, understandable sub-patterns in R code. The
general workflow we have followed in this section can be generalized
to other projects. First, identify a small set of subjects to use for
testing (`fsa.vec` in the code above). Second, create some
sub-patterns for the start/end of the subjects, and create a
sub-pattern that will capture everything else in between. Third,
create some more sub-patterns to match the "everything else" in the
previous step (e.g. `before.dt$before`, `between.dt$between`,
`rest.dt$rest` above). Repeat the process until there is nothing left
to match, and then concatenate the sub-patterns to match the whole
subject.