--- title: "Helper functions" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Helper functions} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` There are several "helper" functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns: ```{r} one.pattern <- function(pat){ if(is.character(pat)){ pat }else{ nc::var_args_list(pat)[["pattern"]] } } show.patterns <- function(...){ L <- list(...) str(lapply(L, one.pattern)) } ``` # `nc::field` for reducing repetition The `nc::field` function can be used to avoid repetition when defining patterns of the form `variable: value`. The example below shows three (mostly) equivalent ways to write a regex that captures the text after the colon and space; the captured text is stored in the `variable` group or output column: ```{r} show.patterns( "variable: (?.*)", #repetitive regex string list("variable: ", variable=".*"),#repetitive nc R code nc::field("variable", ": ", ".*"))#helper function avoids repetition ``` Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern). Another example: ```{r} show.patterns( "Alignment (?[0-9]+)", list("Alignment ", Alignment="[0-9]+"), nc::field("Alignment", " ", "[0-9]+")) ``` Another example: ```{r} show.patterns( "Chromosome:\t+(?.*)", list("Chromosome:\t+", Chromosome=".*"), nc::field("Chromosome", ":\t+", ".*")) ``` # `nc::quantifier` for fewer parentheses Another helper function is `nc::quantifier` which makes patterns easier to read by reducing the number of parentheses required to define sub-patterns with quantifiers. For example all three patterns below create an optional non-capturing group which contains a named capture group: ```{r} show.patterns( "(?:-(?[0-9]+))?", #regex string list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function ``` Another example with a named capture group inside an optional non-capturing group: ```{r} show.patterns( "(?: (?[^,}]+))?", list(list(" ", name="[^,}]+"), "?"), nc::quantifier(" ", name="[^,}]+", "?")) ``` # `nc::alternatives` for simplified alternation We also provide a helper function for defining regex patterns with [alternation](https://www.regular-expressions.info/alternation.html). The following three lines are equivalent. ```{r} show.patterns( "(?:(?bar+)|(?fo+))", list(first="bar+", "|", second="fo+"), nc::alternatives(first="bar+", second="fo+")) ``` # `nc::alternatives_with_shared_groups` for alternatives with identical named sub-pattern groups Sometimes each alternative is just a re-arrangement of the same sub-patterns. For example consider the following subjects, each of which are dates, in one of two formats. ```{r} subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984") ``` In each of the two formats, the month consists of three lower-case letters, the day consists of two digits, and the year consists of four digits. Is there a single pattern that can match each of these subjects? Yes, such a pattern can be defined using the code below, ```{r} pattern <- nc::alternatives_with_shared_groups( month="[a-z]{3}", day=list("[0-9]{2}", as.integer), year=list("[0-9]{4}", as.integer), list(american=list(month, " ", day, ", ", year)), list(european=list(day, " ", month, " ", year))) ``` In the code above, we used `nc::alternatives_with_shared_groups`, which requires two kinds of arguments: * named arguments (month, day, year) define sub-pattern groups that are used in each alternative. * un-named arguments (last two) define alternative patterns, each which can use the sub-pattern group names (month, day, year). The pattern can be used for matching, and the result is a data table with one column for each unique name, ```{r} (match.dt <- nc::capture_first_vec(subject.vec, pattern)) ``` After having parsed the dates into these three columns, we can add a date column: ```{r} Sys.setlocale(locale="C")#to recognize months in English. match.dt[, date := data.table::as.IDate( paste(month, day, year), format="%b %d %Y")] print(match.dt, class=TRUE) ``` Another example is parsing given and family names, in two different formats: ```{r} nc::capture_first_vec( c("Toby Dylan Hocking","Hocking, Toby Dylan"), nc::alternatives_with_shared_groups( family="[A-Z][a-z]+", given="[^,]+", list(given_first=list(given, " ", family)), list(family_first=list(family, ", ", given)) ) ) ```