There are several “helper”
functions which can simplify the definition of complex patterns. First
we define some functions that will help us display the patterns:
nc::field
for reducing repetition
The nc::field
function can be used to avoid repetition
when defining patterns of the form variable: value
. The
example below shows three (mostly) equivalent ways to write a regex that
captures the text after the colon and space; the captured text is stored
in the variable
group or output column:
show.patterns(
"variable: (?<variable>.*)", #repetitive regex string
list("variable: ", variable=".*"),#repetitive nc R code
nc::field("variable", ": ", ".*"))#helper function avoids repetition
#> List of 3
#> $ : chr "variable: (?<variable>.*)"
#> $ : chr "(?:variable: (.*))"
#> $ : chr "(?:variable: (?:(.*)))"
Note that the first version above has a named capture group, whereas
the second and third patterns generated by nc have an un-named capture
group and some non-capturing groups (but they all match the same
pattern).
Another example:
show.patterns(
"Alignment (?<Alignment>[0-9]+)",
list("Alignment ", Alignment="[0-9]+"),
nc::field("Alignment", " ", "[0-9]+"))
#> List of 3
#> $ : chr "Alignment (?<Alignment>[0-9]+)"
#> $ : chr "(?:Alignment ([0-9]+))"
#> $ : chr "(?:Alignment (?:([0-9]+)))"
Another example:
show.patterns(
"Chromosome:\t+(?<Chromosome>.*)",
list("Chromosome:\t+", Chromosome=".*"),
nc::field("Chromosome", ":\t+", ".*"))
#> List of 3
#> $ : chr "Chromosome:\t+(?<Chromosome>.*)"
#> $ : chr "(?:Chromosome:\t+(.*))"
#> $ : chr "(?:Chromosome:\t+(?:(.*)))"
nc::quantifier
for fewer parentheses
Another helper function is nc::quantifier
which makes
patterns easier to read by reducing the number of parentheses required
to define sub-patterns with quantifiers. For example all three patterns
below create an optional non-capturing group which contains a named
capture group:
show.patterns(
"(?:-(?<chromEnd>[0-9]+))?", #regex string
list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists
nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
#> List of 3
#> $ : chr "(?:-(?<chromEnd>[0-9]+))?"
#> $ : chr "(?:(?:-([0-9]+))?)"
#> $ : chr "(?:(?:-([0-9]+))?)"
Another example with a named capture group inside an optional
non-capturing group:
show.patterns(
"(?: (?<name>[^,}]+))?",
list(list(" ", name="[^,}]+"), "?"),
nc::quantifier(" ", name="[^,}]+", "?"))
#> List of 3
#> $ : chr "(?: (?<name>[^,}]+))?"
#> $ : chr "(?:(?: ([^,}]+))?)"
#> $ : chr "(?:(?: ([^,}]+))?)"
nc::alternatives_with_shared_groups
for alternatives
with identical named sub-pattern groups
Sometimes each alternative is just a re-arrangement of the same
sub-patterns. For example consider the following subjects, each of which
are dates, in one of two formats.
subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
In each of the two formats, the month consists of three lower-case
letters, the day consists of two digits, and the year consists of four
digits. Is there a single pattern that can match each of these subjects?
Yes, such a pattern can be defined using the code below,
pattern <- nc::alternatives_with_shared_groups(
month="[a-z]{3}",
day=list("[0-9]{2}", as.integer),
year=list("[0-9]{4}", as.integer),
list(american=list(month, " ", day, ", ", year)),
list(european=list(day, " ", month, " ", year)))
In the code above, we used
nc::alternatives_with_shared_groups
, which requires two
kinds of arguments:
- named arguments (month, day, year) define sub-pattern groups that
are used in each alternative.
- un-named arguments (last two) define alternative patterns, each
which can use the sub-pattern group names (month, day, year).
The pattern can be used for matching, and the result is a data table
with one column for each unique name,
(match.dt <- nc::capture_first_vec(subject.vec, pattern))
#> american month day year european
#> <char> <char> <int> <int> <char>
#> 1: mar 17, 1983 mar 17 1983
#> 2: sep 26 2017 26 sep 2017
#> 3: mar 17 1984 17 mar 1984
After having parsed the dates into these three columns, we can add a
date column:
Sys.setlocale(locale="C")#to recognize months in English.
#> [1] "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
match.dt[, date := data.table::as.IDate(
paste(month, day, year), format="%b %d %Y")]
print(match.dt, class=TRUE)
#> american month day year european date
#> <char> <char> <int> <int> <char> <IDat>
#> 1: mar 17, 1983 mar 17 1983 1983-03-17
#> 2: sep 26 2017 26 sep 2017 2017-09-26
#> 3: mar 17 1984 17 mar 1984 1984-03-17
Another example is parsing given and family names, in two different
formats:
nc::capture_first_vec(
c("Toby Dylan Hocking","Hocking, Toby Dylan"),
nc::alternatives_with_shared_groups(
family="[A-Z][a-z]+",
given="[^,]+",
list(given_first=list(given, " ", family)),
list(family_first=list(family, ", ", given))
)
)
#> given_first given family family_first
#> <char> <char> <char> <char>
#> 1: Toby Dylan Hocking Toby Dylan Hocking
#> 2: Toby Dylan Hocking Hocking, Toby Dylan