---
title: "Capture all matches in a single subject string"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Capture all matches in a single subject string}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The `nc::capture_all_str` function is for the common case of
extracting each match from a multi-line text file (a single large
subject string). In this section we demonstrate how to extract data
tables from such loosely structured text data. For example we consider
the following [track
hub](http://genome.cse.ucsc.edu/goldenPath/help/hgTrackHubHelp.html)
meta-data file:

```{r}
trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="nc")
trackDb.vec <- readLines(trackDb.txt.gz)
```

Some representative lines from that file are shown below.

```{r}
cat(trackDb.vec[78:107], sep="\n")
```

## Match all tracks in the text file

Each block of text begins with "track" and includes several lines of
data before the block ends with two consecutive newlines. That pattern
is coded below using a regex:

```{r}
tracks.dt <- nc::capture_all_str(
  trackDb.vec, 
  "track ",
  track="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")
str(tracks.dt)
```

The result is a data.table with one row for each track block that
matches the regex. There are two character columns: `track` is a
unique name, and `fields` is a string with the rest of the data
in that block:

```{r}
tracks.dt[, .(track, fields.start=substr(fields, 1, 30))]
```

## Match all fields in each track

Each block has a variable number of lines/fields. Each line starts
with a field name, followed by a space, followed by the field
value. That regex is coded below:

```{r}
(fields.dt <- tracks.dt[, nc::capture_all_str(
  fields,
  "\\s+",
  variable=".*?",
  " ",
  value="[^\n]+"),  
  by=track])
str(fields.dt)
```

Note that because `by=track` was specified, `nc::capture_all_str` is
called for each unique value of `track` (i.e. each row). The results
are combined into a single data.table with one row for each
field. This data.table can be easily queried, e.g.

```{r}
fields.dt[
  J("tcell_McGill0107Coverage", "bigDataUrl"),
  value,
  on=.(track, variable)]
fields.dt[, .(count=.N), by=variable][order(count)]
```

For more information about data.table syntax, read
`vignette("datatable-intro", package="data.table")`.

## Match all tracks and some fields with one regex

In the examples above we extracted all fields from all tracks (using
two regexes, one for the track, one for the field). In the example
below we extract only the track name, split into separate columns
(using a single regex for the track).

```{r}
cell.sample.type <- list(
  cellType="[^ ]*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks")
nc::capture_all_str(trackDb.vec, cell.sample.type)
```

Note that the pattern above defines nested capture groups via named
lists (e.g. sampleID is a subset of sampleName). The pattern below
matches either the previously specified track pattern, or any other
type of track name:

```{r}
sample.or.anything <- list(
  cell.sample.type,
  "|",
  "[^\n]+")
track.pattern.old <- list(
  "track ",
  track=sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern.old)
```

Notice the repetition of `track` in the pattern above. This can be
avoided by using the `nc::field` helper function, which takes three
arguments, that are pasted together to form a pattern:

* `field.name` is used as a pattern, and as the capture group
  (column) name for the pattern specified in the third argument.
* `between.pattern` is a pattern that matches between the other two patterns.
* `field.pattern` is the pattern that matches the text to be extracted
  in a capture group.
  
The example above can thus be re-written as below, avoiding the
repetition of `track` which was present above:

```{r}
track.pattern <- nc::field("track", " ", sample.or.anything)
nc::capture_all_str(trackDb.vec, track.pattern)
```

Finally we use `field` again to match the type column:

```{r}
any.lines.pattern <- "(?:\n[^\n]+)*"
nc::capture_all_str(
  trackDb.vec,
  track.pattern,
  any.lines.pattern,
  "\\s+",
  nc::field("type", " ", "[^\n]+"))
```

Exercise for the reader (easy): modify the above regex in order to capture
the bigDataUrl field, and three additional columns (red, green, blue)
from the color field. Assume that `bigDataUrl` occurs before `color`
in each track. Note that this is a limitation of the single regex
approach --- using two regex, as described in previous sections, could
extract any/all fields, even if they appear in different orders in
different tracks.

Exercise for the reader (hard): note that the last code block only
matches tracks which define the type field. How would you optionally
match the type field? Hint: the current `any.lines.pattern` can match
the type field.

# Parsing SweeD output files

Thanks to Marc Tollis for providing the example data used in this
section (from the SweeD bioinformatics program). Some representative
lines from one output file are shown below.

```{r}
info.txt.gz <- system.file(
  "extdata", "SweeD_Info.txt.gz", package="nc")
info.vec <- readLines(info.txt.gz)
info.vec[20:50]
```

The Alignment numbers must be matched with the numbers before slashes
in the other file,

```{r}
report.txt.gz <- system.file(
  "extdata", "SweeD_Report.txt.gz", package="nc")
report.vec <- readLines(report.txt.gz)
cat(report.vec[1:10], sep="\n")
cat(report.vec[1000:1010], sep="\n")
```

The goal is to produce a bed file, which has tab-separated values with
four columns: chrom, chromStart, chromEnd, Likelihood. The chrom
values appear in the info file (Chromosome) so we will need to join
the two files based on alignment ID.
First we capture all alignments in the info file:

```{r}
(info.dt <- nc::capture_all_str(
  info.vec,
  "Alignment ",
  alignment="[0-9]+",
  "\n\n\t\tChromosome:\t\t",
  chrom=".*",
  "\n"))
```

Then we capture all alignment/csv blocks in the report file:

```{r}
(report.dt <- nc::capture_all_str(
  report.vec,
  "//",
  alignment="[0-9]+",
  "\n",
  csv="[^/]+"
)[, {
  data.table::fread(text=csv)
}, by=alignment])
```

Note that because `by=alignment` was specified, `fread` is called for
each unique value of `alignment` (i.e. each row). The results are
combined into a single data.table with all of the csv data from the
original file, plus the additional `alignment` column. Next, we join
this table to the previous table in order to get the `chrom` column:

```{r}
(join.dt <- report.dt[info.dt, on=.(alignment)])
```

Finally the desired bed table can be created via

```{r}
join.dt[, .(
  chrom,
  chromStart=as.integer(Position-1),
  chromEnd=as.integer(Position),
  Likelihood)]
```

Exercise for the reader (easy): notice that the code above for
creating `info.dt` involves repetition in the pattern and group names
(`alignment`, `Alignment`, `chrom`, `Chromosome`). Re-write the
pattern using `nc::field` in order to eliminate that repetition.

Exercise for the reader (hard): notice that Chromosome is only the
first field -- how could you extract the other fields as well? Hint:
use `nc::field` in a helper function in order to avoid repetition.