1
0
mirror of https://github.com/janet-lang/janet synced 2024-11-16 21:54:48 +00:00
janet/doc/Peg.md

63 lines
3.2 KiB
Markdown
Raw Normal View History

2019-01-14 20:06:35 +00:00
# Peg (Parsing Expression Grammars)
A common task for developers is to recognize patterns in text, be it
filtering emails from a list or extracting data from a CSV file. Programming
languages and libraries usually offer a number of tools for this, including prebuilt
parsers, simple operations on strings (splitting a string on commas), and regular expressions.
The pre-built or custom-built parser is usually the most robust solution, but can
be very complex to maintain and may not exist for many languages. String functions are not
powerful enough for a large class of languages, and regular expressions can be hard to read
(which characters are escaped?) and underpowered (don't parse HTML with regex!).
PEGs, or Parsing Expression Grammars, are another formalism for recognizing languages that
are easier to write as a custom parser and more powerful than regular expressions. They also
can produce grammars that are easily unerstandable and moderatly fast. PEGs can also be compiled
to a bytecode format that can be reused.
Below is a siimple example for checking if a string is a valid IP address. Notice how
the grammar is descriptive enough that you can read it even if you don't know the peg
syntax (example is translated from a (RED language blog post)[https://www.red-lang.org/2013/11/041-introducing-parse.html]).
```
(def ip-address
'{:dig (range "09")
:0-4 (range "04")
:0-5 (range "05")
:byte (choice
(sequence "25" :0-5)
(sequence "2" :0-4 :dig)
(sequence "1" :dig :dig)
(between 1 2 :dig))
:main (sequence :byte "." :byte "." :byte "." :byte)})
(peg/match ip-address "0.0.0.0") # -> @[]
(peg/match ip-address "elephant") # -> nil
(peg/match ip-address "256.0.0.0") # -> nil
```
## Primitive Patterns
Larger patterns are built up with primitive patterns, which recognize individual
characters, string literals, or a given number of characters. A character in Janet
is considered a byte, so PEGs will work on any string of bytes. No special meaning is
given to the 0 byte, or the string terminator in many languages.
| Pattern | Alias | What it Matches |
| string ("cat") | | The literal string. |
| integer (3) | | Matches a number of characters, and advances that many characters. If negative, matches if not that many characters and does not advance. For example, -1 will match the end of a string |
| `(range "az" "AZ")` | | Matches characters in a range and advances 1 character. Multiple ranges can be combined together. |
| `(set "abcd")` | | Match any character in the argument string. Advances 1 character. |
## Combining Patterns
These primitve patterns are combined with a few specials to match a wide number of languages.
## Grammars and Recursion
Parsing Expression Grammars try to match an input text with a pattern in a greedy manner.
This means that if a rule fails to match, that rule will fail and not try again. The only
backtracking provided in a peg is provided by the `(choice x y z ...)` special, which will
try rules in order until one succeeds, and the whole pattern succeeds. If no sub pattern
succeeds, then the whole pattern fails. Note that this means that the order of `x y z` in choice
DOES matter. If y matches everything that z matches, z will never succeed.