Add parser documentation.

2025-11-27 04:25:14 +00:00 · 2018-12-15 20:14:34 -05:00
parent 24b9ae7820
commit a412eecd36
1 changed files with 245 additions and 0 deletions
--- a/doc/Parser.md
+++ b/doc/Parser.md
@@ -0,0 +1,245 @@
+# The Parser
+
+A Janet program begins life as a text file, just a sequence of byte like
+any other on your system. Janet source files should be UTF-8 or ASCII
+encoded. Before Janet can compile or run your program, it must transform
+your source code into a data structure. Janet is a lisp, which means it is
+homoiconic - code is data, so all of the facilities in the language for
+manipulating arrays, tuples, strings, and tables can be used for manipulating
+your source code as well.
+
+But before janet code is represented as a data structure, it must be read, or parsed,
+by the janet parser. Called the reader in many other lisps, the parser is a machine
+that takes in plain text and outputs data structures which can be used by both
+the compiler and macros. In janet, it is a parser rather than a reader because
+there is no code execution at read time. This is safer and simpler, and also
+lets janet syntax serve as a robust data interchange format. While a parser
+is not extensible, in janet the philosophy is to extend the language via macros
+rather than reader macros.
+
+## Nil, True and False
+
+Nil, true and false are all literals than can be entered as such
+in the parser.
+
+```
+nil
+true
+false
+```
+
+## Symbols
+
+Janet symbols are represented a sequence of alphanumeric characters
+not starting with a digit. They can also contain the characters
+\!, @, $, \%, \^, \&, \*, -, \_, +, =, \|, \~, :, \<, \>, ., \?, \\, /, as
+well as any Unicode codepoint not in the ascii range.
+
+By convention, most symbols should be all lower case and use dashes to connect words
+(sometimes called kebab case).
+
+Symbols that come from another module often contain a forward slash that separates
+the name of the module from the name of the definition in the module
+
+```
+symbol
+kebab-case-symbol
+snake_case_symbol
+my-module/my-fuction
+*****
+!%$^*__--__._+++===~-crazy-symbol
+*global-var*
+你好
+```
+
+## Keywords
+
+Janet keywords are really just symbols that begin with the character :. However, they
+are used differently and treated by the compiler as a constant rather than a name for
+something. Keywords are used mostly for keys in tables and structs, or pieces of syntax
+in macros.
+
+```
+:keyword
+:range
+:0x0x0x0
+:a-keyword
+::
+:
+```
+
+## Numbers
+
+Janet numbers are represented by either 32 bit integers or
+IEEE-754 floating point numbers. The syntax is similar to that of many other languages
+as well. Numbers can be written in base 10, with
+underscores used to separate digits into groups. A decimal point can be used for floating
+point numbers. Numbers can also be written in other bases by prefixing the number with the desired
+base and the character 'r'. For example, 16 can be written as `16`, `1_6`, `16r10`, `4r100`, or `0x10`. The
+`0x` prefix can be used for hexadecimal as it is so common. The radix must be themselves written in base 10, and
+can be any integer from 2 to 36. For any radix above 10, use the letters as digits (not case sensitive).
+
+```
+0
+12
+-65912
+4.98
+1.3e18
+1.3E18
+18r123C
+11raaa&a
+1_000_000
+0xbeef
+```
+
+## Strings
+
+Strings in janet are surrounded by double quotes. Strings are 8bit clean, meaning
+meaning they can contain any arbitrary sequence of bytes, including embedded
+0s. To insert a double quote into a string itself, escape
+the double quote with a backslash. For unprintable characters, you can either use
+one of a few common escapes, use the `\xHH` escape to escape a single byte in
+hexidecimal. The supported escapes are:
+
+    - \\xHH Escape a single arbitrary byte in hexidecimal.
+    - \\n Newline (ASCII 10)
+    - \\t Tab character (ASCII 9)
+    - \\r Carriage Return (ASCII 13)
+    - \\0 Null (ASCII 0)
+    - \\z Null (ASCII 0)
+    - \\f Form Feed (ASCII 12)
+    - \\e Escape (ASCII 27)
+    - \\" Double Quote (ASCII 34)
+    - \\\\ Backslash (ASCII 92)
+
+Strings can also contain literal newline characters that will be ignore.
+This lets one define a multiline string that does not contain newline characters.
+
+An alternative way of representing strings in janet is the long string, or the backquote
+delimited string. A string can also be define to start with a certain number of
+backquotes, and will end the same number of backquotes. Long strings
+do not contain escape sequences; all bytes will be parsed literally until
+ending delimiter is found. This is useful
+for definining multiline strings with literal newline characters, unprintable
+characters, or strings that would otherwise require many escape sequences.
+
+```
+"This is a string."
+"This\nis\na\nstring."
+"This
+is
+a
+string."
+``
+This
+is
+a
+string
+``
+```
+
+## Buffers
+
+Buffers are similar strings except they are mutable data structures. Strings in janet
+cannot be mutated after created, where a buffer can be changed after creation.
+The syntax for a buffer is the same as that for a string or long string, but
+the buffer must be prefixed with the '@' character.
+
+```
+@""
+@"Buffer."
+@``Another buffer``
+```
+
+## Tuples
+
+Tuples are a sequence of white space separated values surrounded by either parentheses
+or brackets. The parser considers any of the characters ASCII 32, \\0, \\f, \\n, \\r or \\t
+to be whitespace.
+
+```
+(do 1 2 3)
+[do 1 2 3]
+```
+
+## Arrays
+
+Arrays are the same as tuples, but have a leading @ to indicate mutability.
+
+```
+@(:one :two :three)
+@[:one :two :three]
+```
+
+## Structs
+
+Structs are represented by a sequence of whitespace delimited key value pairs
+surrounded by curly braces. The sequence is defined as key1, value1, key2, value2, etc.
+There must be an even number of items between curly braces or the parser will
+signal a parse error. Any value can be a key or value. Using nil as a key or
+value, however, will drop that pair from the parsed struct.
+
+```
+{}
+{:key1 "value1" :key2 :value2 :key3 3}
+{(1 2 3) (4 5 6)}
+{@[] @[]}
+{1 2 3 4 5 6}
+```
+## Tables
+
+Table have the same syntax as structs, except they have the @ prefix to indicate
+that they are mutable.
+
+```
+@{}
+@{:key1 "value1" :key2 :value2 :key3 3}
+@{(1 2 3) (4 5 6)}
+@{@[] @[]}
+@{1 2 3 4 5 6}
+```
+
+## Comments
+
+Comments begin with a \# character and continue until the end of the line.
+There are no multiline comments. For ricm multiline comments, use a
+string literal.
+
+## Shorthands
+
+Often called reader macros in other lisps, Janet provides several shorthand
+notations for some forms.
+
+### 'x
+
+Shorthand for `(quote x)`
+
+### ;x
+
+Shorthand for `(splice x)`
+
+### \`x
+
+Shorthand for `(quasiquote x)`
+
+### ,x
+
+Shorthand for `(unquote x)`
+
+These shorthand notations can be combined in any order, allowing
+forms like `''x` (`(quote (quote x))`), or `,;x` (`(unquote (splice x))`).
+
+## API
+
+The parser contains the following functions which exposes
+the parser state machine as a janet abstract object.
+
+- `parser/byte`
+- `parser/consume`
+- `parser/error`
+- `parser/flush`
+- `parser/new`
+- `parser/produce`
+- `parser/state`
+- `parser/status`
+- `parser/where`