5.8 KiB
Goal
Implement a correct parser for the language described in SYNTAX.md, producing the existing AST types (Expr, Pattern, ProductPattern, etc.).
Code quality is not the primary concern. Correctness, clarity, and reasonable error messages are.
Overall architecture
The parser is split into two stages:
- Lexing (tokenization) Converts source text into a stream of tokens, each with precise source location info.
- Parsing Consumes the token stream and constructs the AST using recursive-descent parsing.
This split is deliberate and should be preserved.
Stage 1: Lexer (Tokenizer)
Purpose
The lexer exists to:
- Normalize the input into a small set of token types
- Track line / column / offset precisely
- Make parsing simpler and more reliable
- Enable good error messages later
The lexer is intentionally simple and dumb:
- No semantic decisions
- No AST construction
- Minimal lookahead
Unicode handling
The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.
Important rule:
- Iterate over Unicode code points, not UTF-16 code units.
In TypeScript:
- Use
for (const ch of input)or equivalent - Do not index into strings with
input[i]
Column counting:
- Increment column by 1 per code point
- Exact visual width is not required
Source positions and spans
All tokens must carry precise location information.
Suggested types (can be adjusted):
type Position = {
offset: number; // code-point index from start of input
line: number; // 1-based
column: number; // 1-based
};
type Span = {
start: Position;
end: Position;
};
Each token has a span.
Token types
Suggested minimal token set:
type Token =
| { kind: "number"; value: number; span: Span }
| { kind: "string"; value: string; span: Span }
| { kind: "identifier"; value: string; span: Span }
| { kind: "keyword"; value: Keyword; span: Span }
| { kind: "symbol"; value: Symbol; span: Span }
| { kind: "eof"; span: Span };
Where:
type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";
Notes:
- Operators like
+,==,<=,*are identifiers =is treated as a keyword (same ofr|)- Identifiers are parsed first, then checked against keywords
Lexer responsibilities
The lexer should:
-
Skip whitespace (spaces, tabs, newlines)
-
Track line and column numbers
-
Emit tokens with correct spans
-
Fail immediately on:
- Unterminated string literals
- Invalid characters
The lexer should not:
- Attempt error recovery
- Guess intent
- Validate grammar rules
Stage 2: Parser
Parsing strategy
Use recursive-descent parsing.
The grammar is:
- Context-free
- Non-left-recursive
- No precedence rules
- No implicit associativity
This makes recursive descent ideal.
Parser state
The parser operates over:
class Parser {
tokens: Token[];
pos: number;
}
Helper methods are encouraged:
peek(): Token
advance(): Token
matchKeyword(kw: Keyword): boolean
matchSymbol(sym: Symbol): boolean
expectKeyword(kw: Keyword): Token
expectSymbol(sym: Symbol): Token
error(message: string, span?: Span): never
Error handling
Error recovery is not required.
On error:
-
Throw a
ParseError -
Include:
- A clear message
- A span pointing to the offending token (or best approximation)
The goal is:
- One good error
- Accurate location
- No cascading failures
Expression parsing
There is no precedence hierarchy.
parseExpr() should:
-
Look at the next token
-
Dispatch to the correct parse function based on:
- keyword (e.g.
let,fn,match,apply) - symbol (e.g.
$,#,(,{) - identifier (e.g. top-level function call)
- keyword (e.g.
Order matters.
Important parsing rules
Variable use
$x
$immediately followed by identifier- No whitespace allowed
Tag expressions
#foo
#foo expr
Parsing rule:
-
After
#tag, look at the next token -
If the next token can start an expression and is not a terminator (
),},,,|,.):- Parse a
tagged-expr
- Parse a
-
Otherwise:
- Parse a
tag-expr
- Parse a
This rule is intentional and should be implemented directly.
Tuples vs grouping
Parentheses always construct tuples.
()
(123)
(1, 2, 3)
Parentheses are not used for grouping expressions. So (123) is NOT the same as 123.
Lists with separators
Many constructs use:
list-sep-by(p, sep)
This allows:
- Empty lists
- Optional leading separator
- Optional trailing separator
Implement a reusable helper that:
- Stops at a known terminator token
- Does not allow repeated separators without elements
Parsing patterns
Patterns are parsed only in specific contexts:
matchbranchesletbindings- lambda parameters
There are two distinct pattern parsers:
parsePattern()— full patterns (including tags)parseProductPattern()— no tags allowed
These should be separate functions.
AST construction
Parser functions should construct AST nodes directly, matching the existing AST types exactly.
If necessary, spans may be:
- Stored directly on AST nodes, or
- Discarded after parsing
Either is acceptable.
Division of responsibility
Lexer:
- Characters → tokens
- Unicode-safe
- Tracks positions
Parser:
- Tokens → AST
- Grammar enforcement
- Context-sensitive decisions
- Error reporting
Do not merge these stages.
Final notes
- Favor clarity over cleverness
- Favor explicit structure over abstraction
- Assume the grammar in
SYNTAX.mdis authoritative - It is acceptable to tweak helper types or utilities if needed
Correct parsing is the goal. Performance and elegance are not.