## Goal Implement a correct parser for the language described in `SYNTAX.md`, producing the existing AST types (`Expr`, `Pattern`, `ProductPattern`, etc.). Code quality is **not** the primary concern. Correctness, clarity, and reasonable error messages are. --- ## Overall architecture The parser is split into **two stages**: 1. **Lexing (tokenization)** Converts source text into a stream of tokens, each with precise source location info. 2. **Parsing** Consumes the token stream and constructs the AST using recursive-descent parsing. This split is deliberate and should be preserved. --- ## Stage 1: Lexer (Tokenizer) ### Purpose The lexer exists to: * Normalize the input into a small set of token types * Track **line / column / offset** precisely * Make parsing simpler and more reliable * Enable good error messages later The lexer is intentionally **simple and dumb**: * No semantic decisions * No AST construction * Minimal lookahead --- ### Unicode handling The input may contain arbitrary Unicode (including emoji) inside identifiers and strings. **Important rule**: * Iterate over Unicode *code points*, not UTF-16 code units. In TypeScript: * Use `for (const ch of input)` or equivalent * Do **not** index into strings with `input[i]` Column counting: * Increment column by **1 per code point** * Exact visual width is not required --- ### Source positions and spans All tokens must carry precise location information. Suggested types (can be adjusted): ```ts type Position = { offset: number; // code-point index from start of input line: number; // 1-based column: number; // 1-based }; type Span = { start: Position; end: Position; }; ``` Each token has a `span`. --- ### Token types Suggested minimal token set: ```ts type Token = | { kind: "number"; value: number; span: Span } | { kind: "string"; value: string; span: Span } | { kind: "identifier"; value: string; span: Span } | { kind: "keyword"; value: Keyword; span: Span } | { kind: "symbol"; value: Symbol; span: Span } | { kind: "eof"; span: Span }; ``` Where: ```ts type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|"; type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | "."; ``` Notes: * Operators like `+`, `==`, `<=`, `*` are **identifiers** * `=` is treated as a keyword (same ofr `|`) * Identifiers are parsed first, then checked against keywords --- ### Lexer responsibilities The lexer should: * Skip whitespace (spaces, tabs, newlines) * Track line and column numbers * Emit tokens with correct spans * Fail immediately on: * Unterminated string literals * Invalid characters The lexer **should not**: * Attempt error recovery * Guess intent * Validate grammar rules --- ## Stage 2: Parser ### Parsing strategy Use **recursive-descent parsing**. The grammar is: * Context-free * Non-left-recursive * No precedence rules * No implicit associativity This makes recursive descent ideal. --- ### Parser state The parser operates over: ```ts class Parser { tokens: Token[]; pos: number; } ``` Helper methods are encouraged: ```ts peek(): Token advance(): Token matchKeyword(kw: Keyword): boolean matchSymbol(sym: Symbol): boolean expectKeyword(kw: Keyword): Token expectSymbol(sym: Symbol): Token error(message: string, span?: Span): never ``` --- ### Error handling Error recovery is **not required**. On error: * Throw a `ParseError` * Include: * A clear message * A span pointing to the offending token (or best approximation) The goal is: * One good error * Accurate location * No cascading failures --- ### Expression parsing There is **no precedence hierarchy**. `parseExpr()` should: * Look at the next token * Dispatch to the correct parse function based on: * keyword (e.g. `let`, `fn`, `match`, `apply`) * symbol (e.g. `$`, `#`, `(`, `{`) * identifier (e.g. top-level function call) Order matters. --- ### Important parsing rules #### Variable use ```txt $x ``` * `$` immediately followed by identifier * No whitespace allowed #### Tag expressions ```txt #foo #foo expr ``` Parsing rule: * After `#tag`, look at the next token * If the next token can start an expression **and is not a terminator** (`)`, `}`, `,`, `|`, `.`): * Parse a `tagged-expr` * Otherwise: * Parse a `tag-expr` This rule is intentional and should be implemented directly. --- #### Tuples vs grouping Parentheses always construct **tuples**. ```txt () (123) (1, 2, 3) ``` Parentheses are **not** used for grouping expressions. So `(123)` is NOT the same as `123`. --- #### Lists with separators Many constructs use: ```txt list-sep-by(p, sep) ``` This allows: * Empty lists * Optional leading separator * Optional trailing separator Implement a reusable helper that: * Stops at a known terminator token * Does not allow repeated separators without elements --- ### Parsing patterns Patterns are parsed only in specific contexts: * `match` branches * `let` bindings * lambda parameters There are **two distinct pattern parsers**: * `parsePattern()` — full patterns (including tags) * `parseProductPattern()` — no tags allowed These should be separate functions. --- ### AST construction Parser functions should construct AST nodes directly, matching the existing AST types exactly. If necessary, spans may be: * Stored directly on AST nodes, or * Discarded after parsing Either is acceptable. --- ## Division of responsibility **Lexer**: * Characters → tokens * Unicode-safe * Tracks positions **Parser**: * Tokens → AST * Grammar enforcement * Context-sensitive decisions * Error reporting Do **not** merge these stages. --- ## Final notes * Favor clarity over cleverness * Favor explicit structure over abstraction * Assume the grammar in `SYNTAX.md` is authoritative * It is acceptable to tweak helper types or utilities if needed Correct parsing is the goal. Performance and elegance are not.