scrowl/src/PARSER-PLAN.md
2026-02-05 00:23:27 +01:00

5.8 KiB

Goal

Implement a correct parser for the language described in SYNTAX.md, producing the existing AST types (Expr, Pattern, ProductPattern, etc.).

Code quality is not the primary concern. Correctness, clarity, and reasonable error messages are.


Overall architecture

The parser is split into two stages:

  1. Lexing (tokenization) Converts source text into a stream of tokens, each with precise source location info.
  2. Parsing Consumes the token stream and constructs the AST using recursive-descent parsing.

This split is deliberate and should be preserved.


Stage 1: Lexer (Tokenizer)

Purpose

The lexer exists to:

  • Normalize the input into a small set of token types
  • Track line / column / offset precisely
  • Make parsing simpler and more reliable
  • Enable good error messages later

The lexer is intentionally simple and dumb:

  • No semantic decisions
  • No AST construction
  • Minimal lookahead

Unicode handling

The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.

Important rule:

  • Iterate over Unicode code points, not UTF-16 code units.

In TypeScript:

  • Use for (const ch of input) or equivalent
  • Do not index into strings with input[i]

Column counting:

  • Increment column by 1 per code point
  • Exact visual width is not required

Source positions and spans

All tokens must carry precise location information.

Suggested types (can be adjusted):

type Position = {
  offset: number; // code-point index from start of input
  line: number;   // 1-based
  column: number; // 1-based
};

type Span = {
  start: Position;
  end: Position;
};

Each token has a span.


Token types

Suggested minimal token set:

type Token =
  | { kind: "number"; value: number; span: Span }
  | { kind: "string"; value: string; span: Span }
  | { kind: "identifier"; value: string; span: Span }
  | { kind: "keyword"; value: Keyword; span: Span }
  | { kind: "symbol"; value: Symbol; span: Span }
  | { kind: "eof"; span: Span };

Where:

type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";

Notes:

  • Operators like +, ==, <=, * are identifiers
  • = is treated as a keyword (same ofr |)
  • Identifiers are parsed first, then checked against keywords

Lexer responsibilities

The lexer should:

  • Skip whitespace (spaces, tabs, newlines)

  • Track line and column numbers

  • Emit tokens with correct spans

  • Fail immediately on:

    • Unterminated string literals
    • Invalid characters

The lexer should not:

  • Attempt error recovery
  • Guess intent
  • Validate grammar rules

Stage 2: Parser

Parsing strategy

Use recursive-descent parsing.

The grammar is:

  • Context-free
  • Non-left-recursive
  • No precedence rules
  • No implicit associativity

This makes recursive descent ideal.


Parser state

The parser operates over:

class Parser {
  tokens: Token[];
  pos: number;
}

Helper methods are encouraged:

peek(): Token
advance(): Token
matchKeyword(kw: Keyword): boolean
matchSymbol(sym: Symbol): boolean
expectKeyword(kw: Keyword): Token
expectSymbol(sym: Symbol): Token
error(message: string, span?: Span): never

Error handling

Error recovery is not required.

On error:

  • Throw a ParseError

  • Include:

    • A clear message
    • A span pointing to the offending token (or best approximation)

The goal is:

  • One good error
  • Accurate location
  • No cascading failures

Expression parsing

There is no precedence hierarchy.

parseExpr() should:

  • Look at the next token

  • Dispatch to the correct parse function based on:

    • keyword (e.g. let, fn, match, apply)
    • symbol (e.g. $, #, (, {)
    • identifier (e.g. top-level function call)

Order matters.


Important parsing rules

Variable use

$x
  • $ immediately followed by identifier
  • No whitespace allowed

Tag expressions

#foo
#foo expr

Parsing rule:

  • After #tag, look at the next token

  • If the next token can start an expression and is not a terminator (), }, ,, |, .):

    • Parse a tagged-expr
  • Otherwise:

    • Parse a tag-expr

This rule is intentional and should be implemented directly.


Tuples vs grouping

Parentheses always construct tuples.

()
(123)
(1, 2, 3)

Parentheses are not used for grouping expressions. So (123) is NOT the same as 123.


Lists with separators

Many constructs use:

list-sep-by(p, sep)

This allows:

  • Empty lists
  • Optional leading separator
  • Optional trailing separator

Implement a reusable helper that:

  • Stops at a known terminator token
  • Does not allow repeated separators without elements

Parsing patterns

Patterns are parsed only in specific contexts:

  • match branches
  • let bindings
  • lambda parameters

There are two distinct pattern parsers:

  • parsePattern() — full patterns (including tags)
  • parseProductPattern() — no tags allowed

These should be separate functions.


AST construction

Parser functions should construct AST nodes directly, matching the existing AST types exactly.

If necessary, spans may be:

  • Stored directly on AST nodes, or
  • Discarded after parsing

Either is acceptable.


Division of responsibility

Lexer:

  • Characters → tokens
  • Unicode-safe
  • Tracks positions

Parser:

  • Tokens → AST
  • Grammar enforcement
  • Context-sensitive decisions
  • Error reporting

Do not merge these stages.


Final notes

  • Favor clarity over cleverness
  • Favor explicit structure over abstraction
  • Assume the grammar in SYNTAX.md is authoritative
  • It is acceptable to tweak helper types or utilities if needed

Correct parsing is the goal. Performance and elegance are not.