scrowl/src/PARSER-PLAN.md at 99cd517a584f2f8452ee664989dd0c431391b1a3

Yura Dupyn 99cd517a58 Prep for parser

2026-02-05 00:23:27 +01:00

5.8 KiB

Raw Blame History

Goal

Implement a correct parser for the language described in SYNTAX.md, producing the existing AST types (Expr, Pattern, ProductPattern, etc.).

Code quality is not the primary concern. Correctness, clarity, and reasonable error messages are.

Overall architecture

The parser is split into two stages:

Lexing (tokenization) Converts source text into a stream of tokens, each with precise source location info.
Parsing Consumes the token stream and constructs the AST using recursive-descent parsing.

This split is deliberate and should be preserved.

Stage 1: Lexer (Tokenizer)

Purpose

The lexer exists to:

Normalize the input into a small set of token types
Track line / column / offset precisely
Make parsing simpler and more reliable
Enable good error messages later

The lexer is intentionally simple and dumb:

No semantic decisions
No AST construction
Minimal lookahead

Unicode handling

The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.

Important rule:

Iterate over Unicode code points, not UTF-16 code units.

In TypeScript:

Use for (const ch of input) or equivalent
Do not index into strings with input[i]

Column counting:

Increment column by 1 per code point
Exact visual width is not required

Source positions and spans

All tokens must carry precise location information.

Suggested types (can be adjusted):

type Position = {
  offset: number; // code-point index from start of input
  line: number;   // 1-based
  column: number; // 1-based
};

type Span = {
  start: Position;
  end: Position;
};

Each token has a span.

Token types

Suggested minimal token set:

type Token =
  | { kind: "number"; value: number; span: Span }
  | { kind: "string"; value: string; span: Span }
  | { kind: "identifier"; value: string; span: Span }
  | { kind: "keyword"; value: Keyword; span: Span }
  | { kind: "symbol"; value: Symbol; span: Span }
  | { kind: "eof"; span: Span };

Where:

type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";

Notes:

Operators like +, ==, <=, * are identifiers
= is treated as a keyword (same ofr |)
Identifiers are parsed first, then checked against keywords

Lexer responsibilities

The lexer should:

Skip whitespace (spaces, tabs, newlines)
Track line and column numbers
Emit tokens with correct spans
Fail immediately on:
- Unterminated string literals
- Invalid characters

The lexer should not:

Attempt error recovery
Guess intent
Validate grammar rules

Stage 2: Parser

Parsing strategy

Use recursive-descent parsing.

The grammar is:

Context-free
Non-left-recursive
No precedence rules
No implicit associativity

This makes recursive descent ideal.

Parser state

The parser operates over:

class Parser {
  tokens: Token[];
  pos: number;
}

Helper methods are encouraged:

peek(): Token
advance(): Token
matchKeyword(kw: Keyword): boolean
matchSymbol(sym: Symbol): boolean
expectKeyword(kw: Keyword): Token
expectSymbol(sym: Symbol): Token
error(message: string, span?: Span): never

Error handling

Error recovery is not required.

On error:

Throw a ParseError
Include:
- A clear message
- A span pointing to the offending token (or best approximation)

The goal is:

One good error
Accurate location
No cascading failures

Expression parsing

There is no precedence hierarchy.

parseExpr() should:

Look at the next token
Dispatch to the correct parse function based on:
- keyword (e.g. let, fn, match, apply)
- symbol (e.g. $, #, (, {)
- identifier (e.g. top-level function call)

Order matters.

Important parsing rules

Variable use

$x

$ immediately followed by identifier
No whitespace allowed

Tag expressions

#foo
#foo expr

Parsing rule:

After #tag, look at the next token
If the next token can start an expression and is not a terminator (), }, ,, |, .):
- Parse a tagged-expr
Otherwise:
- Parse a tag-expr

This rule is intentional and should be implemented directly.

Tuples vs grouping

Parentheses always construct tuples.

()
(123)
(1, 2, 3)

Parentheses are not used for grouping expressions. So (123) is NOT the same as 123.

Lists with separators

Many constructs use:

list-sep-by(p, sep)

This allows:

Empty lists
Optional leading separator
Optional trailing separator

Implement a reusable helper that:

Stops at a known terminator token
Does not allow repeated separators without elements

Parsing patterns

Patterns are parsed only in specific contexts:

match branches
let bindings
lambda parameters

There are two distinct pattern parsers:

parsePattern() — full patterns (including tags)
parseProductPattern() — no tags allowed

These should be separate functions.

AST construction

Parser functions should construct AST nodes directly, matching the existing AST types exactly.

If necessary, spans may be:

Stored directly on AST nodes, or
Discarded after parsing

Either is acceptable.

Division of responsibility

Lexer:

Characters → tokens
Unicode-safe
Tracks positions

Parser:

Tokens → AST
Grammar enforcement
Context-sensitive decisions
Error reporting

Do not merge these stages.

Final notes

Favor clarity over cleverness
Favor explicit structure over abstraction
Assume the grammar in SYNTAX.md is authoritative
It is acceptable to tweak helper types or utilities if needed

Correct parsing is the goal. Performance and elegance are not.

5.8 KiB Raw Blame History