## Goal

Implement a correct parser for the language described in `SYNTAX.md`, producing the existing AST types (`Expr`, `Pattern`, `ProductPattern`, etc.).

Code quality is **not** the primary concern.
Correctness, clarity, and reasonable error messages are.

---

## Overall architecture

The parser is split into **two stages**:

1. **Lexing (tokenization)**
   Converts source text into a stream of tokens, each with precise source location info.
2. **Parsing**
   Consumes the token stream and constructs the AST using recursive-descent parsing.

This split is deliberate and should be preserved.

---

## Stage 1: Lexer (Tokenizer)

### Purpose

The lexer exists to:

* Normalize the input into a small set of token types
* Track **line / column / offset** precisely
* Make parsing simpler and more reliable
* Enable good error messages later

The lexer is intentionally **simple and dumb**:

* No semantic decisions
* No AST construction
* Minimal lookahead

---

### Unicode handling

The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.

**Important rule**:

* Iterate over Unicode *code points*, not UTF-16 code units.

In TypeScript:

* Use `for (const ch of input)` or equivalent
* Do **not** index into strings with `input[i]`

Column counting:

* Increment column by **1 per code point**
* Exact visual width is not required

---

### Source positions and spans

All tokens must carry precise location information.

Suggested types (can be adjusted):

```ts
type Position = {
  offset: number; // code-point index from start of input
  line: number;   // 1-based
  column: number; // 1-based
};

type Span = {
  start: Position;
  end: Position;
};
```

Each token has a `span`.

---

### Token types

Suggested minimal token set:

```ts
type Token =
  | { kind: "number"; value: number; span: Span }
  | { kind: "string"; value: string; span: Span }
  | { kind: "identifier"; value: string; span: Span }
  | { kind: "keyword"; value: Keyword; span: Span }
  | { kind: "symbol"; value: Symbol; span: Span }
  | { kind: "eof"; span: Span };
```

Where:

```ts
type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";
```

Notes:

* Operators like `+`, `==`, `<=`, `*` are **identifiers**
* `=` is treated as a keyword (same ofr `|`)
* Identifiers are parsed first, then checked against keywords

---

### Lexer responsibilities

The lexer should:

* Skip whitespace (spaces, tabs, newlines)
* Track line and column numbers
* Emit tokens with correct spans
* Fail immediately on:

  * Unterminated string literals
  * Invalid characters

The lexer **should not**:

* Attempt error recovery
* Guess intent
* Validate grammar rules

---

## Stage 2: Parser

### Parsing strategy

Use **recursive-descent parsing**.

The grammar is:

* Context-free
* Non-left-recursive
* No precedence rules
* No implicit associativity

This makes recursive descent ideal.

---

### Parser state

The parser operates over:

```ts
class Parser {
  tokens: Token[];
  pos: number;
}
```

Helper methods are encouraged:

```ts
peek(): Token
advance(): Token
matchKeyword(kw: Keyword): boolean
matchSymbol(sym: Symbol): boolean
expectKeyword(kw: Keyword): Token
expectSymbol(sym: Symbol): Token
error(message: string, span?: Span): never
```

---

### Error handling

Error recovery is **not required**.

On error:

* Throw a `ParseError`
* Include:

  * A clear message
  * A span pointing to the offending token (or best approximation)

The goal is:

* One good error
* Accurate location
* No cascading failures

---

### Expression parsing

There is **no precedence hierarchy**.

`parseExpr()` should:

* Look at the next token
* Dispatch to the correct parse function based on:

  * keyword (e.g. `let`, `fn`, `match`, `apply`)
  * symbol (e.g. `$`, `#`, `(`, `{`)
  * identifier (e.g. top-level function call)

Order matters.

---

### Important parsing rules

#### Variable use

```txt
$x
```

* `$` immediately followed by identifier
* No whitespace allowed

#### Tag expressions

```txt
#foo
#foo expr
```

Parsing rule:

* After `#tag`, look at the next token
* If the next token can start an expression **and is not a terminator** (`)`, `}`, `,`, `|`, `.`):

  * Parse a `tagged-expr`
* Otherwise:

  * Parse a `tag-expr`

This rule is intentional and should be implemented directly.

---

#### Tuples vs grouping

Parentheses always construct **tuples**.

```txt
()
(123)
(1, 2, 3)
```

Parentheses are **not** used for grouping expressions. So `(123)` is NOT the same as `123`.

---

#### Lists with separators

Many constructs use:

```txt
list-sep-by(p, sep)
```

This allows:

* Empty lists
* Optional leading separator
* Optional trailing separator

Implement a reusable helper that:

* Stops at a known terminator token
* Does not allow repeated separators without elements

---

### Parsing patterns

Patterns are parsed only in specific contexts:

* `match` branches
* `let` bindings
* lambda parameters

There are **two distinct pattern parsers**:

* `parsePattern()` — full patterns (including tags)
* `parseProductPattern()` — no tags allowed

These should be separate functions.

---

### AST construction

Parser functions should construct AST nodes directly, matching the existing AST types exactly.

If necessary, spans may be:

* Stored directly on AST nodes, or
* Discarded after parsing

Either is acceptable.

---

## Division of responsibility

**Lexer**:

* Characters → tokens
* Unicode-safe
* Tracks positions

**Parser**:

* Tokens → AST
* Grammar enforcement
* Context-sensitive decisions
* Error reporting

Do **not** merge these stages.

---

## Final notes

* Favor clarity over cleverness
* Favor explicit structure over abstraction
* Assume the grammar in `SYNTAX.md` is authoritative
* It is acceptable to tweak helper types or utilities if needed

Correct parsing is the goal. Performance and elegance are not.