Move language files into dedicated folder
This commit is contained in:
parent
3d1cd89067
commit
1b406899e0
15 changed files with 7 additions and 343 deletions
|
|
@ -1,336 +0,0 @@
|
||||||
## Goal
|
|
||||||
|
|
||||||
Implement a correct parser for the language described in `SYNTAX.md`, producing the existing AST types (`Expr`, `Pattern`, `ProductPattern`, etc.).
|
|
||||||
|
|
||||||
Code quality is **not** the primary concern.
|
|
||||||
Correctness, clarity, and reasonable error messages are.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Overall architecture
|
|
||||||
|
|
||||||
The parser is split into **two stages**:
|
|
||||||
|
|
||||||
1. **Lexing (tokenization)**
|
|
||||||
Converts source text into a stream of tokens, each with precise source location info.
|
|
||||||
2. **Parsing**
|
|
||||||
Consumes the token stream and constructs the AST using recursive-descent parsing.
|
|
||||||
|
|
||||||
This split is deliberate and should be preserved.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Stage 1: Lexer (Tokenizer)
|
|
||||||
|
|
||||||
### Purpose
|
|
||||||
|
|
||||||
The lexer exists to:
|
|
||||||
|
|
||||||
* Normalize the input into a small set of token types
|
|
||||||
* Track **line / column / offset** precisely
|
|
||||||
* Make parsing simpler and more reliable
|
|
||||||
* Enable good error messages later
|
|
||||||
|
|
||||||
The lexer is intentionally **simple and dumb**:
|
|
||||||
|
|
||||||
* No semantic decisions
|
|
||||||
* No AST construction
|
|
||||||
* Minimal lookahead
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Unicode handling
|
|
||||||
|
|
||||||
The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.
|
|
||||||
|
|
||||||
**Important rule**:
|
|
||||||
|
|
||||||
* Iterate over Unicode *code points*, not UTF-16 code units.
|
|
||||||
|
|
||||||
In TypeScript:
|
|
||||||
|
|
||||||
* Use `for (const ch of input)` or equivalent
|
|
||||||
* Do **not** index into strings with `input[i]`
|
|
||||||
|
|
||||||
Column counting:
|
|
||||||
|
|
||||||
* Increment column by **1 per code point**
|
|
||||||
* Exact visual width is not required
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Source positions and spans
|
|
||||||
|
|
||||||
All tokens must carry precise location information.
|
|
||||||
|
|
||||||
Suggested types (can be adjusted):
|
|
||||||
|
|
||||||
```ts
|
|
||||||
type Position = {
|
|
||||||
offset: number; // code-point index from start of input
|
|
||||||
line: number; // 1-based
|
|
||||||
column: number; // 1-based
|
|
||||||
};
|
|
||||||
|
|
||||||
type Span = {
|
|
||||||
start: Position;
|
|
||||||
end: Position;
|
|
||||||
};
|
|
||||||
```
|
|
||||||
|
|
||||||
Each token has a `span`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Token types
|
|
||||||
|
|
||||||
Suggested minimal token set:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
type Token =
|
|
||||||
| { kind: "number"; value: number; span: Span }
|
|
||||||
| { kind: "string"; value: string; span: Span }
|
|
||||||
| { kind: "identifier"; value: string; span: Span }
|
|
||||||
| { kind: "keyword"; value: Keyword; span: Span }
|
|
||||||
| { kind: "symbol"; value: Symbol; span: Span }
|
|
||||||
| { kind: "eof"; span: Span };
|
|
||||||
```
|
|
||||||
|
|
||||||
Where:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
|
|
||||||
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";
|
|
||||||
```
|
|
||||||
|
|
||||||
Notes:
|
|
||||||
|
|
||||||
* Operators like `+`, `==`, `<=`, `*` are **identifiers**
|
|
||||||
* `=` is treated as a keyword (same ofr `|`)
|
|
||||||
* Identifiers are parsed first, then checked against keywords
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Lexer responsibilities
|
|
||||||
|
|
||||||
The lexer should:
|
|
||||||
|
|
||||||
* Skip whitespace (spaces, tabs, newlines)
|
|
||||||
* Track line and column numbers
|
|
||||||
* Emit tokens with correct spans
|
|
||||||
* Fail immediately on:
|
|
||||||
|
|
||||||
* Unterminated string literals
|
|
||||||
* Invalid characters
|
|
||||||
|
|
||||||
The lexer **should not**:
|
|
||||||
|
|
||||||
* Attempt error recovery
|
|
||||||
* Guess intent
|
|
||||||
* Validate grammar rules
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Stage 2: Parser
|
|
||||||
|
|
||||||
### Parsing strategy
|
|
||||||
|
|
||||||
Use **recursive-descent parsing**.
|
|
||||||
|
|
||||||
The grammar is:
|
|
||||||
|
|
||||||
* Context-free
|
|
||||||
* Non-left-recursive
|
|
||||||
* No precedence rules
|
|
||||||
* No implicit associativity
|
|
||||||
|
|
||||||
This makes recursive descent ideal.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Parser state
|
|
||||||
|
|
||||||
The parser operates over:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
class Parser {
|
|
||||||
tokens: Token[];
|
|
||||||
pos: number;
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
Helper methods are encouraged:
|
|
||||||
|
|
||||||
```ts
|
|
||||||
peek(): Token
|
|
||||||
advance(): Token
|
|
||||||
matchKeyword(kw: Keyword): boolean
|
|
||||||
matchSymbol(sym: Symbol): boolean
|
|
||||||
expectKeyword(kw: Keyword): Token
|
|
||||||
expectSymbol(sym: Symbol): Token
|
|
||||||
error(message: string, span?: Span): never
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Error handling
|
|
||||||
|
|
||||||
Error recovery is **not required**.
|
|
||||||
|
|
||||||
On error:
|
|
||||||
|
|
||||||
* Throw a `ParseError`
|
|
||||||
* Include:
|
|
||||||
|
|
||||||
* A clear message
|
|
||||||
* A span pointing to the offending token (or best approximation)
|
|
||||||
|
|
||||||
The goal is:
|
|
||||||
|
|
||||||
* One good error
|
|
||||||
* Accurate location
|
|
||||||
* No cascading failures
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Expression parsing
|
|
||||||
|
|
||||||
There is **no precedence hierarchy**.
|
|
||||||
|
|
||||||
`parseExpr()` should:
|
|
||||||
|
|
||||||
* Look at the next token
|
|
||||||
* Dispatch to the correct parse function based on:
|
|
||||||
|
|
||||||
* keyword (e.g. `let`, `fn`, `match`, `apply`)
|
|
||||||
* symbol (e.g. `$`, `#`, `(`, `{`)
|
|
||||||
* identifier (e.g. top-level function call)
|
|
||||||
|
|
||||||
Order matters.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Important parsing rules
|
|
||||||
|
|
||||||
#### Variable use
|
|
||||||
|
|
||||||
```txt
|
|
||||||
$x
|
|
||||||
```
|
|
||||||
|
|
||||||
* `$` immediately followed by identifier
|
|
||||||
* No whitespace allowed
|
|
||||||
|
|
||||||
#### Tag expressions
|
|
||||||
|
|
||||||
```txt
|
|
||||||
#foo
|
|
||||||
#foo expr
|
|
||||||
```
|
|
||||||
|
|
||||||
Parsing rule:
|
|
||||||
|
|
||||||
* After `#tag`, look at the next token
|
|
||||||
* If the next token can start an expression **and is not a terminator** (`)`, `}`, `,`, `|`, `.`):
|
|
||||||
|
|
||||||
* Parse a `tagged-expr`
|
|
||||||
* Otherwise:
|
|
||||||
|
|
||||||
* Parse a `tag-expr`
|
|
||||||
|
|
||||||
This rule is intentional and should be implemented directly.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
#### Tuples vs grouping
|
|
||||||
|
|
||||||
Parentheses always construct **tuples**.
|
|
||||||
|
|
||||||
```txt
|
|
||||||
()
|
|
||||||
(123)
|
|
||||||
(1, 2, 3)
|
|
||||||
```
|
|
||||||
|
|
||||||
Parentheses are **not** used for grouping expressions. So `(123)` is NOT the same as `123`.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
#### Lists with separators
|
|
||||||
|
|
||||||
Many constructs use:
|
|
||||||
|
|
||||||
```txt
|
|
||||||
list-sep-by(p, sep)
|
|
||||||
```
|
|
||||||
|
|
||||||
This allows:
|
|
||||||
|
|
||||||
* Empty lists
|
|
||||||
* Optional leading separator
|
|
||||||
* Optional trailing separator
|
|
||||||
|
|
||||||
Implement a reusable helper that:
|
|
||||||
|
|
||||||
* Stops at a known terminator token
|
|
||||||
* Does not allow repeated separators without elements
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Parsing patterns
|
|
||||||
|
|
||||||
Patterns are parsed only in specific contexts:
|
|
||||||
|
|
||||||
* `match` branches
|
|
||||||
* `let` bindings
|
|
||||||
* lambda parameters
|
|
||||||
|
|
||||||
There are **two distinct pattern parsers**:
|
|
||||||
|
|
||||||
* `parsePattern()` — full patterns (including tags)
|
|
||||||
* `parseProductPattern()` — no tags allowed
|
|
||||||
|
|
||||||
These should be separate functions.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### AST construction
|
|
||||||
|
|
||||||
Parser functions should construct AST nodes directly, matching the existing AST types exactly.
|
|
||||||
|
|
||||||
If necessary, spans may be:
|
|
||||||
|
|
||||||
* Stored directly on AST nodes, or
|
|
||||||
* Discarded after parsing
|
|
||||||
|
|
||||||
Either is acceptable.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Division of responsibility
|
|
||||||
|
|
||||||
**Lexer**:
|
|
||||||
|
|
||||||
* Characters → tokens
|
|
||||||
* Unicode-safe
|
|
||||||
* Tracks positions
|
|
||||||
|
|
||||||
**Parser**:
|
|
||||||
|
|
||||||
* Tokens → AST
|
|
||||||
* Grammar enforcement
|
|
||||||
* Context-sensitive decisions
|
|
||||||
* Error reporting
|
|
||||||
|
|
||||||
Do **not** merge these stages.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Final notes
|
|
||||||
|
|
||||||
* Favor clarity over cleverness
|
|
||||||
* Favor explicit structure over abstraction
|
|
||||||
* Assume the grammar in `SYNTAX.md` is authoritative
|
|
||||||
* It is acceptable to tweak helper types or utilities if needed
|
|
||||||
|
|
||||||
Correct parsing is the goal. Performance and elegance are not.
|
|
||||||
|
|
@ -1,10 +1,8 @@
|
||||||
|
|
||||||
import { CARRIAGE_RETURN, char, NEW_LINE, SPACE, TAB } from './source_text';
|
import { CARRIAGE_RETURN, char, NEW_LINE } from './source_text';
|
||||||
import type { SourceText, Span, SourceLocation, CodePoint, StringIndex, CodePointIndex } from './source_text';
|
import type { Span, CodePoint } from './source_text';
|
||||||
import { isDigit, isWhitespace, scanNumber, scanString } from './cursor';
|
import { isDigit, isWhitespace, scanNumber, scanString } from './cursor';
|
||||||
import type { Cursor, CursorState, GenericScanError, NumberError, StringError } from './cursor';
|
import type { Cursor, GenericScanError, NumberError, StringError } from './cursor';
|
||||||
import { Result } from '../result';
|
|
||||||
import { Expr } from 'src/value';
|
|
||||||
|
|
||||||
export function skipWhitespaceAndComments(cursor: Cursor): number {
|
export function skipWhitespaceAndComments(cursor: Cursor): number {
|
||||||
let totalConsumed = 0;
|
let totalConsumed = 0;
|
||||||
|
|
@ -9,3 +9,5 @@ let {
|
||||||
}
|
}
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -15,7 +15,7 @@ npm install -D sass-embedded
|
||||||
npx ts-node src/parser/cursor.test.ts
|
npx ts-node src/parser/cursor.test.ts
|
||||||
|
|
||||||
|
|
||||||
npx ts-node src/debug/repl.ts
|
npx ts-node src/lang/debug/repl.ts
|
||||||
|
|
||||||
npx ts-node src/debug/repl.ts tmp_repl/test.flux
|
npx ts-node src/lang/debug/repl.ts tmp_repl/test.flux
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue