Move language files into dedicated folder

This commit is contained in:
Yura Dupyn 2026-02-07 10:43:30 +01:00
parent 3d1cd89067
commit 1b406899e0
15 changed files with 7 additions and 343 deletions

View file

@ -1,336 +0,0 @@
## Goal
Implement a correct parser for the language described in `SYNTAX.md`, producing the existing AST types (`Expr`, `Pattern`, `ProductPattern`, etc.).
Code quality is **not** the primary concern.
Correctness, clarity, and reasonable error messages are.
---
## Overall architecture
The parser is split into **two stages**:
1. **Lexing (tokenization)**
Converts source text into a stream of tokens, each with precise source location info.
2. **Parsing**
Consumes the token stream and constructs the AST using recursive-descent parsing.
This split is deliberate and should be preserved.
---
## Stage 1: Lexer (Tokenizer)
### Purpose
The lexer exists to:
* Normalize the input into a small set of token types
* Track **line / column / offset** precisely
* Make parsing simpler and more reliable
* Enable good error messages later
The lexer is intentionally **simple and dumb**:
* No semantic decisions
* No AST construction
* Minimal lookahead
---
### Unicode handling
The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.
**Important rule**:
* Iterate over Unicode *code points*, not UTF-16 code units.
In TypeScript:
* Use `for (const ch of input)` or equivalent
* Do **not** index into strings with `input[i]`
Column counting:
* Increment column by **1 per code point**
* Exact visual width is not required
---
### Source positions and spans
All tokens must carry precise location information.
Suggested types (can be adjusted):
```ts
type Position = {
offset: number; // code-point index from start of input
line: number; // 1-based
column: number; // 1-based
};
type Span = {
start: Position;
end: Position;
};
```
Each token has a `span`.
---
### Token types
Suggested minimal token set:
```ts
type Token =
| { kind: "number"; value: number; span: Span }
| { kind: "string"; value: string; span: Span }
| { kind: "identifier"; value: string; span: Span }
| { kind: "keyword"; value: Keyword; span: Span }
| { kind: "symbol"; value: Symbol; span: Span }
| { kind: "eof"; span: Span };
```
Where:
```ts
type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";
```
Notes:
* Operators like `+`, `==`, `<=`, `*` are **identifiers**
* `=` is treated as a keyword (same ofr `|`)
* Identifiers are parsed first, then checked against keywords
---
### Lexer responsibilities
The lexer should:
* Skip whitespace (spaces, tabs, newlines)
* Track line and column numbers
* Emit tokens with correct spans
* Fail immediately on:
* Unterminated string literals
* Invalid characters
The lexer **should not**:
* Attempt error recovery
* Guess intent
* Validate grammar rules
---
## Stage 2: Parser
### Parsing strategy
Use **recursive-descent parsing**.
The grammar is:
* Context-free
* Non-left-recursive
* No precedence rules
* No implicit associativity
This makes recursive descent ideal.
---
### Parser state
The parser operates over:
```ts
class Parser {
tokens: Token[];
pos: number;
}
```
Helper methods are encouraged:
```ts
peek(): Token
advance(): Token
matchKeyword(kw: Keyword): boolean
matchSymbol(sym: Symbol): boolean
expectKeyword(kw: Keyword): Token
expectSymbol(sym: Symbol): Token
error(message: string, span?: Span): never
```
---
### Error handling
Error recovery is **not required**.
On error:
* Throw a `ParseError`
* Include:
* A clear message
* A span pointing to the offending token (or best approximation)
The goal is:
* One good error
* Accurate location
* No cascading failures
---
### Expression parsing
There is **no precedence hierarchy**.
`parseExpr()` should:
* Look at the next token
* Dispatch to the correct parse function based on:
* keyword (e.g. `let`, `fn`, `match`, `apply`)
* symbol (e.g. `$`, `#`, `(`, `{`)
* identifier (e.g. top-level function call)
Order matters.
---
### Important parsing rules
#### Variable use
```txt
$x
```
* `$` immediately followed by identifier
* No whitespace allowed
#### Tag expressions
```txt
#foo
#foo expr
```
Parsing rule:
* After `#tag`, look at the next token
* If the next token can start an expression **and is not a terminator** (`)`, `}`, `,`, `|`, `.`):
* Parse a `tagged-expr`
* Otherwise:
* Parse a `tag-expr`
This rule is intentional and should be implemented directly.
---
#### Tuples vs grouping
Parentheses always construct **tuples**.
```txt
()
(123)
(1, 2, 3)
```
Parentheses are **not** used for grouping expressions. So `(123)` is NOT the same as `123`.
---
#### Lists with separators
Many constructs use:
```txt
list-sep-by(p, sep)
```
This allows:
* Empty lists
* Optional leading separator
* Optional trailing separator
Implement a reusable helper that:
* Stops at a known terminator token
* Does not allow repeated separators without elements
---
### Parsing patterns
Patterns are parsed only in specific contexts:
* `match` branches
* `let` bindings
* lambda parameters
There are **two distinct pattern parsers**:
* `parsePattern()` — full patterns (including tags)
* `parseProductPattern()` — no tags allowed
These should be separate functions.
---
### AST construction
Parser functions should construct AST nodes directly, matching the existing AST types exactly.
If necessary, spans may be:
* Stored directly on AST nodes, or
* Discarded after parsing
Either is acceptable.
---
## Division of responsibility
**Lexer**:
* Characters → tokens
* Unicode-safe
* Tracks positions
**Parser**:
* Tokens → AST
* Grammar enforcement
* Context-sensitive decisions
* Error reporting
Do **not** merge these stages.
---
## Final notes
* Favor clarity over cleverness
* Favor explicit structure over abstraction
* Assume the grammar in `SYNTAX.md` is authoritative
* It is acceptable to tweak helper types or utilities if needed
Correct parsing is the goal. Performance and elegance are not.

View file

@ -1,10 +1,8 @@
import { CARRIAGE_RETURN, char, NEW_LINE, SPACE, TAB } from './source_text';
import type { SourceText, Span, SourceLocation, CodePoint, StringIndex, CodePointIndex } from './source_text';
import { CARRIAGE_RETURN, char, NEW_LINE } from './source_text';
import type { Span, CodePoint } from './source_text';
import { isDigit, isWhitespace, scanNumber, scanString } from './cursor';
import type { Cursor, CursorState, GenericScanError, NumberError, StringError } from './cursor';
import { Result } from '../result';
import { Expr } from 'src/value';
import type { Cursor, GenericScanError, NumberError, StringError } from './cursor';
export function skipWhitespaceAndComments(cursor: Cursor): number {
let totalConsumed = 0;

View file

@ -9,3 +9,5 @@ let {
}
}

View file

@ -15,7 +15,7 @@ npm install -D sass-embedded
npx ts-node src/parser/cursor.test.ts
npx ts-node src/debug/repl.ts
npx ts-node src/lang/debug/repl.ts
npx ts-node src/debug/repl.ts tmp_repl/test.flux
npx ts-node src/lang/debug/repl.ts tmp_repl/test.flux