Move language files into dedicated folder
This commit is contained in:
parent
3d1cd89067
commit
1b406899e0
15 changed files with 7 additions and 343 deletions
|
|
@ -1,336 +0,0 @@
|
|||
## Goal
|
||||
|
||||
Implement a correct parser for the language described in `SYNTAX.md`, producing the existing AST types (`Expr`, `Pattern`, `ProductPattern`, etc.).
|
||||
|
||||
Code quality is **not** the primary concern.
|
||||
Correctness, clarity, and reasonable error messages are.
|
||||
|
||||
---
|
||||
|
||||
## Overall architecture
|
||||
|
||||
The parser is split into **two stages**:
|
||||
|
||||
1. **Lexing (tokenization)**
|
||||
Converts source text into a stream of tokens, each with precise source location info.
|
||||
2. **Parsing**
|
||||
Consumes the token stream and constructs the AST using recursive-descent parsing.
|
||||
|
||||
This split is deliberate and should be preserved.
|
||||
|
||||
---
|
||||
|
||||
## Stage 1: Lexer (Tokenizer)
|
||||
|
||||
### Purpose
|
||||
|
||||
The lexer exists to:
|
||||
|
||||
* Normalize the input into a small set of token types
|
||||
* Track **line / column / offset** precisely
|
||||
* Make parsing simpler and more reliable
|
||||
* Enable good error messages later
|
||||
|
||||
The lexer is intentionally **simple and dumb**:
|
||||
|
||||
* No semantic decisions
|
||||
* No AST construction
|
||||
* Minimal lookahead
|
||||
|
||||
---
|
||||
|
||||
### Unicode handling
|
||||
|
||||
The input may contain arbitrary Unicode (including emoji) inside identifiers and strings.
|
||||
|
||||
**Important rule**:
|
||||
|
||||
* Iterate over Unicode *code points*, not UTF-16 code units.
|
||||
|
||||
In TypeScript:
|
||||
|
||||
* Use `for (const ch of input)` or equivalent
|
||||
* Do **not** index into strings with `input[i]`
|
||||
|
||||
Column counting:
|
||||
|
||||
* Increment column by **1 per code point**
|
||||
* Exact visual width is not required
|
||||
|
||||
---
|
||||
|
||||
### Source positions and spans
|
||||
|
||||
All tokens must carry precise location information.
|
||||
|
||||
Suggested types (can be adjusted):
|
||||
|
||||
```ts
|
||||
type Position = {
|
||||
offset: number; // code-point index from start of input
|
||||
line: number; // 1-based
|
||||
column: number; // 1-based
|
||||
};
|
||||
|
||||
type Span = {
|
||||
start: Position;
|
||||
end: Position;
|
||||
};
|
||||
```
|
||||
|
||||
Each token has a `span`.
|
||||
|
||||
---
|
||||
|
||||
### Token types
|
||||
|
||||
Suggested minimal token set:
|
||||
|
||||
```ts
|
||||
type Token =
|
||||
| { kind: "number"; value: number; span: Span }
|
||||
| { kind: "string"; value: string; span: Span }
|
||||
| { kind: "identifier"; value: string; span: Span }
|
||||
| { kind: "keyword"; value: Keyword; span: Span }
|
||||
| { kind: "symbol"; value: Symbol; span: Span }
|
||||
| { kind: "eof"; span: Span };
|
||||
```
|
||||
|
||||
Where:
|
||||
|
||||
```ts
|
||||
type Keyword = "let" | "fn" | "match" | "apply" | "=" | "!" | "|";
|
||||
type Symbol = "#" | "$" | "(" | ")" | "{" | "}" | "," | ".";
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
* Operators like `+`, `==`, `<=`, `*` are **identifiers**
|
||||
* `=` is treated as a keyword (same ofr `|`)
|
||||
* Identifiers are parsed first, then checked against keywords
|
||||
|
||||
---
|
||||
|
||||
### Lexer responsibilities
|
||||
|
||||
The lexer should:
|
||||
|
||||
* Skip whitespace (spaces, tabs, newlines)
|
||||
* Track line and column numbers
|
||||
* Emit tokens with correct spans
|
||||
* Fail immediately on:
|
||||
|
||||
* Unterminated string literals
|
||||
* Invalid characters
|
||||
|
||||
The lexer **should not**:
|
||||
|
||||
* Attempt error recovery
|
||||
* Guess intent
|
||||
* Validate grammar rules
|
||||
|
||||
---
|
||||
|
||||
## Stage 2: Parser
|
||||
|
||||
### Parsing strategy
|
||||
|
||||
Use **recursive-descent parsing**.
|
||||
|
||||
The grammar is:
|
||||
|
||||
* Context-free
|
||||
* Non-left-recursive
|
||||
* No precedence rules
|
||||
* No implicit associativity
|
||||
|
||||
This makes recursive descent ideal.
|
||||
|
||||
---
|
||||
|
||||
### Parser state
|
||||
|
||||
The parser operates over:
|
||||
|
||||
```ts
|
||||
class Parser {
|
||||
tokens: Token[];
|
||||
pos: number;
|
||||
}
|
||||
```
|
||||
|
||||
Helper methods are encouraged:
|
||||
|
||||
```ts
|
||||
peek(): Token
|
||||
advance(): Token
|
||||
matchKeyword(kw: Keyword): boolean
|
||||
matchSymbol(sym: Symbol): boolean
|
||||
expectKeyword(kw: Keyword): Token
|
||||
expectSymbol(sym: Symbol): Token
|
||||
error(message: string, span?: Span): never
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Error handling
|
||||
|
||||
Error recovery is **not required**.
|
||||
|
||||
On error:
|
||||
|
||||
* Throw a `ParseError`
|
||||
* Include:
|
||||
|
||||
* A clear message
|
||||
* A span pointing to the offending token (or best approximation)
|
||||
|
||||
The goal is:
|
||||
|
||||
* One good error
|
||||
* Accurate location
|
||||
* No cascading failures
|
||||
|
||||
---
|
||||
|
||||
### Expression parsing
|
||||
|
||||
There is **no precedence hierarchy**.
|
||||
|
||||
`parseExpr()` should:
|
||||
|
||||
* Look at the next token
|
||||
* Dispatch to the correct parse function based on:
|
||||
|
||||
* keyword (e.g. `let`, `fn`, `match`, `apply`)
|
||||
* symbol (e.g. `$`, `#`, `(`, `{`)
|
||||
* identifier (e.g. top-level function call)
|
||||
|
||||
Order matters.
|
||||
|
||||
---
|
||||
|
||||
### Important parsing rules
|
||||
|
||||
#### Variable use
|
||||
|
||||
```txt
|
||||
$x
|
||||
```
|
||||
|
||||
* `$` immediately followed by identifier
|
||||
* No whitespace allowed
|
||||
|
||||
#### Tag expressions
|
||||
|
||||
```txt
|
||||
#foo
|
||||
#foo expr
|
||||
```
|
||||
|
||||
Parsing rule:
|
||||
|
||||
* After `#tag`, look at the next token
|
||||
* If the next token can start an expression **and is not a terminator** (`)`, `}`, `,`, `|`, `.`):
|
||||
|
||||
* Parse a `tagged-expr`
|
||||
* Otherwise:
|
||||
|
||||
* Parse a `tag-expr`
|
||||
|
||||
This rule is intentional and should be implemented directly.
|
||||
|
||||
---
|
||||
|
||||
#### Tuples vs grouping
|
||||
|
||||
Parentheses always construct **tuples**.
|
||||
|
||||
```txt
|
||||
()
|
||||
(123)
|
||||
(1, 2, 3)
|
||||
```
|
||||
|
||||
Parentheses are **not** used for grouping expressions. So `(123)` is NOT the same as `123`.
|
||||
|
||||
---
|
||||
|
||||
#### Lists with separators
|
||||
|
||||
Many constructs use:
|
||||
|
||||
```txt
|
||||
list-sep-by(p, sep)
|
||||
```
|
||||
|
||||
This allows:
|
||||
|
||||
* Empty lists
|
||||
* Optional leading separator
|
||||
* Optional trailing separator
|
||||
|
||||
Implement a reusable helper that:
|
||||
|
||||
* Stops at a known terminator token
|
||||
* Does not allow repeated separators without elements
|
||||
|
||||
---
|
||||
|
||||
### Parsing patterns
|
||||
|
||||
Patterns are parsed only in specific contexts:
|
||||
|
||||
* `match` branches
|
||||
* `let` bindings
|
||||
* lambda parameters
|
||||
|
||||
There are **two distinct pattern parsers**:
|
||||
|
||||
* `parsePattern()` — full patterns (including tags)
|
||||
* `parseProductPattern()` — no tags allowed
|
||||
|
||||
These should be separate functions.
|
||||
|
||||
---
|
||||
|
||||
### AST construction
|
||||
|
||||
Parser functions should construct AST nodes directly, matching the existing AST types exactly.
|
||||
|
||||
If necessary, spans may be:
|
||||
|
||||
* Stored directly on AST nodes, or
|
||||
* Discarded after parsing
|
||||
|
||||
Either is acceptable.
|
||||
|
||||
---
|
||||
|
||||
## Division of responsibility
|
||||
|
||||
**Lexer**:
|
||||
|
||||
* Characters → tokens
|
||||
* Unicode-safe
|
||||
* Tracks positions
|
||||
|
||||
**Parser**:
|
||||
|
||||
* Tokens → AST
|
||||
* Grammar enforcement
|
||||
* Context-sensitive decisions
|
||||
* Error reporting
|
||||
|
||||
Do **not** merge these stages.
|
||||
|
||||
---
|
||||
|
||||
## Final notes
|
||||
|
||||
* Favor clarity over cleverness
|
||||
* Favor explicit structure over abstraction
|
||||
* Assume the grammar in `SYNTAX.md` is authoritative
|
||||
* It is acceptable to tweak helper types or utilities if needed
|
||||
|
||||
Correct parsing is the goal. Performance and elegance are not.
|
||||
|
|
@ -1,10 +1,8 @@
|
|||
|
||||
import { CARRIAGE_RETURN, char, NEW_LINE, SPACE, TAB } from './source_text';
|
||||
import type { SourceText, Span, SourceLocation, CodePoint, StringIndex, CodePointIndex } from './source_text';
|
||||
import { CARRIAGE_RETURN, char, NEW_LINE } from './source_text';
|
||||
import type { Span, CodePoint } from './source_text';
|
||||
import { isDigit, isWhitespace, scanNumber, scanString } from './cursor';
|
||||
import type { Cursor, CursorState, GenericScanError, NumberError, StringError } from './cursor';
|
||||
import { Result } from '../result';
|
||||
import { Expr } from 'src/value';
|
||||
import type { Cursor, GenericScanError, NumberError, StringError } from './cursor';
|
||||
|
||||
export function skipWhitespaceAndComments(cursor: Cursor): number {
|
||||
let totalConsumed = 0;
|
||||
|
|
@ -9,3 +9,5 @@ let {
|
|||
}
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ npm install -D sass-embedded
|
|||
npx ts-node src/parser/cursor.test.ts
|
||||
|
||||
|
||||
npx ts-node src/debug/repl.ts
|
||||
npx ts-node src/lang/debug/repl.ts
|
||||
|
||||
npx ts-node src/debug/repl.ts tmp_repl/test.flux
|
||||
npx ts-node src/lang/debug/repl.ts tmp_repl/test.flux
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue