| Commit message (Collapse) | Author | Age |
| | |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes #23
Up to this point, the compiler has used the ordinary `String` type to represent declaration names, which means a bunch of lookup structures throughout the compiler were string-to-whatever maps, which can reduce efficiency.
It also means that things like the `Token` type end up carying a `String` by value and paying for things like reference-counting.
This change adds a `Name` type that is used to represent names of variables, types, macros, etc.
Names are cached and unique'd globally for a session, and the string-to-name mapping gets done during lexing.
From that point on, most mapping is from pointers, which should make all the various table lookups faster.
More importantly (possibly), this brings us one step closer to being able to pool-allocate the AST nodes.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The existing parser code was doing string-based matching on the lookahead token to figure out how to parse a declaration, e.g.:
```
if(lookAhead == "struct") { /* do struct thing */ }
else if(lookAhead == "interface") { /* do interface thing * }
...
```
That approach has some annoying down-sides:
- It is slower than it needs to be
- It is annoying to deal with cases where the available declaration keywords might differ by language
- Most importantly, it is not possible for us to introduce "extended" keywords that the user can make use of, but which can be ignored by the user and treated as an ordinary identifier.
That last part is important. Suppose the user wanted to have a local variable named `import`, but we also had a Slang extension that added an `import` keyword. Then a line of code like `import += 1` would lead to a failure because we'd try to parse an import declaration, even when it is obvious that the user meant their local variable. This would mean that Slang can't parse existing user code that might clash with syntax extensions. This issue is the reason why we currently have keywords like `__import`.
A traditional solution in a compiler is to map keywords to distinct token codes as part of lexing, which eliminates the first conern (performance) because now we can dispatch with `switch`. It can also aleviate the second concern if we add/remove names from the string->code mapping based on language (the rest of the parsing logic doesn't have to know about keywords being added/removed).
The solution we go for here is more aggressive.
Instead of mapping keyword names to special token codes during lexing, we instead introduce logical "syntax declarations" into the AST, which are looked up using the ordinary scoping rules of the language.
Depending on what code is imported into the scope where parsing is going on, different keywords may then be visible.
This solves our last concern, since a user-defined variable that just happens to use the same name as a keyword is now allowed to shadow the imported declaration for syntax (this is akin to, e.g., Scheme where there really aren't any "keywords").
This also opens the door to the possibility of eventually allowing user to define their own syntax (again, like Scheme).
For now I'm only using this for the declaration keywords.
With this change it should be pretty easy to also add statement keywords in the same fashion.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes #24
So far the code has used a representation for source locations that is heavy-weight, but typical of research or hobby compilers: a `struct` type containing a line number and a (heap-allocated) string.
This is actually very convenient for debugging, but it means that any data structure that might contain a source location needs careful memory management (because of those strings) and has a tendency to bloat.
The new represnetation is that a source location is just a pointer-sized integer.
In the simplest mental model, you can think of this as just counting every byte of source text that is passed in, and using those to name locations.
Finding the path and line number that corresponds to a location involves a lookup step, but we can arrange to store all the files in an array sorted by their start locations, and do a binary search.
Finding line numbers inside a file is similarly fast (one you pay a one-time cost to build an array of starting offsets for lines).
More advanced compilers like clang actually go further and create a unique range of source locations to represent a file each time it gets included, so that they can track the include stack and reproduce it in diagnostic messages.
I'm not doing anything that clever here.
|
| |
|
|
|
|
|
|
|
|
| |
- `ExpressionSyntaxNode` becomes `Expr`
- `StatementSyntaxNode` becomes `Stmt`
- `StructSyntaxNode` becomes `StructDecl`
- `ProgramSyntaxNode` becomes `ModuleDecl`
- `ExpressionType` becomes `Type`
- Existing fields names `Type` become `type`
- There might be some collateral damage here if there were, e.g., `enum`s named `Type`, but I can live with that for now and fix those up as a I see them
|
| |
|
|
|
|
|
| |
The change is mostly about trying to make sure the compiler "fails safe" when it encounters an internal assumption that isn't met.
Most internal errors will now throw exceptions (yes, exceptions are evil, but this will work for now), and these get caught in `spCompile` so that they don't propagate to the user (they just see a message that compilation aborted due to an internal error).
Subsequent changes are going to need to work on diagnosing as many of these situations as possible, so that users can at least know what construct in their code was unexpected or unhandled by the compiler.
|
| |
|
|
|
|
|
|
|
|
|
| |
This is in anticipation of needing to have more complete knowledge to be able to handle user code that `import`s library functionality.
The big picture of this change is just to remove the `UnparsedStmt` class that was used to hold the bodies of user functions as opaque token streams, and thus to let the full parser and compiler loose on that code. That is the easy part, of course, and the hard part is all the fixes that this requires in the rest of the compielr to make this even remotely work.
Subsequent commit address a lot of other issues, so this particular commit mostly represents work-in-progress.
One detail is that this change puts a conditional around nearly every diagnostic message in `check.cpp` to suppress thing when in rewriter mode.
I have yet to check how that works out if there are errors in anything we actually need to understand for the purposes of generating reflection data.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- `RefPtr` no longer tries to have distinct cases for interal-vs-external reference counts. Instead we always require an internal reference count.
- Types the used `RefPtr` but weren't `RefObject` were made to inherit `RefObject`
- The `ReferenceCounted` base class was removed, so that only `RefObject` remains
- Implicit conversion from `RefPtr<T>` to `T*` added
- This created some complicates for other types that relied on implicit conversions, so this isn't a net cleanup right now
- The main type that got messed up by the above was `String`, which previously held a `RefPtr<char, ...>`. This change thus *also* includes a major overhaul of `String`:
- `String` now holds all its data via indirection, using a `StringRepresentation` that is a `RefObject`. This object holds a length, capacity, and directly stores the character data in its allocation. This means that `sizeof(String)==sizeof(void*)`
- It is now possible to directly mutate a `String` by appending to its representation (we just need to ensure it has a reference count of `1`, possibly by cloning it). This means that `StringBuilder` is now basically just an idomatic use of `String`
- A couple operations that just return sub-ranges of a `String` now return `StringSlice` to avoid allocation when it isn't needed. This required more work.
- Indices into strings changed from `int` to `UInt` (which is pointer-sized). This had a bunch of follow-on changes because the value `-1` sometimes needs to be special-cased in code that uses indices. Further cleanups are probably needed here.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add logic to extract the value and suffix from a numeric literal
- This duplicates some of the lexing logic, but this is hard to avoid without redundant runtime work
- Note that I'm not using and stdlib string-to-number code. This should be more robust once it is working, but it is obviously error prone in the near term. The main up-sides to this are:
- We can handle binary integer literals
- We can handle hexadecimal floating-point literals without stdlib support
- We can hypothetically support digit separators, if we ever wanted
- The parser looks at the suffix characters sliced off by the lexer, and tries to pick a type to use for a literal
- It uses `NULL` if there is no suffix, to avoid some nasty order dependencies where the stdlib might need to parse a number before it has seen the definition of `int`
- Right now I only handle a few cases, so there may be bugs lurking here
- The emit logic needs to handle the fact that a literal node in the AST might have a non-default type attached.
- Right now I just quickly check for the most likely types, and emit the literal with a matching suffix. This doesn't seem robust if any source language supports a suffix for a type where a target has no corresponding suffix. In the long term some amount of casting is probably required.
|
| |
|
|
|
| |
These had a typo (`Literial`), so they needed a fix eventually.
I also went ahead and made things a bit more verbose (`IntegerLiteral`, `FloatingPointLiteral`) because these names don't get used often enough for the brevity to pay off.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
The previous changes had left out logic for "scrubbing" a token value that includes an escaped newline, because I expected it would only occur within whitespace. Unfortunately, some user code looked like this:
```
a + b
```
That is, there was a token at the very start of the line, after the escaped newline.
As a result, after consuming the leading whitespace (which didn't end up consuming the escaped newline - but we could consider making it do so in future), the lexer started to lex a token that *starts* with an escaped newline, but turns out to be an identifer (which gets an invalid name).
This change adds some ad-hoc code to "scrub" the value of *every* token, which wasteful but at least solves the problem.
|
| |
|
|
| |
This gets rid of one unecessary namespace.
|
| |
|
|
| |
This is really messy and I'm not entirely happy with the result, but at least we handle a couple more corner cases.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is mostly to allow for the idiomatic style of defining a multi-line macro in C:
#define FOO(a,b) \
x(a) \
y(b) \
/* end */
The handling is reasonably general: in the lexer whenever we need to consume or "peek" the next code point, we check if we are at the start of a backslash-newline sequence, and if so we skip past that to find what we were looking for.
However, the way I'm handling things right now there is no step taken to "clean" a token and remove the backslash or newline from its value, so downstream code that actually inspects token values will probably break if users start putting escaped newlines in the middle of names.
We can fix that issue when (if) it comes up.
|
| |
|