gemini://gemini.circumlunar.space/users/emptyhallway/halfbaking/no-fail-scripting-language.gmi

I like the idea of a scripting language that can run any text file without failing. There is no such thing as bad syntax, since any string of characters (even binary!??) can be unambiguously interpreted. The result of a nonsense script would likely be boring most of the time: an "undefined" value, for example. A few mistakes in an otherwise good script would generally give plausible, but useless, results. What would this kind of language look like? Who would even want to use it? I can't answer the second question, but I like to think about the first one.

Please keep in mind that this is a rambling collection of thoughts. There will be contradictions and unresolved questions. I haven't really done any research for this. If you are looking to learn something, you are probably in the wrong place.

Just to lay some groundwork, what are the minimum building blocks for a language? There are probably several good articles or books about this, but I haven't looked very hard yet.

Here is a list of some ideas. This language should probably be quite small, so I expect several of these will be dropped before the final specification.

Numeric: negation, addition, subtraction, multiplication, division, remainder, comparisons

The only built-in language features in this proposed language are operators and literals (strings and numbers, for example). There are no built-in functions. All operators are made of ASCII punctuation. This was an arbitrary choice, mostly unrelated to the primary goals of the language. I could have used "if" and "and" as operators, but I'm experimenting with things like "?:" instead. I may write more about the reasons for this later.

Human-friendly languages follow a "natural" flow, called infix notation, where each operator is surrounded by its operands. Infix notation can often be read aloud in an English-like sentence. For example:

This requires a formal order of operations for each operator to get "natural" results. In the example above, it's important to evaluate + before = to get the expected result of a equal to 8.

Of course, there's no requirement that operators follow a "natural" order of preference. They could simply evaluate from left to right, consuming values immediately. In this case, the above example would evaluate to "a equals three, add five to the result of an assignement". This is odd, depending on what happens. Does the = operator even return a result? Maybe this could be evaluated as "a equals three, then add five to it". This gives the same result as the "natural" example, but the internal logic is different. There will be other examples that are not equivalent.

In any event, I don't think it's possible to write a usable infix notation without a way to prioritize segments of the expression above others, commonly parentheses. Starting from an example from wikipedia, rewriting 5 - (6 * 7) without parentheses results in the unwieldy -6 * 7 + 5.

An alternative is prefix notation (also called Polish notation), where the operator is printed first, followed by its arguments:

This is less human-readable, especially in complex statements where the second operand might be far away from the operator and the start of the first operand. But complex nested parentheses can be hard to make sense of, too, and the prefix notation is easier for a machine to parse. Part of the benefit, though, comes the abandonment of operator priority. As seen above, we can abandon operator priority in infix operator order as well.

Prefix notation also has the advantage of obvious support for operators that require more than two operands. There aren't many of these, though, so the benefit is probably small. In fact, I can only think of one, which I know as ternary comparison, which typically requires two operators:

In prefix notation, this could defined to ingest three parameters and be written with a single operator:

Let's go with prefix notation for now. I think its formality and simplicity will make it easier to introduce flexibility in other areas where it will be necessary.

The whole point of this language is that it should be impossible for a program to fail during execution. This means that any string of characters must evaluate to something. Syntax must be extremely flexible.

Despite this, I don't want to discard any input. Simply ignoring chunks of code that the parser doesn't understand feels like a major cop-out. Instead, my goal is for every character to be significant (excepting things like comments, obviously, which are explicitly ignored).

Any of the common line break sequences (\r\n, \r, \n) are treated as a line break.

I think unicode has a list of white space characters, such as the space, tab, and more, so assume that we accept all of them as our white space (not including the line break sequences, if they are on the unicode list). Wait, no, this is a terrible idea because unicode is a living standard, so the script could be parsed differently based on which unicode library the parser uses if the list of white space characters changes. Let's go the other direction: the only white space character is space. I'll still talk about "white space characters" as a group in case I change my mind on this later.

Any characters which are not line breaks and not white space are called symbolic characters. Each group of consecutive symbolic characters is called a symbol. A symbol might be an operator, a variable name, a function name, a numeric literal, and so on. Heck, even binary data that isn't valid character strings could be interpreted if the parser supports it. Why not?

A line break sequence can be escaped by preceding it with a backslash. This causes the line break to be treated as white space instead of a line break, which allows long statements to span multiple lines.

A white space character can be escaped by preceding it with a backslash. This causes the white space to be treated as a symbolic character. (Is this a good idea? Probably not. Is there any valid use case at all? I'll have to think about it.)

Note that white space is required between operators and their operands. I think this is normal in prefix notation, but it may be unexpected for users of infix notation.

If an expression is missing parameters or operands when a line break occurs, all missing values are parsed as the "undefined" value. All operators must be able to handle "undefined" as input, although often the result will be "undefined" in that situation.

Every operator takes a fixed number of operands. Because of this, the parser can know with certainty when an expression has ended. If it encounters additional input on the same line, it can simply start parsing it as a new expression.

The above code assigns the value 5 to the variable a, then assigns the sum a+1 to the variable b. This is terrible programming style. I don't know why you would want to do this.

This is two expressions. The first expression, "a", returns "undefined", but it doesn't matter because nothing is handling that result. The second expression, "= 5" could also be written "= 5 undefined" because the end of the line terminates the expression and passes "undefined" in place of any missing operators. This creates a new variable, "5", and assigns it the value "undefined". See more below about variable naming and why this is not as dangerous as it looks (but still obviously bad practice).

The comment symbol is probably the -- character pair. When a comment symbol appears on a line, all characters are ignored until the next line break. This includes escape characters, so it's not possible to continue the same comment onto a new line.

Like all symbols, note that the comment symbol must be bounded by white space. If combined with other symbolic characters, it will be parsed as a different symbol, such as a variable.

A block is a multi-line subset of code. It's used in some styles of if-else statements, loop statements, function definitions, and maybe other areas that I'm forgetting.

By surrounding them with brackets, like { at the beginning of the block and } at the end.

In this language, blocks are indented with any white space character. Here is an example:

(The colon is not part of the block syntax, but it does connect the if operator to the block. See the section on block replacement below.)

(If I end up allowing non-space white space characters, all white space characters wil count equally towards indentation. For example, two spaces would be the same indentation level as two tabs, or one space and one tab. I probably won't do that, though.)

The number of white space characters at the beginning of a line is called the indentation depth. A block stack keeps track of the current hierarchy of indentation depths.

For each line, compare the new indentation depth with the top of the block stack. If the value is the same, then the new line is part of the same block. If the value is larger, then it is added to the block stack and a new block is started. If the value is smaller, then the previous block is closed and the value is removed from the block stack, then the evaluation repeats for the current top of the block stack.

Any line that contains only white space and line breaks is called an empty line. Empty lines do not trigger indentation checks. Any line that contains only white space and a comment is called a comment line. Comment lines do not trigger indentation checks.

Remember that line breaks can be escaped. In these cases, spaces at the start of the following line would be considered just a continuation of the preceding line's white space, not indentation for a block.

Blocks are necessary with complex if-else and loop statements, but they may also appear alone. In these cases, is there any meaning to a block? Are variables localized within stand-alone blocks? Is there a good reason for this or is it just a justification for a syntax that is otherwise meaningless?

Maybe a multi-block statement could be written without an else clause, by expecting consecutive blocks (of different depths), like so:

This is unusual and probably confusing. But it has a certain elegance in the way that it avoids the need for an "else" operator. I'll have to think about it.

Here's a questionable idea. Consider defining all operators as in-line operators. In order to use a block, as is often desirable with if-then or loops, the line must end with the : operator. Any parameter that would have appeared at or after the : operator is instead parsed as a following block.

With this syntax, all functions would be in-line by default, but could be converted to block statements where convenient.

What is a number? How is it stored? Like javascript, where every number is a float? I don't really like it, but it does seem easy to have only a single number type. Otherwise you either require explicit casting between numeric types (which feels counter to the spirit of this language) or you cast automatically, which could still introduce unexpected errors, just not as often.

I'll think about this. In the meantime, it's safe to say that generally a number looks like this:

A string is some text. A string literal is a symbol that begins with a double-quote character. A string ends when it encounters another double-quote character. If the parser encounters the end of the line before a double-quote character, the string is closed as if it had encountered a double-quote character.

To include a double-quote character within a string itself, precede it with the backslash. To include a line break in a string, use the \r character pair, I guess. Probably you can also use the backslash at the end of the line (i.e. before a literal line break character) to include it in the string and continue the string on the next line? Is that a good idea?

I tried to think of a way to define " as an operator instead of a more primitive parser construct, but I felt like it introduced too many questions that I couldn't answer.

A name is a user-defined symbol used to refer to variables and functions. A name can contain any symbols, but names that would be interpreted as literals or operators are probably a terrible idea and should be avoided.

When an operator expects a name, such as when defining a variable, the operand is treated as a name, even if it would be parsed as a literal or operator in another context.

When there is ambiguity, literals and operators take precedence over a name with the same representation. In order to use a name that looks like a core symbol, the ambiguous characters must be escaped. But this is a bad idea. Don't do this.

Because names are always bounded by white space, operators within names don't need to be escaped, as seen above with names that contain - which is the subtraction operator.

Names can not contain white space, even if it's escaped. Right? That's going too far.

A list is an ordered collection of values. Each value in a list can be a different type.

The { operator creates a list from any number of objects. It ingests objects until it encounters the . operator. If a line break is encountered before the . operator, the list is closed as if it had encountered the . operator.

Possibly it would be better to incorporate parentheses into the list operator and use ) as the closing operator. Then reuse the parentheses pattern for other variable-length purposes, like passing function parameters. But I like the finality of using the full-stop character to terminate lists. And I think using closing a closing parenthesis would be lead to misunderstandings of the language.

Use the # operator to access a value from a list. The first operand is the list and the second operand is the number to access. Lists start at 1.

The %= operator defines a function. It ingests any number of names and one block. The first name is the name of the function. The remaining names are names of the parameters passed to the function. The block is the content of the function.

In keeping with the "in-line first" proposal above, maybe this could be redefined using the list terminator to end the parameter list and a single expression as the function body.

I don't know how to square this behavior with the if-then block. In that case, the first block represents the next required parameter, but in this case the block terminates the list of parameters and then begins the final parameter (the function itself). Is this just the nature of the : operator? It terminates unbounded lists?

A function returns the value of the final expression in the function block. Maybe there needs to be a -> operator or something to return early.

On the other hand, maybe a function should be an anonymous thing that is only incidentally assigned a name. Consider:

I'm not sure why I wrote it differently before. This feels more consistent than having a special assignment operator for functions. I'm trying to get this posted, so I'm not going to go back and change anything, but I'm really leaning this direction now.

The %! operator performs a function. It ingests any number of expressions until it encounters the . operand. The first operand is the function to evaluate and any remaining operands are the parameters for the function.

If a function call contains fewer parameters than the function defines, then the missing parameters are assigned the undefined value. (Or, to say it differently, they are not assigned any value.)

Maybe all the parameters passed to a function are placed in a list, call it %# for now. If a function call contains more parameters than are named in the function definition, they can be accessed from the list.

(I haven't defined a loop operator yet, so treat that part as pseudo-code for now.)

Ooh, or, even easier, they are simply assigned names which are their position! This gives us a completely reckless justification for ability to use numbers as variable names. I don't know how to actually reference numbers indirectly, though, as you would want to do in an example like above. Like, you could write \3 to get the value of the third parameter, but how do you get the nth parameter while in a loop?

What should happen if a script attempts to perform an operation on incompatible data types? For example:

Some languages raise an error and stop execution in this case. Obviously that option is off the table for this language.

Another option is to attempt to cast one value or the other. For example "5" + 3 would automatically cast the string "5" to the number 5, and then apply the addition for 8. Other languages might cast the number 3 to the string "3" and concatenate the strings for "53".

A third option is to simply return an "undefined" or "not a number" value if an operator receives incompatible parameters. I'm kind of leaning this direction.

Oops, I've moved on to a different hobby horse. I'm posting this as-is for now. Maybe I'll revisit it in the future. Certainly there is more to explore, and I really would like to write a functional parser at some point.

The "can't fail" scripting language

What is a programming language?

Data types

"Non-data" types (what are these commonly called?)

Data collection types

Flow control

Operators

Input/output

Operator characters

Infix vs prefix notation

Syntax flexibility

Line breaks, white space and symbols

Line breaks terminate expressions

Lines may contain multiple expressions

Comments

Blocks

Block replacement?

Number literals

String literals

Names

Lists

Defining functions

Evaluating functions

Missing or excess parameters

Other design topics

To cast or not to cast

The end