-- Leo's gemini proxy

-- Connecting to gemini.circumlunar.space:1965...

-- Connected

-- Sending request

-- Meta line: 20 text/gemini

The "can't fail" scripting language


I like the idea of a scripting language that can run any text file without failing. There is no such thing as bad syntax, since any string of characters (even binary!??) can be unambiguously interpreted. The result of a nonsense script would likely be boring most of the time: an "undefined" value, for example. A few mistakes in an otherwise good script would generally give plausible, but useless, results. What would this kind of language look like? Who would even want to use it? I can't answer the second question, but I like to think about the first one.


Please keep in mind that this is a rambling collection of thoughts. There will be contradictions and unresolved questions. I haven't really done any research for this. If you are looking to learn something, you are probably in the wrong place.


What is a programming language?


Just to lay some groundwork, what are the minimum building blocks for a language? There are probably several good articles or books about this, but I haven't looked very hard yet.


Here is a list of some ideas. This language should probably be quite small, so I expect several of these will be dropped before the final specification.


Data types

Boolean

Integer numbers

Floating-point numbers

Text strings

Dates

Times

Date-times


"Non-data" types (what are these commonly called?)

Undefined

Null

Not a number


Data collection types

Lists

Dictionaries


Flow control

Operation priority (parentheses)

Functions

If/else statements

Loops (while, until, for each)


Operators

Boolean: and, or, not

Numeric: negation, addition, subtraction, multiplication, division, remainder, comparisons

Binary: binary and, binary or, binary exclusive or

Text: concatenation

Dates: addition of days, subtraction of days, comparisons

Times: addition of seconds, subtraction of seconds, comparisons

Date-times: addition of seconds, subtraction of seconds, comparisons

Comparison: equal to, less than (or equal), greater than (or equal)


Input/output

Read from a file

Write to a file

Read from standard input

Write to standard output

Write to standard error


Operator characters


The only built-in language features in this proposed language are operators and literals (strings and numbers, for example). There are no built-in functions. All operators are made of ASCII punctuation. This was an arbitrary choice, mostly unrelated to the primary goals of the language. I could have used "if" and "and" as operators, but I'm experimenting with things like "?:" instead. I may write more about the reasons for this later.


Infix vs prefix notation


Human-friendly languages follow a "natural" flow, called infix notation, where each operator is surrounded by its operands. Infix notation can often be read aloud in an English-like sentence. For example:

  a = 3 + 5
  "a equals three plus five"

This requires a formal order of operations for each operator to get "natural" results. In the example above, it's important to evaluate + before = to get the expected result of a equal to 8.


Of course, there's no requirement that operators follow a "natural" order of preference. They could simply evaluate from left to right, consuming values immediately. In this case, the above example would evaluate to "a equals three, add five to the result of an assignement". This is odd, depending on what happens. Does the = operator even return a result? Maybe this could be evaluated as "a equals three, then add five to it". This gives the same result as the "natural" example, but the internal logic is different. There will be other examples that are not equivalent.


In any event, I don't think it's possible to write a usable infix notation without a way to prioritize segments of the expression above others, commonly parentheses. Starting from an example from wikipedia, rewriting 5 - (6 * 7) without parentheses results in the unwieldy -6 * 7 + 5.


An alternative is prefix notation (also called Polish notation), where the operator is printed first, followed by its arguments:

  = a + 3 5
  "assign to a the sum of three and five"

This is less human-readable, especially in complex statements where the second operand might be far away from the operator and the start of the first operand. But complex nested parentheses can be hard to make sense of, too, and the prefix notation is easier for a machine to parse. Part of the benefit, though, comes the abandonment of operator priority. As seen above, we can abandon operator priority in infix operator order as well.


Prefix notation also has the advantage of obvious support for operators that require more than two operands. There aren't many of these, though, so the benefit is probably small. In fact, I can only think of one, which I know as ternary comparison, which typically requires two operators:

  condition ? if-true-result : if-false-result

In prefix notation, this could defined to ingest three parameters and be written with a single operator:

  ? condition if-true-result if-false-result

Let's go with prefix notation for now. I think its formality and simplicity will make it easier to introduce flexibility in other areas where it will be necessary.


Syntax flexibility


The whole point of this language is that it should be impossible for a program to fail during execution. This means that any string of characters must evaluate to something. Syntax must be extremely flexible.


Despite this, I don't want to discard any input. Simply ignoring chunks of code that the parser doesn't understand feels like a major cop-out. Instead, my goal is for every character to be significant (excepting things like comments, obviously, which are explicitly ignored).


Line breaks, white space and symbols


Any of the common line break sequences (\r\n, \r, \n) are treated as a line break.


I think unicode has a list of white space characters, such as the space, tab, and more, so assume that we accept all of them as our white space (not including the line break sequences, if they are on the unicode list). Wait, no, this is a terrible idea because unicode is a living standard, so the script could be parsed differently based on which unicode library the parser uses if the list of white space characters changes. Let's go the other direction: the only white space character is space. I'll still talk about "white space characters" as a group in case I change my mind on this later.


Any characters which are not line breaks and not white space are called symbolic characters. Each group of consecutive symbolic characters is called a symbol. A symbol might be an operator, a variable name, a function name, a numeric literal, and so on. Heck, even binary data that isn't valid character strings could be interpreted if the parser supports it. Why not?


A line break sequence can be escaped by preceding it with a backslash. This causes the line break to be treated as white space instead of a line break, which allows long statements to span multiple lines.


A white space character can be escaped by preceding it with a backslash. This causes the white space to be treated as a symbolic character. (Is this a good idea? Probably not. Is there any valid use case at all? I'll have to think about it.)


Note that white space is required between operators and their operands. I think this is normal in prefix notation, but it may be unexpected for users of infix notation.


Line breaks terminate expressions


If an expression is missing parameters or operands when a line break occurs, all missing values are parsed as the "undefined" value. All operators must be able to handle "undefined" as input, although often the result will be "undefined" in that situation.


Lines may contain multiple expressions


Every operator takes a fixed number of operands. Because of this, the parser can know with certainty when an expression has ended. If it encounters additional input on the same line, it can simply start parsing it as a new expression.


For example:

  = a 5 = b + a 1

The above code assigns the value 5 to the variable a, then assigns the sum a+1 to the variable b. This is terrible programming style. I don't know why you would want to do this.


Consider this blunder:

  a = 5

This is two expressions. The first expression, "a", returns "undefined", but it doesn't matter because nothing is handling that result. The second expression, "= 5" could also be written "= 5 undefined" because the end of the line terminates the expression and passes "undefined" in place of any missing operators. This creates a new variable, "5", and assigns it the value "undefined". See more below about variable naming and why this is not as dangerous as it looks (but still obviously bad practice).


Comments


The comment symbol is probably the -- character pair. When a comment symbol appears on a line, all characters are ignored until the next line break. This includes escape characters, so it's not possible to continue the same comment onto a new line.


Like all symbols, note that the comment symbol must be bounded by white space. If combined with other symbolic characters, it will be parsed as a different symbol, such as a variable.


  = a 5 -- This is a comment.
  = a 5 --None of this is a comment. It's all code to be interpreted!
  = a 5-- This is not a comment either, and 5-- is probably an undefined variable.

Blocks


A block is a multi-line subset of code. It's used in some styles of if-else statements, loop statements, function definitions, and maybe other areas that I'm forgetting.


I'm aware of two common ways to specify blocks:

By surrounding them with brackets, like { at the beginning of the block and } at the end.

By indenting each line in the block, as popularized by Python.


In this language, blocks are indented with any white space character. Here is an example:


  ? true-or-false :
    = a + a 1
    = b a

(The colon is not part of the block syntax, but it does connect the if operator to the block. See the section on block replacement below.)


(If I end up allowing non-space white space characters, all white space characters wil count equally towards indentation. For example, two spaces would be the same indentation level as two tabs, or one space and one tab. I probably won't do that, though.)


The number of white space characters at the beginning of a line is called the indentation depth. A block stack keeps track of the current hierarchy of indentation depths.


For each line, compare the new indentation depth with the top of the block stack. If the value is the same, then the new line is part of the same block. If the value is larger, then it is added to the block stack and a new block is started. If the value is smaller, then the previous block is closed and the value is removed from the block stack, then the evaluation repeats for the current top of the block stack.


Any line that contains only white space and line breaks is called an empty line. Empty lines do not trigger indentation checks. Any line that contains only white space and a comment is called a comment line. Comment lines do not trigger indentation checks.


Remember that line breaks can be escaped. In these cases, spaces at the start of the following line would be considered just a continuation of the preceding line's white space, not indentation for a block.


Blocks are necessary with complex if-else and loop statements, but they may also appear alone. In these cases, is there any meaning to a block? Are variables localized within stand-alone blocks? Is there a good reason for this or is it just a justification for a syntax that is otherwise meaningless?


Maybe a multi-block statement could be written without an else clause, by expecting consecutive blocks (of different depths), like so:


  ? true-or-false :
      -- do this if true
      = a + a 1
      = b a
    -- do this if false
    = a - a 1
    = b 0

This is unusual and probably confusing. But it has a certain elegance in the way that it avoids the need for an "else" operator. I'll have to think about it.


Block replacement?


Here's a questionable idea. Consider defining all operators as in-line operators. In order to use a block, as is often desirable with if-then or loops, the line must end with the : operator. Any parameter that would have appeared at or after the : operator is instead parsed as a following block.


For example, an if statement can be written on one line like this:


  ? condition if-true if-false

Or it can be written with blocks using the : operator:


  ? condition :
      if-true
    if-false

With this syntax, all functions would be in-line by default, but could be converted to block statements where convenient.


In this scenario, what is the point of blocks without the block operator?


Number literals


What is a number? How is it stored? Like javascript, where every number is a float? I don't really like it, but it does seem easy to have only a single number type. Otherwise you either require explicit casting between numeric types (which feels counter to the spirit of this language) or you cast automatically, which could still introduce unexpected errors, just not as often.


I'll think about this. In the meantime, it's safe to say that generally a number looks like this:


  [+-] [0-9] [.] (0-9) [e0-9]

String literals


A string is some text. A string literal is a symbol that begins with a double-quote character. A string ends when it encounters another double-quote character. If the parser encounters the end of the line before a double-quote character, the string is closed as if it had encountered a double-quote character.


  = name "lazy dog"

To include a double-quote character within a string itself, precede it with the backslash. To include a line break in a string, use the \r character pair, I guess. Probably you can also use the backslash at the end of the line (i.e. before a literal line break character) to include it in the string and continue the string on the next line? Is that a good idea?


I tried to think of a way to define " as an operator instead of a more primitive parser construct, but I felt like it introduced too many questions that I couldn't answer.


Names


A name is a user-defined symbol used to refer to variables and functions. A name can contain any symbols, but names that would be interpreted as literals or operators are probably a terrible idea and should be avoided.


When an operator expects a name, such as when defining a variable, the operand is treated as a name, even if it would be parsed as a literal or operator in another context.


  = max-height 23
  = total + 7 5
  = !!! "exciting" -- This creates a variable named !!!. Questionable practice, but not a problem.
  = 5 23 -- This creates a variable named 5 with a value of 23. Why would you do this?
  = "apple" 7 -- This creates a variable named "apple" (the quotes are part of the name). Why would you do this?

When there is ambiguity, literals and operators take precedence over a name with the same representation. In order to use a name that looks like a core symbol, the ambiguous characters must be escaped. But this is a bad idea. Don't do this.


  = 5 23 -- This creates a variable named 5 with a value of 23.
  = add-up + \5 5 -- This adds the variable 5 to the literal number 5, for a total of 28.

Because names are always bounded by white space, operators within names don't need to be escaped, as seen above with names that contain - which is the subtraction operator.


Names can not contain white space, even if it's escaped. Right? That's going too far.


Lists


A list is an ordered collection of values. Each value in a list can be a different type.


The { operator creates a list from any number of objects. It ingests objects until it encounters the . operator. If a line break is encountered before the . operator, the list is closed as if it had encountered the . operator.


  = some-list { 1 5 23 6 12
  = some-list { 1 5 23 6 12 .

Possibly it would be better to incorporate parentheses into the list operator and use ) as the closing operator. Then reuse the parentheses pattern for other variable-length purposes, like passing function parameters. But I like the finality of using the full-stop character to terminate lists. And I think using closing a closing parenthesis would be lead to misunderstandings of the language.


Use the # operator to access a value from a list. The first operand is the list and the second operand is the number to access. Lists start at 1.


  = first-value # some-list 1
  = third-value # some-list 3

Defining functions


The %= operator defines a function. It ingests any number of names and one block. The first name is the name of the function. The remaining names are names of the parameters passed to the function. The block is the content of the function.


  %= some-function a b :
    -- add one to a and multiply it by b for some reason
    = c * + a 1 b
    c

In keeping with the "in-line first" proposal above, maybe this could be redefined using the list terminator to end the parameter list and a single expression as the function body.


  %= in-line-function a b . + a b -- This function returns the result of a+b.

I don't know how to square this behavior with the if-then block. In that case, the first block represents the next required parameter, but in this case the block terminates the list of parameters and then begins the final parameter (the function itself). Is this just the nature of the : operator? It terminates unbounded lists?


A function returns the value of the final expression in the function block. Maybe there needs to be a -> operator or something to return early.


On the other hand, maybe a function should be an anonymous thing that is only incidentally assigned a name. Consider:


  = some-function %{ a b :
    + a b

I'm not sure why I wrote it differently before. This feels more consistent than having a special assignment operator for functions. I'm trying to get this posted, so I'm not going to go back and change anything, but I'm really leaning this direction now.


Evaluating functions


The %! operator performs a function. It ingests any number of expressions until it encounters the . operand. The first operand is the function to evaluate and any remaining operands are the parameters for the function.


  = some-total %! some-function 3 7 .

As always, the . operator is optional at the end of a line:


  = some-total %! some-function 3 7

Missing or excess parameters


If a function call contains fewer parameters than the function defines, then the missing parameters are assigned the undefined value. (Or, to say it differently, they are not assigned any value.)


Maybe all the parameters passed to a function are placed in a list, call it %# for now. If a function call contains more parameters than are named in the function definition, they can be accessed from the list.


(I haven't defined a loop operator yet, so treat that part as pseudo-code for now.)


  % max .
    = highest # %# 1
    = index 0
    loop %! length %# :
      = index + index 1
      if > highest # %# index
        = highest # %# index
    -> highest

Ooh, or, even easier, they are simply assigned names which are their position! This gives us a completely reckless justification for ability to use numbers as variable names. I don't know how to actually reference numbers indirectly, though, as you would want to do in an example like above. Like, you could write \3 to get the value of the third parameter, but how do you get the nth parameter while in a loop?


Other design topics


To cast or not to cast


What should happen if a script attempts to perform an operation on incompatible data types? For example:

  a = "apple" + 7

Some languages raise an error and stop execution in this case. Obviously that option is off the table for this language.


Another option is to attempt to cast one value or the other. For example "5" + 3 would automatically cast the string "5" to the number 5, and then apply the addition for 8. Other languages might cast the number 3 to the string "3" and concatenate the strings for "53".


A third option is to simply return an "undefined" or "not a number" value if an operator receives incompatible parameters. I'm kind of leaning this direction.


The end


Oops, I've moved on to a different hobby horse. I'm posting this as-is for now. Maybe I'll revisit it in the future. Certainly there is more to explore, and I really would like to write a functional parser at some point.


emptyhallway

2020-11-01


-- Response ended

-- Page fetched on Thu May 2 11:00:38 2024