Computer languages concern themselves with data, its properties and operations. Designing a language consists in part of designing a type system which entails notions of how to store, address and operate on data. Types define what you can and cannot do to data. Types also help manage complexity by organizing data into collections, and by providing a means to verify correctness. Furthermore, we discriminate between "primitive" types (which the system, whatever that is, operates on) and "user-defined" types, which the programmer creates within the language.
In designing a language meant to express the configuration of an interactive simulation with visualization, what primitive types should we support, and how shall we express the means of combining those primitives into compound types useful for a specific simulation? As with this entire series, we further constrain the solution to our problem to one which a graduate student can implement in a single semester.
Data Type System
As with other dynamically typed languages, variables in our game engine configuration language do not have types; values do.
The language provides no mechanism for defining a type. Our language is "prototype-oriented" (in contrast to object-oriented) so we make no distinction between "class" and "object".
In the parlance, our language has "structural equivalence under naming" which means that if 2 objects have members with the same name and type, and an operation exists which requires those properties, then in that sense the objects have the same "type". This is also called "duck typing".
Primitive Types
In "The Structure and Interpretation of Computer Programs", Abelson and Sussman state that for any computer language we should ask, what are the primitives and what are the means of combination? (They also ask, "what are the means of abstraction?" which we defer to another article.)
Primitive types are those which the system (either the hardware or the virtual machine) can process natively, i.e. without instructions from within the language.
Numbers, such as integers and real numbers, are the most common form of primitive, both in hardware and in computer languages. Computer hardware has arithmetic units that operate on integers and floating-point values, but these are separate units. The integer unit is distinct from the floating-point unit, and mixing those operations causes tremendous inefficiencies. We therefore strive to keep computations within a single realm and avoid conversion between those types. The configuration language should reflect that choice, so we discriminate between integers and floating-point values.
Integers come in many flavors, based on their size (byte, short, long) and range (signed, unsigned) but for the sake of simplicity and computational efficiency we restrict our configuration language to signed integers of the size the native machine handles most efficiently, i.e. a plain-old "int". Likewise, floating-point values come in a variety of precisions (half, single, double and quad) but gaming consoles typically treat only single-precision floats efficiently, therefore our configuration language only supports that flavor.
In contrast, Lua only supports a "number" type, and you can choose its representation at compile (e.g. int, float, double, long or whatever) but I find this too restrictive. Integers and floats have rather distinct usages and mixing them leads to grotesquely slow arithmetic operations.
Interactive simulations with visualization (such as video games) commonly require vector and matrix operations, and modern CPU's have special instructions (e.g. SSE, VMX) that operate on those types, so we incorporate "vector" and "matrix" as primitive types in our language. This again departs from the choice in most scripting languages, but in my opinion, to the detriment of speed.
Some languages (e.g. Lua, C++) treat Boolean values as primitives but we shall not since, following the same logic as above, the native hardware does not recognize a "Boolean" type -- it only operates on integers (which are a superset of booleans anyway). C++ restricts the value of "bool" to be 0 or 1, and I imagine that takes some overhead. We can have our system give some integers special treatment, though. We can adopt the convention that integers whose names start with "b" shall have a value of either 0 or 1, and in debug builds we shall assert on that condition. We can therefore be informed of departures from that, but in release builds we can avoid the computation overhead associated with that restriction. So we'll consider this the "trust but verify" policy of bools.
Means of Combination
Most computer languages support some means of combining primitive types into compound types, called "records", "user defined types", usually implemented as a list, structure or table. The various implementations have different properties.
Lists are extremely simple but not cache friendly and can only represent name-value pairs by nesting lists -- elegant but slow.
Structures order the data such that we can access members by an integer offset from the first element of the structure. This is the fastest technique since all the offsets can be computed at compile time.
Tables, or associative arrays, implemented as hash maps, allow access to each member by a "name" which is usually a string of human-readable characters, but sometimes is an integer (as in an array).
Array access is fast but access by name is more useful for abstraction. Lua uses a hybrid table with 2 subtables, one using a hash map, the other using an array, and uses a sophisticated algorithm to decide which subtable to use to store any given member. Our language should strive for the benefits of this approach, but in the interest of time we want a simpler solution. So in our language, hash tables represent records, but each element in each table is, in general, an array. Note that each element in that array has the same type, so this lacks the versatility of a Lua "array" in that, in a Lua array, each element can have a different type. But our language also supports that in the sense that our language supports arrays of references, and also supports a table whose index values can be any string, including a representation of a sequence of integers.
Structs provide the fastest access of these options, and we can obtain a similar fast access of offsets by using a dynamic array container to store values, and using a hash map to access the contents of that array by name. This fixed ordering implies that we can never delete members from a table (though we can remove values from any member, or otherwise make them "invisible") -- but I have never wanted to remove members from a table, and many languages do not support deleting members anyway. This array&table scheme lets us access any member by index (which is much faster than even hash table lookup). In many cases, we know that index because we can prescribe it at compile-time. For types defined in native code at compile-time, we can initially populate a table's attributes using a well-known order, and thereafter access those "prescribed attributes" by index. We can also still append more values to the same table, at runtime, so prescribing attributes at compile time does not restrict the utility or extensibility of the table. Furthermore, no script code needs to know or care about this optimization. It's a nice bonus for a very small price.
Functions and methods
[Note: In a subsequent post, I retract some of the proposals made in the following subsection: That functions do not need to have their own type. While this is technically true and I did try this notion on for size, and it does work as described below; the benefits, if any, do not outweigh their costs.]
[Also note that, in a subsequent post, I clarify the mechanism for binding actual arguments to formal arguments, and it turns out that the ideas expressed in this subsection suggest using something akin to dynamic binding which, although it has some nifty benefits, turn out not to do what I want. So in a way, I also retract yet more of the following subsection.]
Lua and most other computer languages represent procedures as their own type, distinct from other types. A procedure differs from other types in that it can be invoked. But in a dynamic language (such as our configuration language) this distinction only matters when invoking the procedure. I argue that this distinction is one that the runtime needs in order to execute the process, but not one which our language grammar needs to know about, because of the peculiar way in which we represent procedures.
In Alonzo Church's treatment of lambda-calculus, a function binds variables (parameters, a.k.a. arguments) to a scope. Parameters have 2 flavors: Formal parameters which the code in the function body sees and actual parameters, which the caller passes into the function. Usually, the meaning of those parameters depends on their order in the argument list. These are sometimes called "positional parameters", in contrast to passing parameters by name. In addition to parameters, functions also have "local variables" and "free variables". The local variables are defined within the scope of the function and "free variables" are defined outside the lexical scope of the procedure, somewhere in the "environment". When a function is created, the "free" variables are bound using a "closure", which is a topic for another article, and which some languages do not support, so let's ignore them for now. The remaining variables, the arguments and locals, are similar in that their lifetime matches the function call itself. When the function is not "active", those variables do not exist. When the function is "active", those variables exist. So those variables are part of the "activation record", which is a piece of memory valid only during the function call.
In our configuration language, we have not yet made a distinction between an "argument" and a local variable -- and until we find a reason to make such a distinction, we will avoid it. We shall (effectively) pass in parameters by virtue of establishing an "environment". In other words, it turns out that we can treat all 3 kinds of variables (free, local and bound) as the same thing -- a table. And indeed, the table is the procedure. In its present incarnation, our type system has no separate "function" type. It is just another table. Behind the scenes, it is a table that happens to know how to execute its content -- and not all tables know how to do that -- but we could equivalently say that all tables respond to messages, but most do nothing in response.
This treatment of functions as tables simplifies our system because we can copy functionality just as we can copy other properties -- and in a prototype-oriented language, that means we can assign behaviors to objects by copying them from something that already has that behavior. Nice!
References, Pointers and Passing Semantics
Imagine you want to write a function that changes the value of an argument. Maybe the function increments a variable, or swaps the values in 2 variables. You can only accomplish this by passing the variable itself (not its value) into a function. Languages can accomplish this using "pass-by-reference" or "pass-by-name". Suffice it to say that "pass-by-name" is problematic but that if we really want, we can support that by passing in a string, and looking up that string, at runtime, in the current environment. It's slow but powerful -- probably more powerful than we need most of the time. So let's consider pass-by-reference.
Technically, C only has pass-by-value, meaning that when you write the code associated with a function call, the values of the arguments are delivered to the function. While that is true, one of the value types in C is an address, which pointer variables can store. C provides a syntax for "dereferencing" pointers, meaning that the expression refers to what the pointer points to, i.e. its contents, a.k.a. the referent. So in this sense, C supports pass-by-reference semantics, even though the syntax differs.
Java, its pundits claim, has only pass-by-value semantics. But that is either pedantry or misleading, depending on your tolerance for misinformation. Either way, I call it "false". Their loophole is that some variables are primitive "value types" and some are object "reference types". That means that depending on the type of the value, the variable either contains the value itself, or simply points to the place in memory that contains the value -- and you as a programmer have little or no control over which is which. So if you write code that syntactically has the effect of changing the value in a variable, then the same syntax has drastically different meanings depending on the variable type. And I don't mean "drastically different meaning" in the prosaic sense that different types have different interpretations -- because when you swap 2 values, regardless of whether the values are integers or Canada geese, at the end of the day, you expect the 2 to be swapped. But in Java, that is not the case. Functions that operate on value types affect only local copies of the values, whereas functions that operate on reference types affect the referents themselves. So you can call that "pass-by-value" if you feel like living in denial and confusing would-be Java programmers, but I call it pass-by-reference, and I can usually sleep well at night.
In any case, we just want to answer the question, how should we facilitate authoring functions that can operate on referent values (not just copies of values passed in for temporary manipulation). Or put simply, what if we want to write a function that swaps 2 values? Once again, I take inspiration from asking, what does the underlying hardware support? The answer is that it supports addresses, i.e. pointers. Our language therefore supports these, in 2 flavors: references to other data (i.e. data that the configuration language understands, including primitives and tables) and opaque pointers to "native" data. The key here is that references can point to either primitives or tables.
What is this abomination of opaque pointers? Remember that our configuration language spans two realms: the script side, which we configure at runtime, and the native side, which we configure at compile time. In the interest of speed and memory, we sometimes (often) implement functionality in native code (i.e. C++) but which to expose only a fraction of its data to script (i.e. not all of that data is driven by configuration files). The "opaque pointer" gives us that out. Lua also supports this notion of opaque pointers, which it calls "user data".
Lua does not, as far as I know, support references to primitive types. Java and C# do effectively support passing primitives by reference, by creating an object manged by the garbage collected heap, which wraps the primitive value. In C# this is called "boxing" and it's build into the virtual machine. It's ludicrously slow, though, for something so simple -- so slow that you just wouldn't use it. So although those languages technically support the feature I describe, you wouldn't use it for a performance-sensitive application like interactive simulations.
Names and Strings
We can use a hash table to map names to values, but what if we want to find the name of a variable, given its "address" (e.g. index or offset or whatever) in a table? We have a few choices, and I consider this somewhat of an open issue because it's not yet clear which choice works best. The most obvious choices include "fast but messy" and "slow but simple". The fast-but-messy solution entails entangling data and names by having data point to their names, or by representing name-value pairs in a way that gives us fast access to both given an index or address. Sounds simple but I believe it is unnecessary -- especially when you consider that the name of a variable can often be discarded after parse time. That turns out to be a complicated subject. Anyway, in contrast, the slow-but-simple solution entails doing an exhaustive search through the table, to find the entry which matches the given address. "Exhaustive search" here sounds worse than it is, since the search tends to be localized to a single table, which rarely has more than a few members. Furthermore, we have to ask, why would we perform these "reverse lookups" anyway? Examples entail reflection and serialization. Do we need to perform those during the execution of the simulation, or only during authoring? If only during authoring then we can probably tolerate slow-but-simple. But during some copy operations, we might need reverse-lookup during the normal operation of a game. I'm not yet sure. Ask me later.
Usually, strings used for the names of table data do not change. Furthermore, those names tend to repeat, i.e. the names of members in one object tend to appear as the names of members in another object, because our system uses the prototype-oriented paradigm (i.e. objects tend to be copies of other objects). This means we can "recycle" strings by using a reference-counted string subsystem. This reduces memory usage since each unique string occurs only once. By clever use of reference counts and operator overloading, we can automate this so that the client code does nothing onerous to gain these benefits -- just use a different type, e.g. RString instead of String. Furthermore, when using these ref-counted strings, string comparisons reduce to comparing pointers, instead of the notoriously slow strcpy, which iterates over every character and performs at least 2 (usually 3) conditional branches per character. Nightmare. So, use ref-counted, consolidated strings for the "key" part of the key-value pair in the hash tables.
Summary
By gleaning examples from other scripting languages but paying close attention to the specific needs of our configuration language, we constructed a type system which is simple, powerful and efficient. It has familiar aspects, it supports the means of combination to allow runtime extension and its underlying implementation mirrors what the hardware supports natively. Even with the restriction that students have only 1 semester to implement this system, it should provide adequate functionality to serve as the basis of more sophisticated systems they develop in their professional careers after graduation.
Monday, March 2, 2009
Subscribe to:
Post Comments (Atom)

0 comments:
Post a Comment