Programming languages: lecture notes

Names, types, binding, and scope

Names and named components

Within a language there are typically a number of symbols and keywords associated with language features and constructs.

Special words, the words used to name actions or control forms within a language, may be keywords, reserved words, or predefined words:

A keyword is special only in certain contexts: for example in FORTRAN the word REAL can be used as a (user-defined) variable name, but if it appears at the beginning of a statement and is followed by another identifier then it is considered to represent the a type within a declaration.
REAL MYVAR is an example of a declaration, whereas REAL = 3.4 is an example of using REAL as a variable name.
A reserved word is a special word that cannot be used as a (user-defined) name.
These reduce the flexibility a user has in naming, but improve readability since the word has a single clear meaning.
A predefined word assigns a meaning to a name, but the user has the ability to alter, or re-define, the name later in the program.

User-defined names: may be applied to:

Variables, or data storage areas
Labels, or points within the sequence of program statements
Subprograms, or functions, procedures, macros, subroutines, etc
Formal parameters - the names by which passed values are referred to within subroutines

In addition to the use to which a user-defined name is put, different languages give users different degrees of freedom in determining valid names (or identifiers).

Among the choices a language designer must make are:

Are the names case sensitive? E.g. should SQUAREROOT refer to the same entity as SquareRoot and squareroot?
Case sensitive names allow for greater flexibility, but can cause confusion (imagine having several versions of many functions, with names distinguished only by various capitalization characteristics).
Is there a maximum length for identifiers in the language? Longer identifiers tend to allow for greater clarity in naming, but what should the upper bound be?
What is the available alphabet for identifiers? E.g. alphanumeric? can blanks and/or punctuation be included? are different alphabets used for different kinds of identifier?
The choice of alphabet may have a dramatic effect on the readability of the language and the ease with which a language is compiled.
(E.g. if brackets are used to distinguish blocks of code, can they also appear as characters within an identifier, and if so how does the compiler identify which use is valid for any particular bracket in the source code?)

Later in the semester we will consider the use of identifiers with labels and subroutines, first we will considerer variables in some detail.

Variables: types, binding, and scope

A program variable encompasses a number of attributes, each of which is implicitly or explicitly defined according to the language characteristics:

A name, by which the user refers to the variable (and possibly one or more aliases, or alternative names by which the variable can also be referred to);
A memory address, or storage location for the data currently associated with the variable (note that the location may change over time, depending on the capabilities of the language and operating system);
A value, currently associated with the variable and stored in the noted memory location;
A type, denoting how the current variable data is to be interpretted (e.g. integer, character, string, the particular range of values, etc);
The variable lifetime, characterizing when during program execution the variable is a viable entity;
The variable scope, defining the program segments from which it is possible to refer to the variable ;

Note that the variable address and value are sometimes referred to as its l-value and r-value, respectively.

Binding of attributes to variables: depending on the language, the attributes mentioned might exist for the entire life of the variable, or might change over time.

To clarify when the attributes of a variable take effect, we use the concept of binding.

A binding is static if the variable attribute is fixed before run time, and is unchanged throughout program execution.

A binding is dynamic, on the other hand, if the attribute can change at some point during execution.

Consider the different variable attributes with respect to binding:

Name: at what point is the identifier bound to the variable (think of some identifiers, such as i, which may be associated with many different index variables over the course of program execution)
Address/location: can the address used to store the variable's data change over time? when is it allocated and deallocated?
This is (generally) highly hardware dependent, and is an aspect we will not focus closely on.
Value: when is the variable initialized, and when is its value re-assigned?
Type: does the variable always contain the same type of data, or does this change during execution?
Lifetime: at what point is the variable first declared, and at what point is it implicitly or explicitly destroyed?
Scope: from which blocks of code (or which files?) can we refer to the variable?

Bindings may be explicitly declared by the user, or may be implicitly declared through the rules and conventions of the language, applied to the usage of the variable in the program itself.

Consider the following C++ code segment:

#include <iostream.h>

int mysquare(int x);

int y;

void main()
{
   cout << "Please enter an integer" << endl;
   cin >> y;
   cout << "The square of " << y << " is ";
   y = mysquare(y);
   cout << y << endl;
}

int mysquare(int x)
{
   int result;
   result = x * x;
   return(result);
}

Variable x has

Static type int (defined before run time)
Local scope (can only be referred to within mysquare
Its lifetime is only the time a specific call to mysquare is active
Its value is initialized to the actual value of passed parameter y, and is unchanged thereafter

While variable y has

Static type int (defined before run time)
Global scope (it can be referred to from anywhere within the program)
Its lifetime is the entire execution life of the program
Dynamic value, uninitialized until the user-entered value is bound to it, then subsequently the square of that value is bound to it

Aside: constants: Observe that a named constant might be bound to both a value and a storage location at the same time. (And, if truly a constant, those bindings will hold for the lifetime of the constant.)

Most programming languages require explicit declaration of variables - supplying at least the name, usually the type, and occasionally an initial value for the variable.

Some languages, such as PERL, FORTRAN, and BASIC, allow implicit declarations: when a variable is first used it is automatically or implicitly declared, and language rules are applied to attempt to derive the other attributes (value, type, etc).

In FORTRAN the variable is an integer if the identifier begins with I..N, and is a real otherwise.)

Explicit declarations guarantee the compiler has complete information with which to apply type and error checking, but place extra restrictions on the programmer.

Languages which use dynamic type binding do not assign a type to a variable until a value is assigned to the variable: the type that is bound is one appropriate to the value assigned.

In some cases, the type can also be dynamically changed - e.g. you assign a variable an integer value at one point, and a string value at some later point.

Dynamic typing makes a language much more flexible, but has several disadvantages:

it limits the ability to perform error checking,
it can lead to readability/reliability problems,
any type checking must take place at run time, making execution significantly slower,
there must be a way to vary the size of storage bound to a variable, since the data type of the variable may change over time

In C++ variables are statically typed, however some implicit type conversion takes place at run time when the type of an evaluated value (e.g. the right hand side of an assignment statement) does not match the expected type (i.e. the left hand side the of statement). This causes some of the same complications as dynamic typing.

Variable lifetimes: the lifetime of a variable is typically referred to as the period during which it has storage space allocated to it.

We will consider three classifications of variables, based on the way in which storage locations are bound to the variables:

Static variables: are bound to storage before execution begins, and remain bound until execution completes.
In C++ functions, a local variable preceded by the word static is a static variable, and uses the same storage location throughout program execution.
Such a variable could be used to track information useful from one call of a subroutine to another. For example, in the function below the static variable invocations tracks how many times the function has been executed.
```
void DoSomeStuff(int data)
{
   static int invocations = 0;
   int x, y, z;

   // do whatever the function is supposed to do

   invocations++;
}
```
Stack-dynamic variables: have their data types bound before execution begins (and constant through execution completion) but are not bound to specific storage until their declarations are executed.
In C++ functions, local variables not preceded by the static keyword are stack-dynamic: specific storage is allocated on each call of the function, and deallocated when the function completes
In the example above, x, y, z are stack-dynamic variables, and their values are lost between invocations of DoSomeStuff.
Heap-dynamic variables: have their storage locations allocated and bound at run time, possibly through explicit instructions from the user.
In C++, the new operator, when applied to a type name, calls for the allocation of memory space for the appropriate data type.

Type Checking: is the process of ensuring that the operands of an operator are of the correct type.

(Here we include subroutines and assignment statements under the umbrella term "operator", and hence actual parameter values and statement r-values under the "operand" label.)

A compatible type is one that is legal for the operator, or one which may be implicity converted (or coerced) into a legal type.

Types which are not compatible provoke type errors.

Type checking is most efficiently carried out prior to execution, but is not possible when dynamic type binding is allowed, or in cases (such as C++ unions) where the same memory location is permitted to store values of different data types at different times during execution.

A programming language is strongly typed if type errors are always detected.

This is most easily done if each name in the language has a single type associated with it, and that type is known at compile time.
There are few languages which are truly strongly typed, though there are some that come close:
- Pascal's design of variant records prevents strong typing
- Ada programmers can specifically request suspension of type checking for particular types of conversions (Java has a similar loophole, created by allowing users to explicitly cast types)
Fortran, C, and C++ all use implicit and explicit coercion of data types frequently, and as such are not strongly typed languages.
In C++, for instance, the statements below result in an implicit conversion of the integer value from x into a floating point value for y:
```
int x = 1;
float y = 3.0;
y = x;
```
(Coercion will be addressed again in a couple of weeks.)

We consider two types of type compatibility: name type compatibility and structure type compatibility.

Name type compatibility means two variables are type compatible only if they are in declarations that use the same type name.
This means that only the variables' type names need be compared to determine if they are compatible: fast and easy, but not as flexible as structure type compatibility.
C++ uses name type compatibility.
Structure type compatibility means two variables are compatible if their types have identical structures, regardless of the actual names given them.
This is more flexible than name type compatibility, but requires more checking and prevents us from distinguishing between two data types if they do happen to have the same underlying structure.
C uses structure type compatibility.

In fact, object-oriented languages also face the issue of object compatibility, but this will be addressed later in the semester.

Variable scopes: the scope of a variable is the range of program statements in which the variable is "visible".

For example, the scope of a variable declared within a C++ function is local to that function - it cannot be referenced from outside the function.

reference environment

The local variables of a program unit or block are the variables which are visible within the block and which are also declared within that block.

The nonlocal variables of a program unit or block are the variables which are visible within the block but which are not declared within it.

Static scoping means the scopes of variables are identified prior to run time, whereas dynamic scoping means variable scopes are identified during execution.

In languages like Pascal, subprograms can be nested, creating a heirarchy of scopes.

If a language uses static scoping, then it is possible prior to execution time (e.g. by the compiler) to determine which variable is referenced by the use of an identifier at any point in the code.

In the case of C++, if an identifier matches a local variable (or parameter) then it is assumed that variable is the one being referenced, otherwise the match is to any global variable using the identifier. (Though you can force a match to the global variable, bypassing the local, by preceding the identifier name with "::".)

For example:

int Result;    // global variable

int AddThree(int y)
{
   int result;        // local variable

   result = y + 3;    // assign y+3 to local variable

   // now add contents of global variable to local variable
   result = result ::result; 

   return(result);
}

In addition to allowing variables to be local to a subroutine, some languages allow variables to local to a block.

In C++, for example, a variable can be declared within a compound statement such as a for loop, while loop, or if statement:

 // assorted code
 while (x < 3) {
    int i; // i is visible only in the while loop
    // assorted code
 }
 // and more assorted code

(Actually, you'll find some non-standard C++ compilers fail to support these scoping rules.)

Dynamic scoping is supported in some versions of APL, LISP, and SNOBOL.

In dynamic scoping, the scope rules are based on the calling sequence of subroutines, not on the way in which they are "structurally" nested.

If an identifier does not match any variables in the current function, we search the function which called it to find any matching variables, then the function which called that, etc until a match is found.

Thus, an identifier in a particular function statement can refer to completely different variables in different executions of the same program!

Obviously this can raise significant concerns with respect to readability and reliability, and can require significantly more run time checking.

However, they can eliminate a great deal of parameter passing, since the relevant variables (and hence values) are implicitly visible to the called routine!