Programming languages: lecture notes

Data Types

The usefulness of a language depends heavily on the types of data which can be "naturally" modeled using the language
In this section we consider a number of the most commonly supported data types, and the implementation issues associated with them
One of the most significant issues is that of "user defined types", in which the user combines existing simpler data types to create a more complex, more specific data type
This allowed improved readability and more refined type checking
Again we will describe variables in terms of descriptors, listing the relevant variable attributes (name, type, value, etc)
If any of the attributes are dynamic - i.e. can change during program execution - then extra storage space must be associated with the variable to track the current value of the attribute
We also need to address how data types are supported in terms of operations on the data

Primitive data types

Primitive data types are the core set of types supported by a language, and which are not expressed in terms of other data types.
These commonly include the numeric data types (integer, floating point, decimal), Boolean data types, and character data types.
Numeric data types: these commonly include integers, floats, and decimals
Boolean data types: these data types are used to distinguish between the two logical values: true and false
Most modern programming languages directly support Boolean types, with the notable exception of C, which can simulate the use of Booleans by treating the numeric values 0 as logical false and 1 as logical true
(For backwards compatibility, C++ also allows the use of 0 and 1 as false and true.)
The typical implementation is to use a single byte (or the smallest addressable memory unit supported by the hardware) to store a Boolean value, although in fact only a single bit would be necessary.
The most common operations based on Booleans are assignment, comparison (equal/equal) and the logical operations (such as AND, OR, NOT, XOR)
Character data types: the set of available characters is limited by the amount of storage space allocated to hold a character (or, conversely, a lower bound on the amount of storage space needed per character is determined by the size of the desired character set)
Some common implementations include ASCII (which uses the binary representations of values 0..127 to represent 128 different characters), EBCDIC (0..255) and Unicode (which uses 16 bits, and hence can handle a character set with 65536 different characters - allowing for representation of alphabets from most of the world's natural languages)
While most of the common programming languages support ASCII, Unicode is supported by Java
The most common character operations are assignment and comparison, the latter of which may be purely equal/not-equal, or may allow for some relative ranking based on the underlying implementation

Strings

Because of the frequency with which character sequences are manipulated, a data type called the string is commonly implemented to handle them
Because many of the common string operations (truncation, concatenation, manipulation of substrings, etc) call for access to particular subsequences of characters, a natural implementation has been to treat the string as an array of characters
Typically a library of string manipulation functions or operations is also supported by a language
One major issue is whether strings have static or dynamic length, and how the length should be identified
If a string has dynamic length, then storage needs to be allocated and deallocated during execution.
Some method is also required to identify where the end of the string currently resides - usually either by having an attribute which records the current length of the string or by having a special termination character which marks the end of the string.
Pascal, Fortran, Cobol, and Ada have fixed string lengths, Perl has unlimited dynamic string lengths, while C and C++ have dynamic string lengths but with a fixed upper bound on length.
An interesting issue is how much information needs to be maintained about the string during execution:
- C/C++ use a termination character, which the user is responsible for finding, and do not perform any checking on array bounds. Thus the "current length" of the string need not be tracked. When storage is allocated for a string, sufficient storage is set aside for the "maximum length" string, thus dynamic allocation issues are not an issue (the string can grow and shrink within the alloted space)
- Languages with fixed length strings also avoid the problem of tracking maximum and current string lengths during execution
- Languages such as Perl require that the current string length be tracked, and some means found to allocate sufficient storage when the string grows.
  Typically this is either by some form of linkage (a new section of storage is allocated and linked to the end of the existing string) or by allocating a completely new section of storage that is sufficiently large to hold the entire string (then deallocating the old string space)

Ordinal types

Ordinal types have a fixed, finite set or range of possible values
This means that the values can be associated with a range of integers
E.g. an ordinal type might be created for the seven days of the week, with the integer value 1 associated with Monday, 2 with Tuesday, 3 with Wednesday, etc.
(Some implementations will begin enumeration at 0, others at 1.)
This improves both the readability of the language, by allowing the use of domain-relevant literal constants, and the reliability of the language, by allowing more restrictive type checking
In many languages, the integers, characters, and Booleans are the core ordinal types, and users can declare their own types in addition to these
The two most common forms of user-defined ordinal types are enumeration and subranges
Enumeration types require the user to list all the values associated with the ordinal type
The days of the weeks example falls into this category - we need to list Monday, Tuesday, etc explicitly for the system to understand they are part of the set
An important issue is whether or not the same literal constant - e.g. the identifier Monday - is allowed to appear in two different enumerated types. (E.g. DaysofWeek and WorkDays.) If so, how should any given occurence of the literal be treated?
In Pascal, C, and C++ this is not allowed - the same literal constant cannot appear in more than one enumeration type definition in any given referencing environment.
In Ada it is allowed (overloading of literals) if rules for identifying the correct use of the literal constant are determinable from its context.
Subrange Types allow the user to specify a subset or subsequence of an existing ordinal type, e.g. all integers in the range 5-10, or all characters in the range 'a'-'z'
These are less universally applicable (obviously) but allow more compact definitions (also obviously)

Arrays

Arrays are ordered collections of like-typed data elements, in which individual elements can be referred to by their position (or subscript) within the ordering
The subscript (or expression describing the subscript) must be evaluated to identify the specific memory cells associated with the array element
Common questions associated with the way a language will address arrays include:
- What types are legal for subscripts?
- Are subscript ranges implicitly or explicitly set, and when?
- How are arrays described and subscripts delimited?
- Is range checking performed?
- When does storage allocation take place?
- Can/must initialization take place at the time of allocation?
- Are array slices supported?
Typically an array element is referenced using an identifier for the array as a whole and a subscript for the relative position of the element within the array.
Typically delimiters such as brackets are used to separate the subscript from the array, e.g. myarray[3]
Readability is improved if the delimiting characters are different from those used to delimit parameter lists or program blocks.
(In Fortran, both parameter lists and array subscripts are delimited using the paranthesis, making it difficult to determine if foo(3) refers to the third element of array foo or a call to function foo with the single parameter 3.)
In some languages the lower bound for subscript ranges is implicitly set (typically at 0, such as in C, C++, Java, or 1, as in Fortran or Pascal)
In other languages the programmer must explicitly set both the upper and lower bounds on valid array subscripts.
The binding of array subscript ranges and storage can be categorized as
- static - ranges and storage is bound before run time
- fixed-stack dynamic - ranges are bound before run time but allocation actually takes place during execution
- stack-dynamic - ranges and storage are bound dynamically, but once bound they do not change during the variable lifetime
- heap-dynamic - ranges and storage are bound dynamically and can be altered during the variable's lifetime
C and C++ provide dynamic arrays in which the user is responsible for allocation and deallocation, while Perl's dynamic arrays are implicitly extended whenever the user makes subscript references beyond the current existing array bounds
Multidimensional arrays can be supported by allowing additional subscripts, e.g. myarray(3, 7, 9) or myarray[3][7][9]
Some languages limit the number of subscripts, and hence the dimensionality of the arrays (e.g. 3 dimensions in early Fortran versions)
Some languages provide methods to initialize arrays at the time of storage allocation, in some cases even using the initialization to implicitly set the array sizes
Array operations are operations that take the entire array as an operand Some examples might be pairwise addition of array elements (e.g. array1 + array2 adds the elements pairwise), reversing the elements of an array, etc
Array slices refer to substructures of an array This is most commonly applied when dealing with multi-dimensional arrays, e.g. taking a single row or column of a two-dimensional array
It might also be used to take a subsequence of the elements of an array
Implementation issues to be considered include whether or not run time range checking takes place, and how the memory locations of array elements are computed.
For instance, attributes of an array might include its base memory address and the (standard) size of individual elements. If subscripting begins at 0 and array elements are stored contiguously in memory then the location of the i'th element might then be computed as start address + i * (size of individual element)
If several subscripts are given in a multidimensional array then this formula must be adjusted to consider the number of elements preceding the desired element. E.g.:

Associative arrays have a completely different context of reference: the content elements are addressed by keys rather than subscripts

Each element of an array is actually a key,value pair

Associative arrays are supported in Perl (refered to as hashes) and Java

Below we give some Perl examples:

 // initializing an associative array
 %salaries = ("Cedric" => 75000,
              "Perry"  => 57000,
              "Mary"   => 55750,
              "Gary"   => 47850);
 
 // assigning a new value
 $salaries{"Mary"} = 76000;
 
 // removing a value 
 delete $salaries{"Gary"};
 
 // looking for a value with a given key
 if (exists $salaries{"Perry"}) ....

Records

Unlike arrays, records are based on a logical collection of data elements which are not necessarily of uniform type.
For instance, an employee record might be composed of the employee's name, salary, employee number, etc.
Most modern programming languages support records in some way:
The elements, or fields, of a record are usually referenced by name (rather than the use of subscripts in arrays) for readability.
Record (or struct) declarations usually involve naming the overall type of record, then specifying names and data types for each of the record fields.
Referencing a record element then involves identifying both the record and field of interest, e.g. employee.name = "Bob"
Record operations (i.e. on entire records, rather than individual fields) usually include assigment (copying the field values of one record to another) and comparison (for equality in all fields)
To access array fields it is usually necessary to record the offset of each field relative to the start of the record

Unions

Unions allow a single set of memory cells to be used to store different data types at different times - for instance a date union might enable a block of storage to hold the date in string format or numeric formats at different times.
The data stored in the union is interpretted based on the data type the union is believed to currently represent.
Unions can substantially improve program flexibility, but their implementations often prevent a language from being strongly typed by failing to allow type checking (since any type checking on a union must take place dynamically)
Fortran, C, and C++ provide unions with no type checking.
Pascal's record variants stores a tag for the current type as part of unions within a record, but the user can change the tag without changing the variant, rendering type checking invalid.
Ada has an enforced version of tags for safe type checking.
Java does not support unions at all.

Sets

Sets are unordered collections of elements drawn from some base ordinal data type.
Most current languages do not explicitly support sets, one exception being Pascal.
(Ada does provide the set membership operator, which may be applied to enumerated types.)
The declaration of a set requires either explicit listing of the set elements (as with enumerated types) or subrange specifications
Commonly supported set operations would include union, intersection, difference, and equality.
Typical implementations use a bit string, with a single bit for each element in the universe of set elements - the bit is then set to 0 or 1 to indicate the element's current presence (or absence) in the set

Pointers

A pointer is used to describe memory addresses, and the values a pointer variable may assume include the range of all valid memory addresses plus the special value nil, which indicates the pointer currently does not reference any memory location
A pointer used for indirect addressing is a pointer variable, whereas variables which are dynamically allocated from the heap are refered to as heap-dynamic variables
Pointers are used to provide indirect memory addressing and to provide a method for dynamic memory allocation
Pointers are typically regarded as distinct from both structured types and scalar variables
The major design issues associated with pointers include:
- What are the scope and lifetime of a pointer variables and heap-dynamic variables?
- Are there limitations on the types of data to which specific pointers can refer (e.g. is something declared as a pointer to a character incompatible with something declared as a pointer to an integer?)
- Are both pointer variables and heap-dynamic variables supported?
The typical pointer operations are assignment (making a pointer refer to a specific memory location) and dereferencing (looking at the contents of memory at the location a pointer refers to)
Languages that support heap-dynamic variables must also provide an operator for allocating memory and returning its reference address
The actual notation for pointer assignment and dereferencing varies considerably by language, but is frequently closely related to the notation used in referencing record fields
Pointers substantially increase program flexibility, but detract from readability by effectively creating aliases for specific memory locations.
Other pointer problems include:
- Dangling pointers refer to the case in which an existing pointer references a section of memory which has been deallocated, and possibly reallocated for other use.
- Memory leakage (or lost heap-dynamic variables) refer to the case in which storage has been allocated, but there is no remaining variable referencing the allocated space, hence the space is effectively unusable
Pointers in specific languages:
- Pascal uses pointers for dynamic allocation and deallocation of memory (through new and dispose) with one of four techniques commonly applied for dispose:
  1. ignore it (deallocation doesn't actually take place until program termination!)
  2. remove dispose from the language
  3. deallocate the memory pointed to by the given variable, and leave the programmer responsible for the problem of dangling pointers
  4. implement dispose to search out all pointers referring to the deallocated memory and set them all to nil -- this approach is the safest, but is extremely costly and rarely (never?) implemented
  5. Note one other alternative has been suggested: a tombstone, in which all pointers point to the tombstone, which in turn references the allocated space.
    When the space is deallocated the tombstone is set to reflect this fact, and any dangling pointers really only reference the tombstone and can be detected during execution.
- Ada follows a process similar to Pascal, but some implementations implicitly deallocate heap-dynamic variables at the end of the scope of the pointer type - thus some dynamic deallocation automatically takes place
- C and C++ treat pointers in a manner very similar to addresses in assembly language - making them a very powerful and potentially dangerous tool
  These pointers can point at almost any variable or function in memory, and in fact array names are treated like constant pointers to the first cell in the array
  Pointers can be indexed as though they are array names
  Generic pointers (of type void *) can point at values of any type, and they are often used as parameters for functions operating on memory.
Reference types: reference types are used in C++ and Java to implement pass-by-reference parameters without requiring the explicit use of pointers
Unless pass-by-reference is used, parameters in C++ are always pass-by-value
Java references actually refer to class instances, rather than memory addresses - making them a much safer (though less flexible) feature
Heap management covers the dynamic allocation and deallocation of memory segments from the available heap.
One implementation (common to LISP approaches) is to treat the heap as a linked list of identical available cells - allocation and deallocation simply takes/returns lists of cells from/to the list.
Two approaches are taken towards deallocating space:
- Garbage collection: the lazy approach: when the heap runs out of space, go through the allocated segments of memory and the list of pointers and identify which memory segments are no longer referenced, then return these to the free pool
- Reference counters: the eager approach: for each allocated memory segment maintain a counter of how many pointers refer to the segment - increment or decrement the counter as necessary during execution, and whenever the count hits 0 return the memory segment to the free pool