C SC 320 Lecture Notes: Recognizing languages

Recognizing languages

When we are given a particular string, and are asked to determine whether or not it is in a particular language, we will proceed as follows:

Work through the string, one character at a time, from left to right
At each character encountered, determine (based on a supplied description of the language) whether or not the string may be a valid string in the language
For instance, if the description of the language is given as the regular expression (1101)^*, and the string we are given is 11110111011101, then proceeding from left to right we know the string is not valid as soon as we reach the third character.
As soon as we realize a string is not in the language we can end the process
If we reach the end of the string and our rules still indicate that the string can be in the language then the string is a valid string in the language.

Memory requirements: one interesting aspect of a language is the amount of information we have to remember about a string while we are trying to determine whether or not it is a valid member of the language.

For the (1101)^* language we only need to remember one of the following five things at any given point in time:

the substring seen so far (in the left-to-right examination) has been of the form (1101)^*) -- this includes our initial state, when we haven't seen any characters yet so (1101)⁰ would be a valid interpretation, or
the substring seen so far has been of the form (1101)^*1, so may still turn out to be valid, or
the substring seen so far has been of the form (1101)^*11, so may still turn out to be valid, or
the substring seen so far has been of the form (1101)^*110, so may still turn out to be valid, or
the substring seen so far is not valid for the language

By remembering which of these five points we are at so far, we can look at the next character in the string and make a valid judgement as to whether or not the substring seen so far is still potentially valid for the language.

Using these five potential states, we could design an abstract machine capable of recognizing whether or not a particular string was in the language.

We could, theoretically, design such a machine that had one state for every different string in the language, but for an infinite language that would require an infinite number of things to remember (i.e. an infinite number of states).

A useful property of regular languages is that they are precisely the set of languages that can be recognized by such an abstract machine with a finite number of states -- i.e. for every regular language we can design a recognizer without having to remember every string in the language.

The abstract recognizing machines we are discussing will be termed finite automata, and are the subject of our next section.

Finite Automata/State Machines

Definitions and theorems

Transition functions

We will use § to represent the transition function for a finite automata, taking you from a state and character to a new state. Thus § Q × ∑ -> Q

Acceptance and rejection

Let machine

M = (Q, ∑, q₀,
   A, §)

be a finite automata. A string x in ∑^* is accepted by M if §^* is in A.

(I.e. by repeatedly applying the transition function to the characters of x, we are finally left in an accept state.)

The language accepted by M, or the language recognized by M, is the set L(M) = { x ∈ ∑^* | x is accepted by M }

If L is any language over ∑, L is accepted, or recognized, by M if and only if L = L(M)

(I.e. to accept (or recognize) a language L, a FA must accepta all the strings in L and reject all the strings in L'.)

Theorem 3.1

A language L over the alphabet ∑ is regular if and only if there is a FA that accepts L.

(we'll prove this one later)

Distinguishing strings

The whole purpose of finite automata is to distinguish some strings from others.

From a practical viewpoint, we want two strings, x, y, to be treated differently (i.e. distinguishable) if we can follow each of them with the same substring, z, and wind up with one string in a language, L, and the other string in L'

Example: over alphabet { 0, 1 }

L

(1010)^*

let x = 101 and y = 10
There is no string z such that both xz and yz are in L, so strings x and y are said to be distinguishable with respect to L
if, on the other hand, x = 101 and y = 1010101, then for any substring, z ,xz is in L iff yz is in L, so x and y are said to be indistinguishable with respect to L

Lemma 3.1

Suppose that L is a language over the alphabet ∑ and M = (Q, ∑, q₀, A, §) is any FA recognizing L. If x and y are any two strings over the alphabet for which

§^*(q₀,x) = 
         §^*(q₀,y)

then x and y are indistinguishable with respect to L.

Proof:

Let z be any string over the alphabet
Consider the two strings xz and yz
We can show by induction that, for any q in Q, §^*(q,xy) = §^*( §^*(q,x),y)
(that will be part of assignment 2)
Thus, we know that
- §^*(q₀,xz) = §^*(§^*( q₀,x),z)
- §^*(q₀,yz) = §^*(§^*( q₀,y),z)
Assume that §^*(q₀,x) = §^*(q₀,y) (the "if" part of our Lemma)
Then it must also be the case that §^*(q₀,xz) = §^*(q₀,yz)
Since M is assumed to recognize L, either the two strings, xz, yz, must either both be in L or they must both be in L'
Thus, x and y are (by our definition) indistinguishable with respect to L

Theorem 3.2

Suppose that L is a language over the alphabet ∑, and for some positive integer, n, there are n strings over the alphabet, any two of which are distinguishable with respect to L. Then there can be no FA recognizing L with fewer than n states.

Proof by contradiction:

Suppose x₁, ..., x_n are n pairwise distinguishable strings with respect to L
Suppose M is any FA with fewer than n states
Then, by the pigeonhole principe, the states §^*(q₀,x₁), §^*(q₀,x₂), ... §^*(q₀,x_n) cannot all be distinct
So, for some i ¬= j, §^*(q₀,x_i) = §^*(q₀,x_j)
Since x_i and x_j were defined to be distinguishable with respect to L, it follows that M cannot recognize L

This is important, in that if we can find n strings that are pairwise distinguishable, we know that any FA for the language must have at least n states.

Theorem 3.3

The language, pal of palindromes (strings that read the same forwards as backwards) over the alphabet { 0, 1 } is not regular.

Proof:

We can show that there are an infinite number of strings which are pairwise distinguishable with respect to L
```
x₁ = 01
x₂ = 001
x₃ = 0001
x₄ = 00001
...
```
To distinguish any two strings x_i, x_j where i ¬= j, use the string 10^j, this will result in the two strings 0ⁱ110^j (which is not in L) and 0^j110^j (which is in L)
Then the number of states required for a FA to recognize L must be infinite
Since a finite automaton must have a finite number of states, no FA can recognize L
Therefore pal is not regular.

Theorem 3.4

(Paraphrased: the regular languages are closed under union, instersection, and complement: we will show this by showing how to construct recognizing FA from the FAs for the languages being operated on)

Suppose that M₁ = (Q₁, ∑₁, q₁, A₁, §₁ accepts L₁, and M₂ = (Q₂, S₂, q₂, A₂, §₂ accepts L2.

Let M = (Q, ∑, q₀, A, §) where Q = Q₁ × Q₂ and q₀ = (q₁,q₂) and §((p,q),a) = (§₁(p,a), §₂(q,a)) for any p in Q₁, q in Q₂, and a in ∑.

Then


      If  A = { (p,q) | p ∈ A₁
                    or q ∈ A₂ } 
          then M accepts the language 
               L₁ ^U L₂
     
 If  A = { (p,q) | p ∈ A₁
                    and q ∈ A₂ } 
          then M accepts the language 
               L₁ ^@ L₂
     
 If  A = { (p,q) | p ∈ A₁
                    and  q ∉ A₂ } 
          then M accepts the language 
               L₁ - L₂


     
     Intuitive argument: 
     

      for any input string,
         our new machine M
         simply needs to be able to track which state the string
         would put us in for machine M₁ and which
         state the string would put us in for machine 
         M₂
     
 The three cases:
         
          For union, we accept if either 
              M₁ or M₂
              is in an accept state.
         
 For intersection, we accept if both 
              M₁ and M₂
              are in an accept state.
         
 For difference, we accept if 
              M₁ is in an accept state and
              M₂ is not in an accept state.
         
    
    Example: Suppose that, over the alphabet
         { 0, 1 }, we have the two regular languages
         L₁ = { x | 00 is not a substring of x } ,
         L₂ = { x | x ends with 01 } .
    
 From the FAs which recognize L₁
        and L₂, construct an FA which
        recognizes L₁ - L₂
    

     State table for L₁
    
    State Symbol Next State
      X   0   Y  
      X   1   X  
      Y   0   Z  
      Y   1   X  
      Z   0   Z  
      Z   1   Z  
    
    X is the start state and
    X, Y are the accept states.
    
 State table for L₂
    
    State Symbol Next State
      T   0   V  
      T   1   T  
      V   0   V  
      V   1   W  
      W   0   V  
      W   1   T  
    
    T is the start state and
    W is the accept state.
    
 Constructing the state table for 
         L₁ - L₂
         
          There are nine potential states for the new machine,
              corresponding to the state combinations
              XT, XV, XW, YT, YV, YW, ZT, ZV, ZW
         
 The start state for the new machine corresponds
              to the start pair from the two originals, i.e.
              XT
         
 From this point, state XT,
              if we observe a 1 we would 
              stay in state XT, while on a 0
              we would move to state YV
         
 From state YV on 0 we move to
              state ZV while on 1 we move to state
              XW
         
 From state ZV on 0 we stay
              in state ZV while on 1 we move to
              state ZW
         
 From state XW on 0 we move to
              state YV while on 1 we move to state
              XT
         
 From state ZW on 0 we move to
              state ZV while on 1 we move to state
              ZT
         
 From state ZT on 0 we move to
              state ZV while on 1 we stay in
              state ZT
         
 Observe that there are three unused, or inaccessible,
              states - these can be safely removed from the FA
         
 The accept states are those that correspond to 
              accept for L₁ and reject for
              L₂, i.e. XT
              and YV.  (XV and YT
              would have been valid, but they were two of the
              unreachable states.)
         
         State Symbol Next State
           XT   0   YV  
           XT   1   XT  
           YV   0   ZV  
           YV   1   XW  
           XW   0   YV  
           XW   1   XT  
           ZV   0   ZV  
           ZV   1   ZW  
           ZW   0   ZV  
           ZW   1   ZT  
           ZT   0   ZV  
           ZT   1   ZT  
         
         Where XT is the start state and the
         accept states are XT and YV
         
         

            Note that for L₁ ^U L₂
            and L₁ ^@ L₂ the transition
            function/state tables are the same as this, but the set of accept states
            differs!
         
         
 Simplification of the table for 
              L₁ - L₂:
              we can observe that, once state 
              ZV is reached the only possible result
              is rejection, so we can collapse states
              ZV, ZW, ZT into a single reject state,
              R, from which there is no escape...
         
         State Symbol Next State
           XT   0   YV  
           XT   1   XT  
           YV   0   R  
           YV   1   XW  
           XW   0   YV  
           XW   1   XT  
           R    0   R   
           R    1   R   
         
         Where XT is the start state and the
         accept states are XT and YV
         
    
    Proof of theorem 3.4:
    

     First, we create an inductive proof (left as an exercise for the reader)
         of the following: 
         
            For any string x over the alphabet, and
            any pair of states p, q from Q,
             §^*((p,q),x) =
            (§₁^*(p,x), §₂^*(q,x))
            
         
    
 A string, x, is accepted by M iff
         §^*((q₁,q₂),x)
         is in A
    
 By the formula above, this is true iff
         
         (§₁^*(p,x), §₂^*(q,x))
         
         is in A
    
 For the three cases of defining set A,
    
     If set A is defined as in case 1 (union) then this is
         the same as saying 
         either 
         §₁^*(q₁,x)
          is in A₁ or 
         §₂^*(q₂,x)
          is in A₂
    
 If set A is defined as in case 2 (intersection) then this is
         the same as saying 
         that both 
         §₁^*(q₁,x)
          is in A₁ and 
         §₂^*(q₂,x)
          is in A₂
    
 If set A is defined as in case 3 (difference) then this is
         the same as saying 
         that both 
         §₁^*(q₁,x)
          is in A₁ and 
         §₂^*(q₂,x)
          is not in A₂


Practice: (based on exercises from text)

based on 3.23
   For each of the following regular expressions, draw a finite automaton
   that recognizes the corresponding language:
   
   
(1+110)^*0
   
(111+100)^*0
   
1(01+10)^*+0(11+10)^*
   
 based on 3.30
   Let z be a fixed string of length n over
   the alphabet { 0, 1 }.
   

   What is the smallest number of states an FA can have if it accepts
   the language { 0, 1 }^*z?
   Prove your answer.
 based on 3.41
   Consider the finite automata described by the following tables
   
   FA to recognize L₁
    
    State Symbol Next State
      T   0   V  
      T   1   T  
      V   0   V  
      V   1   W  
      W   0   V  
      W   1   T  
    
    T is the start state and
    W is the accept state.
   
FA to recognize L₂
    
    State Symbol Next State
      R   0   S  
      R   1   S  
      S   0   R  
      S   1   R  
    
    R is the start state and
    S is the accept state.
   
   Draw the finite automata corresponding to L₂ - L₁

State	Symbol	Next State
XT	0	YV
XT	1	XT
YV	0	ZV
YV	1	XW
XW	0	YV
XW	1	XT
ZV	0	ZV
ZV	1	ZW
ZW	0	ZV
ZW	1	ZT
ZT	0	ZV
ZT	1	ZT