A string is a finite sequence of characters. The empty string is denoted by ε {\displaystyle \varepsilon } . The concatenation of two string s {\displaystyle s} and t {\displaystyle t} is denoted by s ⋅ t {\displaystyle s\cdot t} , or shorter by s t {\displaystyle st} . Concatenating with the empty string makes no difference: s ⋅ ε = s = ε ⋅ s {\displaystyle s\cdot \varepsilon =s=\varepsilon \cdot s} . Concatenation of strings is associative: s ⋅ ( t ⋅ u ) = ( s ⋅ t ) ⋅ u {\displaystyle s\cdot (t\cdot u)=(s\cdot t)\cdot u} .
For example, ( ⟨ b ⟩ ⋅ ⟨ l ⟩ ) ⋅ ( ε ⋅ ⟨ a h ⟩ ) = ⟨ b l ⟩ ⋅ ⟨ a h ⟩ = ⟨ b l a h ⟩ {\displaystyle (\langle b\rangle \cdot \langle l\rangle )\cdot (\varepsilon \cdot \langle ah\rangle )=\langle bl\rangle \cdot \langle ah\rangle =\langle blah\rangle } .
A language is a finite or infinite set of strings. Besides the usual set operations like union, intersection etc., concatenation can be applied to languages: if both S {\displaystyle S} and T {\displaystyle T} are languages, their concatenation S ⋅ T {\displaystyle S\cdot T} is defined as the set of concatenations of any string from S {\displaystyle S} and any string from T {\displaystyle T} , formally S ⋅ T = { s ⋅ t ∣ s ∈ S ∧ t ∈ T } {\displaystyle S\cdot T=\{s\cdot t\mid s\in S\land t\in T\}} . Again, the concatenation dot ⋅ {\displaystyle \cdot } is often omitted for brevity.
The language { ε } {\displaystyle \{\varepsilon \}} consisting of just the empty string is to be distinguished from the empty language { } {\displaystyle \{\}} . Concatenating any language with the former doesn't make any change: S ⋅ { ε } = S = { ε } ⋅ S {\displaystyle S\cdot \{\varepsilon \}=S=\{\varepsilon \}\cdot S} , while concatenating with the latter always yields the empty language: S ⋅ { } = { } = { } ⋅ S {\displaystyle S\cdot \{\}=\{\}=\{\}\cdot S} . Concatenation of languages is associative: S ⋅ ( T ⋅ U ) = ( S ⋅ T ) ⋅ U {\displaystyle S\cdot (T\cdot U)=(S\cdot T)\cdot U} .
For example, abbreviating D = { ⟨ 0 ⟩ , ⟨ 1 ⟩ , ⟨ 2 ⟩ , ⟨ 3 ⟩ , ⟨ 4 ⟩ , ⟨ 5 ⟩ , ⟨ 6 ⟩ , ⟨ 7 ⟩ , ⟨ 8 ⟩ , ⟨ 9 ⟩ } {\displaystyle D=\{\langle 0\rangle ,\langle 1\rangle ,\langle 2\rangle ,\langle 3\rangle ,\langle 4\rangle ,\langle 5\rangle ,\langle 6\rangle ,\langle 7\rangle ,\langle 8\rangle ,\langle 9\rangle \}} , the set of all three-digit decimal numbers is obtained as D ⋅ D ⋅ D {\displaystyle D\cdot D\cdot D} . The set of all decimal numbers of arbitrary length is an example for an infinite language.
The alphabet of a string is the set of all of the characters that occur in a particular string. If s is a string, its alphabet is denoted by
The alphabet of a language S {\displaystyle S} is the set of all characters that occur in any string of S {\displaystyle S} , formally: Alph ( S ) = ⋃ s ∈ S Alph ( s ) {\displaystyle \operatorname {Alph} (S)=\bigcup _{s\in S}\operatorname {Alph} (s)} .
For example, the set { ⟨ a ⟩ , ⟨ c ⟩ , ⟨ o ⟩ } {\displaystyle \{\langle a\rangle ,\langle c\rangle ,\langle o\rangle \}} is the alphabet of the string ⟨ c a c a o ⟩ {\displaystyle \langle cacao\rangle } , and the above D {\displaystyle D} is the alphabet of the above language D ⋅ D ⋅ D {\displaystyle D\cdot D\cdot D} as well as of the language of all decimal numbers.
Let L be a language, and let Σ be its alphabet. A string substitution or simply a substitution is a mapping f that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a character a ∈ Σ, one has f(a)=La where La ⊆ Δ* is some language whose alphabet is Δ. This mapping may be extended to strings as
for the empty string ε, and
for string s ∈ L and character a ∈ Σ. String substitutions may be extended to entire languages as 1
Regular languages are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language.2 Similarly, context-free languages are closed under string substitution.34
A simple example is the conversion fuc(.) to uppercase, which may be defined e.g. as follows:
For the extension of fuc to strings, we have e.g.
For the extension of fuc to languages, we have e.g.
A string homomorphism (often referred to simply as a homomorphism in formal language theory) is a string substitution such that each character is replaced by a single string. That is, f ( a ) = s {\displaystyle f(a)=s} , where s {\displaystyle s} is a string, for each character a {\displaystyle a} .56
String homomorphisms are monoid morphisms on the free monoid, preserving the empty string and the binary operation of string concatenation. Given a language L {\displaystyle L} , the set f ( L ) {\displaystyle f(L)} is called the homomorphic image of L {\displaystyle L} . The inverse homomorphic image of a string s {\displaystyle s} is defined as
f − 1 ( s ) = { w ∣ f ( w ) = s } {\displaystyle f^{-1}(s)=\{w\mid f(w)=s\}}
while the inverse homomorphic image of a language L {\displaystyle L} is defined as
f − 1 ( L ) = { s ∣ f ( s ) ∈ L } {\displaystyle f^{-1}(L)=\{s\mid f(s)\in L\}}
In general, f ( f − 1 ( L ) ) ≠ L {\displaystyle f(f^{-1}(L))\neq L} , while one does have
f ( f − 1 ( L ) ) ⊆ L {\displaystyle f(f^{-1}(L))\subseteq L}
and
L ⊆ f − 1 ( f ( L ) ) {\displaystyle L\subseteq f^{-1}(f(L))}
for any language L {\displaystyle L} .
The class of regular languages is closed under homomorphisms and inverse homomorphisms.7 Similarly, the context-free languages are closed under homomorphisms8 and inverse homomorphisms.9
A string homomorphism is said to be ε-free (or e-free) if f ( a ) ≠ ε {\displaystyle f(a)\neq \varepsilon } for all a in the alphabet Σ {\displaystyle \Sigma } . Simple single-letter substitution ciphers are examples of (ε-free) string homomorphisms.
An example string homomorphism guc can also be obtained by defining similar to the above substitution: guc(‹a›) = ‹A›, ..., guc(‹0›) = ε, but letting guc be undefined on punctuation chars. Examples for inverse homomorphic images are
For the latter language, guc(guc−1({ ‹A›, ‹bb› })) = guc({ ‹a› }) = { ‹A› } ≠ { ‹A›, ‹bb› }. The homomorphism guc is not ε-free, since it maps e.g. ‹0› to ε.
A very simple string homomorphism example that maps each character to just a character is the conversion of an EBCDIC-encoded string to ASCII.
If s is a string, and Σ {\displaystyle \Sigma } is an alphabet, the string projection of s is the string that results by removing all characters that are not in Σ {\displaystyle \Sigma } . It is written as π Σ ( s ) {\displaystyle \pi _{\Sigma }(s)\,} . It is formally defined by removal of characters from the right hand side:
Here ε {\displaystyle \varepsilon } denotes the empty string. The projection of a string is essentially the same as a projection in relational algebra.
String projection may be promoted to the projection of a language. Given a formal language L, its projection is given by
The right quotient of a character a from a string s is the truncation of the character a in the string s, from the right hand side. It is denoted as s / a {\displaystyle s/a} . If the string does not have a on the right hand side, the result is the empty string. Thus:
The quotient of the empty string may be taken:
Similarly, given a subset S ⊂ M {\displaystyle S\subset M} of a monoid M {\displaystyle M} , one may define the quotient subset as
Left quotients may be defined similarly, with operations taking place on the left of a string.
Hopcroft and Ullman (1979) define the quotient L1/L2 of the languages L1 and L2 over the same alphabet as L1/L2 = { s | ∃t∈L2. st∈L1 }.10 This is not a generalization of the above definition, since, for a string s and distinct characters a, b, Hopcroft's and Ullman's definition implies yielding {}, rather than { ε }.
The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton language L1 and an arbitrary language L2 is known as Brzozowski derivative; if L2 is represented by a regular expression, so can be the left quotient.11
The right quotient of a subset S ⊂ M {\displaystyle S\subset M} of a monoid M {\displaystyle M} defines an equivalence relation, called the right syntactic relation of S. It is given by
The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if
is finite. In the case that M is the monoid of words over some alphabet, S is then a regular language, that is, a language that can be recognized by a finite-state automaton. This is discussed in greater detail in the article on syntactic monoids.
The right cancellation of a character a from a string s is the removal of the first occurrence of the character a in the string s, starting from the right hand side. It is denoted as s ÷ a {\displaystyle s\div a} and is recursively defined as
The empty string is always cancellable:
Clearly, right cancellation and projection commute:
The prefixes of a string is the set of all prefixes to a string, with respect to a given language:
where s ∈ L {\displaystyle s\in L} .
The prefix closure of a language is
Example: L = { a b c } then Pref ( L ) = { ε , a , a b , a b c } {\displaystyle L=\left\{abc\right\}{\mbox{ then }}\operatorname {Pref} (L)=\left\{\varepsilon ,a,ab,abc\right\}}
A language is called prefix closed if Pref ( L ) = L {\displaystyle \operatorname {Pref} (L)=L} .
The prefix closure operator is idempotent:
The prefix relation is a binary relation ⊑ {\displaystyle \sqsubseteq } such that s ⊑ t {\displaystyle s\sqsubseteq t} if and only if s ∈ Pref L ( t ) {\displaystyle s\in \operatorname {Pref} _{L}(t)} . This relation is a particular example of a prefix order.
Hopcroft, Ullman (1979), Sect.3.2, p.60 ↩
Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.4, p.60 ↩
Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.2, p.131 ↩
Although every regular language is also context-free, the previous theorem is not implied by the current one, since the former yields a shaper result for regular languages. ↩
Strictly formally, a homomorphism yields a language consisting of just one string, i.e. f ( a ) = { s } {\displaystyle f(a)=\{s\}} . ↩
Hopcroft, Ullman (1979), Sect.3.2, p.60-61 ↩
Hopcroft, Ullman (1979), Sect.3.2, Theorem 3.5, p.61 ↩
This follows from the above-mentioned closure under arbitrary substitutions. ↩
Hopcroft, Ullman (1979), Sect.6.2, Theorem 6.3, p.132 ↩
Hopcroft, Ullman (1979), Sect.3.2, p.62 ↩
Janusz A. Brzozowski (1964). "Derivatives of Regular Expressions". J ACM. 11 (4): 481–494. doi:10.1145/321239.321249. S2CID 14126942. /wiki/Janusz_Brzozowski_(computer_scientist) ↩