Let S = ( s 1 , s 2 , s 3 , … s T ) {\displaystyle S=(s_{1},s_{2},s_{3},\ldots s_{T})} be a sequence of states s i {\displaystyle s_{i}} belonging to a finite set of possible states. Let us denote S {\displaystyle {\mathbf {S} }} the sequence space, i.e. the set of all possible sequences of states.
Optimal matching algorithms work by defining simple operator algebras that manipulate sequences, i.e. a set of operators a i : S → S {\displaystyle a_{i}:{\mathbf {S} }\rightarrow {\mathbf {S} }} . In the most simple approach, a set composed of only three basic operations to transform sequences is used:
Imagine now that a cost c ( a i ) ∈ R 0 + {\displaystyle c(a_{i})\in {\mathbf {R} }_{0}^{+}} is associated to each operator. Given two sequences S 1 {\displaystyle S_{1}} and S 2 {\displaystyle S_{2}} , the idea is to measure the cost of obtaining S 2 {\displaystyle S_{2}} from S 1 {\displaystyle S_{1}} using operators from the algebra. Let A = a 1 , a 2 , … a n {\displaystyle A={a_{1},a_{2},\ldots a_{n}}} be a sequence of operators such that the application of all the operators of this sequence A {\displaystyle A} to the first sequence S 1 {\displaystyle S_{1}} gives the second sequence S 2 {\displaystyle S_{2}} : S 2 = a 1 ∘ a 2 ∘ … ∘ a n ( S 1 ) {\displaystyle S_{2}=a_{1}\circ a_{2}\circ \ldots \circ a_{n}(S_{1})} where a 1 ∘ a 2 {\displaystyle a_{1}\circ a_{2}} denotes the compound operator. To this set we associate the cost c ( A ) = ∑ i = 1 n c ( a i ) {\displaystyle c(A)=\sum _{i=1}^{n}c(a_{i})} , that represents the total cost of the transformation. One should consider at this point that there might exist different such sequences A {\displaystyle A} that transform S 1 {\displaystyle S_{1}} into S 2 {\displaystyle S_{2}} ; a reasonable choice is to select the cheapest of such sequences. We thus call distance d ( S 1 , S 2 ) = min A { c ( A ) s u c h t h a t S 2 = A ( S 1 ) } {\displaystyle d(S_{1},S_{2})=\min _{A}\left\{c(A)~{\rm {such~that}}~S_{2}=A(S_{1})\right\}} that is, the cost of the least expensive set of transformations that turn S 1 {\displaystyle S_{1}} into S 2 {\displaystyle S_{2}} . Notice that d ( S 1 , S 2 ) {\displaystyle d(S_{1},S_{2})} is by definition nonnegative since it is the sum of positive costs, and trivially d ( S 1 , S 2 ) = 0 {\displaystyle d(S_{1},S_{2})=0} if and only if S 1 = S 2 {\displaystyle S_{1}=S_{2}} , that is there is no cost. The distance function is symmetric if insertion and deletion costs are equal c ( a I n s ) = c ( a D e l ) {\displaystyle c(a^{\rm {Ins}})=c(a^{\rm {Del}})} ; the term indel cost usually refers to the common cost of insertion and deletion.
Considering a set composed of only the three basic operations described above, this proximity measure satisfies the triangular inequality. Transitivity however, depends on the definition of the set of elementary operations.
Although optimal matching techniques are widely used in sociology and demography, such techniques also have their flaws. As was pointed out by several authors (for example L. L. Wu2), the main problem in the application of optimal matching is to appropriately define the costs c ( a i ) {\displaystyle c(a_{i})} .
A. Abbott and A. Tsay, (2000) Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect Sociological Methods & Research], Vol. 29, 3-33. doi:10.1177/0049124100029001001 http://smr.sagepub.com/cgi/content/abstract/29/1/3 ↩
L. L. Wu. (2000) Some Comments on "Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect" Archived 2006-10-24 at the Wayback Machine Sociological Methods & Research, 29 41-64. doi:10.1177/0049124100029001003 http://smr.sagepub.com/cgi/content/refs/29/1/41 ↩