Range concatenation grammar

<h2 id="description">Description</h2>
<h3>Formal definition</h3>
A Positive Range Concatenation Grammar (PRCG) is a tuple 
 
 
 
 G
 =
 (
 N
 ,
  
 T
 ,
  
 V
 ,
  
 S
 ,
  
 P
 )
 
 
 {\displaystyle G=(N,~T,~V,~S,~P)}
 
, where:

<ul><li>
 
 
 
 N
 
 
 {\displaystyle N}
 
, 
 
 
 
 T
 
 
 {\displaystyle T}
 
 and 
 
 
 
 V
 
 
 {\displaystyle V}
 
 are disjoint finite sets of (respectively) predicate names, terminal symbols and variable names. Each predicate name has an associated arity given by the function 
 
 
 
 dim
 :
 N
 →
 
 N
 
 ∖
 {
 0
 }
 
 
 {\displaystyle \dim :N\rightarrow \mathbb {N} \setminus \{0\}}
 
.</li>
<li>
 
 
 
 S
 ∈
 N
 
 
 {\displaystyle S\in N}
 
 is the start predicate name and verify 
 
 
 
 dim
 ⁡
 (
 S
 )
 =
 1
 
 
 {\displaystyle \dim(S)=1}
 
.</li>
<li>
 
 
 
 P
 
 
 {\displaystyle P}
 
 is a finite set of clauses of the form 
 
 
 
 
 ψ
 
 0
 
 
 →
 
 ψ
 
 1
 
 
 …
 
 ψ
 
 m
 
 
 
 
 {\displaystyle \psi _{0}\rightarrow \psi _{1}\ldots \psi _{m}}
 
, where the 
 
 
 
 
 ψ
 
 i
 
 
 
 
 {\displaystyle \psi _{i}}
 
 are predicates of the form 
 
 
 
 
 A
 
 i
 
 
 (
 
 α
 
 1
 
 
 ,
 …
 ,
 
 α
 
 dim
 ⁡
 (
 
 A
 
 i
 
 
 )
 
 
 )
 
 
 {\displaystyle A_{i}(\alpha _{1},\ldots ,\alpha _{\dim(A_{i})})}
 
 with 
 
 
 
 
 A
 
 i
 
 
 ∈
 N
 
 
 {\displaystyle A_{i}\in N}
 
 and 
 
 
 
 
 α
 
 i
 
 
 ∈
 (
 T
 ∪
 V
 
 )
 
 ⋆
 
 
 
 
 {\displaystyle \alpha _{i}\in (T\cup V)^{\star }}
 
.</li></ul>
A Negative Range Concatenation Grammar (NRCG) is defined like a PRCG, but with the addition that some predicates occurring in the right-hand side of a clause can have the form 
 
 
 
 
 
 
 
 A
 
 i
 
 
 (
 
 α
 
 1
 
 
 ,
 …
 ,
 
 α
 
 dim
 ⁡
 (
 
 A
 
 i
 
 
 )
 
 
 )
 
 ¯
 
 
 
 
 {\displaystyle {\overline {A_{i}(\alpha _{1},\ldots ,\alpha _{\dim(A_{i})})}}}
 
. Such predicates are called negative predicates.
A Range Concatenation Grammar is a positive or a negative one. Although PRCGs are technically NRCGs, the terms are used to highlight the absence (PRCG) or presence (NRCG) of negative predicates.
A range in a word 
 
 
 
 w
 ∈
 
 T
 
 ⋆
 
 
 
 
 {\displaystyle w\in T^{\star }}
 
 is a couple 
 
 
 
 ⟨
 l
 ,
 r
 
 ⟩
 
 w
 
 
 
 
 {\displaystyle \langle l,r\rangle _{w}}
 
, with 
 
 
 
 0
 ≤
 l
 ≤
 r
 ≤
 n
 
 
 {\displaystyle 0\leq l\leq r\leq n}
 
, where 
 
 
 
 n
 
 
 {\displaystyle n}
 
 is the length of 
 
 
 
 w
 
 
 {\displaystyle w}
 
. Variables bind to ranges, not to arbitrary strings of nonterminals. Two ranges 
 
 
 
 ⟨
 
 l
 
 1
 
 
 ,
 
 r
 
 1
 
 
 
 ⟩
 
 w
 
 
 
 
 {\displaystyle \langle l_{1},r_{1}\rangle _{w}}
 
 and 
 
 
 
 ⟨
 
 l
 
 2
 
 
 ,
 
 r
 
 2
 
 
 
 ⟩
 
 w
 
 
 
 
 {\displaystyle \langle l_{2},r_{2}\rangle _{w}}
 
 can be concatenated iff 
 
 
 
 
 r
 
 1
 
 
 =
 
 l
 
 2
 
 
 
 
 {\displaystyle r_{1}=l_{2}}
 
, and we then have: 
 
 
 
 ⟨
 
 l
 
 1
 
 
 ,
 
 r
 
 1
 
 
 
 ⟩
 
 w
 
 
 ⋅
 ⟨
 
 l
 
 2
 
 
 ,
 
 r
 
 2
 
 
 
 ⟩
 
 w
 
 
 =
 ⟨
 
 l
 
 1
 
 
 ,
 
 r
 
 2
 
 
 
 ⟩
 
 w
 
 
 
 
 {\displaystyle \langle l_{1},r_{1}\rangle _{w}\cdot \langle l_{2},r_{2}\rangle _{w}=\langle l_{1},r_{2}\rangle _{w}}
 
. When instantiating a clause, where an argument consists of multiple elements from 
 
 
 
 T
 ∪
 V
 
 
 {\displaystyle T\cup V}
 
, their ranges must concatenate.
For a word 
 
 
 
 w
 =
 
 w
 
 1
 
 
 
 w
 
 2
 
 
 …
 
 w
 
 n
 
 
 
 
 {\displaystyle w=w_{1}w_{2}\ldots w_{n}}
 
, with 
 
 
 
 
 w
 
 i
 
 
 ∈
 T
 
 
 {\displaystyle w_{i}\in T}
 
, the dotted notation for ranges is: 
 
 
 
 ⟨
 l
 ,
 r
 
 ⟩
 
 w
 
 
 =
 
 w
 
 1
 
 
 …
 
 w
 
 l
 −
 1
 
 
 ∙
 
 w
 
 l
 
 
 …
 
 w
 
 r
 −
 1
 
 
 ∙
 
 w
 
 r
 
 
 …
 
 w
 
 n
 
 
 
 
 {\displaystyle \langle l,r\rangle _{w}=w_{1}\ldots w_{l-1}\bullet w_{l}\ldots w_{r-1}\bullet w_{r}\ldots w_{n}}
 
.

<h3>Recognition of strings</h3>
The strings of predicates being rewritten represent constraints that the string being tested has to satisfy (if positive), or in the case of negative predicates not satisfy. The order of predicates is irrelevant. Rewrite steps amount to replacing one constraint by zero or more simpler constraints.
Like LMGs, RCG clauses have the general schema 
 
 
 
 A
 (
 
 x
 
 1
 
 
 ,
 .
 .
 .
 ,
 
 x
 
 n
 
 
 )
 →
 α
 
 
 {\displaystyle A(x_{1},...,x_{n})\to \alpha }
 
, where in an RCG, 
 
 
 
 α
 
 
 {\displaystyle \alpha }
 
 is either the empty string or a string of predicates. The arguments 
 
 
 
 
 x
 
 i
 
 
 
 
 {\displaystyle x_{i}}
 
 consist of strings of terminal symbols and/or variable symbols, which pattern match against actual argument values like in LMG. Adjacent variables constitute a family of matches against partitions, so that the argument 
 
 
 
 x
 y
 
 
 {\displaystyle xy}
 
, with two variables, matches the literal string 
 
 
 
 a
 b
 
 
 {\displaystyle ab}
 
 in three different ways: 
 
 
 
 x
 =
 ϵ
 ,
  
 y
 =
 a
 b
 ;
  
 x
 =
 a
 ,
  
 y
 =
 b
 ;
  
 x
 =
 a
 b
 ,
  
 y
 =
 ϵ
 
 
 {\displaystyle x=\epsilon ,\ y=ab;\ x=a,\ y=b;\ x=ab,\ y=\epsilon }
 
. These would give rise to three different instantiations of the clause containing that argument 
 
 
 
 x
 y
 
 
 {\displaystyle xy}
 
.
Predicate terms come in two forms, positive (which produce the empty string on success), and negative (which produce the empty string on failure/if the positive term does not produce the empty string). Negative terms are denoted the same as positive terms, with an overbar, as in 
 
 
 
 
 
 
 A
 (
 
 x
 
 1
 
 
 ,
 .
 .
 .
 ,
 
 x
 
 n
 
 
 )
 
 ¯
 
 
 
 
 {\displaystyle {\overline {A(x_{1},...,x_{n})}}}
 
.
The rewrite semantics for RCGs is rather simple, identical to the corresponding semantics of LMGs. Given a predicate string 
 
 
 
 A
 (
 
 α
 
 1
 
 
 ,
 .
 .
 .
 ,
 
 α
 
 n
 
 
 )
 
 
 {\displaystyle A(\alpha _{1},...,\alpha _{n})}
 
, where the symbols 
 
 
 
 
 α
 
 i
 
 
 
 
 {\displaystyle \alpha _{i}}
 
 are terminal strings, if there is a rule 
 
 
 
 A
 (
 
 x
 
 1
 
 
 ,
 .
 .
 .
 ,
 
 x
 
 n
 
 
 )
 →
 β
 
 
 {\displaystyle A(x_{1},...,x_{n})\to \beta }
 
 in the grammar that the predicate string matches, the predicate string is replaced by 
 
 
 
 β
 
 
 {\displaystyle \beta }
 
, substituting for the matched variables in each 
 
 
 
 
 x
 
 i
 
 
 
 
 {\displaystyle x_{i}}
 
.
For example, given the rule 
 
 
 
 A
 (
 x
 ,
 a
 y
 b
 )
 →
 B
 (
 a
 x
 b
 ,
 y
 )
 
 
 {\displaystyle A(x,ayb)\to B(axb,y)}
 
, where 
 
 
 
 x
 
 
 {\displaystyle x}
 
 and 
 
 
 
 y
 
 
 {\displaystyle y}
 
 are variable symbols and 
 
 
 
 a
 
 
 {\displaystyle a}
 
 and 
 
 
 
 b
 
 
 {\displaystyle b}
 
 are terminal symbols, the predicate string 
 
 
 
 A
 (
 a
 ,
 a
 b
 b
 )
 
 
 {\displaystyle A(a,abb)}
 
 can be rewritten as 
 
 
 
 B
 (
 a
 a
 b
 ,
 b
 )
 
 
 {\displaystyle B(aab,b)}
 
, because 
 
 
 
 A
 (
 a
 ,
 a
 b
 b
 )
 
 
 {\displaystyle A(a,abb)}
 
 matches 
 
 
 
 A
 (
 x
 ,
 a
 y
 b
 )
 
 
 {\displaystyle A(x,ayb)}
 
 when 
 
 
 
 x
 =
 a
 ,
  
 y
 =
 b
 
 
 {\displaystyle x=a,\ y=b}
 
. Similarly, if there were a rule 
 
 
 
 A
 (
 x
 ,
 a
 y
 b
 )
 →
 A
 (
 x
 ,
 x
 )
  
 A
 (
 y
 ,
 y
 )
 
 
 {\displaystyle A(x,ayb)\to A(x,x)\ A(y,y)}
 
, 
 
 
 
 A
 (
 a
 ,
 a
 b
 b
 )
 
 
 {\displaystyle A(a,abb)}
 
 could be rewritten as 
 
 
 
 A
 (
 a
 ,
 a
 )
  
 A
 (
 b
 ,
 b
 )
 
 
 {\displaystyle A(a,a)\ A(b,b)}
 
.
A proof/recognition of a string 
 
 
 
 α
 
 
 {\displaystyle \alpha }
 
 is done by showing that 
 
 
 
 S
 (
 α
 )
 
 
 {\displaystyle S(\alpha )}
 
 produces the empty string. For the individual rewrite steps, when multiple alternative variable matches are possible, any rewrite which could lead the whole proof to succeed is considered. Thus, if there is at least one way to produce the empty string from the initial string 
 
 
 
 S
 (
 α
 )
 
 
 {\displaystyle S(\alpha )}
 
, the proof is considered a success, regardless of how many other ways to fail exist.

<h2 id="example">Example</h2>
RCGs are capable of recognizing the non-linear index language 
 
 
 
 {
 w
 w
 w
 :
 w
 ∈
 {
 a
 ,
 b
 
 }
 
 ∗
 
 
 }
 
 
 {\displaystyle \{www:w\in \{a,b\}^{*}\}}
 
 as follows:
Letting x, y, and z be variable symbols:

S
                (
                x
                y
                z
                )
              
              
                
                →
                A
                (
                x
                ,
                y
                ,
                z
                )
              
            
            
              
                A
                (
                a
                x
                ,
                a
                y
                ,
                a
                z
                )
              
              
                
                →
                A
                (
                x
                ,
                y
                ,
                z
                )
              
            
            
              
                A
                (
                b
                x
                ,
                b
                y
                ,
                b
                z
                )
              
              
                
                →
                A
                (
                x
                ,
                y
                ,
                z
                )
              
            
            
              
                A
                (
                ϵ
                ,
                ϵ
                ,
                ϵ
                )
              
              
                
                →
                ϵ
              
            
          
        
      
    
    {\displaystyle {\begin{aligned}S(xyz)&\to A(x,y,z)\\A(ax,ay,az)&\to A(x,y,z)\\A(bx,by,bz)&\to A(x,y,z)\\A(\epsilon ,\epsilon ,\epsilon )&\to \epsilon \end{aligned}}}

The proof for abbabbabb is then

 
 
 
 S
 (
 a
 b
 b
 a
 b
 b
 a
 b
 b
 )
 ⇒
 A
 (
 a
 b
 b
 ,
 a
 b
 b
 ,
 a
 b
 b
 )
 ⇒
 A
 (
 b
 b
 ,
 b
 b
 ,
 b
 b
 )
 ⇒
 A
 (
 b
 ,
 b
 ,
 b
 )
 ⇒
 A
 (
 ϵ
 ,
 ϵ
 ,
 ϵ
 )
 ⇒
 ϵ
 
 
 {\displaystyle S(abbabbabb)\Rightarrow A(abb,abb,abb)\Rightarrow A(bb,bb,bb)\Rightarrow A(b,b,b)\Rightarrow A(\epsilon ,\epsilon ,\epsilon )\Rightarrow \epsilon }

Or, using the more correct dotted notation for ranges:

 
 
 
 S
 (
 ∙

a
        b
        b
        a
        b
        b
        a
        b
        b
        ∙

)
        ⇒
        A
        (
        ∙

a
        b
        b
        ∙

a
        b
        b
        a
        b
        b
        ,
        a
        b
        b
        ∙

a
        b
        b
        ∙

a
        b
        b
        ,
        a
        b
        b
        a
        b
        b
        ∙

a
        b
        b
        ∙

)
        ⇒
        A
        (
        a
        ∙

b
        b
        ∙

a
        b
        b
        a
        b
        b
        ,
        a
        b
        b
        a
        ∙

b
        b
        ∙

a
        b
        b
        ,
        a
        b
        b
        a
        b
        b
        a
        ∙

b
        b
        ∙

)
      
    
    {\displaystyle S(\bullet {}abbabbabb\bullet {})\Rightarrow A(\bullet {}abb\bullet {}abbabb,abb\bullet {}abb\bullet {}abb,abbabb\bullet {}abb\bullet {})\Rightarrow A(a\bullet {}bb\bullet {}abbabb,abba\bullet {}bb\bullet {}abb,abbabba\bullet {}bb\bullet {})}
  
 
  
    
      
        ⇒
        A
        (
        a
        b
        ∙

b
        ∙

a
        b
        b
        a
        b
        b
        ,
        a
        b
        b
        a
        b
        ∙

b
        ∙

a
        b
        b
        ,
        a
        b
        b
        a
        b
        b
        a
        b
        ∙

b
        ∙

)
        ⇒
        A
        (
        ϵ
        ,
        ϵ
        ,
        ϵ
        )
        ⇒
        ϵ
      
    
    {\displaystyle \Rightarrow A(ab\bullet {}b\bullet {}abbabb,abbab\bullet {}b\bullet {}abb,abbabbab\bullet {}b\bullet {})\Rightarrow A(\epsilon ,\epsilon ,\epsilon )\Rightarrow \epsilon }

For a string of 
 
 
 
 3
 n
 
 
 {\displaystyle 3n}
 
 letters, there are 
 
 
 
 
 
 
 (
 
 
 
 3
 n
 +
 2
 
 2
 
 
 )
 
 
 
 =
 
 
 
 (
 3
 n
 +
 2
 )
 (
 3
 n
 +
 1
 )
 
 2
 
 
 
 
 {\displaystyle {\binom {3n+2}{2}}={\frac {(3n+2)(3n+1)}{2}}}
 
 different instantiations of that first clause, but only the one which makes 
 
 
 
 x
 ,
 y
 ,
 z
 
 
 {\displaystyle x,y,z}
 
 all 
 
 
 
 n
 
 
 {\displaystyle n}
 
 letters each allows the derivation to reach 
 
 
 
 ϵ
 
 
 {\displaystyle \epsilon }
 
.

<h2 id="properties">Properties</h2>
Every <a href="/facts/Context-free_grammar/yu1FFdK4">context-free grammar</a> (CFG) can be converted into a range concatenation grammar:

<ul><li>For every nonterminal 
 
 
 
 A
 
 
 {\displaystyle A}
 
 of the CFG, the RCG has an arity 
 
 
 
 1
 
 
 {\displaystyle 1}
 
 predicate 
 
 
 
 A
 (
 x
 )
 
 
 {\displaystyle A(x)}
 
.</li>
<li>For every CFG rule 
 
 
 
 A
 →
 B
 C
 
 
 {\displaystyle A\to BC}
 
, the RCG has 
 
 
 
 A
 (
 x
 y
 )
 →
 B
 (
 x
 )
 C
 (
 y
 )
 
 
 {\displaystyle A(xy)\to B(x)C(y)}
 
.</li>
<li>For every CFG rule 
 
 
 
 A
 →
 a
 
 
 {\displaystyle A\to a}
 
 (where 
 
 
 
 a
 
 
 {\displaystyle a}
 
 terminal), the RCG has 
 
 
 
 A
 (
 a
 )
 →
 ϵ
 
 
 {\displaystyle A(a)\to \epsilon }
 
.</li></ul>
The intersection and union of two range concatenation languages are trivially range concatenation languages:

<ul><li>For 
 
 
 
 S
 
 
 {\displaystyle S}
 
 the intersection of 
 
 
 
 A
 
 
 {\displaystyle A}
 
 and 
 
 
 
 B
 
 
 {\displaystyle B}
 
, you have 
 
 
 
 S
 (
 x
 )
 →
 A
 (
 x
 )
 B
 (
 x
 )
 
 
 {\displaystyle S(x)\to A(x)B(x)}
 
.</li>
<li>For 
 
 
 
 S
 
 
 {\displaystyle S}
 
 the union of 
 
 
 
 A
 
 
 {\displaystyle A}
 
 and 
 
 
 
 B
 
 
 {\displaystyle B}
 
, you have 
 
 
 
 S
 (
 x
 )
 →
 A
 (
 x
 )
 
 
 {\displaystyle S(x)\to A(x)}
 
 and 
 
 
 
 S
 (
 x
 )
 →
 B
 (
 x
 )
 
 
 {\displaystyle S(x)\to B(x)}
 
.</li></ul>
Possibly negative range concatenation languages are also closed under set complement.
A consequence of the above is that it is <a href="/facts/Undecidable_problem/QBxmXBbY">undecidable</a> whether a (positive) range concatenation language is nonempty, because it is undecidable whether the intersection of two context-free languages is nonempty. Hence range concatenation grammars are not generative.

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Boullier, Pierre (Jan 1998). Proposal for a Natural Language Processing Syntactic Backbone (PDF) (Technical report). Vol. 3342. INRIA Rocquencourt (France). <a href="http://hal.archives-ouvertes.fr/docs/00/07/33/47/PDF/RR-3342.pdf" target="_blank">http://hal.archives-ouvertes.fr/docs/00/07/33/47/PDF/RR-3342.pdf</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Pierre Boullier (1999). "Chinese Numbers, MIX, Scrambling, and Range Concatenation Grammars" (PDF). Proc. EACL. pp. 53–60. Archived from the original (PDF) on 2003-05-15. <a href="https://web.archive.org/web/20030515052025/http://acl.ldc.upenn.edu/E/E99/E99-1008.pdf" target="_blank">https://web.archive.org/web/20030515052025/http://acl.ldc.upenn.edu/E/E99/E99-1008.pdf</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Laura Kallmeyer (2010). Parsing Beyond Context-Free Grammars. Springer Science & Business Media. p. 37. ISBN 978-3-642-14846-0. citing Bertsch, Nederhof (2001)[3]
 <a href="978-3-642-14846-0" target="_blank">978-3-642-14846-0</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
</ol>

Range concatenation grammar open-in-new

Range concatenation grammar