Parser combinator

<h2 id="basic-idea">Basic idea</h2>
In any <a href="/facts/Programming_language/cWTYbgWa">programming language</a> that has <a href="/facts/First-class_function/dtMzp81k">first-class functions</a>, parser combinators can be used to combine basic parsers to construct parsers for more complex rules. For example, a <a href="/facts/Formal_grammar/QLxaQNAz">production rule</a> of a <a href="/facts/Context-free_grammar/yu1FFdK4">context-free grammar</a> (CFG) may have one or more alternatives and each alternative may consist of a sequence of non-terminal(s) and/or terminal(s), or the alternative may consist of a single non-terminal or terminal or the empty string. If a simple parser is available for each of these alternatives, a parser combinator can be used to combine each of these parsers, returning a new parser which can recognise any or all of the alternatives.
In languages that support <a href="/facts/Operator_overloading/hRyHs4Qf">operator overloading</a>, a parser combinator can take the form of an <a href="/facts/Infix_operator/LY0dMhay">infix operator</a>, used to glue different parsers to form a complete rule. Parser combinators thereby enable parsers to be defined in an embedded style, in code which is similar in structure to the rules of the formal grammar. As such, implementations can be thought of as executable specifications with all the associated advantages such as readability.

<h2 id="the-combinators">The combinators</h2>
To keep the discussion relatively straightforward, we discuss parser combinators in terms of recognizers only. If the input string is of length #input and its members are accessed through an index j, a recognizer is a <a href="/facts/Parser/nRKB01S3">parser</a> which returns, as output, a set of indices representing indices at which the parser successfully finished recognizing a sequence of tokens that begin at index j. An empty result set indicates that the recognizer failed to recognize any sequence beginning at index j.

<ul><li>The empty recognizer recognizes the empty string. This parser always succeeds, returning a singleton set containing the input index:</li></ul>

e
        m
        p
        t
        y
        (
        j
        )
        =
        {
        j
        }
      
    
    {\displaystyle empty(j)=\{j\}}

<ul><li>A recognizer term x recognizes the terminal x. If the token at index j in the input string is x, this parser returns a singleton set containing j + 1; otherwise, it returns the empty set.</li></ul>

t
        e
        r
        m
        (
        x
        ,
        j
        )
        =
        
          
            {
            
              
                
                  
                    {
                    
                    }
                  
                  ,
                
                
                  j
                  ≥
                  #
                  i
                  n
                  p
                  u
                  t
                
              
              
                
                  
                    {
                    
                      j
                      +
                      1
                    
                    }
                  
                  ,
                
                
                  
                    j
                    
                      t
                      h
                    
                  
                  
                    
                       element of 
                    
                  
                  i
                  n
                  p
                  u
                  t
                  =
                  x
                
              
              
                
                  
                    {
                    
                    }
                  
                  ,
                
                
                  
                    
                      otherwise
                    
                  
                
              
            
            
          
        
      
    
    {\displaystyle term(x,j)={\begin{cases}\left\{\right\},&j\geq \#input\\\left\{j+1\right\},&j^{th}{\mbox{ element of }}input=x\\\left\{\right\},&{\mbox{otherwise}}\end{cases}}}

Given two recognizers p and q, we can define two major parser combinators, one for matching alternative rules and one for sequencing rules:

<ul><li>The ‘alternative’ parser combinator, ⊕, applies each of the recognizers on the same index j and returns the union of the finishing indices of the recognizers:</li></ul>

(
        p
        ⊕
        q
        )
        (
        j
        )
        =
        p
        (
        j
        )
        ∪
        q
        (
        j
        )
      
    
    {\displaystyle (p\oplus q)(j)=p(j)\cup q(j)}

<ul><li>The 'sequence' combinator, ⊛, applies the first recognizer p to the input index j, and for each finishing index applies the second recognizer q with that as a starting index. It returns the union of the finishing indices returned from all invocations of q:</li></ul>

(
        p
        ⊛
        q
        )
        (
        j
        )
        =
        ⋃
        {
        q
        (
        k
        )
        :
        k
        ∈
        p
        (
        j
        )
        }
      
    
    {\displaystyle (p\circledast q)(j)=\bigcup \{q(k):k\in p(j)\}}

There may be multiple distinct ways to parse a string while finishing at the same index, indicating an <a href="/facts/Ambiguous_grammar/elcVDDL4">ambiguous grammar</a>. Simple recognizers do not acknowledge these ambiguities; each possible finishing index is listed only once in the result set. For a more complete set of results, a more complicated object such as a <a href="/facts/Parse_tree/FeBEtXcz">parse tree</a> must be returned.

<h2 id="examples">Examples</h2>
Consider a highly ambiguous <a href="/facts/Context-free_grammar/yu1FFdK4">context-free grammar</a>, s ::= ‘x’ s s | ε. Using the combinators defined earlier, we can modularly define executable notations of this grammar in a modern <a href="/facts/Functional_programming/cdxqPEIM">functional programming</a> language (e.g., <a href="/facts/Haskell/ydEsHuGy">Haskell</a>) as s = term ‘x’ <*> s <*> s <+> empty. When the recognizer s is applied at index 2 of the input sequence x x x x x it would return a result set {2,3,4,5}, indicating that there were matches starting at index 2 and finishing at any index between 2 and 5 inclusive.

<h2 id="shortcomings-and-solutions">Shortcomings and solutions</h2>
Parser combinators, like all <a href="/facts/Recursive_descent_parser/ARhcuAv1">recursive descent parsers</a>, are not limited to the <a href="/facts/Context-free_grammar/yu1FFdK4">context-free grammars</a> and thus do no global search for ambiguities in the <a href="/facts/LL_parser/ot65DwYm">LL(k) parsing</a> Firstk and Followk sets. Thus, ambiguities are not known until run-time if and until the input triggers them. In such cases, the recursive descent parser may default (perhaps unknown to the grammar designer) to one of the possible ambiguous paths, resulting in semantic confusion (aliasing) in the use of the language. This leads to bugs by users of ambiguous programming languages, which are not reported at compile-time, and which are introduced not by human error, but by ambiguous grammar. The only solution that eliminates these bugs is to remove the ambiguities and use a context-free grammar.
The simple implementations of parser combinators have some shortcomings, which are common in top-down parsing. Naïve combinatory parsing requires <a href="/facts/Exponential_time/77T62gmf">exponential</a> time and space when parsing an ambiguous context-free grammar. In 1996, Frost and Szydlowski demonstrated how <a href="/facts/Memoization/u08VhUKm">memoization</a> can be used with parser combinators to reduce the time complexity to polynomial.<a class="footnote-ref" id="fnref:6" href="#fn:6">6</a> Later Frost used <a href="/facts/Monad_(functional_programming)/kYazRoF1">monads</a> to construct the combinators for systematic and correct threading of memo-table throughout the computation.<a class="footnote-ref" id="fnref:7" href="#fn:7">7</a>
Like any top-down <a href="/facts/Recursive_descent_parser/ARhcuAv1">recursive descent parsing</a>, the conventional parser combinators (like the combinators described above) will not terminate while processing a <a href="/facts/Left_recursion/QNtQvgPr">left-recursive grammar</a> (e.g. s ::= s <*> term ‘x’|empty). A recognition <a href="/facts/Algorithm/fnl5NmRt">algorithm</a> that accommodates ambiguous grammars with direct left-recursive rules is described by Frost and Hafiz in 2006.<a class="footnote-ref" id="fnref:8" href="#fn:8">8</a> The algorithm curtails the otherwise ever-growing left-recursive parse by imposing depth restrictions. That algorithm was extended to a complete parsing algorithm to accommodate indirect as well as direct left-recursion in <a href="/facts/Polynomial_time/77T62gmf">polynomial time</a>, and to generate compact polynomial-size representations of the potentially exponential number of parse trees for highly ambiguous grammars by Frost, Hafiz and Callaghan in 2007.<a class="footnote-ref" id="fnref:9" href="#fn:9">9</a> This extended algorithm accommodates indirect left recursion by comparing its ‘computed context’ with ‘current context’. The same authors also described their implementation of a set of parser combinators written in the Haskell language based on the same algorithm.<a class="footnote-ref" id="fnref:10" href="#fn:10">10</a><a class="footnote-ref" id="fnref:11" href="#fn:11">11</a>

<h2 id="notes">Notes</h2>

<ul><li>Burge, William H. (1975). <a href="https://archive.org/details/recursiveprogram0000burg">Recursive Programming Techniques</a>. The Systems programming series. Addison-Wesley. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0201144505.</li>
<li>Frost, Richard; Launchbury, John (1989). <a href="https://web.archive.org/web/20130606162214/https://courses.cit.cornell.edu/ling4424/frost-launchbury.pdf">"Constructing natural language interpreters in a lazy functional language"</a> (PDF). The Computer Journal. Special edition on Lazy Functional Programming. 32 (2): 108–121. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1093%2Fcomjnl%2F32.2.108">10.1093/comjnl/32.2.108</a>. Archived from the original on 2013-06-06.{{cite journal}}: CS1 maint: bot: original URL status unknown (link)</li>
<li>Frost, Richard A.; Szydlowski, Barbara (1996). <a href="http://cs.uwindsor.ca/~richard/PUBLICATIONS/SCOMP_96.pdf">"Memoizing Purely Functional Top-Down Backtracking Language Processors"</a> (PDF). Sci. Comput. Program. 27 (3): 263–288. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1016%2F0167-6423%2896%2900014-7">10.1016/0167-6423(96)00014-7</a>.</li>
<li>Frost, Richard A. (2003). "Monadic Memoization towards Correctness-Preserving Reduction of Search". <a href="http://cs.uwindsor.ca/~richard/PUBLICATIONS/AI_03.pdf">Proceedings of the 16th Canadian Society for Computational Studies of Intelligence Conference on Advances in Artificial Intelligence (AI'03)</a> (PDF). Springer. pp. 66–80. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-3-540-40300-5.</li>
<li>Frost, Richard A.; Hafiz, Rahmatullah (2006). <a href="http://cs.uwindsor.ca/~hafiz/pub/p46-frost.pdf">"A New Top-Down Parsing Algorithm to Accommodate Ambiguity and Left Recursion in Polynomial Time"</a> (PDF). ACM SIGPLAN Notices. 41 (5): 46–54. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1145%2F1149982.1149988">10.1145/1149982.1149988</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:8006549">8006549</a>.</li>
<li>Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2007). "Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars". Proceedings of the 10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE: 109–120. <a href="/facts/CiteSeerX_(identifier)/SceDmd3c">CiteSeerX</a> <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.97.8915">10.1.1.97.8915</a>.</li>
<li>Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2008). "Parser Combinators for Ambiguous Left-Recursive Grammars". Practical Aspects of Declarative Languages. ACM-SIGPLAN. Vol. 4902. pp. 167–181. <a href="/facts/CiteSeerX_(identifier)/SceDmd3c">CiteSeerX</a> <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.2132">10.1.1.89.2132</a>. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1007%2F978-3-540-77442-6_12">10.1007/978-3-540-77442-6_12</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-3-540-77441-9.</li>
<li>Hutton, Graham (1992). "Higher-order functions for parsing". Journal of Functional Programming. 2 (3): 323–343. <a href="/facts/CiteSeerX_(identifier)/SceDmd3c">CiteSeerX</a> <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.1287">10.1.1.34.1287</a>. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1017%2Fs0956796800000411">10.1017/s0956796800000411</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:31067887">31067887</a>.</li>
<li>Okasaki, Chris (1998). <a href="https://doi.org/10.1017%2FS0956796898003001">"Even higher-order functions for parsing or Why would anyone ever want to use a sixth-order function?"</a>. Journal of Functional Programming. 8 (2): 195–199. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1017%2FS0956796898003001">10.1017/S0956796898003001</a>. <a href="/facts/S2CID_(identifier)/ldJsHa2Y">S2CID</a> <a href="https://api.semanticscholar.org/CorpusID:59694674">59694674</a>.</li>
<li>Swierstra, S. Doaitse (2001). <a href="https://doi.org/10.1016%2FS1571-0661%2805%2980545-6">"Combinator parsers: From toys to tools"</a>. Electronic Notes in Theoretical Computer Science. 41: 38–59. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1016%2FS1571-0661%2805%2980545-6">10.1016/S1571-0661(05)80545-6</a>.</li>
<li><a href="/facts/Philip_Wadler/ShX91nId">Wadler, Philip</a> (1985). "How to replace failure by a list of successes a method for exception handling, backtracking, and pattern matching in lazy functional languages". Functional Programming Languages and Computer Architecture. Lecture Notes in Computer Science. Vol. 201. pp. 113–128. <a href="/facts/Doi_(identifier)/muM9Etpq">doi</a>:<a href="https://doi.org/10.1007%2F3-540-15975-4_33">10.1007/3-540-15975-4_33</a>. <a href="/facts/ISBN_(identifier)/15AdSPa9">ISBN</a> 978-0-387-15975-1 – via Proceedings of a Conference on Functional Programming Languages and Computer Architecture.</li></ul>

<h2 id="references">References</h2>

<ol>
<li id="fn:1">Frost & Launchbury 1989. - Frost, Richard; Launchbury, John (1989). "Constructing natural language interpreters in a lazy functional language" (PDF). The Computer Journal. Special edition on Lazy Functional Programming. 32 (2): 108–121. doi:10.1093/comjnl/32.2.108. Archived from the original on 2013-06-06. <a href="https://web.archive.org/web/20130606162214/https://courses.cit.cornell.edu/ling4424/frost-launchbury.pdf" target="_blank">https://web.archive.org/web/20130606162214/https://courses.cit.cornell.edu/ling4424/frost-launchbury.pdf</a> <a href="#fnref:1" class="footnote-back-ref">↩</a></li>
<li id="fn:2">Hutton 1992. - Hutton, Graham (1992). "Higher-order functions for parsing". Journal of Functional Programming. 2 (3): 323–343. CiteSeerX 10.1.1.34.1287. doi:10.1017/s0956796800000411. S2CID 31067887. <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.1287" target="_blank">https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.1287</a> <a href="#fnref:2" class="footnote-back-ref">↩</a></li>
<li id="fn:3">Hutton, Graham; Meijer, Erik. Monadic Parser Combinators (PDF) (Report). University of Nottingham. Retrieved 13 February 2023. <a href="http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf" target="_blank">http://www.cs.nott.ac.uk/~pszgmh/monparsing.pdf</a> <a href="#fnref:3" class="footnote-back-ref">↩</a></li>
<li id="fn:4">Swierstra 2001. - Swierstra, S. Doaitse (2001). "Combinator parsers: From toys to tools". Electronic Notes in Theoretical Computer Science. 41: 38–59. doi:10.1016/S1571-0661(05)80545-6. <a href="https://doi.org/10.1016%2FS1571-0661%2805%2980545-6" target="_blank">https://doi.org/10.1016%2FS1571-0661%2805%2980545-6</a> <a href="#fnref:4" class="footnote-back-ref">↩</a></li>
<li id="fn:5">Frost, Hafiz & Callaghan 2008. - Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2008). "Parser Combinators for Ambiguous Left-Recursive Grammars". Practical Aspects of Declarative Languages. ACM-SIGPLAN. Vol. 4902. pp. 167–181. CiteSeerX 10.1.1.89.2132. doi:10.1007/978-3-540-77442-6_12. ISBN 978-3-540-77441-9. <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.2132" target="_blank">https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.2132</a> <a href="#fnref:5" class="footnote-back-ref">↩</a></li>
<li id="fn:6">Frost & Szydlowski 1996. - Frost, Richard A.; Szydlowski, Barbara (1996). "Memoizing Purely Functional Top-Down Backtracking Language Processors" (PDF). Sci. Comput. Program. 27 (3): 263–288. doi:10.1016/0167-6423(96)00014-7. <a href="http://cs.uwindsor.ca/~richard/PUBLICATIONS/SCOMP_96.pdf" target="_blank">http://cs.uwindsor.ca/~richard/PUBLICATIONS/SCOMP_96.pdf</a> <a href="#fnref:6" class="footnote-back-ref">↩</a></li>
<li id="fn:7">Frost 2003. - Frost, Richard A. (2003). "Monadic Memoization towards Correctness-Preserving Reduction of Search". Proceedings of the 16th Canadian Society for Computational Studies of Intelligence Conference on Advances in Artificial Intelligence (AI'03) (PDF). Springer. pp. 66–80. ISBN 978-3-540-40300-5. <a href="http://cs.uwindsor.ca/~richard/PUBLICATIONS/AI_03.pdf" target="_blank">http://cs.uwindsor.ca/~richard/PUBLICATIONS/AI_03.pdf</a> <a href="#fnref:7" class="footnote-back-ref">↩</a></li>
<li id="fn:8">Frost & Hafiz 2006. - Frost, Richard A.; Hafiz, Rahmatullah (2006). "A New Top-Down Parsing Algorithm to Accommodate Ambiguity and Left Recursion in Polynomial Time" (PDF). ACM SIGPLAN Notices. 41 (5): 46–54. doi:10.1145/1149982.1149988. S2CID 8006549. <a href="http://cs.uwindsor.ca/~hafiz/pub/p46-frost.pdf" target="_blank">http://cs.uwindsor.ca/~hafiz/pub/p46-frost.pdf</a> <a href="#fnref:8" class="footnote-back-ref">↩</a></li>
<li id="fn:9">Frost, Hafiz & Callaghan 2007. - Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2007). "Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars". Proceedings of the 10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE: 109–120. CiteSeerX 10.1.1.97.8915. <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.97.8915" target="_blank">https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.97.8915</a> <a href="#fnref:9" class="footnote-back-ref">↩</a></li>
<li id="fn:10">Frost, Hafiz & Callaghan 2008. - Frost, Richard A.; Hafiz, Rahmatullah; Callaghan, Paul (2008). "Parser Combinators for Ambiguous Left-Recursive Grammars". Practical Aspects of Declarative Languages. ACM-SIGPLAN. Vol. 4902. pp. 167–181. CiteSeerX 10.1.1.89.2132. doi:10.1007/978-3-540-77442-6_12. ISBN 978-3-540-77441-9. <a href="https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.2132" target="_blank">https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89.2132</a> <a href="#fnref:10" class="footnote-back-ref">↩</a></li>
<li id="fn:11">cf. X-SAIGA — executable specifications of grammars <a href="http://www.cs.uwindsor.ca/~hafiz/proHome.html" target="_blank">http://www.cs.uwindsor.ca/~hafiz/proHome.html</a> <a href="#fnref:11" class="footnote-back-ref">↩</a></li>
</ol>

Parser combinator open-in-new

Parser combinator