Recursive descent parsers are the simplest thing. Code below is a parser for a simple expression language with a left-associative '+', a right-associative '^', parens, and integers. This covers all the commonest patterns, after you write a number of such parsers you'll be able to write them in your sleep provided you know the language grammar.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
 

open System

type Expr =
    | Add of Expr * Expr
    | Pow of Expr * Expr
    | Num of int

// every parser takes the string and the current index, returns Expr and next index
let rec ParseExpr(s:string, i:int) =
    // Start with lowest operator precedence and work way up
    ParseLowPrecOp(s,i)
and ParseLowPrecOp(s:string, i:int) =
    // left-associative operators use a while loop to accumulate values of next-highest precedence
    let mutable e, j = ParseHighPrecOp(s,i)
    while j < s.Length && s.[j] = '+' do
        let mutable e2, k = ParseHighPrecOp(s,j+1)
        j <- k
        e <- Add(e,e2)
    e, j
and ParseHighPrecOp(s:string, i:int) =
    // right-associative operators parse one expr of next-highest precedence, then recurse as needed to accumulate
    let leftExpr, j = ParseParenExpr(s,i)
    if j >= s.Length || s.[j] <> '^' then
        leftExpr, j
    else
        let rightExpr, k = ParseHighPrecOp(s,j+1)
        Pow(leftExpr, rightExpr), k
and ParseParenExpr(s:string, i:int) =
    // parentheses bind tightest
    if s.[ i] = '(' then
        let e, j = ParseExpr(s, i+1)
        if s.[j] <> ')' then
            failwith "parse error"
        else
            e, j+1
    else
        // if gone through all ops, literals are only remaining forms
        ParseNum(s, i)    
and ParseNum(s:string, i:int) =
    if not(Char.IsDigit(s.[ i])) then
        failwith "parse error"
    let mutable j = i
    while Char.IsDigit(s.[j]) do
        j <- j + 1
    Num(int(s.Substring(i, j-i))), j

let s = "1+2+3^4+5+6+((7+8)^9^10)"
printfn "%s" s
printfn "%A" (ParseExpr(s,0))        

Parsing is a whole sub-field of CS, one of the oldest and most studied CS things. Some would say it is a "solved problem", but it still a very rich and active area of CS innovation (especially regarding error diagnostics for failed parses, error recovery, etc). I am personally a great fan of the monadic parser combinator approach, a la

[link:lorgonblog.spaces.live.com]

but there are a variety of parsing strategies, including recursive descent parsers (like above), table-driven parsers (like those creates by lex/yacc DSLs), EDSLs like monadic parser combinatorsm etc, which all have various strengths and weaknesses along such axes as performance, diagnostics, and classes of languages they can easily handle.

The parsing fool,
Brian

By on 3/9/2010 10:51 PM ()

Thanks Brian!

I will try working through (and understanding!) that code.

If I understand correctly (I'm a software+electrical engineering major, not a computer science major ;)), "combinatorial" parsers are those that are combined from simple parser functions (like those in chapter 8 of this book: [link:www.cs.nott.ac.uk] There, the parser is combined from a series of small and individually simple functions. That's very appealing to me; the problem with the parser above is that I have a hard time decomposing it.

Also, from a brief excursion on Wikipedia, I believe I'm going after a top-down parser; is that correct?

[link:en.wikipedia.org]

Well, I'm off to try this out. Thanks again!

By on 3/10/2010 6:07 AM ()
IntelliFactory Offices Copyright (c) 2011-2012 IntelliFactory. All rights reserved.
Home | Products | Consulting | Trainings | Blogs | Jobs | Contact Us | Terms of Use | Privacy Policy | Cookie Policy
Built with WebSharper