表达式求值之优先关系矩阵法

YACC 或者 ANTLR 或者 PLY 之类的东西是很不错啦, 不过当面对一个类似 C(++) 这么变态的需要将语法和符号表勾搭在一起的语言形式时, 还是可以考虑一下使用 Python 纯手工打造.

搞定表达式求值这种现代高级程序设计语言中基础部分, 用 LR(n), LALR 分析之类的当然是可行的, 假定需要支持括号和下面这些运算

	算符
优先级由高到低	+ - (正负号)	^ (幂运算)	* /	+ - (加减号)
结合方式	数值左边	右结合	左结合	左结合

    先搞个产生式集合 (运算符就不搞什么转义了, | 表示或)
Expr := Term +|- Expr
     := Term
Term := Exponential *|/ Term
     := Exponential
Exponential := Factor ^ Exponential
            := Factor
Factor := +|- Factor
       := Number
       := ( Expr )
    接着还要再计算 FIRST 集啊什么的, 等到最后写成代码, 秀吉都嫁人了.

    所幸这个世界上无痛的解决方案还是挺多的, 比如为符号设定优先级, 使用优先级来指导算符是应该移进还是规约. 这个方法可称为优先关系矩阵法.
    这里仍然会用到移进 (Shift) 和规约 (Reduce) 这两个术语, 不过它们的意义跟 LR 分析法里面的有些不同. 在 LR 分析中, 符号指的是词法分析中得到的 Token, 而不是运算符, 也就是说数和运算符, 以及其它任何东西都放在同一个栈中; 而在运用优先关系法的表达式分析过程中, 符号栈包括两个不同的栈, 一个专门用来存运算符 (这个栈称为算子栈), 另一个存放数或者已解析的表达式 (这个栈称为表达式栈). 在这里, 移进表示将运算符压入算子栈, 或将数压入表达式栈; 而规约指的是, 根据运算符的元数, 从表达式栈中弹出指定数目的
    遇到一个数时无条件移进, 而遇到运算符则要视情况决定移进还是规约, 而所谓的情况就是算符优先级. 比如下面这个表达式
3 + 4 * 5 + 6
    首先读入 3, 移进

表达式栈 3 算子栈 (空)

然后是 +, 算子栈栈顶为空, 没得比, 那好, 先移进

表达式栈 3 算子栈 +

继续, 4 移进, 接着是 *, 比一下, 发现 * 优先级较栈顶的 + (加号, 不是正号) 优先级高, 那肯定得移进

表达式栈 3 4 算子栈 + *

接着移进 5, 再往后又是一个 + (加号), 这时栈顶的 * 优先级高于它, 所以可以开始规约流程: 从算子栈弹出 *, 它是二元的, 于是再从表达式栈弹出栈顶 (刚刚压入的 5) 和次栈顶 (4), 将这三者组成表达式 4 * 5, 再压回表达式栈 (下面以一个二叉树的形象描述)

表达式栈 3 * / \ 4 5 算子栈 +

还没完呢, 现在的算子栈顶 + (加号) 与手头上拿着的 + (加号) 平级, 由于加号是左结合的, 所以应当继续规约, 得到

表达式栈 + / \ 3 * / \ 4 5 算子栈 (空)

现在算子栈又空了, 于是 + (最后强调一次, 加号) 压栈, 最后是 6, 移进之, 结束时, 整个算符栈从头到尾规约一次, 大功告成.

回顾一下刚才的过程, 规则是

前面说过的, 数无条件入栈
如果算子栈为空, 移进当前算子
如果算子栈顶的优先级高于当前算子, 或者两者优先级相等, 但当前符号是左结合的, 那么进行规约, 直到不满足规约条件
算子无法规约时则移进

这一些初步地写成代码类似这样

def handleNumber(token, expr_stack, op_stack):
    expr_stack.append(token)

def handleOperator(token, expr_stack, op_stack):
    while 0 != len(op_stack) and priorityJudge(token, op_stack[-1]):
        reduceOp(op_stack.pop(), expr_stack)
    op_stack.append(token)

def buildExpression(tokens):
    want_number = True
    expr_stack = []
    op_stack = []

    for token in tokens:
        if want_number:
            handleNumber(token, expr_stack, op_stack)
            want_number = False
        else:
            handleOperator(token, expr_stack, op_stack)
            want_number = True

    while len(op_stack) > 0:
        reduceOp(op_stack.pop(), expr_stack)
    return expr_stack.pop()

上面代码中, token 就是词法分析得到的结果; priorityJudge 和 reduceOp 两个函数尚未实现. 在完善这两个函数之前, 有一点需要先改进, + 和 - 这两个符号在语法中有歧义, 必须设法区分它们是加减还是正负. 如果 priorityJudge 和 reduceOp 直接拿着 token 来用就有问题, 至少在判别加减正负号这一点是完全不可能的. 所以, 必须先将 token 包装一下

class Token:
    def __init__(self, image, is_number, is_pre_unary_operator):
        self.image = image
        self.is_number = is_number
        self.is_pre_unary_operator = is_pre_unary_operator

POS, NEG, ADD, SUB, MUL, DIV, POW = range(0, 7)

PRE_UNARY_SIGN_OP_MAP = {
    '+': POS,
    '-': NEG,
}

BINARY_SIGN_OP_MAP = {
    '+': ADD,
    '-': SUB,
    '*': MUL,
    '/': DIV,
    '^': POW,
}

class Operator:
    def __init__(self, op_image, op_type):
        self.op_image = op_image
        self.op_type = op_type

class PreUnaryOperator(Operator):
    def __init__(self, op_image):
        Operator.__init__(self, op_image, PRE_UNARY_SIGN_OP_MAP[op_image])

class BinaryOperator(Operator):
    def __init__(self, op_image):
        Operator.__init__(self, op_image, BINARY_SIGN_OP_MAP[op_image])

    Token 是词法分析结果的类型, 假定词法分析还是比较勤奋的, 能加上 is_number 和 is_pre_unary_operator 这两个简单属性, 而不仅仅简单给个字符串结果 (字符串存在 image 中).
POS, NEG, ADD, SUB, MUL, DIV, POW = range(0, 7)
这句实际上弄了一串枚举值, 分别表示正 (positive) 负 (negative) 加 (addition) 减 (subtract) 乘 (multiplication) 除 (division) 幂 (power).
    Operator 类型就是算子 Token 包装之后的类型, 以后算子栈就压这玩意儿进去.

    此外, reduceOp 这个行为也可以封入 Operator

class Operator:
    def reduce(self, expr_stack):
        pass

    def __init__(self, op_image, op_type):
        self.op_image = op_image
        self.op_type = op_type

class PreUnaryOperator(Operator):
    def reduce(self, expr_stack):
        rhs = expr_stack.pop()
        return PreUnaryOperation(self.op_image, rhs)

    def __init__(self, op_image):
        Operator.__init__(self, op_image, PRE_UNARY_SIGN_OP_MAP[op_image])

class BinaryOperator(Operator):
    def reduce(self, expr_stack):
        rhs = expr_stack.pop()
        lhs = expr_stack.pop()
        return BinaryOperation(self.op_image, lhs, rhs)

    def __init__(self, op_image):
        Operator.__init__(self, op_image, BINARY_SIGN_OP_MAP[op_image])

def handleOperator(token, expr_stack, op_stack):
    while 0 != len(op_stack) and priorityJudge(token, op_stack[-1]):
        op_stack.pop().reduce(expr_stack)
    op_stack.append(token)

def buildExpression(tokens):
    want_number = True
    expr_stack = []
    op_stack = []

    for token in tokens:
        if want_number:
            handleNumber(token, expr_stack, op_stack)
            want_number = False
        else:
            handleOperator(token, expr_stack, op_stack)
            want_number = True

    while len(op_stack) > 0:
        op_stack.pop().reduce(expr_stack)
    return expr_stack.pop()

    reduce 方法会从表达式栈中弹出特定数目的表达式并构建新表达式再压回栈中.

    回到之前的问题, 怎么判别加减和正负呢? 看 want_number 吧, 如果这货为 True, 那么当前符号应该是正负号 (如果 image 是 + 或 - 的话), 否则就是二元运算符. 也就是说 want_number 这名字其实也没起好, 它实际上 want 的并不只是数字, 而是...
    表达式的 FIRST 集. 准确的说其实是最开始产生式中的 Factor 的 FIRST 集, 不过无所谓了, 两个实际上相等.
    所以这个变量改成 want_factor 比较好, 而且, 这东西在每处理一个 token 之后赋值的方式也不对, 可以考虑变成下面这个样子

def handleNumber(token, expr_stack, op_stack):
    if token.is_pre_unary_operator:
        op_stack.append(PreUnaryOperator(token.image))
        return True

    expr_stack.append(token.image)
    return False

def handleOperator(token, expr_stack, op_stack):
    operator = BinaryOperator(token.image)
    while 0 != len(op_stack) and priorityJudge(operator, op_stack[-1]):
        op_stack.pop().reduce(expr_stack)
    op_stack.append(operator)
    return True

def buildExpression(tokens):
    want_number = True
    expr_stack = []
    op_stack = []

    for token in tokens:
        if want_number:
            want_number = handleNumber(token, expr_stack, op_stack)
        else:
            want_number = handleOperator(token, expr_stack, op_stack)

    while len(op_stack) > 0:
        op_stack.pop().reduce(expr_stack)
    return expr_stack.pop()

那么, 现在就只剩下 priorityJudge. 之前说到的 "算子栈顶的优先级高于当前算子, 或者两者优先级相等, 但当前符号是左结合" 这个符合条件, 看起来很复杂, 貌似要三个条件结合在一起, 让代码搅成一团的样子.
好吧, 终于要开始切题了. 其实可以用一个优先关系矩阵来表示. 这个矩阵看起来像下面这样子

OP_PRIO_MATRIX = (
  #  Stack top:
  #  ADD    SUB    MUL    DIV     | Encounters:
    (True,  True,  True,  True), #| ADD
    (True,  True,  True,  True), #| SUB
    (False, False, True,  True), #| MUL
    (False, False, True,  True), #| DIV
)

每一行表示当前遇到的算子, 每一列表示当前栈顶的算子. 比如现在遇到加法 (ADD), 而栈顶是乘法 (MUL) 的话, 查表看 ADD 行, MUL 列得到 True, 这个 True 即表示, 好, 现在需要规约. 而如果当前算子是乘法, 栈顶算子为加法, 那么查表得 False, 则不应规约算子, 所以移进吧.
上面这两个例子只说明了之前那个复杂条件中, "算子栈顶的优先级高于当前算子" 这最基本的一个满足与否时应该怎么做. 不过后两个条件实际上也体现在其中, 请仔细看对角线上的真值, 这一条下来可不是随手填作 True 的哦, 它们的意思就是, 如果现在遇到的是某个算符, 比如加法, 而栈顶也同样是加法算子, 那么查表得 True 实施规约. 这正是 "两者优先级相等, 但当前符号是左结合" 这一组条件的描述 (嗯, 上面 ADD 对 SUB, MUL 对 DIV 为 True 也是如此). 相对的, 如果是右结合算符, 比如幂 (POW) 运算, 那么对角线上这一项就应当是 False, 如

OP_PRIO_MATRIX = (
  #  Stack top:
  #  ADD    SUB    MUL    DIV    POW      | Encounters:
    (True,  True,  True,  True,  True ), #| ADD
    (True,  True,  True,  True,  True ), #| SUB
    (False, False, True,  True,  True ), #| MUL
    (False, False, True,  True,  True ), #| DIV
    (False, False, False, False, False), #| POW
)

    现在的问题就剩把表填完 (加入单目算子), 再拿着 op_type 查这个表了, 吗?
    当然不是, 还有括号.
    括号相当于一层防火墙, 阻隔规约进行到括号之外, 并且仅当出现一个反括号时才会被弹出. 要合理地使用括号的阻隔功能, 自然只能把括号也当作算子压入算子栈了. 而且, 括号的优先级必须无比的低, 以至于任何算子碰到括号都会移进 (一个括号碰到另一个括号也移进).

POS, NEG, ADD, SUB, MUL, DIV, POW, PAREN = range(0, 8)

class OpenParen(Operator):
    def __init__(self):
        Operator.__init__(self, '(', PAREN)

OP_PRIO_MATRIX = (
  #  Stack top:
  #  ADD    SUB    MUL    DIV    POW    PAREN    | Encounters:
    (True,  True,  True,  True,  True,  False), #| ADD
    (True,  True,  True,  True,  True,  False), #| SUB
    (False, False, True,  True,  True,  False), #| MUL
    (False, False, True,  True,  True,  False), #| DIV
    (False, False, False, False, False, False), #| POW
    (False, False, False, False, False, False), #| PAREN
)

    这样括号就像 AT 力场一样能抵御一切了, 除非出现反括号才会被从栈中弹出来.

    除此之外, "如果算子栈为空, 移进当前算子" 这个动作也可以用括号来合并成一个一般情况, 毕竟分支太多代码会比较挫. 很简单, 在分析开始时移进一个括号, 结束之后再来一个反括号, 这样表达式整体语义不会变, 但是程序不用再为算子栈的判空费心了.

    现在补全单目算子的优先级, 差不多整个过程的轮廓就很清晰了

OP_PRIO_MATRIX = (
  #  Stack top:
  #  POS    NEG    ADD    SUB    MUL    DIV    POW    PAREN    | Encounters:
    (False, False, False, False, False, False, False, False), #| POS
    (False, False, False, False, False, False, False, False), #| NEG
    (True,  True,  True,  True,  True,  True,  True,  False), #| ADD
    (True,  True,  True,  True,  True,  True,  True,  False), #| SUB
    (True,  True,  False, False, True,  True,  True,  False), #| MUL
    (True,  True,  False, False, True,  True,  True,  False), #| DIV
    (True,  True,  False, False, False, False, False, False), #| POW
    (False, False, False, False, False, False, False, False), #| PAREN
)

def closeParenthesis(expr_stack, op_stack):
    while len(op_stack) > 0:
        operator = op_stack.pop()
        if '(' == operator.op_image:
            break
        op_stack.pop().reduce(expr_stack)

def handleOperator(token, expr_stack, op_stack):
    if ')' == token.image:
        closeParenthesis(expr_stack, op_stack)
        return False

    operator = BinaryOperator(token.image)
    while OP_PRIO_MATRIX[operator.op_type][op_stack[-1].op_type]:
        op_stack.pop().reduce(expr_stack)
    op_stack.append(operator)
    return True

def buildExpression(tokens):
    want_number = True
    expr_stack = []
    op_stack = [OpenParen()]

    for token in tokens:
        if want_number:
            want_number = handleNumber(token, expr_stack, op_stack)
        else:
            want_number = handleOperator(token, expr_stack, op_stack)

    handleOperator(Token(')', False, False), expr_stack, op_stack)
    return expr_stack.pop()

到此为止, Token 的一个属性 is_number 还没用到呢, 当然, 话说回来整个程序有个重要的功能还没照顾到: 报错.
为了简单起见, 这里的行为是报错而不是容错, 也就是说遇到第一个错误程序扔个异常出去算了, 这样的话在下面这几处加些代码就好了

def handleNumber(token, expr_stack, op_stack):
    if '(' == token.image:
        op_stack.append(OpenParen())
        return True

    if token.is_pre_unary_operator:
        op_stack.append(PreUnaryOperator(token.image))
        return True

    if token.is_number:
        expr_stack.append(Integer(token.image))
        return False

    raise ValueError('Want a number, actually: ' + token.image)

def closeParenthesis(expr_stack, op_stack):
    while len(op_stack) > 0:
        operator = op_stack.pop()
        if '(' == operator.op_image:
            break
        expr_stack.append(operator.reduce(expr_stack))
    else:
        raise ValueError('Parenthesis not blanced')

def handleOperator(token, expr_stack, op_stack):
    if token.is_number or '(' == token.image:
        raise ValueError('Want a operator, actually: ' + token.image)

    if ')' == token.image:
        closeParenthesis(expr_stack, op_stack)
        return False

    operator = BinaryOperator(token.image)
    while OP_PRIO_MATRIX[operator.op_type][op_stack[-1].op_type]:
        expr_stack.append(op_stack.pop().reduce(expr_stack))
    op_stack.append(operator)
    return True

def buildExpression(tokens):
    want_factor = True
    expr_stack = []
    op_stack = [OpenParen()]

    for token in tokens:
        if want_factor:
            want_factor = handleNumber(token, expr_stack, op_stack)
        else:
            want_factor = handleOperator(token, expr_stack, op_stack)

    handleOperator(Token(')', False, False), expr_stack, op_stack)
    if len(expr_stack) != 1 or len(op_stack) != 0:
        raise ValueError('Invalid expression')
    return expr_stack.pop()

最后给上完整的可以跑起来的程序, 附带三个不会引发异常的案例

class NodeBase:
    def evaluate(self):
        return 0

class Integer(NodeBase):
    def evaluate(self):
        return self.value

    def __init__(self, value):
        self.value = int(value)

PRE_UNARY_OP_MAP = {
    '+': lambda x: x,
    '-': lambda x: -x,
}

BINARY_OP_MAP = {
    '+': lambda x, y: x + y,
    '-': lambda x, y: x - y,
    '*': lambda x, y: x * y,
    '/': lambda x, y: x / y,
    '^': lambda x, y: x ** y,
}

class PreUnaryOperation(NodeBase):
    def evaluate(self):
        return PRE_UNARY_OP_MAP[self.op](self.rhs.evaluate())

    def __init__(self, op, rhs):
        self.op = op
        self.rhs = rhs

class BinaryOperation(NodeBase):
    def evaluate(self):
        return BINARY_OP_MAP[self.op](self.lhs.evaluate(), self.rhs.evaluate())

    def __init__(self, op, lhs, rhs):
        self.op = op
        self.lhs = lhs
        self.rhs = rhs

class Token:
    def __init__(self, image, is_number, is_pre_unary_operator):
        self.image = image
        self.is_number = is_number
        self.is_pre_unary_operator = is_pre_unary_operator

POS, NEG, ADD, SUB, MUL, DIV, POW, PAREN = range(0, 8)

PRE_UNARY_SIGN_OP_MAP = {
    '+': POS,
    '-': NEG,
    '(': PAREN,
}

BINARY_SIGN_OP_MAP = {
    '+': ADD,
    '-': SUB,
    '*': MUL,
    '/': DIV,
    '^': POW,
}

OP_PRIO_MATRIX = (
  #  Stack top:
  #  POS    NEG    ADD    SUB    MUL    DIV    POW    PAREN    | Encounters:
    (False, False, False, False, False, False, False, False), #| POS
    (False, False, False, False, False, False, False, False), #| NEG
    (True,  True,  True,  True,  True,  True,  True,  False), #| ADD
    (True,  True,  True,  True,  True,  True,  True,  False), #| SUB
    (True,  True,  False, False, True,  True,  True,  False), #| MUL
    (True,  True,  False, False, True,  True,  True,  False), #| DIV
    (True,  True,  False, False, False, False, False, False), #| POW
    (False, False, False, False, False, False, False, False), #| PAREN
)

class Operator:
    def reduce(self, expr_stack):
        pass

    def __init__(self, op_image, op_type):
        self.op_image = op_image
        self.op_type = op_type

class OpenParen(Operator):
    def __init__(self):
        Operator.__init__(self, '(', PAREN)

class PreUnaryOperator(Operator):
    def reduce(self, expr_stack):
        rhs = expr_stack.pop()
        return PreUnaryOperation(self.op_image, rhs)

    def __init__(self, op_image):
        Operator.__init__(self, op_image, PRE_UNARY_SIGN_OP_MAP[op_image])

class BinaryOperator(Operator):
    def reduce(self, expr_stack):
        rhs = expr_stack.pop()
        lhs = expr_stack.pop()
        return BinaryOperation(self.op_image, lhs, rhs)

    def __init__(self, op_image):
        Operator.__init__(self, op_image, BINARY_SIGN_OP_MAP[op_image])

def handleNumber(token, expr_stack, op_stack):
    if '(' == token.image:
        op_stack.append(OpenParen())
        return True

    if token.is_pre_unary_operator:
        op_stack.append(PreUnaryOperator(token.image))
        return True

    if token.is_number:
        expr_stack.append(Integer(token.image))
        return False

    raise ValueError('Want a number, actually: ' + token.image)

def closeParenthesis(expr_stack, op_stack):
    while len(op_stack) > 0:
        operator = op_stack.pop()
        if '(' == operator.op_image:
            break
        expr_stack.append(operator.reduce(expr_stack))
    else:
        raise ValueError('Parenthesis not blanced')

def encounterBinaryOperator(operator, expr_stack, op_stack):
    while OP_PRIO_MATRIX[operator.op_type][op_stack[-1].op_type]:
        expr_stack.append(op_stack.pop().reduce(expr_stack))
    op_stack.append(operator)

def handleOperator(token, expr_stack, op_stack):
    if token.is_number or '(' == token.image:
        raise ValueError('Want a operator, actually: ' + token.image)

    if ')' == token.image:
        closeParenthesis(expr_stack, op_stack)
        return False

    operator = BinaryOperator(token.image)
    encounterBinaryOperator(operator, expr_stack, op_stack)
    return True

def buildExpression(tokens):
    want_factor = True
    expr_stack = []
    op_stack = [OpenParen()]

    for token in tokens:
        if want_factor:
            want_factor = handleNumber(token, expr_stack, op_stack)
        else:
            want_factor = handleOperator(token, expr_stack, op_stack)

    handleOperator(Token(')', False, False), expr_stack, op_stack)
    if len(expr_stack) != 1 or len(op_stack) != 0:
        raise ValueError('Invalid expression')
    return expr_stack.pop()

if '__main__' == __name__:
    assert (-(6 * 5)) + (4 ** (3 ** 2)) == buildExpression([
        Token('-', False, True),
        Token('(', False, False),
        Token('6', True,  False),
        Token('*', False, True),
        Token('5', True,  False),
        Token(')', False, False),
        Token('+', False, False),
        Token('4', True,  False),
        Token('^', False, False),
        Token('3', True,  False),
        Token('^', False, False),
        Token('2', True,  False),
    ]).evaluate()

    assert (-6) + 5 * (-4) + 3 + 2 == buildExpression([
        Token('-', False, True),
        Token('6', True,  False),
        Token('+', False, True),
        Token('5', True,  False),
        Token('*', False, False),
        Token('-', False, True),
        Token('4', True,  False),
        Token('+', False, False),
        Token('3', True,  False),
        Token('+', False, False),
        Token('2', True,  False),
    ]).evaluate()

    assert (+6) + (-5) * ((-4) ** (3 ** 2)) == buildExpression([
        Token('+', False, True),
        Token('6', True,  False),
        Token('+', False, True),
        Token('-', False, True),
        Token('5', True,  False),
        Token('*', False, False),
        Token('-', False, True),
        Token('4', True,  False),
        Token('^', False, False),
        Token('3', True,  False),
        Token('^', False, False),
        Token('2', True,  False),
    ]).evaluate()

感谢双木成林对本文的审阅和指导, 他是一位热衷于 Python 的上进青年 & geek.

	Back to Tobary book
	Tobary book - Copyright (C) ZhePlus @ Bit Focus
	About