Parsing ASL psuedocode with Rust's nom crate

Recently, I’ve been working a parser for ARM’s XML description of the AArch64 instruction set architecture in Rust. This has been a great way to write more Rust and having a parser for ARM’s XML specification sets the stage for all sorts of fun when it comes to writing disassemblers and emulators for the AArch64 architecture.

Using a combination of quick_xml and serde has made parsing XML mostly a breeze. Initially my focus was on the parts of the spec describing the encoding of instructions. But soon it came to my attention that ARM’s psuedocode language (aka ASL, Architecture Specification Language) is used in quite a few important places. One example is instruction aliases. In AArch64, under certain conditions an orr instruction should be decoded as a mov instruction. In the spec, these conditions are described as a boolean expression in ASL:

Rn == '11111' && !MoveWidePreferred(sf, N, imms, immr)

To make the most out of these expressions we’ll probably want to parse it into an abstract syntax tree that is easier to work with. This seemed like the perfect excuse to add another dependency to my project so I settled on using nom, a parser-combinator library for Rust, to parse these boolean expressions.

Note: before taking a stab at this task, I didn’t bother to take a look at ARM’s reference manual for their Architecture Specification Language 🤪 so all the grammers in this post are ones I came up with based off what I saw in the AArch64 spec.

That being said for my initial use case, I just cared about the boolean expressions in alias conditions. Here are few more examples I worked with:

op1 == '011' && CRm IN {'0xx0'} && op2 == '011'
imms != '111111' && UInt(imms) + 1 == UInt(immr)
!(cond IN {'111x'}) && Rn == Rm

From these examples we can extrapolate a few different types we’ll need. We have identifiers like op1, binary operators like &&, +, IN, and &&, function calls, and ! as a negation operator.

Given these individual parts we can start building up our nom parser. I started by defining the type that I want my parsers to return:

#[derive(Debug, PartialEq)]
pub enum Expr {
    Equal(EqualOperator),
    NotEqual(NotEqualOperator),
    LessThan(LessThanOperator),
    GreaterThan(GreaterThanOperator),
    And(AndOperator),
    Or(OrOperator),
    Identifier(String),
    BinaryConstant(BinaryConstantExpr),
    BinaryPattern(BinaryPatternExpr),
    Call(CallExpr),
}

I really enjoy using nom. You get a lot of handy combinators that you can use to build up your parser little by little. Here I define an identifier as a sequence of uninterrupted non-whitespace characters:

pub fn identifier(code: &str) -> IResult<&str, Expr, Error<&str>> {
    let (input, value) = take_while(not_whitespace)(code)?;
    Ok((input, Expr::Identifier(value.into())))
}

We can just as easily define a parser for binary constants which are in single quotes in ASL:

pub fn binary_constant(code: &str) -> IResult<&str, Expr, Error<&str>> {
    let (input, constant) = delimited(
        char('\''),
        take_while(is_alphanumeric),
        char('\'')
    )(code)?;
    Ok((input, Expr::BinaryConstant(BinaryConstantExpr { value: constant.into() })))
}

Any parser we define we can reuse in subsequent parsers. For example I defined call arguments as a list of comma-seperated identifiers and then use that call arguments combinator in my call parser.

pub fn call_arguments(code: &str) -> IResult<&str, Vec<Expr>, Error<&str>> {
    delimited(
        tag("("),
        separated_list0(tag(", "), identifier),
        tag(")")
    )(code)
}

pub fn call(code: &str) -> IResult<&str, Expr, Error<&str>> {
    let (i, (identifier, arguments)) = separated_pair(
        identifier,
        tag(" "),
        call_arguments
    )(code)?;
    Ok((i, Expr::Call(CallExpr {
        identifier: Box::new(identifier),
        arguments
    })))
}

Eventually, I get to the point where I want to implement some of the binary operations I’m seeing. At a high level, I figure that ASL’s grammar should look something like

expr ::= binary_expression | binary_constant | identifier
binary_expression ::= or | and | in | cmp
and ::= expr "&&" expr

However, if we try to literally implement this using nom, we’ll fall into the trap of left recursion. Take the following code snippet as an example.

pub fn and(code: &str) -> IResult<&str, Expr, Error<&str>> {
    let (input, (left, right)) = separated_pair(
        expr,
        tag(" && "),
        expr
    )(code)?;
    Ok((input, Expr::And(AndOperator {
        left: Box::new(left),
        right: Box::new(right)
    })))
}

pub fn expr(code: &str) -> IResult<&str, Expr, Error<&str>> {
    alt((
        and,
        eq,
        binary_pattern,
        binary_constant,
        identifier,
    ))(code)
}

This will result in a stack overflow because we’ll end up constantly recursing into our expr parser. Instead we’ll want to come up with a grammar that doesn’t have recursion. One approach is to just replace the left expr with a terminal type which will only contain terminal items.

expr ::= binary_expression | term
term ::= binary_constant | identifier
binary_expression ::= or | and | in | cmp
and ::= term "&&" expr

Now we can rewrite our nom parser to implement this new approach.

pub fn and(code: &str) -> IResult<&str, Expr, Error<&str>> {
    let (input, (left, right)) = separated_pair(
        term,
        tag(" && "),
        expr
    )(code)?;
    Ok((input, Expr::And(AndOperator {
        left: Box::new(left),
        right: Box::new(right)
    })))
}

pub fn term(code: &str) -> IResult<&str, Expr, Error<&str>> {
    alt((
        binary_pattern,
        binary_constant,
        identifier,
    ))(code)
}

pub fn expr(code: &str) -> IResult<&str, Expr, Error<&str>> {
    alt((
        and,
        eq,
        term,
    ))(code)
}

Finally, with this parser we can build parsers for the different binary expressions we want to support and successfully parse them into an AST we can work with. However, there’s one big piece missing.

Because we always treat the leftmost item as a terminal, we end up parsing operators in a way that is always left-associative. What we really want is to be smarter about the way we reason about operator precedence in our code. But I’ll leave the task of implementing operator precedence to a subsequent post!