Algebraic Data Type and Entities
Algebraic data types are powerful programming constructs that allows user to define custom data structures and variants.
Consider a traditional functional definition of a List
:
type IntList = Nil()
| Cons(i32, List)
We are saying that a IntList
can be one of two variants, Nil
and Cons
:
Nil
denotes the end of a list;Cons
contains the currenti32
integer and a continuation of the list.
In this representation, we can represent a list like [1, 2, 3]
with Cons(1, Cons(2, Cons(3, Nil())))
.
This is indeed what we can write in Scallop.
We can declare such a list as a constant:
const MY_LIST = Cons(1, Cons(2, Cons(3, Nil())))
In general, we call the type definition of such data structure Algebraic Data Type definitions, or ADT definitions.
The name Entity is used to refer to objects of such data types.
In the example above, the constant MY_LIST
is an entity of the ADT named IntList
.
In this section, we describe in detail the definition and use of ADT and Entities. We also touch on the internals.
Defining Algebraic Data Types (ADT)
We use the following syntax to define ADTs:
type TYPE_NAME = VARIANT_NAME(ARG_TYPE_1, ARG_TYPE_2, ...) | ...
An ADT named TYPE_NAME
is defined to have multiple (at least 2) named variants with VARIANT_NAME
.
Each variant holds a tuple of values typed by ARG_TYPE_1
, ARG_TYPE_2
, etc.
We call variants that have no argument terminal variants.
Parenthesis are still needed for those variants.
Please note that there cannot be duplicated variant names, either within the same ADT or different ADTs. For example, the following code would result in compilation failure:
type IntList = Cons(i32, IntList) | Nil()
type BoolList = Cons(bool, BoolList) | Nil() // Failure: Cons and Nil are already defined
Currently, ADTs do not support generics.
In the above case, the IntList
and BoolList
needs to be defined separately with differently named variants.
Using ADT to represent arithmetic expressions
Common data that can be expressed through ADT could be structured expressions. The following definition describes the abstract syntax tree (AST) of simple arithmetic expressions:
type Expr = Int(i32) // An expression could be a simple integer,
| Add(Expr, Expr) // a summation of two expressions
| Sub(Expr, Expr) // a substraction of two expressions
The following code encodes a simple expression
// The expression (1 + 3) - 5
const MY_EXPR = Sub(Add(Int(1), Int(3)), Int(5))
Using ADT to represent data structures
Data structures such as binary trees can also be represented:
type Tree = Node(i32, Tree, Tree) | Nil()
Here, Node(i32, Tree, Tree)
represents a node in a tree holding three things:
an integer (i32
), a left sub-tree Tree
, and a right sub-tree Tree
.
The other variant Nil
represents an empty sub-tree.
In this encoding, Node(5, Nil(), Nil())
would be representing a leaf-node holding a number 5.
The following code encodes a balanced binary search tree:
// 3
// / \
// 1 5
// / \ / \
// 0 2 4 6
const MY_TREE =
Node(3,
Node(1,
Node(0, Nil(), Nil()),
Node(2, Nil(), Nil()),
),
Node(5,
Node(4, Nil(), Nil()),
Node(6, Nil(), Nil()),
)
)
Working with Entities
Entities are most commonly created as constants using the const
keyword.
Let us revisit the List
example and see how we can use the defined constant in our analysis.
type List = Cons(i32, List) | Nil()
const MY_LIST = Cons(1, Cons(2, Cons(3, Nil()))) // [1, 2, 3]
Using Entities in Relations
We can include the constant entities as part of a fact:
rel target(MY_LIST)
query target
As a result of the above program, we are going to get the value of the entity MY_LIST
:
target: {(entity(0xff08d5d60a201f17))}
The value is going to be a 64-bit integer encoded in hex. It is a unique identifier for the created entity.
Note that, identical entities are going to have the same identifier.
In the following example, MY_LIST_1
and MY_LIST_2
are identical, and therefore their hex identifier are the same.
const MY_LIST_1 = Cons(1, Nil()),
MY_LIST_2 = Cons(1, Nil()),
MY_LIST_3 = Cons(2, Nil())
rel lists = {
(1, MY_LIST_1),
(2, MY_LIST_2),
(3, MY_LIST_3),
}
query lists
// lists: {
// (1, entity(0x678defa0a65c83ab)), // Notice that the entity 1 and 2 are the same
// (2, entity(0x678defa0a65c83ab)),
// (3, entity(0x3734567c3d9f8d3f)), // This one is different than above
// }
Decomposing Entities in Rules
To peek into the content of an Entity, we can destruct it using the case
-is
operator.
We look at an example of computing the length of a list:
type length(list: List, len: i32)
rel length(list, 0) = case list is Nil()
rel length(list, l + 1) = case list is Cons(_, tl) and length(tl, l)
We define a recursive relation length
to compute the length of a list.
There are two cases.
When the list is Nil()
, this means the list has already ended.
Therefore the list has a length of 0
For the second case, the list is Cons(_, tl)
.
Here, the length of list is the length of tl
plus 1.
We can then compute the length of a list by query
ing the length
relationship on a constant list.
query length(MY_LIST, l) // l = 3
Case Study: Decomposing Entities for Pretty-Printing
We can look at more examples of using the case
-is
operators.
The following set of rules pretty-prints expressions:
type Expr = Int(i32) | Add(Expr, Expr) | Sub(Expr, Expr)
type to_string(expr: Expr, str: String)
rel to_string(e, $format("{}", i)) = case e is Int(i)
rel to_string(e, $format("({} + {})", a, b)) = case e is Add(e1, e2) and to_string(e1, a) and to_string(e2, b)
rel to_string(e, $format("({} - {})", a, b)) = case e is Sub(e1, e2) and to_string(e1, a) and to_string(e2, b)
Shown in the example, we have written three to_string
rules for pretty-printing the Expr
data structure.
Each rule correspond to handling exactly one of the variants.
For the inductive cases Add
and Sub
, we have the to_string
rule defined recursively so that the sub-expressions are also converted to strings.
For pretty-printing, we have used the $format
foreign function.
At the end, running the following snippet
const MY_EXPR = Sub(Add(Int(3), Int(5)), Int(1))
query to_string(MY_EXPR, s)
would give the following result, suggesting that the pretty-printed expression is ((3 + 5) - 1)
to_string(MY_EXPR, s): {(entity(0xa97605c2703c6249), "((3 + 5) - 1)")}
Case Study: Checking Regular Expressions
With ADT, we can specify the language of regular expressions (regex) with ease.
Let's consider a very simple regex with union (|
) and star (*
), while phrases can be grouped together.
For example, the regex "a*b"
expresses that character a
can be repeated arbitrary amount of time (including 0-times), followed by a single b
.
This regex can be used to match strings like "aaaab"
and "b"
, but not "ba"
.
Let's try to define this regex language in Scallop!
type Regex = Char(char) // a single character
| Star(Regex) // the star of a regex
| Union(Regex, Regex) // a union of two regexes
| Concat(Regex, Regex) // concatenation of two regexes
As can be seen, we have defined 4 variants of this regex language.
With this, our regex "a*b"
can be expressed as follows:
// a*b
const A_STAR_B = Concat(Star(Char('a')), Char('b'))
Now, let's define the actual semantics of this regex language and write a relation matches
to check if the regex matches with a given sub-string.
We first setup the types of such relations.
input_regex
is a unary-relation for holding the regex to be checked against;input_string
is a unary-relation for holding the string to be checked against;matches_substr
is for checking if a sub-regexr
can be matched with the input string betweenbegin
andend
indices, whereend
is exclusive;matches
is a boolean relation telling whether theA_STAR_B
regex matches with the input string or not.
type input_regex(r: Regex)
type input_string(s: String)
type matches_substr(r: Regex, begin: usize, end: usize)
type matches()
The main bulk of the code will then be dedicated to define the matches_substr
relation.
At a high level, we decompose on each type of regex, and match on sub-strings.
The first rule that we are going to write would be for the Char
variant.
rel matches_substr(r, i, i + 1) = case r is Char(c) and input_string(s) and string_chars(s, i, c)
The rule suggests that if the regex r
is a single character c
, then we go into the input string s
and find all the index i
such that its corresponding character is c
.
The matched sub-string would start at index i
and end at index i + 1
.
Note that the string_chars
relation is a foreign predicate that decomposes the string into characters.
Similarly, we can write the rules for other variants:
// For star; it matches empty sub-strings [i, i) and recursively on sub-regex
rel matches_substr(r, i, i) = case r is Star(_) and input_string(s) and string_chars(s, i, _)
rel matches_substr(r, b, e) = case r is Star(r1) and matches_substr(r, b, c) and matches_substr(r1, c, e)
// For union; any string that matches left or right sub-regex would match the union
rel matches_substr(r, b, e) = case r is Union(r1, r2) and matches_substr(r1, b, e)
rel matches_substr(r, b, e) = case r is Union(r1, r2) and matches_substr(r2, b, e)
// For concat; we need strings to match in a consecutive matter
rel matches_substr(r, b, e) = case r is Concat(r1, r2) and matches_substr(r1, b, c) and matches_substr(r2, c, e)
Lastly, we add the rule to derive the final matches
relation.
Basically, it checks if the regex matches the start-to-end of the input string
rel matches() = input_regex(r) and input_string(s) and matches_substr(r, 0, $string_length(s))
Let us test the result!
rel input_regex(A_STAR_B)
rel input_string("aaaab")
query matches // {()}
Dynamically Creating Entities
There are cases where we want to create new entities during the deductive process.
This is done through the new
keyword followed by the entity to create.
Suppose we have the definition of List
and some pretty-printing code for it:
type List = Cons(i32, List) | Nil()
rel to_string_2(l, "]") = case l is Nil()
rel to_string_2(l, $format("{}]", i)) = case l is Cons(i, Nil())
rel to_string_2(l, $format("{}, {}", i, ts)) = case l is Cons(i, tl) and case tl is Cons(_, _) and to_string_2(tl, ts)
rel to_string(l, $format("[{}", tl)) = to_string_2(l, tl)
The following example shows that, given an input list l
, we generate a result list Cons(1, l)
.
type input_list(List)
rel result_list(new Cons(1, l)) = input_list(l)
Given an actual list defined as a constant, we will be able to specify that the constant is the input list:
const MY_INPUT_LIST = Cons(2, Cons(3, Nil()))
rel input_list(MY_INPUT_LIST)
Now, let's visualize the results!
rel input_list_str(s) = to_string(MY_INPUT_LIST, s)
rel result_list_str(s) = result_list(l) and to_string(l, s)
query input_list_str // [2, 3]
query result_list_str // [1, 2, 3]
As can be seen, through the new
operator, we have essentially created a new list containing the element 1
.
We note that the rule for result_list
is not recursive.
In general, extra care needs to be taken to ensure that the program does not go into infinite loop.`
Case Study: Creating Entities for Equality Saturation
In this case study we look at the problem of equality saturation. Given an symbolic expression, there might be ways to simplify it, which are defined through rewrite rules. Notice that after simplification, the program should be equivalent to the input. The problem is challenging as there might be multiple ways to apply the rewrite rules. How do we then systematically derive the simplest equivalent program?
A simple example here is the symbolic arithmetic expression language, with constant, variables, and summation rule:
type Expr = Const(i32) | Var(String) | Add(Expr, Expr)
One example expression that we can express in this language would be
const MY_EXPR = Add(Add(Const(-3), Var("a")), Const(3)) // (-3 + a) + 3
For visualization, we write a to_string
function
rel to_string(p, i as String) = case p is Const(i)
rel to_string(p, v) = case p is Var(v)
rel to_string(p, $format("({} + {})", s1, s2)) =
case p is Add(p1, p2) and to_string(p1, s1) and to_string(p2, s2)
If we query on to_string
for MY_EXPR
, we would get
query to_string(MY_EXPR, s) // s = "((-3 + a) + 3)"
Now let us deal with the actual simplification.
The expression (-3 + a) + 3
could be simplified to just a
, as the -3
and 3
cancels out.
The way to do the simplification is to write two things:
- rewrite rules in the form of equivalence relation;
- the weight function giving each expression a weight to tell which expression is simpler.
For this, the following set of relations needs to be defined.
type input_expr(expr: Expr)
type equivalent(expr_1: Expr, expr_2: Expr)
type weight(expr: Expr, w: i32)
type simplest(expr: Expr)
Note that we need set a prior knowledge on equivalent
: the expr_1
is always more complex than the expr_2
.
This is to prevent the simplification to go to arbitrary direction and result in infinite-loop.
In such case, equivalent
would not be commutative.
Let us start with equivalent
and define its basic property of identity and transitivity:
// Identity
rel equivalent(e, e) = case e is Const(_) or case e is Var(_) or case e is Add(_, _)
// Transitivity
rel equivalent(e1, e3) = equivalent(e1, e2) and equivalent(e2, e3)
Now, we can write the rewrite rules.
The first one we are going to write states that, if e1
and e1p
are equivalent and e2
and e2p
are equivalent,
their additions (Add(e1, e2)
and Add(e1p, e2p)
) are equivalent too.
// e1 == e1p, e2 == e2p ==> (e1 + e2) == (e1p + e2p)
rel equivalent(e, new Add(e1p, e2p)) = case e is Add(e1, e2) and equivalent(e1, e1p) and equivalent(e2, e2p)
The next rule states that Addition is commutative, such that Add(a, b)
is equivalent to Add(b, a)
:
// (a + b) == (b + a)
rel equivalent(e, new Add(b, a)) = case e is Add(a, b)
We also have a rule for associativity:
// (a + (b + c)) == ((a + b) + c)
rel equivalent(e, new Add(new Add(a, b), c)) = case e is Add(a, Add(b, c))
A rule for simplifying adding summation identity 0:
// a + 0 = a
rel equivalent(e, a) = case e is Add(a, Const(0))
A rule for reducing two constants addition:
rel equivalent(e, Const(a + b)) = case e is Add(Const(a), Const(b))
Now we have 5 rewrite-rules in place, let us define how to compute the weight of each expression.
The leaf nodes (Var
and Const
) have weight of 1
, and the addition have the weight from left and right sub-expr added together plus 1.
rel weight(e, 1) = case e is Var(_) or case e is Const(_)
rel weight(e, l + r + 1) = case e is Add(a, b) and weight(a, l) and weight(b, r)
Lastly, we use the aggregation to find the equivalent programs with the minimum weight, which is our definition of the "simplest" program.
Note that we have used an argmax
aggregation denoted by min[p]
here:
rel best_program(p) = p := argmin[p](w: input_expr(e) and equivalent(e, p) and weight(p, w))
If we query for the best program and turn it into string, we will get our expected output, a single variable "a"
!
rel best_program_str(s) = best_program(p) and to_string(p, s)
query best_program_str // {("a")}
Parsing Entities from String
Scallop provides foreign functions and predicates for dynamically parsing entities from string input. Consider the following example:
type Expr = Const(f32) | Add(Expr, Expr)
rel expr_str = {"Add(Const(1), Const(2.5))"}
Let us say that we want to parse an expression from the expr_str
, we can do the following:
rel expr($parse_entity(s)) = expr_str(s)
Here, we are using the foreign function of $parse_entity
.
We would get the following result:
query expr
// expr: {(entity(0xadea13a2621dd155))}