Scallop, a Language for Neurosymbolic Programming
Scallop is a language based on DataLog that supports differentiable logical and relational reasoning. Scallop program can be easily integrated in Python and even with a PyTorch learning module. You can also use it as another DataLog solver. This book aims to give both high-level overview of the language usage and also low-level documentation on how each language feature is used.
The following example shows how knowledge base facts, rules, and probabilistic facts recognized from images can operate together.
// Knowledge base facts
rel is_a("giraffe", "mammal")
rel is_a("tiger", "mammal")
rel is_a("mammal", "animal")
// Knowledge base rules
rel name(a, b) :- name(a, c), is_a(c, b)
// Recognized from an image, maybe probabilistic
rel name = {
0.3::(1, "giraffe"),
0.7::(1, "tiger"),
0.9::(2, "giraffe"),
0.1::(2, "tiger"),
}
// Count the animals
rel num_animals(n) :- n = count(o: name(o, "animal"))
Table of Content
Please refer to the side-bar for a detailed table of content. At a high-level, we organize this book into the following 5 sections:
Installation and Crash Course
Installation gives instructions on how to install the Scallop on your machine. Crash Course gives a quick introduction to what the language is and how it is used. Both sections are designed so that you can start quickly with Scallop.
Scallop and Logic Programming
Scallop and Logic Programming aims to give you a detailed introduction on the language. It introduces language features such as relational programming, negation and aggregation, queries, foreign constructs, and etc. Reading through all of these you should be well-versed in Scallop's core functionality and you will be able to use Scallop as a Datalog engine.
type fib(bound x: i32, y: i32)
rel fib = {(0, 1), (1, 1)}
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2) and x > 1
query fib(10, y)
Scallop and Probabilistic Programming
Scallop and Probabilistic Programming introduces the probabilistic side of Scallop. You will learn to tag facts with probabilities, its underlying algorithms and frameworks, and additional programming constructs for probabilistic semantics. By the end of this section, you will be familiar with using Scallop as a probabilistic programming language.
rel attr = { 0.99::(OBJECT_A, "blue"), 0.01::(OBJECT_B, "red"), ... }
rel relate = { 0.01::(OBJECT_A, "holds", OBJECT_B), ... }
Scallopy and Neurosymbolic Programming
Scallopy and Neurosymbolic Programming goes into the heart of Scallop to introduce applying Scallop to write Neurosymbolic applications.
Neurosymbolic methods are for methods that have both neural and logical components.
For this, we are going to use the Python binding of Scallop, scallopy
, to integrate with machine learning libraries such as PyTorch.
This section will be describing the API of scallopy
.
sum_2 = scallopy.Module(
program="""type digit_1(i32), digit_2(i32)
rel sum_2(a + b) = digit_1(a) and digit_2(b)""",
input_mappings={"digit_1": range(10), "digit_2": range(10)},
output_mapping=("sum_2", range(19)))
For Developers
For Developers discusses how developers and researchers who are interested in extending Scallop can step into the source code of Scallop and program extensions.
Installation
There are many ways in which you can use Scallop, forming a complete toolchain. We specify how to installing the toolchain from source. The following instructions assume you have access to the Scallop source code and have basic pre-requisites installed.
Requirements
- Rust - nightly 2023-03-07 (please visit here to learn more about Rust nightly and how to install them)
- Python 3.7+ (for connecting Scallop with Python and PyTorch)
Scallop Interpreter
The interpreter of Scallop is named scli
.
To install it, please do
$ make install-scli
From here, you can use scli
to test and run simple programs
$ scli examples/datalog/edge_path.scl
Scallop Interactive Shell
Scallop Python Interface
Scallop and Logic Programming
In this part of the book we introduce Scallop as a relational and logical programming language. Scallop is a Datalog based language extended with various language features such as negation, aggregation, disjunctive head, algebraic data type, foreign functions, and foreign predicates. We will explain all of these concepts in detail, aiming to provide a comprehensive introduction to the core language constructs.
Relations and Facts
Scallop is a relational and logical programming language. As described in the Wikipedia:
Logic programming is a programming paradigm which is largely based on formal logic. Any program written in a logic programming language is a set of sentences in logical form, expressing facts and rules about some problem domain.
In Scallop, relations are the most fundamental building blocks of program.
In the following example, we have declared the type of a relation called edge
, using the type
keyword:
type edge(a: i32, b: i32)
We say that the name edge
is a predicate or a relation.
Inside of the parenthesis, we have two arguments
, a: i32
and b: i32
.
Therefore, we have edge
being an arity-2 relation, due to it having 2 arguments.
For the argument a: i32
, we give a name of the field (a
) and a type of that argument i32
.
Here, both of the arguments are of the i32
type, which means signed-integer, 32-bit.
For more information on value types, refer to the Value Types section.
The above line only declares the type of the relation but not the content of the relation.
The actual information stored in the relations are called facts.
Here we define a single fact under the relation edge
:
rel edge(0, 1)
Assuming 0
and 1
each denote an ID of a node, this fact declares that there is an edge going from node 0
to node 1
.
There are two arguments in this fact, matching the arity of this relation.
Regardless of the predicate edge
, one also simply consider the (0, 1)
as a tuple, more specifically, a 2-tuple.
To declare multiple facts, one can simply write multiple single fact declaration using the rel
keyword, like
rel edge(0, 1)
rel edge(1, 2)
One can also use the set syntax to declare multiple facts of a relation.
The following line reads: "the relation edge
contains a set of tuples, including (0, 1)
and (1, 2)
":
rel edge = {(0, 1), (1, 2)}
Note that it is possible to declare multiple fact sets for the same relation.
rel edge = {(0, 1), (1, 2)}
rel edge = {(2, 3)}
With the above two lines the edge relation now contains 3 facts, (0, 1)
, (1, 2)
, and (2, 3)
.
Examples of Relations
Boolean and 0-arity Relation
Many things can be represented as relations.
We start with the most basic programming construct, boolean.
While Scallop allows value to have the boolean type, relations themselves can encode boolean values.
The following example contains an arity-0 relation named is_target
:
type is_target()
There is only one possible tuple that could form a fact in this relation, that is the empty tuple ()
.
Assuming that we are treating the relation is_target
as a set, then if the set contains no element (i.e., empty), then it encodes boolean "false".
Otherwise, if the set contains exactly one (note: it can contain at most one) tuple, then it encodes boolean "true".
Declaring only the type of is_target
as above, would assume that the relation is empty.
To declare the fact, we can do:
rel is_target()
// or
rel is_target = {()}
Unary Relations
Unary relations are relations of arity 1.
We can define unary relations for "variables" as we see in other programming languages.
The following example declares a relation named greeting
containing one single string of "hello world!"
.
rel greeting("hello world!")
// or
rel greeting = {("hello world!",)}
Note that for the second way of expressing the fact, we may omit the parenthesis and make it cleaner:
rel greeting = {"hello world!"}
In light of this, we may write the following rule:
rel possible_digit = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
Integer Arithmetics as Relations
Integer arithmetics can be represented as relations as well.
Consider a simple summation in algebra, a + b = c
encodes the sum relationship among two operands (a
and b
) and their summation (c
).
Encoded in Scallop, they form arity-3 relations:
type add(op1: i32, op2: i32, result: i32)
Note that, in Scallop, relations are not polymorphic. That is, every relation, no matter declared or inferred, only has one type annotation.
We are working on an update in the future to relax this restriction.
To declare facts of this add
relation, such as 3 + 4 = 7
, we write
rel add(3, 4, 7) // 3 + 4 = 7
However, you might notice that the add
relation is theoretically infinite.
That is, there are infinite amount of facts that can satisfy the add
relation.
There is no way that we can possibly enumerate or declare all the facts.
In such case, we resort to declaring rules using foreign functions or predicates, which we will discuss later.
For now, let's use add
as an example relation that encodes integer arithmetics.
Terminologies
We have the following terminologies for describing relations.
- Boolean Relation: arity-0 relation
- Unary Relation: arity-1 relation
- Binary Relation: arity-2 relation
- Ternary Relation: arity-3 relation
Type Inference
Scallop supports Type Inference. One does not need to fully annotate every relation on their types. Types are inferred during the compilation process.
For example, given the following code,
rel edge = {(0, 1), (1, 2)}
we can infer that the relation edge
is a binary-relation where both arguments are integers.
Note that when integers are specified, they are set default to the type of i32
.
Type inference will fail if conflicts are detected.
In the following snippet, we have the second argument being 1
as integer and also "1"
as string.
rel edge = {(0, 1), (0, "1")}
Having this code will raise the following compile error, suggesting that the types cannot be unified.
Note that the following response is generated in sclrepl
command line interface.
[Error] cannot unify types `numeric` and `string`, where the first is declared here
REPL:0 | rel edge = {(0, 1), (0, "1")}
| ^
and the second is declared here
REPL:0 | rel edge = {(0, 1), (0, "1")}
| ^^^
For more information on values and types, please refer to the next section
Rules
Rules are the fundamental to computation in Scallop.
Each rule defines the value and data flowing from some relation to another relation.
In the following program, we have defined a few facts for the edge
relation.
On the second line, we have defined that, for each edge (a, b)
, there is also a path (a, b)
.
We note that here, a
and b
are variables instead of constants as we have with defining facts.
During computation, the two facts in edge
will populate the path
relation.
This way, we have defined a rule for the path
, which is executed during computation.
rel edge = {(0, 1), (1, 2)}
rel path(a, b) = edge(a, b) // (0, 1), (1, 2)
In this section, we talk about how we write rules in Scallop and how intricate computation can be done through it.
Syntax
In general, the basic rules in Scallop are of the form
RULE ::= rel ATOM = FORMULA
FORMULA ::= ATOM
| not ATOM
| CONSTRAINT
| AGGREGATION
| FORMULA and FORMULA
| FORMULA or FORMULA
| ( FORMULA )
For each rule, we name the atom on the left to be the head of the rule, and the formula on the right to be the body. We read it from right to left: when the body formula holds, the head also holds. The formula might contain atoms, negated atoms, aggregations, conjunction, disjunction, and a few more constructs. For this section, we focus on simple (positive) atom, constraints, and their conjunctions and disjunctions. We will leave the discussion of negation and aggregation to the next sections.
Atom
Simple atoms are of the form RELATION(ARG_1, ARG_2, ...)
.
Similar to facts, we have the relation name followed by a tuple of numerous arguments.
Now, the arguments can be of richer forms, involving variables, constants, expressions, function calls, and many more.
Considering the most basic example from above:
rel path(a, b) = edge(a, b)
We have two variables a
and b
grounded by the edge
relation.
This means we are treating the variables a
and b
as source of information, which can be propagated to the head.
In this example, the head also contains two variables, both being grounded by the body.
Therefore the whole rule is well formed.
In case the head variables are not grounded by the body, such as the following,
rel path(a, c) = edge(a, b)
we would get an error that looks like the following:
[Error] Argument of the head of a rule is ungrounded
REPL:1 | rel path(a, c) = edge(a, b)
| ^
The error message points us to the variable c
that has not being grounded in the body.
For basic atoms, such as the ones that the user has defined, can be used to directly ground variables which are directly arguments of the atoms.
They can be used to ground other variables or expressions.
In the following example, although the rule itself might not make any sense, the variable a
is used to ground the expression a + 1
.
Therefore, the rule is completely valid.
rel output_relation(a, a + 1) = input_relation(a)
In certain cases, expressions can be used to bound variables as well!
rel output_relation(a, b) = input_relation(a, b + 1)
In the above example, the expression b + 1
can be used to derive b
, and thus making the variable b
grounded.
However, this might not be true for other expressions:
rel output_relation(b, c) = input_relation(b + c) // FAILURE
The input_relation
can ground the expression b + c
directly, however, the two arguments b
and c
cannot be derived from their sum, as there are (theoretically) infinite amount of combinations.
In this case, we will get a compilation failure.
There can be constraints present in atoms as well. For example, consider the following rule:
rel self_edge(a) = edge(a, a)
The atom edge(a, a)
in the body grounds only one variable a
.
But the pattern is used to match any edge that goes from a
and to a
itself.
Therefore, instead of grounding two values representing the "from" and "to" of an edge
, we are additionally posing constraint on the type of edge that we are matching.
Conceptually, we can view the above rule as the following equivalent rule:
rel self_edge(a) = edge(a, b) and a == b
where there is an additional constraint posed on the equality of a
and b
.
We are going to touch on and
and constraints in the upcoming sections.
Disjunction (Or)
The body formula can contain logical connectives such as and
, or
, not
, and implies
, used to connect basic formulas such as Atom.
In the following example, we are defining that if a
is b
's father or mother, then a
is b
's parent:
rel parent(a, b) = father(a, b)
rel parent(a, b) = mother(a, b)
In this program, we have divided the derivation of parent
into two separate rules, one processing the father
relationship and the other processing the mother
relationship.
This natually form a disjunction (or), as the derivation of parent
can come from 2 disjunctive sources.
Note that in Scallop (or Datalog in general), the ordering of the two rules does not matter.
Therefore, given that
rel father = {("Bob", "Alice")}
rel mother = {("Christine", "Alice")}
we can derive that the parent
relation holding two tuples, ("Bob", "Alice")
and ("Christine", "Alice")
.
The above program can be rewritten into a more compact form that looks like the following:
rel parent(a, b) = father(a, b) or mother(a, b)
// or
rel parent(a, b) = father(a, b) \/ mother(a, b)
We have used an explicit or
(\/
) keyword to connect the two atoms, father(a, b)
and mother(a, b)
.
The \/
symbol, which is commonly seen in the formal logics as the symbol vee (\(\vee\)), is also supported.
Notice that written in this way, each branch of the disjunction need to fully bound the variables/expressions in the head.
Conjunction (And)
To demonstrate the use of and
, let's look at the following example computing the relation of grandmother
based on father
and mother
:
rel grandmother(a, c) = mother(a, b) and father(b, c)
// or
rel grandmother(a, c) = mother(a, b) /\ father(b, c)
Notice that the symbol /\
is a replacement for the and
operator, which resembles the wedge (\(\wedge\)) symbol seen in formal logics.
As can be seen from the rule, the body grounds three variables a
, b
, and c
.
The variables a
and b
comes from mother
and the variables b
and c
comes from father
.
Notice that there is one variable, b
, in common.
In this case, we are joining the relation of mother
and father
on the variable b
.
Constraints
Rule body can have boolean constraints. For example, the conjunctive rule above can be re-written as
rel grandmother(a, c) = mother(a, b) and father(bp, c) and b == bp
Here, we are posing an equality (==
) constraint on b
and bp
.
Normally, constraints are such kind of binary expressions involving predicates such as
- equality and inequality (
==
and!=
) - numerical comparisons (
<
,>
,<=
, and>=
)
Other constructs
There are other constructs available for defining rules, which we continue to discuss in detail in other sections:
Traditional Datalog Syntax
If you are familiar with traditional Datalog, you can have it by swapping the =
with :-
, and the and
to ,
For example, the rule for defining grandmother
can be rewritten as
rel grandmother(a, c) :- mother(a, b), father(b, c)
Values and Types
Scallop has a built-in set of basic value types, following Rust's naming convention.
From there, we have types such as Symbol
, DateTime
, Entity
, and Tensor
, which are special types to Scallop.
Type | Description |
---|---|
i8 | Signed-integer, 8-bit |
i16 | Signed-integer, 16-bit |
i32 | Signed-integer, 32-bit |
i64 | Signed-integer, 64-bit |
i128 | Signed-integer, 128-bit |
isize | Signed size; its size is dependent on the system |
u8 | Unsigned-integer, 8-bit |
u16 | Unsigned-integer, 16-bit |
u32 | Unsigned-integer, 32-bit |
u64 | Unsigned-integer, 64-bit |
u128 | Unsigned-integer, 128-bit |
usize | Unsigned size; its size is dependent on the system |
f32 | Floating-point number, 32-bit |
f64 | Floating-point number, 64-bit |
bool | Boolean |
char | Character |
String | Variable-length string |
Symbol | Symbol |
DateTime | Date and time |
Duration | Duration |
Entity | Entity |
Tensor | Tensor |
Integers
Integers are the most basic data-type in Scallop.
If not specified, the default integer type that the system will pick is the i32
(signed integer 32-bit) type:
rel edge = {(0, 1), (1, 2)} // (i32, i32)
If an unsigned integer type is specified but a negative number is used in the declared facts, a type inference error will be raised.
We demonstrate this in the sclrepl
environment:
scl> type my_edge(usize, usize)
scl> rel my_edge = {(-1, -5), (0, 3)}
[Error] cannot unify types `usize` and `signed integer`, where the first is declared here
REPL:0 | type my_edge(usize, usize)
| ^^^^^
and the second is declared here
REPL:1 | rel my_edge = {(-1, -5), (0, 3)}
| ^^
Primitive operations that can be used along with integers are
- Comparators:
==
(equality)!=
(inequality)>
(greater-than)>=
(greater-than-or-equal-to)<
(less-than)<=
(less-than-or-equal-to)
- Arithmetic operators:
+
(plus)-
(minus/negate)*
(mult)/
(div)%
(mod)
All of the above operations need to operate on two integers of the same type.
For instance, you cannot compare an i32
value with a usize
value.
Floating Point Numbers
Floating point numbers are supported in Scallop as well. The following example shows the definition of student and their class grades:
type student_grade(name: String, class: String, grade: f32)
rel student_grade = {
("alice", "cse 100", 95.2),
("bob", "cse 100", 90.8),
}
It is possible derive special floating points such as inf
and -inf
, though we cannot declare such values directly.
For the floating point that is nan
(not-a-number), we will omit the whole fact from the database to maintain sanity.
Specifically, the derivation of nan
is treated as a failure of foreign functions, which we explain in detail here.
All the basic operations that can work on integers would be able to work for floating point numbers as well.
Boolean
Scallop allows the use of boolean values (true
and false
).
type variable_assign(String, bool)
rel variable_assign = {("a", true), ("b", false)}
We support the following boolean operations:
- Comparisons
==
(equality)!=
(inequality)
- Logical operations
!
(unary negate)&&
(binary and)||
(binary or)^
(binary xor)
For example, we can have the following code
rel result(a ^ b) = variable_assign("a", a) and variable_assign("b", b) // true
Character
Scallop allows definition of characters such as 'a'
, '*'
.
They are single-quoted, and can contain escaped characters such as '\n'
(new-line) and '\t'
(tab).
type my_chars = {(0, 'h'), (1, 'e'), (2, 'l'), (3, 'l'), (4, 'o')}
Comparisons operations ==
and !=
are available for characters.
String
Scallop support variable length strings of the type String
.
Strings are declared using the double quote ("
), and can contain escaped characters such as \n
and \t
.
rel greeting = {"Hello World"}
Strings can certainly be compared using ==
and !=
.
The main ways for interacting with strings are through foreign functions such as $string_length
, $substring
, $string_concat
, and etc.
Please refer to the foreign functions section for more information.
Symbols
Symbols are internally registered strings.
They are most commonly created through loading from external files.
But they can still be specified using the s
-quoted-string notation:
rel symbols = {s"NAME", s"AGE", s"GENDER"}
DateTime and Duration
DateTime
and Duration
are natively supported data structures by Scallop.
We commonly specify DateTime
and Duration
using their string form.
In the following example, we specify the DateTime
values using the t
-quoted-string notation (t
represents time):
rel event_dates = {("enroll", t"2020-01-01"), ("finish", t"2020-03-01")}
The dates will be all transformed into UTC time-zone.
When the date part is specified and the time is not specified, we will fill the time 00:00:00 UTC
.
When the time is specified but the date is not, we will use the current date when the program is invoked.
Any reasonable date-time format are acceptable, common ones include
t"2019-11-29 08:08:05-08"
t"4/8/2014 22:05"
t"September 17, 2012 10:09am"
t"2014/04/2 03:00:51"
t"2014年04月08日"
Duration
s can be specified using the d
-quoted-string notation (d
represents duration):
rel event_durations = {("e1", d"12 days"), ("e2", d"15 days 20 seconds")}
The string can contain numbers followed by their units. When specifying durations, the following units are accepted:
- nanoseconds (
n
) - microseconds (
usecs
) - milliseconds (
msecs
) - seconds (
secs
) - minutes (
m
) - hours (
h
) - days (
d
) - weeks (
w
) - months (
M
) - years (
y
)
We can operate between Duration
and DateTime
using simple operations such as +
and -
:
DateTime + Duration ==> DateTime
Duration + Duration ==> Duration
DateTime - DateTime ==> Duration
DateTime - Duration ==> DateTime
Duration - Duration ==> Duration
Entity
Entity values are 64-bit unsigned integers created through hashing. They are used to represent pointers of created entities. They cannot be directly created. Rather, they are managed by Scallop through the creation of entities. For example,
type List = Nil() | Cons(i32, List)
const MY_LIST = Cons(1, Cons(2, Nil()))
rel input_list(MY_LIST)
query input_list
The result is then
input_list: {(entity(0x4cd0d9e6652cdfc7))}
Please refer to this section for more informaiton on algebraic data types and entities.
Type Conversions
In Scallop, types can be converted using the as
operator.
For example, we can have
rel numbers = {1, 2, 3, 4, 5}
rel num_str(n as String) = numbers(n)
to derive the numbers
to be {"1", "2", "3", "4", "5"}
.
In general, we can have all numbers castable to each other.
We also have every type being castable to String
.
For converting String
to other types, it undergoes a parsing process.
When the parsing does not go through, no result will be returned.
Writing Queries
Consider the following example of classes, students, and enrollments, and that we want to compute the number of students who have enrolled in at least one CS class.
// There are three classes
rel classes = {0, 1, 2}
// Each student is enrolled in a course (Math or CS)
rel enroll = {
("tom", "CS"), ("jenny", "Math"), // Class 0
("alice", "CS"), ("bob", "CS"), // Class 1
("jerry", "Math"), ("john", "Math"), // Class 2
}
// Count how many student enrolls in CS course
rel num_enroll_cs(n) = n := count(s: enroll(s, "CS"))
Normally, executing a program would result in scli
outputting every single relation.
classes: {(0), (1), (2)}
num_enroll_cs: {(3)}
enroll: {("alice", "CS"), ("bob", "CS"), ("jenny", "Math"), ...}
However, we might only be interested in the relation named num_enroll_cs
.
In this case, we write a query using the query
keyword:
query num_enroll_cs
In this case, only the relation num_enroll_cs
will be output:
num_enroll_cs: {(3)}
Atomic Query
One can also write atomic query if we just want to get a part of the relation. For instance, consider the fibonacci example:
type fib(x: i32, y: i32)
rel fib = {(0, 1), (1, 1)}
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2) and x <= 10
query fib(8, y) // fib(8, y): {(8, 34)}
In this case, we are just looking at the 8-th fibonacci number, which is 34.
Recursive Rules
One very powerful programming construct with Scallop is to declaratively define recursion. Inside of a rule, if a relational predicate appearing in the head appears in the body, the predicate is recursive. For example, the definition of fibonacci number is recursive:
\[ \text{fib}(x) = \left\{ \begin{array}{ll} \text{fib}(x - 1) + \text{fib}(x - 2), & \text{if}~ x > 1 \\ 1, & \text{otherwise} \end{array} \right. \]
Written in Scallop, we encode the function fib
as a binary relation between the integer input and output:
type fib(x: i32, y: i32)
We can define the base cases for \(\text{fib}(0)\) and \(\text{fib}(1)\):
rel fib = {(0, 1), (1, 1)}
Now it comes to the definition of recursive cases, which peeks into \(\text{fib}(x - 1)\) and \(\text{fib}(x - 2)\) and sums them.
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2) // infinite-loop
However, when actually executing this, it would not terminate as we are attempting to compute all fibonacci numbers, and there are infinite amount of them.
In order to stop it, we can temporarily add a constraint to limit the value of x
, so that we only compute the fibonacci number up to 10:
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2) and x <= 10
At the end, we would get a the fib
relation to contain the following facts:
fib: {(0, 1), (1, 1), (2, 2), (3, 3), (4, 5), (5, 8), (6, 13), (7, 21), (8, 34), (9, 55), (10, 89)}
As suggested by the result, the 10-th fibonacci number is 89.
Case Study: Graphs and Transitive Closure
Following is one of the most widely known Datalog program: computing the path
s inside of a graph.
By definition, an edge or a sequence of edges constitute a path.
This is reflected by the following two rules:
type edge(i32, i32)
rel path(a, b) = edge(a, b)
rel path(a, c) = path(a, b) and edge(b, c)
The first line states that an edge can form a path.
The second line states that a path, connected to a new edge, forms a new path.
As can be seen from the second line, the relation path
appears in both the body and the head, making it a recursive relation.
In this example, suppose we have
rel edge = {(0, 1), (1, 2)}
we would get the set of paths to be
path: {(0, 1), (0, 2), (1, 2)}
Notice that the path (0, 2)
is a compound path obtained from joining the two edges (0, 1)
and (1, 2)
.
Relation Dependency
Given a rule with head and body, we say that the predicate appearing in the head depends on the predicates of the atoms appearing in the body. This forms a dependency graph. The above edge-path example would have the following dependency graph:
edge <--- path <---+
| |
+------+
The relation edge
depends on nothing, while path
depends on edge
and also path
itself.
This forms a loop in the dependency graph.
In general, if a program has a dependency graph with a loop, then the program requires recursion.
Any relation that is involved in a loop would be a recursive relation.
Notice that we are mostly talking about positive dependency here, as the atoms in the body of the rule are positive atoms (i.e., without annotation of negation or aggregation). In more complex scenarios, there will be negation or aggregation in a rule, which we explain in detail in future sections.
Fixed-point Iteration
The recursion in Scallop happens in fixed-point iteration. In plain terms, the recursion will continue until there is no new fact being derived in an iteration. In hind-sight, the whole Scallop program is executed in a loop. Within one iteration, all of the rules inside of the program are executed. Let us digest the actual execution happens when executing the above edge-path program:
rel edge = {(0, 1), (1, 2), (2, 3)}
rel path(a, b) = edge(a, b) // rule 1
rel path(a, c) = path(a, b) and edge(b, c) // rule 2
Before the first iteration, the edge
has already been filled with 3 facts, namely (0, 1)
, (1, 2)
, and (2, 3)
.
But the path
is empty.
Let's now go through all the iterations:
Iter 0: path = {}
Iter 1: path = {(0, 1), (1, 2), (2, 3)}
Δpath = {(0, 1), (1, 2), (2, 3)} // through applying rule 1
Iter 2: path = {(0, 1), (1, 2), (2, 3), (0, 2), (1, 3)}
Δpath = {(0, 2), (1, 3)} // through applying rule 2
Iter 3: path = {(0, 1), (1, 2), (2, 3), (0, 2), (1, 3), (0, 3)}
Δpath = {(0, 3)} // through applying rule 2
Iter 4: path = {(0, 1), (1, 2), (2, 3), (0, 2), (1, 3), (0, 3)}
Δpath = {}
In the above note, we also include Δpath
, which contains the new paths derived during the current iteration.
As can be seen, during iteration 1, paths of length 1 are derived; during iteration 2, paths of length 2 are derived.
During iteration 4, there is no more path to be derived, and therefore the Δpath
is empty.
This tells us that no new facts are derived and the whole fixed-point iteration is stopped, giving us the final result.
Infinite Relations
As we have described in the fixed-point iteration, the recursion will continue until no more fact is derived. However, we are capable of writing rules that are infinite. As shown in the first example:
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2)
gives you an infinite relation as there can always be a new x
to be derived.
In this case, the fixed-point iteration never stops.
The root cause of this is Scallop's support for value creationg, i.e., the creation of new values.
Typically, database systems work in closed-world assumption, that is, all the items being reasoned about are already there.
No computation is done on arbitrarily created values.
But in the above example, we have derived x
from the grounded expression x - 1
, hence creating a new value.
Typically, the way to resolve this is to create bounds on the created values. For example, the rule
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2) and x <= 10
restricts that x
cannot be greater than 10.
This makes the fixed-point iteration to stop after around 10 iterations.
Other way of getting around with this involve the use of Magic-Set Transformations, which we describe its equivalent in Scallop in a later section.
Negations
Scallop supports negation to be attached to atoms to form negations. In the following example, we are trying to obtain the set of people with no children:
rel person = {"bob", "alice", "christine"} // There are three persons of interest
rel father = {("bob", "alice")} // Bob is Alice's father
rel mother = {("alice", "christine")} // Alice is Christine's mother
rel has_no_child(n) = person(n) and not father(n, _) and not mother(n, _)
The last rule basically says that if there is a person n
who is neither anyone's father nor anyone's mother then the person n
has no child.
This is indeed what we are going to obtain:
has_no_child: {("christine",)}
It is clear that negations are very helpful in writing such kind of the rules. However, there are many restrictions on negations. We explain in detail such restrictions.
Negation and Variable Grounding
If we look closely to the rule of has_no_child
above, we will find that there is an atom person(n)
being used in the body.
Why can't we remove it and just say "if one is neither father nor mother then the one has no child"?
rel has_no_child(n) = not father(n, _) and not mother(n, _) // Error: variable `n` is not grounded
The problem is with variable grounding.
For the variable n
to be appeared in the head, there is no positive atom that grounds it.
All we are saying are what n
is not, but not what n
is.
With only "what it is not", it could be literally anything else in the world.
Therefore, we need to ground it with a positive atom such as person(n)
.
With this rule, we have basically
Stratified Negation
Expanding upon our definition of dependency graph,
if a predicate occurs in a negative atom in a body,
we say that the predicate of the rule head negatively depends on this predicate.
For example, the above has_no_child
example has the following dependency graph.
Notice that we have marked the positive (pos
) and negative (neg
) on each edge:
person <--pos-- has_no_child --neg--> father
|
+-----neg-----> mother
Scallop supports stratified negation, which states that there is never a loop in the dependency graph which involves a negative dependency edge. In other words, if there exists such a loop, the program will be rejected by the Scallop compiler. Consider the following example:
rel is_true() = not is_true() // Rejected
The relation is_true
negatively depends on the relation is_true
itself, making it a loop containing a negative dependency edge.
The error message would show that this program "cannot be stratified".
If we draw the dependency graph of this program, it look like the following:
is_true <---+
| |
+--neg---+
Since there is a loop (is_true -> is_true
) and the loop contains a negative edge, this program cannot be stratified.
The reason that stratified negation is named such way is that, if there is no negative dependency edge in a loop, the whole dependency can be decomposed in to strongly connected components, where inside of each strongly connected component (SCC), there is no negative dependency. In other words, the negation has been stratified, so that the negative edge can only happen between SCCs. We call each SCC a stratum, and the collection of them a strata. Any non-recursive program has a dependency graph forming a Directed Acyclic Graph (DAG), and is therefore always stratifiable.
The following program, although containing both negation and recursion, can be stratified:
rel path(a, b) = edge(a, b) and not sanitized(b)
rel path(a, c) = path(a, b) and edge(b, c) and not sanitized(b)
For it, the following dependency graph can be drawn:
sanitized <--neg-- path <----+
| | |
edge <--pos---+ +--pos-+
In this program, we have three SCCs (or strata):
- Stratum 1:
{edge}
- Stratum 2:
{sanitized}
- Stratum 3:
{path}
Negative dependency only occurs between stratum 2 and 3. Therefore, the program can be accepted.
Aggregations
Aggregations in Scallop can be viewed as operations that aggregates over multiple facts. Such operations include counting, summation and product, finding min and max, and logical quantifiers such as exists and forall. Aggregations appear in the body of a rule, and can be nested for abbrevity.
As a concrete example, we look at a program which counts over a set of people:
rel person = {"alice", "bob", "christine"}
rel num_people(n) = n := count(p: person(p)) // n = 3
In general, we use the following syntax for aggregation formulas.
R1, R2, ... := AGGREGATOR(V1, V2, ...: FORMULA (where U1, U2, ...: FORMULA)?)
We name R1, ...
to be the aggregation result variable, V1, ...
to be the binding variable, and the formula inside of the aggregation the body.
When the where
keyword is used, we have the aggregation associated with explicit group-by clause.
Here, we call the set of variables U1, ...
as group-by variables.
The formula under the where
clause is named the group-by body.
The binding variables need to be fully grounded by the body formula, and the group-by variables (if presented) need to also be fully grounded by the group-by body.
For different types of aggregation, the AGGREGATOR
might also change and annotated with different information.
The number of result variables, the number of binding variables, and their types differ for each aggregation.
Here is a high-level overview of each supported aggregator and their configurations.
In the table, ...
is used to denote an arbitrary amount of variables.
Aggregator | Binding Variables | Result Variables |
---|---|---|
count | Any... | usize |
sum | Number | the same as the binding variable |
prod | Number | the same as the binding variable |
min | Any | the same as the binding variables |
max | Any | the same as the binding variables |
exists | Any... | bool |
forall | Any... | bool |
Below, we elaborate on each aggregators and describe their usages.
Count
To count the number of facts, we can use the count
aggregator.
Just repeating the examples shown in the beginning:
rel person = {"alice", "bob", "christine"}
rel num_people(n) = n := count(p: person(p)) // n = 3
We are counting the number of persons appear in the person
relation.
To be more concrete, let's read out the aggregation formula:
We count the number of
p
such thatp
is aperson
, and assign the result to the variablen
.
For count
, there could be arbitrary (> 0) number of binding variables which can be typed arbitrarily.
It will only have a single result variable which is typed usize
.
For example, you may count the number of edge
s:
rel num_edges(n) = n := count(a, b: edge(a, b))
Here, we have two binding variables a
and b
, meaning that we are counting the number of distinct pairs of a
and b
.
Note that we can use the syntax sugar for aggregation to omit the repeated n
:
rel num_edges = count(a, b: edge(a, b))
Implicit Group-By
With group-by
, we may count the number of facts under a pre-defined group.
Consider the example where there is a scene with differet colored objects,
rel obj_color = {(0, "red"), (1, "red"), (2, "blue"), (3, "red")}
rel num_obj_per_color(col, num) = num := count(obj: obj_color(obj, col))
As suggested by the facts inside of obj_color
, there are 4
objects indexed using 0, 1, 2, 3
, each associated with a different color.
The object #0, #1, and #3 are red
and the object #2 is blue
.
Therefore, we will get 3 red objects and 1 blue object, as computed in the result of num_obj_per_color
:
num_obj_per_color: {("blue", 1), ("red", 3)}
Let's analyze the rule in detail.
We find that we are counting over obj
such that the object obj
has a certain color col
.
But col
is also a variable occurring in the head of the rule.
This is an implicit group-by, in that the variable col
is being used as an implicit group-by variable.
That is, we are conditioning the counting procedure under each group that is defined by the col
variable.
Since there are two colors appearing in the obj_color
relation, we are performing count for each of the two groups.
In general, if a variable is positively grounded in the body and appear in the head of a parent rule, we call the variable an implicit group-by variable.
Explicit Group-By
In the above example, there is no green colored object. However, how do we know that the number of green object is 0? The result does not seem to address this problem.
The missing piece is a domain of the possible groups. Without explicitly setting the domain, Scallop could only search inside of the database on possible groups. However, we can explicitly tell Scallop about what are the groups. Consider the following rewrite of the above program:
rel colors = {"red", "green", "blue"}
rel obj_color = {(0, "red"), (1, "red"), (2, "blue"), (3, "red")}
rel num_obj_per_color(col, num) = num := count(obj: obj_color(obj, col) where col: colors(col))
With the where
clause, we have explicitly declared that col
is a group-by variable which is grounded by the colors
relation.
If we look into the colors
relation, we find that there are three possible colors that we care about, red, green, and blue.
In this case, we will consider "green"
as the third group and try to count the number of green objects -- which there are 0:
num_obj_per_color: {("blue", 1), ("green", 0), ("red", 3)}
Sum and Product
We can use the aggregator of sum and product to aggregate multiple numerical values. Consider the following example of sales:
rel sales = {("alice", 1000.0), ("bob", 1200.0), ("christine", 1000.0)}
We can compute the sum of all the sales:
rel total_sales(s) = s := sum[p](sp: sales(p, sp)) // 3200.0
// or
rel total_sales = sum[p](sp: sales(p, sp)) // 3200.0
Notice that the result type of s
is the same as the type of the binding variable sp
, which is f32
as indicated by the decimals in the definition of sales
.
Here, the argument variable p
is necessary since it is the key to index each sale number.
The above rule body is equivalent to the following math formula:
\[ s = \sum_p \text{sale}_p \]
If we do not use the argument variable, we get the following:
rel total_sales_wrong(s) = s := sum(sp: sales(p, sp)) // 2200.0, since the two 1000.0 will be deduplicated without its key
The product aggregator prod
can be used in a similar manner as sum
.
Min, Max, Argmin, and Argmax
Scallop can compute the minimum or maximum among a set of values. In the following example, we find the maximum grade of an exam:
rel exam_grades = {("a", 95.2), ("b", 87.3), ("c", 99.9)}
rel min_score(m) = m := max(s: exam_grades(_, s)) // 99.9
// or, succinctly
rel min_score = max(s: exam_grades(_, s)) // 99.9
The number (and types) of binding variables can be arbitrary, but the result variables must match the binding variables.
In the above case, since s
is of type f32
, m
will be of type f32
as well.
It is also possible to get argmax/argmin. Suppose we want to get the person (along with their grade) who scored the best, we write:
rel best_student(n, s) = (n, s) := max[n](s: exam_grades(n, s))
// or, succinctly
rel best_student = max[n](s: exam_grades(n, s))
Here, we are still finding the maximum score s
, but along with max
we have specified the "arg" ([n]
) which associates with the maximum score.
We call n
an arg variable for min
/max
aggregator.
The arg variable is grounded by the aggregation body, and can be directly used in the head of the rule.
If we do not care about the grade and just want to know who has the best grade, we can use wildcard _
to ignore the result variable, like
rel best_student(n) = (n, _) := max[n](s: exam_grades(n, s))
Alternatively, we can also use argmax
:
rel best_student(n) = n := argmax[n](s: exam_grades(n, s))
// or, succinctly
rel best_student = argmax[n](s: exam_grades(n, s))
Exists and Forall
Logical quantifier such as exists and forall can also be encoded as aggregations. They will return value of boolean as the aggregation result.
Existential Quantifier
Let us start with discussing the easier of the two, exists
.
Technically, all variables in the body of Scallop rule are existentially quantified.
We can use exists
aggregation to make it explicit.
For example, we can check if there exists an object that is blue:
rel obj_color = {(0, "red"), (1, "green")}
rel has_blue(b) = b := exists(o: obj_color(o, "blue"))
Specifically, we are checking "if there exists an object o
such that its color is blue
".
The result is being assigned to a variable b
.
Since there is no blue object, we will get a result of has_blue(false)
.
In case when we just want the result boolean to be true
or false
, we can omit the result variables.
For example, we can rewrite the recursive case of edge-path transitive closure as
rel path(a, c) = exists(b: path(a, b) and edge(b, c))
We note that this is just a syntax sugar equivalent to the following:
rel path(a, c) = r := exists(b: path(a, b) and edge(b, c)) and r == true
When we want to know the inexistence of something, we can do
rel no_red() = not exists(o: obj_color(o, "red"))
Note that there can be arbitrary amount of binding variables.
Universal Quantifier
We can also have universal quantifier forall
.
For this, there is a special requirement for universal quantification: the body formula has to be an implies
(=>
) formula.
This restriction is enforced so that all the binding variables have bounds being specified on the left-hand-side of the implies
formula.
In the following example, we check if all the objects are spherical:
type Shape = CUBE | SPHERE | CONE | CYLINDER
rel object = {0, 1, 2}
rel obj_shape = {(0, CUBE), (1, SPHERE), (2, SPHERE)}
rel target(b) = b := forall(o: object(o) implies obj_shape(o, SPHERE))
Notice that we have a relation which defines the domain of object
, suggesting that there are just 3 objects for us to work with.
In the aggregation, we are checking "for all o
such that o
is an object, is the object a sphere?"
The result is stored in the variable b
and propagated to the target
relation.
The reason we need to have an implies formula is that we need to use the left-hand-side of implies
to give bounds to the universally quantified variables.
Scallop cannot reason about open domain variables.
Note that similar to exists
, we can also remove the result variable.
The following program derives a boolean (arity-0) relation target
denoting whether all the red objects are cubes:
type Shape = CUBE | SPHERE | CONE | CYLINDER
type Color = RED | GREEN | BLUE
rel obj_shape = {(0, CUBE), (1, SPHERE), (2, SPHERE)}
rel obj_color = {(0, RED), (1, GREEN), (2, GREEN)}
rel target() = forall(o: obj_color(o, RED) implies obj_shape(o, CUBE)) // {()}
Here, we directly use obj_color
to serve as the left-hand-side of the implies
.
There will be one empty tuple being derived, suggesting that the statement is true.
String Join
If you have multiple facts containing strings and you want to join them together, you can use the string_join
aggregator:
rel R = {"hello", "world"}
rel P1(n) = n := string_join(s: R(s)) // P1("helloworld")
rel P2(n) = n := string_join<" ">(s: R(s)) // P2("hello world")
In the above example,
we can either directly join, producing the string "helloworld",
or join with separator " "
, producing the string "hello world".
Note that the order of the strings in the joined string is determined by the strings.
Here, "hello"
starts with "h"
, which is smaller than the "w"
in "world"
, therefore occurring before "world"
.
If you want to specify an explicit order, use the argument variable:
rel R = {(2, "hello"), (1, "world")}
rel P(n) = n := string_join<" ">[i](s: R(i, s)) // P("world hello")
Since we have specified the variable i
to be the argument of string_join
, it serves to order the tuples.
Here, we have (1, "world")
and (2, "hello")
, so the joined string will be "world hello"
instead of "hello world"
.
Declaring Constants
We can declare constants and give it names. The general syntax is the following:
const NAME (: TYPE)? = CONSTANT
For example, we can define the value of PI
:
const PI = 3.1415926
Notice that here we have not specified the type of PI
.
By default, a float value would resort to the place where the constant is used.
If we want to specify a non-default type, we can do
const PI: f64 = 3.1415926
We can also declare multiple constants at a time:
const LEFT = 0, UP = 1, RIGHT = 2, DOWN = 3
Enum Types
We sometimes want to define enum types which contain constant variables.
Common examples include RED
, GREEN
, and BLUE
under the Color
type, and LEFT
, RIGHT
, UP
under the Action
type.
These can be achieved by defining enum types:
type Color = RED | GREEN | BLUE
type Action = LEFT | UP | RIGHT | DOWN
Internally, the values such as RED
and UP
are unsigned integer constants.
If not specified, the values start from 0 and goes up 1 at a time.
For example, given the type definition above, RED = 0
, GREEN = 1
, and BLUE = 2
.
For Action
s, LEFT = 0
, UP = 1
, and etc.
Notice that even when Color
and Action
are different types, their values can overlap.
One can specify the values of these enum variants by attaching actual numbers to them. In the following example, we have explicitly assigned three values to the colors.
type Color = RED = 3 | GREEN = 5 | BLUE = 7
We can also just set a few of those:
type Color = RED | GREEN = 10 | BLUE
In this case, RED = 0
, GREEN = 10
, and BLUE = 11
.
Notice how blue's value is incremented from GREEN
.
Displaying Constants
Constants are just values and many of them are integer values. They are not explicitly associated with any symbols. If you want to display them correctly, we advise you create auxilliary relations storing the mapping from each constant to its string form. For example, we can have
rel color_to_string = {(RED, "red"), (GREEN, "green"), (BLUE, "blue")}
In this case, following the result with color_to_string
relation will display their desired meanings properly.
Algebraic Data Type and Entities
Algebraic data types are powerful programming constructs that allows user to define custom data structures and variants.
Consider a traditional functional definition of a List
:
type IntList = Nil()
| Cons(i32, List)
We are saying that a IntList
can be one of two variants, Nil
and Cons
:
Nil
denotes the end of a list;Cons
contains the currenti32
integer and a continuation of the list.
In this representation, we can represent a list like [1, 2, 3]
with Cons(1, Cons(2, Cons(3, Nil())))
.
This is indeed what we can write in Scallop.
We can declare such a list as a constant:
const MY_LIST = Cons(1, Cons(2, Cons(3, Nil())))
In general, we call the type definition of such data structure Algebraic Data Type definitions, or ADT definitions.
The name Entity is used to refer to objects of such data types.
In the example above, the constant MY_LIST
is an entity of the ADT named IntList
.
In this section, we describe in detail the definition and use of ADT and Entities. We also touch on the internals.
Defining Algebraic Data Types (ADT)
We use the following syntax to define ADTs:
type TYPE_NAME = VARIANT_NAME(ARG_TYPE_1, ARG_TYPE_2, ...) | ...
An ADT named TYPE_NAME
is defined to have multiple (at least 2) named variants with VARIANT_NAME
.
Each variant holds a tuple of values typed by ARG_TYPE_1
, ARG_TYPE_2
, etc.
We call variants that have no argument terminal variants.
Parenthesis are still needed for those variants.
Please note that there cannot be duplicated variant names, either within the same ADT or different ADTs. For example, the following code would result in compilation failure:
type IntList = Cons(i32, IntList) | Nil()
type BoolList = Cons(bool, BoolList) | Nil() // Failure: Cons and Nil are already defined
Currently, ADTs do not support generics.
In the above case, the IntList
and BoolList
needs to be defined separately with differently named variants.
Using ADT to represent arithmetic expressions
Common data that can be expressed through ADT could be structured expressions. The following definition describes the abstract syntax tree (AST) of simple arithmetic expressions:
type Expr = Int(i32) // An expression could be a simple integer,
| Add(Expr, Expr) // a summation of two expressions
| Sub(Expr, Expr) // a substraction of two expressions
The following code encodes a simple expression
// The expression (1 + 3) - 5
const MY_EXPR = Sub(Add(Int(1), Int(3)), Int(5))
Using ADT to represent data structures
Data structures such as binary trees can also be represented:
type Tree = Node(i32, Tree, Tree) | Nil()
Here, Node(i32, Tree, Tree)
represents a node in a tree holding three things:
an integer (i32
), a left sub-tree Tree
, and a right sub-tree Tree
.
The other variant Nil
represents an empty sub-tree.
In this encoding, Node(5, Nil(), Nil())
would be representing a leaf-node holding a number 5.
The following code encodes a balanced binary search tree:
// 3
// / \
// 1 5
// / \ / \
// 0 2 4 6
const MY_TREE =
Node(3,
Node(1,
Node(0, Nil(), Nil()),
Node(2, Nil(), Nil()),
),
Node(5,
Node(4, Nil(), Nil()),
Node(6, Nil(), Nil()),
)
)
Working with Entities
Entities are most commonly created as constants using the const
keyword.
Let us revisit the List
example and see how we can use the defined constant in our analysis.
type List = Cons(i32, List) | Nil()
const MY_LIST = Cons(1, Cons(2, Cons(3, Nil()))) // [1, 2, 3]
Using Entities in Relations
We can include the constant entities as part of a fact:
rel target(MY_LIST)
query target
As a result of the above program, we are going to get the value of the entity MY_LIST
:
target: {(entity(0xff08d5d60a201f17))}
The value is going to be a 64-bit integer encoded in hex. It is a unique identifier for the created entity.
Note that, identical entities are going to have the same identifier.
In the following example, MY_LIST_1
and MY_LIST_2
are identical, and therefore their hex identifier are the same.
const MY_LIST_1 = Cons(1, Nil()),
MY_LIST_2 = Cons(1, Nil()),
MY_LIST_3 = Cons(2, Nil())
rel lists = {
(1, MY_LIST_1),
(2, MY_LIST_2),
(3, MY_LIST_3),
}
query lists
// lists: {
// (1, entity(0x678defa0a65c83ab)), // Notice that the entity 1 and 2 are the same
// (2, entity(0x678defa0a65c83ab)),
// (3, entity(0x3734567c3d9f8d3f)), // This one is different than above
// }
Decomposing Entities in Rules
To peek into the content of an Entity, we can destruct it using the case
-is
operator.
We look at an example of computing the length of a list:
type length(list: List, len: i32)
rel length(list, 0) = case list is Nil()
rel length(list, l + 1) = case list is Cons(_, tl) and length(tl, l)
We define a recursive relation length
to compute the length of a list.
There are two cases.
When the list is Nil()
, this means the list has already ended.
Therefore the list has a length of 0
For the second case, the list is Cons(_, tl)
.
Here, the length of list is the length of tl
plus 1.
We can then compute the length of a list by query
ing the length
relationship on a constant list.
query length(MY_LIST, l) // l = 3
Case Study: Decomposing Entities for Pretty-Printing
We can look at more examples of using the case
-is
operators.
The following set of rules pretty-prints expressions:
type Expr = Int(i32) | Add(Expr, Expr) | Sub(Expr, Expr)
type to_string(expr: Expr, str: String)
rel to_string(e, $format("{}", i)) = case e is Int(i)
rel to_string(e, $format("({} + {})", a, b)) = case e is Add(e1, e2) and to_string(e1, a) and to_string(e2, b)
rel to_string(e, $format("({} - {})", a, b)) = case e is Sub(e1, e2) and to_string(e1, a) and to_string(e2, b)
Shown in the example, we have written three to_string
rules for pretty-printing the Expr
data structure.
Each rule correspond to handling exactly one of the variants.
For the inductive cases Add
and Sub
, we have the to_string
rule defined recursively so that the sub-expressions are also converted to strings.
For pretty-printing, we have used the $format
foreign function.
At the end, running the following snippet
const MY_EXPR = Sub(Add(Int(3), Int(5)), Int(1))
query to_string(MY_EXPR, s)
would give the following result, suggesting that the pretty-printed expression is ((3 + 5) - 1)
to_string(MY_EXPR, s): {(entity(0xa97605c2703c6249), "((3 + 5) - 1)")}
Case Study: Checking Regular Expressions
With ADT, we can specify the language of regular expressions (regex) with ease.
Let's consider a very simple regex with union (|
) and star (*
), while phrases can be grouped together.
For example, the regex "a*b"
expresses that character a
can be repeated arbitrary amount of time (including 0-times), followed by a single b
.
This regex can be used to match strings like "aaaab"
and "b"
, but not "ba"
.
Let's try to define this regex language in Scallop!
type Regex = Char(char) // a single character
| Star(Regex) // the star of a regex
| Union(Regex, Regex) // a union of two regexes
| Concat(Regex, Regex) // concatenation of two regexes
As can be seen, we have defined 4 variants of this regex language.
With this, our regex "a*b"
can be expressed as follows:
// a*b
const A_STAR_B = Concat(Star(Char('a')), Char('b'))
Now, let's define the actual semantics of this regex language and write a relation matches
to check if the regex matches with a given sub-string.
We first setup the types of such relations.
input_regex
is a unary-relation for holding the regex to be checked against;input_string
is a unary-relation for holding the string to be checked against;matches_substr
is for checking if a sub-regexr
can be matched with the input string betweenbegin
andend
indices, whereend
is exclusive;matches
is a boolean relation telling whether theA_STAR_B
regex matches with the input string or not.
type input_regex(r: Regex)
type input_string(s: String)
type matches_substr(r: Regex, begin: usize, end: usize)
type matches()
The main bulk of the code will then be dedicated to define the matches_substr
relation.
At a high level, we decompose on each type of regex, and match on sub-strings.
The first rule that we are going to write would be for the Char
variant.
rel matches_substr(r, i, i + 1) = case r is Char(c) and input_string(s) and string_chars(s, i, c)
The rule suggests that if the regex r
is a single character c
, then we go into the input string s
and find all the index i
such that its corresponding character is c
.
The matched sub-string would start at index i
and end at index i + 1
.
Note that the string_chars
relation is a foreign predicate that decomposes the string into characters.
Similarly, we can write the rules for other variants:
// For star; it matches empty sub-strings [i, i) and recursively on sub-regex
rel matches_substr(r, i, i) = case r is Star(_) and input_string(s) and string_chars(s, i, _)
rel matches_substr(r, b, e) = case r is Star(r1) and matches_substr(r, b, c) and matches_substr(r1, c, e)
// For union; any string that matches left or right sub-regex would match the union
rel matches_substr(r, b, e) = case r is Union(r1, r2) and matches_substr(r1, b, e)
rel matches_substr(r, b, e) = case r is Union(r1, r2) and matches_substr(r2, b, e)
// For concat; we need strings to match in a consecutive matter
rel matches_substr(r, b, e) = case r is Concat(r1, r2) and matches_substr(r1, b, c) and matches_substr(r2, c, e)
Lastly, we add the rule to derive the final matches
relation.
Basically, it checks if the regex matches the start-to-end of the input string
rel matches() = input_regex(r) and input_string(s) and matches_substr(r, 0, $string_length(s))
Let us test the result!
rel input_regex(A_STAR_B)
rel input_string("aaaab")
query matches // {()}
Dynamically Creating Entities
There are cases where we want to create new entities during the deductive process.
This is done through the new
keyword followed by the entity to create.
Suppose we have the definition of List
and some pretty-printing code for it:
type List = Cons(i32, List) | Nil()
rel to_string_2(l, "]") = case l is Nil()
rel to_string_2(l, $format("{}]", i)) = case l is Cons(i, Nil())
rel to_string_2(l, $format("{}, {}", i, ts)) = case l is Cons(i, tl) and case tl is Cons(_, _) and to_string_2(tl, ts)
rel to_string(l, $format("[{}", tl)) = to_string_2(l, tl)
The following example shows that, given an input list l
, we generate a result list Cons(1, l)
.
type input_list(List)
rel result_list(new Cons(1, l)) = input_list(l)
Given an actual list defined as a constant, we will be able to specify that the constant is the input list:
const MY_INPUT_LIST = Cons(2, Cons(3, Nil()))
rel input_list(MY_INPUT_LIST)
Now, let's visualize the results!
rel input_list_str(s) = to_string(MY_INPUT_LIST, s)
rel result_list_str(s) = result_list(l) and to_string(l, s)
query input_list_str // [2, 3]
query result_list_str // [1, 2, 3]
As can be seen, through the new
operator, we have essentially created a new list containing the element 1
.
We note that the rule for result_list
is not recursive.
In general, extra care needs to be taken to ensure that the program does not go into infinite loop.`
Case Study: Creating Entities for Equality Saturation
In this case study we look at the problem of equality saturation. Given an symbolic expression, there might be ways to simplify it, which are defined through rewrite rules. Notice that after simplification, the program should be equivalent to the input. The problem is challenging as there might be multiple ways to apply the rewrite rules. How do we then systematically derive the simplest equivalent program?
A simple example here is the symbolic arithmetic expression language, with constant, variables, and summation rule:
type Expr = Const(i32) | Var(String) | Add(Expr, Expr)
One example expression that we can express in this language would be
const MY_EXPR = Add(Add(Const(-3), Var("a")), Const(3)) // (-3 + a) + 3
For visualization, we write a to_string
function
rel to_string(p, i as String) = case p is Const(i)
rel to_string(p, v) = case p is Var(v)
rel to_string(p, $format("({} + {})", s1, s2)) =
case p is Add(p1, p2) and to_string(p1, s1) and to_string(p2, s2)
If we query on to_string
for MY_EXPR
, we would get
query to_string(MY_EXPR, s) // s = "((-3 + a) + 3)"
Now let us deal with the actual simplification.
The expression (-3 + a) + 3
could be simplified to just a
, as the -3
and 3
cancels out.
The way to do the simplification is to write two things:
- rewrite rules in the form of equivalence relation;
- the weight function giving each expression a weight to tell which expression is simpler.
For this, the following set of relations needs to be defined.
type input_expr(expr: Expr)
type equivalent(expr_1: Expr, expr_2: Expr)
type weight(expr: Expr, w: i32)
type simplest(expr: Expr)
Note that we need set a prior knowledge on equivalent
: the expr_1
is always more complex than the expr_2
.
This is to prevent the simplification to go to arbitrary direction and result in infinite-loop.
In such case, equivalent
would not be commutative.
Let us start with equivalent
and define its basic property of identity and transitivity:
// Identity
rel equivalent(e, e) = case e is Const(_) or case e is Var(_) or case e is Add(_, _)
// Transitivity
rel equivalent(e1, e3) = equivalent(e1, e2) and equivalent(e2, e3)
Now, we can write the rewrite rules.
The first one we are going to write states that, if e1
and e1p
are equivalent and e2
and e2p
are equivalent,
their additions (Add(e1, e2)
and Add(e1p, e2p)
) are equivalent too.
// e1 == e1p, e2 == e2p ==> (e1 + e2) == (e1p + e2p)
rel equivalent(e, new Add(e1p, e2p)) = case e is Add(e1, e2) and equivalent(e1, e1p) and equivalent(e2, e2p)
The next rule states that Addition is commutative, such that Add(a, b)
is equivalent to Add(b, a)
:
// (a + b) == (b + a)
rel equivalent(e, new Add(b, a)) = case e is Add(a, b)
We also have a rule for associativity:
// (a + (b + c)) == ((a + b) + c)
rel equivalent(e, new Add(new Add(a, b), c)) = case e is Add(a, Add(b, c))
A rule for simplifying adding summation identity 0:
// a + 0 = a
rel equivalent(e, a) = case e is Add(a, Const(0))
A rule for reducing two constants addition:
rel equivalent(e, Const(a + b)) = case e is Add(Const(a), Const(b))
Now we have 5 rewrite-rules in place, let us define how to compute the weight of each expression.
The leaf nodes (Var
and Const
) have weight of 1
, and the addition have the weight from left and right sub-expr added together plus 1.
rel weight(e, 1) = case e is Var(_) or case e is Const(_)
rel weight(e, l + r + 1) = case e is Add(a, b) and weight(a, l) and weight(b, r)
Lastly, we use the aggregation to find the equivalent programs with the minimum weight, which is our definition of the "simplest" program.
Note that we have used an argmax
aggregation denoted by min[p]
here:
rel best_program(p) = p := argmin[p](w: input_expr(e) and equivalent(e, p) and weight(p, w))
If we query for the best program and turn it into string, we will get our expected output, a single variable "a"
!
rel best_program_str(s) = best_program(p) and to_string(p, s)
query best_program_str // {("a")}
Parsing Entities from String
Scallop provides foreign functions and predicates for dynamically parsing entities from string input. Consider the following example:
type Expr = Const(f32) | Add(Expr, Expr)
rel expr_str = {"Add(Const(1), Const(2.5))"}
Let us say that we want to parse an expression from the expr_str
, we can do the following:
rel expr($parse_entity(s)) = expr_str(s)
Here, we are using the foreign function of $parse_entity
.
We would get the following result:
query expr
// expr: {(entity(0xadea13a2621dd155))}
On-Demand Relations
There are often times relations/predicates where you know that would not need to be fully computed. This would include the infinite relations. This means that, we want to define such relations without worrying about its infinite-ness while also being able to supply it with information needed for the computation. Such relations are called On-Demand Relations.
We show here one on-demand relation which is the fibonacci
number relation:
type fib(bound x: i32, y: i32)
rel fib = {(0, 1), (1, 1)}
rel fib(x, y1 + y2) = fib(x - 1, y1) and fib(x - 2, y2) and x > 1
query fib(10, y)
Normally, if we define the fibonacci relation, it would only contain the second and the third line, which respectively defines the base cases and the recursive cases. However, as we all know, there are infinitely many fibonacci numbers and it would not be wise to compute the relation fully. Usually, we want to infer some fact inside of the infinite relation, based on some inputs. In this case, as noted on the last line, we want to know the 10th fibonacci number.
It is hinted that when we want to compute a fibonacci number, we usually supply the x
value, in this case, 10, in order to get the value y
.
This is exactly what we tell the compiler in the first line.
Inside of the type declaration, we provide an additional adornment to each of the variables.
x
is adorned bybound
, denoting that it is treated as an input (or bounded) variable to the relationy
is not adorned by anything, suggesting that it is a free variable which will be computed by the rules of the relation
Getting
x
based ony
is out-of-scope in this tutorial.
By providing the adornments (with at least one bound
), we are telling Scallop that the relation should be computed on-demand.
From there, Scallop will search for every place where the relation is demanded, and restrict the computation of the relation only on the demand.
In our case, there is just one single place where the fib
relation is demanded (where x
is 10
).
Therefore, Scallop will compute only the necessary facts in order to derive the final solution.
Adornments
There are only two kinds of adornments:
bound
free
Annotating whether the variable is treated as bounded variable or free variable.
If an adornment is not provided on a variable, then it is by default a free
variable.
In this sense, all normal relations without any adornment would be treated as non-on-demand relations.
When at least one bound
adornment is annotated on a relation type declaration, we know that the relation needs to be computed on-demand.
More Examples
On-Demand Path
Let's go back to our example of edge-and-path. Consider that there is a huge graph, but we only want to know a path ending at a specific node:
rel path(a, b) = edge(a, b) or (edge(a, c) and path(c, b))
query path(a, 1024)
In this case, enumerating all paths would be strictly more expensive than just exploring from the end point.
Therefore, we add an adornment to the path
relation like the following:
type path(free i32, bound i32)
We say the second argument is bound
and the first argument is free
, matching what we expect from the query.
On-Demand To-String
Let's consider an simple arithmetic expression language and a to_string
predicate for the language:
type Expr = Const(i32) | Var(String) | Add(Expr, Expr) | Sub(Expr, Expr)
rel to_string(e, $format("{}", i)) = case e is Const(i)
rel to_string(e, $format("{}", v)) = case e is Var(e)
rel to_string(e, $format("({} + {})", s1, s2)) = case e is Add(e1, e2) and to_string(e1, s1) and to_string(e2, s2)
rel to_string(e, $format("({} - {})", s1, s2)) = case e is Sub(e1, e2) and to_string(e1, s1) and to_string(e2, s2)
Now that let's say there are many expressions declared as constants:
const EXPR_1 = Add(Const(1), Add(Const(5), Const(3)))
const EXPR_2 = Add(Const(1), Var("x"))
const EXPR_3 = Const(13)
Scallop would have automatically generated string for all of the expressions.
However, let's say we are only interested in one of the expressions:
query to_string(EXPR_3, s)
Then most of the computations for to_string
would be redundant.
In this case, we would also declare to_string
as an on-demand predicate, like this:
type to_string(bound Expr, String)
Then only the queried expression will be to_string
-ed.
Internals
Internally, when there are relations being annotated with adornments, the whole Scallop program is undergone a program transformation. This transformation is traditionally called Magic-Set Transformation or Demand Transformation. There are multiple papers on the topic, which we reference below:
- Extended Magic for Negation (Tuncay Tekle et. al. 2019)
- Precise complexity analysis for efficient datalog queries (Tuncay Tekle et. al. 2010)
- Efficient bottom-up computation of queries on stratified databases (Balbin et. al. 1991)
Loading from CSV
Scallop can be used along with existing datasets loaded from CSVs.
This is usually achieved with annotating on specific relations.
For example, assuming we have a file edge.csv
,
0,1
1,2
we can load the content of it into a relation edge
in Scallop using the following syntax
@file("edge.csv")
type edge(from: usize, to: usize)
rel path(a, c) = edge(a, c) or path(a, b) and edge(b, c)
query path
In particular, we annotate the @file(...)
attribute onto the relation type declaration type edge(...)
.
The file name is written inside the @file
attribute.
We require the relation to be declared with types in order for it to be loaded with CSV file content.
Depending on the type declaration, the file content will be parsed into values of certain types.
From here, the edge
relation will be loaded with the content (0, 1)
and (1, 2)
.
After executing the Scallop program above, we would obtain the result path
being (0, 1)
, (0, 2)
, and (1, 2)
.
Certainly, there are many ways to load CSV. In this section, we introduce the various ways to configure the CSV loading.
Headers
There are CSV files with headers. Suppose we have the following CSV file
from,to
0,1
1,2
To load this file, we would need to add an additional argument header=true
to the @file
attribute:
@file("edge.csv", header=true)
type edge(from: usize, to: usize)
Note that by default we assume that CSV files don't have headers.
Deliminators
By default, we assume the values inside of the CSV file are deliminated by commas ','
.
In case where CSV files have values deliminated by other characters, such as tabs '\t'
, we would need to specify that in the @file
attribute:
@file("edge.csv", deliminator="\t")
type edge(from: usize, to: usize)
Note that deliminators cannot be of multiple characters.
Parsing Field-Value Pairs
There are many CSV tables which have a lot of columns. One way is to specify all the fields and their types, like the following.
type table(field1: type1, field2: type2, ..., fieldn: typen)
However, this might be very hard to encode. Therefore, we provide another way of parsing CSV files into relations, by using primary keys and field-value pairs. Let's assume we have the following CSV file:
student_id,name,year,gender
0001,alice,2020,female
0002,bob,2021,male
We see that student_id
can serve as the primary key of this table.
With this, it can be loaded into the following relation
@file("student.csv", keys="student_id")
type table(student_id: usize, field: String, value: String)
By specifying keys="student"
, we tell Scallop that student_id
should be viewed as unique primary keys.
The rest of the two elements are field
and value
, both need to be typed String
s.
As a result, it produces the following 6 facts in the table
relation:
(1, "name", "alice"), (1, "year", "2020"), (1, "gender", "female"),
(2, "name", "bob"), (2, "year", "2021"), (2, "gender", "male")
Note that there could be more than one keys. Consider the following table
student_id,course_id,enroll_time,grade
0001,cse100,fa2020,a
0001,cse101,sp2021,a
0002,cse120,sp2021,b
We see that the combination of student_id
and course_id
form the unique primary keys.
In this case, they can be loaded using the following syntax:
@file("enrollment.csv", keys=["student_id", "course_id"])
type enrollment(student_id: usize, course_id: String, field: String, value: String)
By setting keys
to be a list ["student_id", "course_id"]
, the student_id
field is the first primary key and course_id
is the second.
There are still two additional arguments for the enrollment
relation.
In general, the arity of the relation will be the number of primary keys plus 2.
Specifying Fields to Load
In case not all fields are desired when loading, one can use the fields
argument to specify what to load.
Consider the same enrollment table encoded in CSV:
student_id,course_id,enroll_time,grade
0001,cse100,fa2020,a
0001,cse101,sp2021,a
0002,cse120,sp2021,b
If we only want to get everything but omit the enroll_time
column, we can do
@file("enrollment.csv", fields=["student_id", "course_id", "grade"])
type enrollment(student_id: usize, course_id: String, grade: String)
This can also work in conjunction with the keys
argument.
In this case, we do not need to specify the primary keys.
@file("enrollment.csv", keys=["student_id", "course_id"], fields=["grade"])
type enrollment(student_id: usize, course_id: String, field: String, value: String)
// The following facts will be obtained
// enrollment(1, "cse100", "grade", "a")
// enrollment(1, "cse101", "grade", "a")
// enrollment(2, "cse120", "grade", "b")
Foreign Functions
Foreign functions allows for complex value manipulation in Scallop.
Conceptually, they are pure and partial functions that operate on value(s) and return one single value only.
Functions with states, such as random
, are not allowed as foreign functions.
Function Types
In Scallop, foreign functions are generically typed with optional and variable arguments.
All the functions have a dollar sign ($
) associated with the function name.
We use the following syntax to denote a function signature
$FUNC_NAME<ARG(: FAMILY)?, ...>(
POS_ARG: POS_ARG_TYPE, ...,
OPT_ARG: OPT_ARG_TYPE?, ...,
VAR_ARG_TYPE...
) -> RETURN_TYPE
The generic arguments are specified in the <...>
after the function name, and can be annotated by optional type family.
For the arguments of the function, optional arguments have to appear after all positional arguments, and the variable arg type must appear after all positional and optional arguments.
Functions must have a return type.
For example, the function $string_char_at(s: String, i: usize) -> char
takes in a string s
and an index i
, and returns the character at that location.
The two arguments s
and i
are both positional arguments.
In the function $substring(s: String, b: usize, e: usize?)
, we have 2 positional arguments (s
and b
) and 1 optional argument (e
).
This means that this substring function can be invoked with 2 or 3 arguments.
Invoking $substring("hello", 3)
would give us "lo"
, and invoking $substring("hello", 1, 3)
would give us "el"
.
For functions like $abs<T: Number>(T) -> T
, we have absolute value function taking in value of any type that is a number (including integers and floating points).
The function also returns a type the same as the input.
For a function like $format(f: String, Any...)
, it looks at the format string and fill all the {}
symbol in the string with the latter arguments.
Notice how there can be arbitrary number of arguments (variable arg) of Any
type.
For example, we can have $format("{} + {}", 3, "a") ==> "3 + a"
and $format("{}", true) ==> "true"
.
Function Failures
Foreign functions may fail with errors such as divide-by-zero, index-out-of-bounds. When error happens, values will not be propagated along the computation, and will be dropped silently.
For example, the following program makes use of the foreign function $string_char_at
.
It walks through the indices 1, 3, and 5, and get the character on those indices of the string "hello"
.
rel indices = {1, 3, 5}
rel output(i, $string_char_at("hello", i)) = indices(i)
However, there are only 5 characters in the string, meaning that getting the 5
-th character would result in an index-out-of-bounds error.
Scallop will drop this invokation silently, resulting in only two facts being derived:
output: {(1, 'e'), (3, 'l')}
Similar things happen when nan
is derived from floating point operations, or that the foreign function fails.
Library of Foreign Functions
We hereby list all the available foreign functions, their signatures, descriptions, and an example of how to invoke them.
The functions here are ordered alphabetically.
For some of the functions that are slightly more complicated (e.g. $format
), please refer to the section below for more information.
Function | Description | Example |
---|---|---|
$abs<T: Number>(x: T) -> T | Absolute value function \(\lvert x \rvert\) | $abs(-1) => 1 |
$acos<T: Float>(x: T) -> T | Arc cosine function \(\text{acos}(x)\) | $acos(0.0) => 1.5708 |
$atan<T: Float>(x: T) -> T | Arc tangent function \(\text{atan}(x)\) | $atan(0.0) => 0.0 |
$atan2<T: Float>(y: T, x: T) -> T | 2-argument arc tangent function \( \text{atan}(y, x) \) | $atan2(0.0, 1.0) => 0.0 |
$ceil<T: Number>(x: T) -> T | Round up to closest integer \( \lceil x \rceil \) | $ceil(0.5) => 1.0 |
$cos<T: Float>(x: T) -> T | Cosine function \(\text{cos}(x)\) | $cos(0.0) => 1.0 |
$datetime_day(d: DateTime) -> u32 | Get the day component of a DateTime , starting from 1 | $datetime_day(t"2023-01-01") => 1 |
$datetime_month(d: DateTime) -> u32 | Get the month component of a DateTime , starting from 1 | $datetime_month(t"2023-01-01") => 1 |
$datetime_month0(d: DateTime) -> u32 | Get the month component of a DateTime , starting from 0 | $datetime_month0(t"2023-01-01") => 0 |
$datetime_year(d: DateTime) -> i32 | Get the year component of a DateTime | $datetime_month0(t"2023-01-01") => 2023 |
$dot(a: Tensor, b: Tensor) -> Tensor | Dot product of two tensors \(a \cdot b\); only available when compiled with torch-tensor | |
$exp<T: Float>(x: T) -> T | Exponential function \(e^x\) | $exp(0.0) => 1.0 |
$exp2<T: Float>(x: T) -> T | Exponential function \(2^x\) (base 2) | $exp2(2.0) => 4.0 |
$floor<T: Number>(x: T) -> T | Round down to closest integer \(\lfloor x \rfloor\) | $exp2(2) => 4 |
$format(String, Any...) -> String | Formatting string | $format("{} + {}", 3, "a") => "3 + a" |
$hash(Any...) -> u64 | Hash the given values | $hash("a", 3, 5.5) => 5862532063111067262 |
$log<T: Float>(x: T) -> T | Natural logarithm function \(\text{log}_e(x)\) | $log(1.0) => 0.0 |
$log2<T: Float>(x: T) -> T | Logarithm function \(\text{log}_2(x)\) (base 2) | $log2(4.0) => 2.0 |
$max<T: Number>(T...) -> T | Maximum \(\text{max}(x_1, x_2, \dots)\) | $max(4.0, 1.0, 9.5) => 9.5 |
$min<T: Number>(T...) -> T | Minimum \(\text{min}(x_1, x_2, \dots)\) | $max(4.0, 1.0, 9.5) => 1.0 |
$pow<T: Integer>(x: T, y: u32) -> T | Integer power function \(x^y\) | $pow(2.2, 2) => 4.84 |
$powf<T: Float>(x: T, y: T) -> T | Float power function \(x^y\) | $powf(4.0, 0.5) => 2.0 |
$sign<T: Number>(x: T) -> T | Sign function that returns \({-1, 0, 1}\) in respective types | $sign(-3.0) => -1.0 |
$sin<T: Float>(x: T) -> T | Sine function \(\text{sin}(x)\) | $sin(0.0) => 0.0 |
$string_char_at(s: String, i: usize) -> char | Get the i -th character of string s | $string_char_at("hello", 2) => 'l' |
$string_concat(String...) -> String | Concatenate multiple strings | $string_concat("hello", " ", "world") => "hello world" |
$string_index_of(s: String, pat: String) -> usize | Find the index of the first occurrence of the pattern pat in string s | $string_index_of("hello world", "world") => 6 |
$string_length(s: String) -> usize | Get the length of the string | $string_length("hello") => 5 |
$string_lower(s: String) -> String | To lower-case | $string_lower("LisA") => "lisa" |
$string_trim(s: String) -> String | Trim a string | $string_trim(" hello ") => "hello" |
$string_upper(s: String) -> String | To upper-case | $string_upper("LisA") => "LISA" |
$substring(s: String, b: usize, e: usize?) -> String | Find the substring given begin index and optional the end index | $substring("hello world", 6) => "world" |
$tan<T: Number>(x: T) -> T | Tangent function \(\text{tan}(x)\) | $tan(0.0) => 0.0 |
Foreign Predicates
Foreign predicates aim to provide programmers with extra capabilities with relational predicates.
Traditional Datalog program defines relational predicate using only horn-rules.
Given the assumption that the input database is finite, these derived relational predicates will also be finite.
However, there are many relational predicates that are infinite and could not be easily expressed by horn-rules.
One such example is the range
relation.
Suppose it is defined as range(begin, end, i)
where i
could be between begin
(inclusive) and end
(exclusive).
There could be infinitely many triplets, and we cannot simply enumerate all of them.
But if the first two arguments begin
and end
are given, we can reasonably enumerate the i
.
In Scallop, range
is available to be used as a foreign predicate.
Notice that range
can be applied on any integer data, making it a generic predicate.
For example, to use range
on i32
data, we will need to invoke range<i32>
:
rel result(x) = range<i32>(0, 5, x)
Here we enumerate the value of x
from 0 (inclusive) to 5 (exclusive), meaning that we will obtain that result = {0, 1, 2, 3, 4}
.
For the rest of this section, we describe in detail how foreign predicates are constructed in Scallop and why are they useful.
Foreign Predicate Types
Foreign predicates can be generic and are statically typed. In addition to just providing the argument types, we also need to provide a boundness pattern.
A boundness pattern is a string of length equal to the relation arity and consisting of b
and f
.
The character b
means bounded, reflecting the variable on that position is taken as input to the predicate.
The character f
means free, suggesting that the variable on that position will be generated as output by the predicate.
For example, the full definition of range
is
range<T: Integer>(begin: T, end: T, i: T)[bbf]
Notice that at the end of the definition we have [bbf]
.
Here, bbf
is a boundness pattern for range, suggesting that begin
and end
will be provided as input, and i
will be generated as output.
In the future, we plan to allow the definition of multiple boundness patterns
Standard Library of Foreign Predicates (Part A)
In this part, we give an overview to the foreign predicates that are discrete only.
Foreign Predicate | Description |
---|---|
datetime_ymd(d: DateTime, y: i32, m: u32, d: u32)[bfff] | Get the y ear, m onth, and day from a DateTime value |
range<T: Integer>(begin: T, end: T, i: T)[bbf] | Generate all the integers i starting from begin and end with end - 1 |
string_chars(s: String, i: usize, c: char)[bff] | Generate all the index-character tuples inside of string s |
string_find(s: String, pat: String, begin: usize, end: usize)[bbff] | Generate all the begin-end ranges of the pattern pat 's occurrence in the string s |
string_split(s: String, pat: String, out: String)[bbf] | Split the string s using the pattern pat and generate the out strings |
Reference Guide
We list all the language features supported by Scallop.
Import Files
import "path/to/other/file.scl"
Type Definition
Type Alias Definition
type ObjectId = usize
Sub-Type Definition
type Name <: String
Enum Type Definition
type Action = LEFT | RIGHT | UP | DOWN
Algebraic Data Type Definition
type Expr = Const(i32) | Add(Expr, Expr) | Sub(Expr, Expr)
Relation Type Definition
type edge(x: i32, y: i32)
Constant Definition
const PI: f32 = 3.1415
Relation Definition
Fact Definition
rel edge(1, 2)
Set-of-Tuples Definition
rel edge = {(1, 2), (2, 3), (3, 4)}
Rule Definition
rel path(a, b) = edge(a, b) or path(a, c) and edge(c, b)
Disjunctive Head
rel { assign(v, false); assign(v, true) } = variable(v)
Atom
fib(x - 1, y)
Negation
rel has_no_child(p) = person(p) and not father(p, _) and not mother(p, _)
Constraint
rel number(0)
rel number(i + 1) = number(i) and i < 10
Aggregation
rel person = {"alice", "bob", "christine"}
rel num_people(n) = n := count(p: person(p))
Foreign Predicate
rel grid(x, y) = range<i32>(0, 5, x) and range<i32>(0, 5, y)
Query Definition
query path
Scallop and Probabilistic Programming
One fundamental concept in machine learning is probability. Scallop, being a neurosymbolic programming language, supports probability and probabilistic programming natively. For example, one can write the following program:
type Action = UP | DOWN | LEFT | RIGHT
rel predicted_action = {0.05::UP, 0.09::DOWN, 0.82::LEFT, 0.04::RIGHT}
where the predicted_action
relation encodes a distribution of actions and their probabilities.
In particular, the UP
action is predicted to have a \(0.05\) probability.
Here, the ::
symbol is used to suggest that probabilities (such as 0.05) are used to tag the facts (such as UP
).
Since we can define probability on user declared facts, the derivation of new facts will be associated with probabilities too. This means that Scallop is doing probabilistic reasoning. The whole probabilistic reasoning semantics of Scallop is defined with the theory of provenance semiring. In this chapter, we give detailed explanation to the probabilities appeared in Scallop.
Tags and Provenance
Scallop's probabilistic semantics is realized by the Provenance Semiring framework. Inside of this framework, each fact can be tagged by an extra piece of information, which we call tag. Such information is propagated throughout the execution of Scallop program according to the provenance, which is the mathematical object defining how tags propagate.
Motivating Probabilistic Example
The following example shows a fact earthquake()
being tagged by a probability 0.03
(earthquake could happen with a 0.03 probability):
rel 0.03::earthquake()
Concretely, we have an (external) tag space of \([0, 1]\), which contains real numbers between 0 and 1, which is the space of probabilities.
Similarly, we define another tagged fact burglary()
:
rel 0.20::burglary()
We can declare a rule saying that, "when earthquake or burglary happens, an alarm will go off".
rel alarm() = earthquake() or burglary()
query alarm
Remember that the facts earthquake()
and burglary()
are probabilistic.
Intuitively, the derived fact alarm()
will also be associated with a derived probability.
Based on probability theory, we have
\[ \begin{align} \Pr(\text{alarm}) &= \Pr(\text{earthquake} \vee \text{burglary}) \\ &= 1 - \Pr(\neg \text{earthquake} \wedge \neg \text{burglary}) \\ &= 1 - \Pr(\neg \text{earthquake}) \cdot \Pr(\neg \text{burglary}) \\ &= 1 - (1 - \Pr(\text{earthquake})) \cdot (1 - \Pr(\text{burglary})) \\ &= 1 - (1 - 0.03) (1 - 0.20) \\ &= 1 - 0.97 \times 0.8 \\ &= 0.224 \end{align} \]
This is indeed what we get if we use the topkproofs
provenance (which we discuss later in the chapter) with the scli
Scallop interpreter:
> scli alarm.scl
alarm: {0.224::()}
Proofs Provenance
Fact with Probability
Logic and Probability
Aggregation with Probability
With the introduction of probabilities, many existing aggregators are augmented with new semantics, which we typically call multi-world semantics.
What's more, there are new aggregators, such as softmax
, rank
, and weighted_avg
, that make use of the probabilities.
We introduce these aggregators one-by-one in this section.
Multi-world Semantics with Aggregators
Let us take the count
aggregator as an example.
Consider we have 2 objects, each could be big or small with their respective probabilities:
type OBJ = OBJ_A | OBJ_B
rel size = {0.8::(OBJ_A, "big"); 0.2::(OBJ_A, "small")} // obj A is very likely big
rel size = {0.1::(OBJ_B, "big"); 0.9::(OBJ_B, "small")} // obj B is very likely small
Now let's say we want to count how many big objects are there, by using the following
Note that even when using probabilites, one can opt to not use the multi-world semantics by adding !
sign to the end of the aggregator.
New Aggregators using Probabilities
Softmax and Normalize
Rank
Weighted Average and Weighted Sum
Sampling with Probability
In Scallop, samplers share the same syntax as aggregators. They usually work with probabilistic provenances, but can also work without them. Here are some example samplers:
top
: get the $k$ facts with top probabilitiescategorical
: treat the relation as a categorical distribution and sample from ituniform
: treat the relation as a uniform distribution and sample from it
Let's take top
as an example.
We can obtain the top ranked symbol by using the following rule:
rel symbols = {0.9::"+", 0.05::"-", 0.02::"3"}
rel top_symbol(s) = s := top<1>(s: symbols(s)) // 0.9::top_symbol("+")
The Scallop Python Binding scallopy
scallopy
is the Python binding for Scallop, offering an interface for computationg with Scallop in Python.
In addition, it can be integrated with PyTorch, allowing users to write Neuro-Symbolic applications that can be connected to PyTorch.
In this section, we elaborate on how to install, configure, and use the scallopy
library.
For an example, please look at Getting Started. To start reading the documentation, proceed to Scallopy Context
Installation
TODO: Installation with venv
TODO: Installation with Conda
Getting Started with Scallopy
Motivating Example
Let's start with a very simple example illustrating the usage of scallopy
.
import scallopy
ctx = scallopy.Context()
ctx.add_relation("edge", (int, int))
ctx.add_facts("edge", [(1, 2), (2, 3)])
ctx.add_rule("path(a, c) = edge(a, c) or path(a, b) and edge(b, c)")
ctx.run()
print(list(ctx.relation("path"))) # [(1, 2), (1, 3), (2, 3)]
In this very simple edge-path example, we are interacting with Scallop through a Python class called Context
.
Basically, a Context
manages a Scallop program, along with the relations, facts, and execution results corresponding to that Scallop program.
We create a Context
by ctx = scallopy.Context
.
Relations, facts, and rules are added through the functions add_relation(...)
, add_facts(...)
, and add_rule(...)
.
With everything set, we can execute the program inside the context by calling run()
Lastly, we pull the result from ctx
by using relation(...)
.
Please refer to a more detailed explanation of this example in Scallop Context.
Machine Learning with Scallopy and PyTorch
When doing machine learning, we usually want to have batched inputs and outputs.
Instead of building the Scallop context incrementally and explicitly run the program, we can create a Module
at once and be able to run the program for a batch of inputs.
This offers a few advantages, such as optimization during compilation, batched execution for integration with machine learning pipelines, simplified interaction between data structures, and so on.
For example, we can create a module and run it like the following:
import scallopy
import torch
# Creating a module for execution
my_sum2 = scallopy.Module(
program="""
type digit_1(a: i32), digit_2(b: i32)
rel sum_2(a + b) = digit_1(a) and digit_2(b)
""",
input_mappings={"digit_1": range(10), "digit_2": range(10)},
output_mappings={"sum_2": range(19)},
provenance="difftopkproofs")
# Invoking the module with torch tensors. `result` is a tensor of 16 x 19
result = my_sum2(
digit_1=torch.softmax(torch.randn(16, 10), dim=0),
digit_2=torch.softmax(torch.randn(16, 10), dim=0))
As can be seen in this example, we have defined a Module
which can be treated also as a PyTorch module.
Similar to other PyTorch modules, it can take in torch tensors and return torch tensors.
The logical symbols (such as the i32
numbers used in digit_1
and digit_2
) are configured in input_mappings
and output_mappings
, and can be automatically converted from tensors.
We also see that it is capable of handling a batch of inputs (here, the batch size is 16).
Internally, Scallop also knows to execute in parallel, making it performing much faster than normal.
Please refer to Scallop Module for more information.
Scallop Context
The most fundamental point of interaction of scallopy
is ScallopContext
.
The following is a very simple example setting up a ScallopContext
to compute the edge-path
program:
import scallopy
# Creating a new context
ctx = scallopy.ScallopContext()
# Add relation of `edge`
ctx.add_relation("edge", (int, int))
ctx.add_facts("edge", [(0, 1), (1, 2)])
# Add rule of `path`
ctx.add_rule("path(a, c) = edge(a, c) or path(a, b) and edge(b, c)")
# Run!
ctx.run()
# Check the result!
print(list(ctx.relation("path"))) # [(0, 1), (0, 2), (1, 2)]
Roughly, the program above can be divided into three phases:
- Setup the context: this involves defining relations, adding facts to relations, and adding rules that do the computation
- Running the program inside of context
- Fetch the results
While the 2nd and 3rd steps are the place where the computation really happens, it's more important for the programmers to correctly setup the full context for computation. We now elaborate on what are the high-level things to do when setting up the context
Configurations
When creating a new ScallopContext
, one should configure it with intended provenance.
If no argument is supplied, as shown in the above example, the context will be initialized with the default provenance, unit
, which resembles untagged semantics (a.k.a. discrete Datalog).
To explicitly specify this, you can do
ctx = scallopy.ScallopContext(provenance="unit")
Of course, Scallop can be used to perform reasoning on probabilistic and differentiable inputs. For instance, you can write the following
ctx = scallopy.ScallopContext(provenance="minmaxprob") # Probabilistic
# or
ctx = scallopy.ScallopContext(provenance="diffminmaxprob") # Differentiable
For more information on possible provenance information, please refer to the provenance section.
It it worth noting that some provenance, such as topkproofs
, accept additional parameters such as k
.
In this case, you can supply this as additional arguments when creating the context:
ctx = scallopy.ScallopContext(provenance="topkproofs", k=5) # top-k-proofs provenance with k = 5
Adding Program
Given that a context has been configured and initialized, we can set it up the quickest by loading a program into the context.
One can either load an external .scl
file, or directly inserting a program written as Python string.
To directly add a full program string to the context, one can do
ctx.add_program("""
rel edge = {(0, 1), (1, 2)}
rel path(a, c) = edge(a, c) or path(a, b) and edge(b, c)
""")
On the other hand, assuming that there is a file edge_path.scl
that contains the same content as the above string, one can do
ctx.import_file("edge_path.scl")
Adding Relations
Instead of adding program as a whole, one can also add relations one-at-a-time.
When adding new relations, one would need to supply the name as well as the type of the relation.
For example, the edge
relation can be defined as follows
ctx.add_relation("edge", (int, int))
Here, we are saying that edge
is an arity-2 relation storing pairs of integers.
Note that we are specifying the type using Python's int
type.
This is equivalent to the i32
type inside Scallop.
Therefore, the above instruction tranlates to the following Scallop code:
rel edge(i32, i32)
Many existing Python types can directly translate to Scallop type. In particular, we have the mapping listed as follows:
Python Type | Scallop Type |
---|---|
int | i32 |
bool | bool |
float | f32 |
str | String |
In case one want to use types other than the listed ones (e.g., usize
), they can be accessed directly using the string "usize"
, or they can be accessed through predefined types such as scallopy.usize
.
The example below defines a relation of type (usize, f64, i32)
:
ctx.add_relation("my_relation", (scallopy.usize, "f64", int))
Specifically for arity-1 relations, users don't need to use a tuple to specify the type. For instance,
ctx.add_relation("digit", int)
Adding Facts
The most basic version of adding facts into an existing relation inside of an existing context.
We are assuming that the context has a provenance of "unit"
.
ctx.add_facts("edge", [(1, 2), (2, 3)])
If the relation is declared to be having arity-1 and that the type is a singleton type instead of a 1-tuple, then the facts inside of the list do not need to be a tuple.
ctx.add_relation("digit", int)
ctx.add_facts("digit", [1, 2, 3])
Probabilistic Facts (Tagged Facts)
When the Scallop context is configured to use a provenance other than.
If one wants to add facts along with probabilities, they can wrap their non-probabilistic facts into tuples whose first element is a simple probability.
For example, if originally we have a fact 1
, wrapping it with a corresponding probability gives us (0.1, 1)
, where 0.1
is the probability.
ctx.add_facts("digit", [1, 2, 3]) # without probability
ctx.add_facts("digit", [(0.1, 1), (0.2, 2), (0.7, 3)]) # with probability
Of course, if the original facts are tuples, the ones with probability will be required to wrap further:
ctx.add_facts("color", [("A", "blue"), ("A", "green"), ...]) # without probability
ctx.add_facts("color", [(0.1, ("A", "blue")), (0.2, ("A", "green")), ...]) # with probability
We can extend this syntax into tagged facts in general.
Suppose we are using the boolean semiring (boolean
), we are going to tag each fact using values such as True
or False
.
ctx = scallopy.Context(provenance="boolean")
ctx.add_relation("edge", (int, int))
ctx.add_facts("edge", [(True, (1, 2)), (False, (2, 3))])
Non-tagged Facts in Tagged Context
Adding Rules
Tagged Rules
Running
Additional Features
There are more features provided by the ScallopContext
interface.
We hereby list them for reference.
Cloning
One can copy a context to create a new context. The resulting context will contain all the program, configurations, and provenance information.
new_ctx = ctx.clone()
The cloning feature relates to pseudo-incremental computation and branching computation. We elaborate on this in the Branching Computation section.
Compiling
Iteration Count Limit
One can configure the
Early Discarding
Obtaining Context Information
Foreign Functions and Predicates
Saving and Loading
Please refer to the Saving and Loading section for more information.
Branching Executions
One cool feature that scallopy
supports is branching execution.
People can create a context, clone it to form new contexts, and modify the new context without touching the old ones.
This is particularly useful when incremental computation is desired.
# Create the first version of the context
ctx = scallopy.ScallopContext()
ctx.add_relation(...)
ctx.add_facts(...)
# Branch it into another context
ctx1 = ctx.clone()
ctx1.add_relation(...)
ctx1.add_facts(...)
ctx1.run() # Running the first context
# Branch it into one more context; `ctx1` and `ctx2` are completely disjoint
ctx2 = ctx.clone()
ctx2.add_relation(...)
ctx2.add_facts(...)
ctx2.run() # Running the second context
Configuring Provenance
Foreign Functions
While there are existing foreign functions such as $hash
and $abs
, people sometimes want more functions to be included for specialized computation.
scallopy
provides such interface and allows user to define foreign functions in Python.
Here is an example defining a custom $sum
function in Python which is later used in Scallop:
# Create a new foreign function by annotating an existing function with `@scallopy.foreign_function`
# Note that this function has variable arguments!
@scallopy.foreign_function
def my_sum(*args: int) -> int:
s = 0
for x in args:
s += x
return s
# Create a context
ctx = scallopy.ScallopContext()
# Register the declared foreign function (`my_sum`)
# Note that the function needs to be registered before it is used
ctx.register_foreign_function(my_sum)
# Add some relations
ctx.add_relation("I", (int, int))
ctx.add_facts("I", [(1, 2), (2, 3), (3, 4)])
# Add a rule which uses the registered function!
ctx.add_rule("R($my_sum(a, b)) = I(a, b)")
# Run the context
ctx.run()
# See the result, should be [(3,), (5,), (7,)]
print(list(ctx.relation("R")))
Now we elaborate on how we define new foreign functions in Python.
Function Signature
The annotator @scallopy.foreign_function
performs analysis of the annotated Python function and makes sure that it is accepted as a Scallop foreign function.
We require that types are annotated on all arguments and the return value.
For simplicity, Python types such as int
, bool
, and str
are mapped to Scallop types (and type families) as following:
Python type | Scallop type | Scallop base types |
---|---|---|
int | Integer family | i8 , i16 , ..., u8 , u16 , ..., usize |
float | Float family | f32 , f64 |
bool | bool | bool |
str | String | String |
If one desires to use a more fine-grained type
Argument Types
Optional Arguments
Variable Arguments
Error Handling
Debugging Proofs
We offer capability in Scallop to debug the internal proofs generated by provenance. This can be done through using special provenances specifically designed for debugging.
Debugging Top-K Proofs
We can use the provenance difftopkproofsdebug
.
Take the sum_2
application as an example, we have
ctx = scallopy.ScallopContext(provenance="difftopkproofsdebug", k=3)
ctx.add_relation("digit_a", int)
ctx.add_relation("digit_b", int)
ctx.add_rule("sum_2(a + b) = digit_a(a) and digit_b(b)")
# !!! SPECIAL TREATMENT WHEN INSERTING FACTS !!!
ctx.add_facts("digit_a", [((torch.tensor(0.1), 1), (1,)), ((torch.tensor(0.9), 2), (2,))])
ctx.add_facts("digit_b", [((torch.tensor(0.9), 3), (1,)), ((torch.tensor(0.1), 4), (2,))])
ctx.run()
result = ctx.relation("sum_2")
A majority of the code will look identical to the original example,
but special treatment is required when adding new facts to the context.
Originally, tags are just torch.tensor(SOME_PROBABILITY)
.
But now, we need to supply an extra positive integer which we call Fact ID.
Each of the fact added should be the following:
ctx.add_facts("SOME_RELATION", [
( ( torch.tensor(0.1), 1 ) , ( 0, ) ) # Fact 1
# -- PROBABILITY -- -- FACT ID --
# --------------- TAG -------------- -- TUPLE --
( ( torch.tensor(0.9), 2 ) , ( 1, ) ) # Fact 2
( ( torch.tensor(0.1), 3 ) , ( 0, ) ) # Fact 2
# ...
])
Basically, fact IDs are labels that the programmers can add to instrument the computation process.
All IDs should start from 1, and should be distinct and contiguous for all added facts.
The programmers will need to be responsible for managing the IDs so that there is no collision.
Let's say 10 facts are added with probability, then there should be 10 fact IDs ranging from
1 to 10, inclusive.
Of course, for non-probabilistic facts, the whole tag can be specified as None
.
Debug in Forward Mode
When used in forward
mode, one should do the following.
The forward module should be setup just like before, with the only change being the provenance configuration:
sum_2 = scallopy.Module(
provenance="difftopkproofsdebug",
k=3,
program="""
type digit_a(a: i32), digit_b(b: i32)
rel sum_2(a + b) = digit_a(a) and digit_b(b)
""",
output_relation="sum_2",
output_mapping=[2, 3, 4])
Let's assume that only 1 and 2 are added for each digit in the sum_2
task.
That is, we have digit A and digit B.
Digit A can be 1 or 2, and the digit B can be 1 or 2 as well.
Looking at the following code, we have the fact IDs being
- Digit
A
is 1: Fact ID 1 - Digit
A
is 2: Fact ID 2 - Digit
B
is 1: Fact ID 3 - Digit
B
is 2: Fact ID 4
Notice that the fact IDs all start from 1 and are contiguous (i.e., no gap in the used fact IDs).
Also notice that, when in forward mode, the inputs need to be batched.
In this example, let's only focus on one single data-point, say Datapoint 1
.
digit_a = [
[((torch.tensor(0.1), 1), (1,)), ((torch.tensor(0.9), 2), (2,))], # Datapoint 1
# ...
]
digit_b = [
[((torch.tensor(0.9), 3), (1,)), ((torch.tensor(0.1), 4), (2,))], # Datapoint 1
# ...
]
After preparing the input facts, we can invoke the debug module as the following
(result_tensor, proofs) = sum_2(digit_a=digit_a, digit_b=digit_b)
Here, the output will be a tuple (result_tensor, proofs)
, unlike the non-debug version where only one single PyTorch tensor is returned.
Specifically, the two components will be
result_tensor = torch.tensor([
[0.09, 0.8119, 0.09], # Datapoint 1
# ...
])
proofs = [
[ # Datapoint 1
[ # Proofs of (2,)
[ (True, 1), (True, 3) ], # The first proof is 1 + 1 (using positive fact 1 and 3)
],
[ # Proofs of (3,)
[ (True, 1), (True, 4) ], # The first proof is 1 + 2 (using positive fact 1 and 4)
[ (True, 2), (True, 3) ], # The second proof is 2 + 1 (using positive fact 2 and 3)
],
[ # Proofs of (4,)
[ (True, 2), (True, 4) ], # The first proof is 2 + 2 (using positive fact 2 and 4)
]
],
# ...
]
Notice that result_tensor
is just the original output probability tensor.
The proofs
will be List[List[List[List[Tuple[bool, int]]]]]
.
In particular, it is Batch -> Datapoint Results -> Proofs -> Proof -> Literal
.
Where each Literal
is Tuple[bool, int]
where the boolean indicates the positivity of the literal and the integer indicates the used fact ID.
Scallop CLI
Scallop Interpreter
Scallop REPL
Scallop Compiler
Full Scallop Grammar
SCALLOP_PROGRAM ::= ITEM*
ITEM ::= TYPE_DECL
| RELATION_DECL
| CONST_DECL
| QUERY_DECL
TYPE ::= u8 | u16 | u32 | u64 | u128 | usize
| i8 | i16 | i32 | i64 | i128 | isize
| f32 | f64 | char | bool
| String
| CUSTOM_TYPE_NAME
TYPE_DECL ::= type CUSTOM_TYPE_NAME = TYPE
| type CUSTOM_TYPE_NAME <: TYPE
| type ENUM_TYPE_NAME = VARIANT1 [= VAL1] | VARIANT2 [= VAL2] | ...
| type ADT_TYPE_NAME = CONSTRUCTOR1(TYPE*) | CONSTRUCTOR2(TYPE*) | ...
| type RELATION_NAME(TYPE*)
| type RELATION_NAME(VAR1: TYPE1, VAR2: TYPE2, ...)
| type $FUNCTION_NAME(VAR1: TYPE1, VAR2: TYPE2, ...) -> TYPE_RET
CONST_DECL ::= const CONSTANT_NAME : TYPE = CONSTANT
| const CONSTANT_NAME1 [: TYPE1] = CONSTANT1, CONSTANT_NAME2 [: TYPE2] = CONSTANT2, ...
RELATION_DECL ::= FACT_DECL
| FACTS_SET_DECL
| RULE_DECL
CONSTANT ::= true | false | NUMBER_LITERAL | STRING_LITERAL
CONST_TUPLE ::= CONSTANT | (CONSTANT1, CONSTANT2, ...)
FOREIGN_FN ::= hash | string_length | string_concat | substring | abs
BIN_OP ::= + | - | * | / | % | == | != | <= | < | >= | > | && | || | ^
UNARY_OP ::= ! | -
CONST_EXPR ::= CONSTANT
| CONST_EXPR BIN_OP CONST_EXPR | UNARY_OP CONST_EXPR
| $ FOREIGN_FN(CONST_EXPR*)
| if CONST_EXPR then CONST_EXPR else CONST_EXPR
| ( CONST_EXPR )
TAG ::= true | false | NUMBER_LITERAL // true/false is for boolean tags; NUMBER_LITERAL is used for probabilities
FACT_DECL ::= rel RELATION_NAME(CONST_EXPR*) // Untagged fact
| rel TAG :: RELATION_NAME(CONST_EXPR*) // Tagged fact
FACTS_SET_DECL ::= rel RELATION_NAME = {CONST_TUPLE1, CONST_TUPLE2, ...} // Untagged tuples
| rel RELATION_NAME = {TAG1 :: CONST_TUPLE1, TAG2 :: CONST_TUPLE2, ...} // Tagged tuples
| rel RELATION_NAME = {TAG1 :: CONST_TUPLE1; TAG2 :: CONST_TUPLE2; ...} // Tagged tuples forming annotated disjunction
EXPR ::= VARIABLE
| CONSTANT
| EXPR BIN_OP EXPR | UNARY_OP EXPR
| new CONSTRUCTOR(EXPR*)
| $ FOREIGN_FN(EXPR*)
| if EXPR then EXPR else EXPR
| ( EXPR )
ATOM ::= RELATION_NAME(EXPR*)
RULE_DECL ::= rel ATOM :- FORMULA | rel ATOM = FORMULA // Normal rule
| rel TAG :: ATOM :- FORMULA | rel TAG :: ATOM = FORMULA // Tagged rule
FORMULA ::= ATOM
| not ATOM | ~ ATOM // negation
| FORMULA1, FORMULA2, ... | FORMULA and FORMULA | FORMULA /\ FORMULA // conjunction
| FORMULA or FORMULA | FORMULA \/ FORMULA // disjunction
| FORMULA implies FORMULA | FORMULA => FORMULA // implies
| case VARIABLE is ENTITY
| CONSTRAINT
| AGGREGATION
| ( FORMULA )
ENTITY ::= CONSTRUCTOR(ENTITY*)
CONSTRAINT ::= EXPR // When expression returns a boolean value
AGGREGATOR ::= count | sum | prod | min | max | exists | forall | unique
AGGREGATION ::= VAR* = AGGREGATOR(VAR* : FORMULA) // Normal aggregation
| VAR* = AGGREGATOR(VAR* : FORMULA where VAR* : FORMULA) // Aggregation with group-by condition
| VAR* = AGGREGATOR[VAR*](VAR* : FORMULA) // Aggregation with arg (only applied to AGGREGATOR = min or max)
| VAR* = AGGREGATOR[VAR*](VAR* : FORMULA where VAR* : FORMULA) // Aggregation with arg and group-by condition (only applied to AGGREGATOR = min or max)
| forall(VAR* : FORMULA)
| exists(VAR* : FORMULA)
QUERY_DECL ::= query RELATION_NAME
| query ATOM