Aggregations

Aggregations in Scallop can be viewed as operations that aggregates over multiple facts. Such operations include counting, summation and product, finding min and max, and logical quantifiers such as exists and forall. Aggregations appear in the body of a rule, and can be nested for abbrevity.

As a concrete example, we look at a program which counts over a set of people:

rel person = {"alice", "bob", "christine"}
rel num_people(n) = n := count(p: person(p)) // n = 3

In general, we use the following syntax for aggregation formulas.

R1, R2, ... := AGGREGATOR(V1, V2, ...: FORMULA (where U1, U2, ...: FORMULA)?)

We name R1, ... to be the aggregation result variable, V1, ... to be the binding variable, and the formula inside of the aggregation the body. When the where keyword is used, we have the aggregation associated with explicit group-by clause. Here, we call the set of variables U1, ... as group-by variables. The formula under the where clause is named the group-by body. The binding variables need to be fully grounded by the body formula, and the group-by variables (if presented) need to also be fully grounded by the group-by body. For different types of aggregation, the AGGREGATOR might also change and annotated with different information. The number of result variables, the number of binding variables, and their types differ for each aggregation.

Here is a high-level overview of each supported aggregator and their configurations. In the table, ... is used to denote an arbitrary amount of variables.

Aggregator	Binding Variables	Result Variables
`count`	`Any...`	`usize`
`sum`	`Number`	the same as the binding variable
`prod`	`Number`	the same as the binding variable
`min`	`Any`	the same as the binding variables
`max`	`Any`	the same as the binding variables
`exists`	`Any...`	`bool`
`forall`	`Any...`	`bool`

Below, we elaborate on each aggregators and describe their usages.

Count

To count the number of facts, we can use the count aggregator. Just repeating the examples shown in the beginning:

rel person = {"alice", "bob", "christine"}
rel num_people(n) = n := count(p: person(p)) // n = 3

We are counting the number of persons appear in the person relation. To be more concrete, let's read out the aggregation formula:

We count the number of p such that p is a person, and assign the result to the variable n.

For count, there could be arbitrary (> 0) number of binding variables which can be typed arbitrarily. It will only have a single result variable which is typed usize. For example, you may count the number of edges:

rel num_edges(n) = n := count(a, b: edge(a, b))

Here, we have two binding variables a and b, meaning that we are counting the number of distinct pairs of a and b.

Note that we can use the syntax sugar for aggregation to omit the repeated n:

rel num_edges = count(a, b: edge(a, b))

Implicit Group-By

With group-by, we may count the number of facts under a pre-defined group. Consider the example where there is a scene with differet colored objects,

rel obj_color = {(0, "red"), (1, "red"), (2, "blue"), (3, "red")}
rel num_obj_per_color(col, num) = num := count(obj: obj_color(obj, col))

As suggested by the facts inside of obj_color, there are 4 objects indexed using 0, 1, 2, 3, each associated with a different color. The object #0, #1, and #3 are red and the object #2 is blue. Therefore, we will get 3 red objects and 1 blue object, as computed in the result of num_obj_per_color:

num_obj_per_color: {("blue", 1), ("red", 3)}

Let's analyze the rule in detail. We find that we are counting over obj such that the object obj has a certain color col. But col is also a variable occurring in the head of the rule. This is an implicit group-by, in that the variable col is being used as an implicit group-by variable. That is, we are conditioning the counting procedure under each group that is defined by the col variable. Since there are two colors appearing in the obj_color relation, we are performing count for each of the two groups.

In general, if a variable is positively grounded in the body and appear in the head of a parent rule, we call the variable an implicit group-by variable.

Explicit Group-By

In the above example, there is no green colored object. However, how do we know that the number of green object is 0? The result does not seem to address this problem.

The missing piece is a domain of the possible groups. Without explicitly setting the domain, Scallop could only search inside of the database on possible groups. However, we can explicitly tell Scallop about what are the groups. Consider the following rewrite of the above program:

rel colors = {"red", "green", "blue"}
rel obj_color = {(0, "red"), (1, "red"), (2, "blue"), (3, "red")}
rel num_obj_per_color(col, num) = num := count(obj: obj_color(obj, col) where col: colors(col))

With the where clause, we have explicitly declared that col is a group-by variable which is grounded by the colors relation. If we look into the colors relation, we find that there are three possible colors that we care about, red, green, and blue. In this case, we will consider "green" as the third group and try to count the number of green objects -- which there are 0:

num_obj_per_color: {("blue", 1), ("green", 0), ("red", 3)}

Sum and Product

We can use the aggregator of sum and product to aggregate multiple numerical values. Consider the following example of sales:

rel sales = {("alice", 1000.0), ("bob", 1200.0), ("christine", 1000.0)}

We can compute the sum of all the sales:

rel total_sales(s) = s := sum[p](sp: sales(p, sp)) // 3200.0
// or
rel total_sales = sum[p](sp: sales(p, sp)) // 3200.0

Notice that the result type of s is the same as the type of the binding variable sp, which is f32 as indicated by the decimals in the definition of sales. Here, the argument variable p is necessary since it is the key to index each sale number. The above rule body is equivalent to the following math formula:

\[ s = \sum_p \text{sale}_p \]

If we do not use the argument variable, we get the following:

rel total_sales_wrong(s) = s := sum(sp: sales(p, sp)) // 2200.0, since the two 1000.0 will be deduplicated without its key

The product aggregator prod can be used in a similar manner as sum.

Min, Max, Argmin, and Argmax

Scallop can compute the minimum or maximum among a set of values. In the following example, we find the maximum grade of an exam:

rel exam_grades = {("a", 95.2), ("b", 87.3), ("c", 99.9)}
rel min_score(m) = m := max(s: exam_grades(_, s)) // 99.9
// or, succinctly
rel min_score = max(s: exam_grades(_, s)) // 99.9

The number (and types) of binding variables can be arbitrary, but the result variables must match the binding variables. In the above case, since s is of type f32, m will be of type f32 as well.

It is also possible to get argmax/argmin. Suppose we want to get the person (along with their grade) who scored the best, we write:

rel best_student(n, s) = (n, s) := max[n](s: exam_grades(n, s))
// or, succinctly
rel best_student = max[n](s: exam_grades(n, s))

Here, we are still finding the maximum score s, but along with max we have specified the "arg" ([n]) which associates with the maximum score. We call n an arg variable for min/max aggregator. The arg variable is grounded by the aggregation body, and can be directly used in the head of the rule.

If we do not care about the grade and just want to know who has the best grade, we can use wildcard _ to ignore the result variable, like

rel best_student(n) = (n, _) := max[n](s: exam_grades(n, s))

Alternatively, we can also use argmax:

rel best_student(n) = n := argmax[n](s: exam_grades(n, s))
// or, succinctly
rel best_student = argmax[n](s: exam_grades(n, s))

Exists and Forall

Logical quantifier such as exists and forall can also be encoded as aggregations. They will return value of boolean as the aggregation result.

Existential Quantifier

Let us start with discussing the easier of the two, exists. Technically, all variables in the body of Scallop rule are existentially quantified. We can use exists aggregation to make it explicit. For example, we can check if there exists an object that is blue:

rel obj_color = {(0, "red"), (1, "green")}
rel has_blue(b) = b := exists(o: obj_color(o, "blue"))

Specifically, we are checking "if there exists an object o such that its color is blue". The result is being assigned to a variable b. Since there is no blue object, we will get a result of has_blue(false).

In case when we just want the result boolean to be true or false, we can omit the result variables. For example, we can rewrite the recursive case of edge-path transitive closure as

rel path(a, c) = exists(b: path(a, b) and edge(b, c))

We note that this is just a syntax sugar equivalent to the following:

rel path(a, c) = r := exists(b: path(a, b) and edge(b, c)) and r == true

When we want to know the inexistence of something, we can do

rel no_red() = not exists(o: obj_color(o, "red"))

Note that there can be arbitrary amount of binding variables.

Universal Quantifier

We can also have universal quantifier forall. For this, there is a special requirement for universal quantification: the body formula has to be an implies (=>) formula. This restriction is enforced so that all the binding variables have bounds being specified on the left-hand-side of the implies formula. In the following example, we check if all the objects are spherical:

type Shape = CUBE | SPHERE | CONE | CYLINDER
rel object = {0, 1, 2}
rel obj_shape = {(0, CUBE), (1, SPHERE), (2, SPHERE)}
rel target(b) = b := forall(o: object(o) implies obj_shape(o, SPHERE))

Notice that we have a relation which defines the domain of object, suggesting that there are just 3 objects for us to work with. In the aggregation, we are checking "for all o such that o is an object, is the object a sphere?" The result is stored in the variable b and propagated to the target relation.

The reason we need to have an implies formula is that we need to use the left-hand-side of implies to give bounds to the universally quantified variables. Scallop cannot reason about open domain variables.

Note that similar to exists, we can also remove the result variable. The following program derives a boolean (arity-0) relation target denoting whether all the red objects are cubes:

type Shape = CUBE | SPHERE | CONE | CYLINDER
type Color = RED | GREEN | BLUE
rel obj_shape = {(0, CUBE), (1, SPHERE), (2, SPHERE)}
rel obj_color = {(0, RED),  (1, GREEN),  (2, GREEN)}
rel target() = forall(o: obj_color(o, RED) implies obj_shape(o, CUBE)) // {()}

Here, we directly use obj_color to serve as the left-hand-side of the implies. There will be one empty tuple being derived, suggesting that the statement is true.

String Join

If you have multiple facts containing strings and you want to join them together, you can use the string_join aggregator:

rel R = {"hello", "world"}
rel P1(n) = n := string_join(s: R(s)) // P1("helloworld")
rel P2(n) = n := string_join<" ">(s: R(s)) // P2("hello world")

In the above example, we can either directly join, producing the string "helloworld", or join with separator " ", producing the string "hello world". Note that the order of the strings in the joined string is determined by the strings. Here, "hello" starts with "h", which is smaller than the "w" in "world", therefore occurring before "world". If you want to specify an explicit order, use the argument variable:

rel R = {(2, "hello"), (1, "world")}
rel P(n) = n := string_join<" ">[i](s: R(i, s)) // P("world hello")

Since we have specified the variable i to be the argument of string_join, it serves to order the tuples. Here, we have (1, "world") and (2, "hello"), so the joined string will be "world hello" instead of "hello world".

Scallop Book