Loading from CSV

Scallop can be used along with existing datasets loaded from CSVs. This is usually achieved with annotating on specific relations. For example, assuming we have a file edge.csv,

0,1
1,2

we can load the content of it into a relation edge in Scallop using the following syntax

@file("edge.csv")
type edge(from: usize, to: usize)

rel path(a, c) = edge(a, c) or path(a, b) and edge(b, c)

query path

In particular, we annotate the @file(...) attribute onto the relation type declaration type edge(...). The file name is written inside the @file attribute. We require the relation to be declared with types in order for it to be loaded with CSV file content. Depending on the type declaration, the file content will be parsed into values of certain types.

From here, the edge relation will be loaded with the content (0, 1) and (1, 2). After executing the Scallop program above, we would obtain the result path being (0, 1), (0, 2), and (1, 2).

Certainly, there are many ways to load CSV. In this section, we introduce the various ways to configure the CSV loading.

Headers

There are CSV files with headers. Suppose we have the following CSV file

from,to
0,1
1,2

To load this file, we would need to add an additional argument header=true to the @file attribute:

@file("edge.csv", header=true)
type edge(from: usize, to: usize)

Note that by default we assume that CSV files don't have headers.

Deliminators

By default, we assume the values inside of the CSV file are deliminated by commas ','. In case where CSV files have values deliminated by other characters, such as tabs '\t', we would need to specify that in the @file attribute:

@file("edge.csv", deliminator="\t")
type edge(from: usize, to: usize)

Note that deliminators cannot be of multiple characters.

Parsing Field-Value Pairs

There are many CSV tables which have a lot of columns. One way is to specify all the fields and their types, like the following.

type table(field1: type1, field2: type2, ..., fieldn: typen)

However, this might be very hard to encode. Therefore, we provide another way of parsing CSV files into relations, by using primary keys and field-value pairs. Let's assume we have the following CSV file:

student_id,name,year,gender
0001,alice,2020,female
0002,bob,2021,male

We see that student_id can serve as the primary key of this table. With this, it can be loaded into the following relation

@file("student.csv", keys="student_id")
type table(student_id: usize, field: String, value: String)

By specifying keys="student", we tell Scallop that student_id should be viewed as unique primary keys. The rest of the two elements are field and value, both need to be typed Strings. As a result, it produces the following 6 facts in the table relation:

(1, "name", "alice"), (1, "year", "2020"), (1, "gender", "female"),
(2, "name", "bob"),   (2, "year", "2021"), (2, "gender", "male")

Note that there could be more than one keys. Consider the following table

student_id,course_id,enroll_time,grade
0001,cse100,fa2020,a
0001,cse101,sp2021,a
0002,cse120,sp2021,b

We see that the combination of student_id and course_id form the unique primary keys. In this case, they can be loaded using the following syntax:

@file("enrollment.csv", keys=["student_id", "course_id"])
type enrollment(student_id: usize, course_id: String, field: String, value: String)

By setting keys to be a list ["student_id", "course_id"], the student_id field is the first primary key and course_id is the second. There are still two additional arguments for the enrollment relation. In general, the arity of the relation will be the number of primary keys plus 2.

Specifying Fields to Load

In case not all fields are desired when loading, one can use the fields argument to specify what to load. Consider the same enrollment table encoded in CSV:

student_id,course_id,enroll_time,grade
0001,cse100,fa2020,a
0001,cse101,sp2021,a
0002,cse120,sp2021,b

If we only want to get everything but omit the enroll_time column, we can do

@file("enrollment.csv", fields=["student_id", "course_id", "grade"])
type enrollment(student_id: usize, course_id: String, grade: String)

This can also work in conjunction with the keys argument. In this case, we do not need to specify the primary keys.

@file("enrollment.csv", keys=["student_id", "course_id"], fields=["grade"])
type enrollment(student_id: usize, course_id: String, field: String, value: String)
// The following facts will be obtained
//   enrollment(1, "cse100", "grade", "a")
//   enrollment(1, "cse101", "grade", "a")
//   enrollment(2, "cse120", "grade", "b")