Databases

A database is a collection of data with some organizational structure. A database can be used to store, process, and query large amounts of information. A database can be used to give short-lived programs persistence by allowing them to save and retrieve program state between executions.

We will consider two classes of databases: relational and non-relational databases.

Relational databases

A relational database stores data in tables. Each table consists of a number of named columns that each store some type of data. Each row in the database has exactly one entry per column, and represents a data record.

Common relational databases include MySQL, PostgreSQL, and SQLite. Both MySQL and PostgreSQL are run as dedicated services which your program would then connect to. SQLite is actually just a C library that is embedded in your program, and so requires no external process. One benefit of using SQLite is that your database is a single file that can be easily emailed, uploaded, and downloaded, unlike MySQL or PostgreSQL which store their data across many files on the server. All three databases use the SQL language for constructing queries.

SQL Queries

We can use queries written in SQL, the structured query language. These queries perform operations such as reading from or writing to the database. Throughout this section we assume we have a single database called example with a single table called IrisData containing Fisher’s iris plant data. It has the following structure:

Id SepalLength SepalWidth PetalLength PetalWidth IrisClass
1 5.1 3.5 1.4 0.2 Iris-setosa

To read all data from the database, we can write

SELECT
  Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass
FROM
  IrisData

Here extra whitespace is ignored, and so we use it for clarity. This query would return a number of rows, each containing the requested columns. If we were only interested in a subset of the data, we could write

SELECT
  SepalLength, SepalWidth
FROM
  IrisData
WHERE
  IrisClass = "Iris-setosa"

Here we’ve elected to only retrieve two columns, and furthermore we are only interested in rows where the IrisClass value is Iris-setosa.

Storing lists, or one-to-many relationships

The types of data we can store in a column include integers, floating point numbers, references to other columns, dates, and strings. However, there is no list type. This is because of database normalization, more specifically because of the first normal form requirement that every row and column intersection contain only a single value. To get around this we must introduce more tables.

Suppose we want to store the following Python class in an SQLite database.

class Foo:
    def __init__(self, a, b, xs):
        self.a = int(a)
        self.b = float(b)
        self.xs = list(xs) # List of strings

The a and b attributes can easily be stored, but the xs entry is more difficult. We first create a table where each row contains a single Foo object.

CREATE TABLE Foos (
  FooId INTEGER PRIMARY KEY AUTOINCREMENT,
  a     INTEGER,
  b     DOUBLE PRECISION
)

We will now create a new table where each row contains a single element of the xs attribute for some Foo object.

CREATE TABLE FooXs (
  FooXsId INTEGER PRIMARY KEY AUTOINCREMENT,
  FooId   INTEGER REFERENCES Foos (FooId),
  SeqNum  INTEGER,
  x       TEXT
)

The FooId column points to the Foo object that this data is stored in, the x column contains a single entry of the associated xs attribute, and the SeqNum column contains the index of the value. Suppose we wanted to store the following three Foo objects.

f1 = Foo(7, 10.1, ['Hello', 'world'])
f2 = Foo(8, 3.14, ['Hello', 'there', '!'])
f3 = Foo(1, 0, [])

We would end up with a Foo table like the following,

FooId a b
1 7 10.1
2 8 3.14
3 1 0

and a FooXs table like the following:

FooXsId FooId SeqNum x
1 1 0 “Hello”
2 1 1 “world”
3 2 0 “Hello”
4 2 1 “there”
5 2 2 ”!”

To recover this information, we would use a JOIN query such as

SELECT
  Foos.FooId, a, b, x, SeqNum
FROM
  Foos
LEFT OUTER JOIN
  FooXs
ON
  Foos.FooId = FooXs.FooId

We use a left outer join to ensure that the third Foo object is returned; with an inner join, it would not be returned since there is no entry in FooXs for it.