Databases¶
A database is a collection of data with some organizational structure. A database can be used to store, process, and query large amounts of information. A database can be used to give short-lived programs persistence by allowing them to save and retrieve program state between executions.
We will consider two classes of databases: relational and non-relational databases.
Relational databases¶
A relational database stores data in tables. Each table consists of a number of named columns that each store some type of data. Each row in the database has exactly one entry per column, and represents a data record.
Common relational databases include MySQL, PostgreSQL, and
SQLite. Both MySQL
and PostgreSQL
are run as dedicated
services which your program would then connect to. SQLite
is
actually just a C library that is embedded in your program, and so
requires no external process. One benefit of using SQLite
is that
your database is a single file that can be easily emailed, uploaded,
and downloaded, unlike MySQL
or PostgreSQL
which store their
data across many files on the server. All three databases use the
SQL
language for constructing queries.
SQL Queries¶
We can use queries written in SQL, the structured query
language. These queries perform operations such as reading from or
writing to the database. Throughout this section we assume we have a
single database called example
with a single table called
IrisData
containing Fisher’s iris plant data. It has the
following structure:
Id | SepalLength | SepalWidth | PetalLength | PetalWidth | IrisClass |
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
To read all data from the database, we can write
SELECT
Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass
FROM
IrisData
Here extra whitespace is ignored, and so we use it for clarity. This query would return a number of rows, each containing the requested columns. If we were only interested in a subset of the data, we could write
SELECT
SepalLength, SepalWidth
FROM
IrisData
WHERE
IrisClass = "Iris-setosa"
Here we’ve elected to only retrieve two columns, and furthermore we
are only interested in rows where the IrisClass
value is
Iris-setosa
.
Storing lists, or one-to-many relationships¶
The types of data we can store in a column include integers, floating point numbers, references to other columns, dates, and strings. However, there is no list type. This is because of database normalization, more specifically because of the first normal form requirement that every row and column intersection contain only a single value. To get around this we must introduce more tables.
Suppose we want to store the following Python class in an SQLite database.
class Foo:
def __init__(self, a, b, xs):
self.a = int(a)
self.b = float(b)
self.xs = list(xs) # List of strings
The a
and b
attributes can easily be stored, but the xs
entry is more difficult. We first create a table where each row
contains a single Foo
object.
CREATE TABLE Foos (
FooId INTEGER PRIMARY KEY AUTOINCREMENT,
a INTEGER,
b DOUBLE PRECISION
)
We will now create a new table where each row contains a single
element of the xs
attribute for some Foo
object.
CREATE TABLE FooXs (
FooXsId INTEGER PRIMARY KEY AUTOINCREMENT,
FooId INTEGER REFERENCES Foos (FooId),
SeqNum INTEGER,
x TEXT
)
The FooId
column points to the Foo
object that this data is
stored in, the x
column contains a single entry of the
associated xs
attribute, and the SeqNum
column contains the
index of the value. Suppose we wanted to store the following three
Foo
objects.
f1 = Foo(7, 10.1, ['Hello', 'world'])
f2 = Foo(8, 3.14, ['Hello', 'there', '!'])
f3 = Foo(1, 0, [])
We would end up with a Foo
table like the following,
FooId | a | b |
1 | 7 | 10.1 |
2 | 8 | 3.14 |
3 | 1 | 0 |
and a FooXs
table like the following:
FooXsId | FooId | SeqNum | x |
1 | 1 | 0 | “Hello” |
2 | 1 | 1 | “world” |
3 | 2 | 0 | “Hello” |
4 | 2 | 1 | “there” |
5 | 2 | 2 | ”!” |
To recover this information, we would use a JOIN
query such as
SELECT
Foos.FooId, a, b, x, SeqNum
FROM
Foos
LEFT OUTER JOIN
FooXs
ON
Foos.FooId = FooXs.FooId
We use a left outer join to ensure that the third Foo
object is
returned; with an inner join, it would not be returned since there is
no entry in FooXs
for it.