Databases

A database is a collection of data with some organizational structure. A database can be used to store, process, and query large amounts of information. A database can be used to give short-lived programs persistence by allowing them to save and retrieve program state between executions.

We will consider two classes of databases: relational and non-relational databases.

Relational databases

A relational database stores data in tables. Each table consists of a number of named columns that each store some type of data. Each row in the database has exactly one entry per column, and represents a data record.

Common relational databases include MySQL, PostgreSQL, and SQLite. Both MySQL and PostgreSQL are run as dedicated services which your program would then connect to. SQLite is actually just a C library that is embedded in your program, and so requires no external process. One benefit of using SQLite is that your database is a single file that can be easily emailed, uploaded, and downloaded, unlike MySQL or PostgreSQL which store their data across many files on the server. All three databases use the SQL language for constructing queries.

SQL Queries

We can use queries written in SQL, the structured query language. These queries perform operations such as reading from or writing to the database. Throughout this section we assume we have a single database called example with a single table called IrisData containing Fisher’s iris plant data. It has the following structure:

Id SepalLength SepalWidth PetalLength PetalWidth IrisClass
1 5.1 3.5 1.4 0.2 Iris-setosa

To read all data from the database, we can write

SELECT
  Id, SepalLength, SepalWidth, PetalLength, PetalWidth, IrisClass
FROM
  IrisData

Here extra whitespace is ignored, and so we use it for clarity. This query would return a number of rows, each containing the requested columns. If we were only interested in a subset of the data, we could write

SELECT
  SepalLength, SepalWidth
FROM
  IrisData
WHERE
  IrisClass = "Iris-setosa"

Here we’ve elected to only retrieve two columns, and furthermore we are only interested in rows where the IrisClass value is Iris-setosa.

Storing lists, or one-to-many relationships

The types of data we can store in a column include integers, floating point numbers, references to other columns, dates, and strings. However, there is no list type. This is because of database normalization, more specifically because of the first normal form requirement that every row and column intersection contain only a single value. To get around this we must introduce more tables.

Suppose we want to store the following Python class in an SQLite database.

class Foo:
    def __init__(self, a, b, xs):
        self.a = int(a)
        self.b = float(b)
        self.xs = list(xs) # List of strings

The a and b attributes can easily be stored, but the xs entry is more difficult. We first create a table where each row contains a single Foo object.

CREATE TABLE Foos (
  FooId INTEGER PRIMARY KEY AUTOINCREMENT,
  a     INTEGER,
  b     DOUBLE PRECISION
)

We will now create a new table where each row contains a single element of the xs attribute for some Foo object.

CREATE TABLE FooXs (
  FooXsId INTEGER PRIMARY KEY AUTOINCREMENT,
  FooId   INTEGER REFERENCES Foos (FooId),
  SeqNum  INTEGER,
  x       TEXT
)

The FooId column points to the Foo object that this data is stored in, the x column contains a single entry of the associated xs attribute, and the SeqNum column contains the index of the value. Suppose we wanted to store the following three Foo objects.

f1 = Foo(7, 10.1, ['Hello', 'world'])
f2 = Foo(8, 3.14, ['Hello', 'there', '!'])
f3 = Foo(1, 0, [])

We would end up with a Foo table like the following,

FooId a b
1 7 10.1
2 8 3.14
3 1 0

and a FooXs table like the following:

FooXsId FooId SeqNum x
1 1 0 “Hello”
2 1 1 “world”
3 2 0 “Hello”
4 2 1 “there”
5 2 2 ”!”

To recover this information, we would use a JOIN query such as

SELECT
  Foos.FooId, a, b, x, SeqNum
FROM
  Foos
LEFT OUTER JOIN
  FooXs
ON
  Foos.FooId = FooXs.FooId

We use a left outer join to ensure that the third Foo object is returned; with an inner join, it would not be returned since there is no entry in FooXs for it.

Object-relational mapping (ORM)

As seen with the above example, working with databases directly can lead to code that is less intuitive. Ideally, one would want to work with objects directly and have code that writes the correct SQL queries. This is the intuition behind an object-relational mapping tool. Its job is to bridge the gap between data stored in an object-oriented paradigm and that in a relational database.

There are a large variety of ORM Tools for many languages that provide an intuitive way to interact with relational databases. For example, a commonly used ORM for Python is SQLAlchemy. To see how it works, we can write Python code to perform the same tasks as above. The following declares an Iris object that will take advantage of the existent table.

from sqlalchemy import Column,Integer,Float,String
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()

class Iris(Base):
   __tablename__ = IrisData

   Id          = Column(Integer,primary_key=True)
   SepalLength = Column(Float)
   SepalWidth  = Column(Float)
   PetalLength = Column(Float)
   PetalWidth  = Column(Float)
   IrisType    = Column(String(20))

   def __init__(self,SepalLength=None,SepalWidth=None,
                     PetalLength=None,PetalWidth=None,
                     IrisType=None):

      self.SepalLength = SepalLength
      self.SepalWidth  = SepalWidth
      self.PetalLength = PetalLength
      self.PetalWidth  = PetalWidth
      self.IrisType    = IrisType

   def __repr__(self):
      return "Iris(%r,%r,%r,%r,%r)" % (self.SepalLength,
                                       self.SepalWidth,
                                       self.PetalLength,
                                       self.PetalWidth,
                                       self.IrisType)

Once a session object is used to connect with the database, all that is needed to query the table is are a few additional lines. The following lines query all the values in the table.

iris_data = session.query(Iris).all()
for iris in iris_data:
    print(iris)

To duplicate the second example, we could write the following lines.

sepal_query         = session.query(Iris.SepalLength,Iris.SepalWidth)
filtered_sepal_data = sepal_query.filter_by(IrisType == 'Iris-setosa').all()

for sepal_length,sepal_width in filtered_sepal_data:
   print(sepal_length,sepal_width)

For more complete code and to be able to run examples interactively, see the full SQL example repository.

As an added benefit, many web application frameworks have extensions that allow them to interact with ORMs. For example, Flask-SQLAlchemy is an extension of Flask adding support for SQLAlchemy.

To summarize, using an ORM is beneficial because it allows one to focus on the object-oriented nature of the code and not the SQL queries themselves.