Graphs and Graph Theory

Next up we’re going to take a look at a new data structure, intended to model the mathematical object of graphs. Mathematically, a graph consists of several things:

A graph is called undirected if every edge goes “both ways” (that is, for every \(m,n) \in E, (n,m) \in E\)). A directed graph is one in which at least one edge does not go both ways. Usually we assume that every node has an (implicit) edge to itself (i.e., for all \(n \in V, (n,n) \in E\)), but sometimes we have explicit self-edges. Some graphs allow you to mix directed and undirected edges.

Some graphs have labeled edges, meaning the edges have some extra information attached to them. Some graphs have extra information attached to each vertex. Usually we assume that the “names” of vertices are integers starting at 0.

At any particular vertex \(n\), we have several attributes:

Some graphs allow more than one edge between a single pair of vertices (called a multigraph), but usually we assume a single edge. If we only allow a single edge between vertices, then the maximum number of edges is \(|V|^2\). If \(|E|\) is close to \(|V|^2\) then we call the graph dense. If \(|E|\) is much smaller than \(|V|^2\) then we call the graph sparse.

A node with only in-edges is called a sink. A node with only out-edges is called a source. A node that has only out-edges to every other node, and no in edges, is called a universal source; similarly, a node with only in-edges from every other node (and no out edges) is a universal sink.

The transpose of a graph is another graph that is formed by reversing the directions of all the edges.

A path is a sequence of nodes connected by edges. The length of a path is one less than the number of nodes in it (so that a single node by itself is a path of length 0). Two nodes are connected if there exists at least one path between them.

An undirected graph is called connected if there is a path between any two vertices. A directed graph is called weakly connected if replacing all directed edges with undirected ones would produce a connected graph. A directed graph is called strongly connected if there is a path between all pairs of vertices.

A cycle is a path that begins and ends at the same vertex. A graph that contains no cycles is called acyclic.

A graph where there is at most one path between any pair of nodes is actually a tree.

Graph representations

If that’s a graph mathematically, how do we represent them in the computer? There are two general schemes, with different performance characteristics:

(Often we need to store some attributes per-vertex: this is easy in either representation, as we can just give ourselves a vector of size \(|V|\).)

Adjacency matrix

An adjacency matrix is a two-dimensional, \(|V| \times |V|\) array of bools, where an element is true if there is an edge between the node given by the x-coordinate and the y-coordinate. E.g.,

    0  1  2  3  4  (from)
  +-------------- 
 0|    T 
 1|       T 
 2|    T
 3|    T  T   
 4| T         T          

(to) 

If we need to attach extra information to edges, we can store it in the array elements by making them something other than bool. Similarly for a multigraph, we can make the elements ints and store the edge count, or store a linked-list of edges if edges have extra data attached to them.

In this representation, we have the following implementations of some common operations:

Adjacency list

An adjacency list stores a graph as an array (or vector) of linked lists. Each element of the array represents a vertex, and the elements of its list are all its neighbors (or rather, the edges to all its neighbors).

Depending on the algorithm, one or the other representation may perform better. Most algorithms which run on adj. matrix representations require \(O(|V|^2)\) time, however, there is one interesting case where the adj. matrix representation is faster than an adj. list:

Finding universal sources/sinks

Recall that a universal source is a node with 0 indegree, and outdegree of \(|V|-1\). A naive way to find a universal source would be to check every column of the matrix, to see whether it contains \(|V|-1\) true values, which would require \(O(|V|^2)\) time. However, this is ignoring information which we discover while processing previous nodes.

Suppose we are looking at an entry in the matrix G[m][n]. This entry is true if there is an edge from node m to node n.

We start at i = 0, j = 0 looking at G[i][i] (assuming the diagonal entries are all false).

This gives us an \(O(|V|)\) algorithm for finding a universal source. It can be easily modified to find a universal sink.

There are two fundamental algorithms that come up constantly when we work with graphs, and both have to deal with the problem of searching. Specifically, we want to solve this problem:

Given a starting node and an ending node, determine whether a path exists from start to end, and return the path if it exists.

In both algorithms, we begin at the starting node and proceed to
search its neighbors, marking a vertex as “explored” when we have exhausted its possibilities. The difference between the two algorithms deals with how they handle branching:

Both BFS and DFS rely on us being able to mark vertices as “explored” or “unexplored”. Actually, they both use three colors:

The idea behind BFS is that when we explore a vertex, we add all its neighbors to a the end of a queue. We then dequeue the next vertex and explore that. Thus, the neighbors of a vertex at distance \(d\) from the source will not be explore until all other nodes at distance \(d\) have also been exlored. This gives BFS the appearance of exploring a continually growing “fringe” around the starting node. At the edge of the fringe are the nodes which are in the queue. Inside the fringe are nodes which were in the queue at one time, but which have since been explored. Beyond the fringe are the nodes which we haven’t explored yet.

We need to keep track of whether we’ve explored a node or not, so we keep an extra vector<bool> to store whether a particular vertex has been explored. (If the graph is acyclic, we don’t need this, its only only useful if cycles exist.)

A general sketch of a BFS is something like this:

  1. Mark all vertices unexplored

  2. Mark the starting vertex explored, and enqueue it.

  3. While the queue is not empty, dequeue a node \(n\)

  4. If \(n\) is the ending node, we are done, return True.

  5. Otherwise, mark \(n\) explored, and enqueue all its neighbors that are not already explored. Goto 2.

  6. If the queue is empty and we have not found the ending node, then no path from start to end exists, return False.

(Most descriptions of BFS search use colors to distinguish explored/unexplored. A vertex is “white” if it is unexplored, “gray” while we are processing it, and “black” after all its neighbors have been enqueued and we are finished with it. But the algorithm works the same if we only distinguish between “white” and “non-white”.)

This takes time proportional to the number of vertices \(|V|\), because once we’ve explored a vertex, we never revisit it.

Note that if we run the algorithm until it has explored everything, and if then there are still unexplored nodes, then that means that the graph is not connected. It’s possible to use this to find the connected components of the graph, by marking all explored nodes as being in a component, and then repeating the search at an unexplored node. Continue until there are no unexplored nodes.

If we run the search only until we find the target node, then we call it a search algorithm. If, on the other hand, we run it until it visits all (connected) nodes then we call it a graph traversal algorithm, because it traverses the graph. We could, of course, “traverse” all the nodes in the graph by just doing a loop over all the nodes, but this ignores the connections between them. When we traverse a graph, we would like to do it in a way that gives us some information about connectivity.

Example…

This gives us a true/false value indicating whether or not a path exists, but it doesn’t tell us anything about the path itself. Fortunately, it’s easy to modify the algorithm to create a BFS tree which can actually tell us the path from the starting vertex to any other vertex, if we let the algorithm run until all nodes have been processed (i.e., skip (3)). To do this, we create a vector of parent pointers, pointers to nodes. We refer to the parent of a node \(n\) as \(n.\pi\). We modify the algorithm as follows:

(step 5) Mark \(n\) explored, and enqueue all unexplored neighbors. When we enqueue a neighbor \(m\), set \(m.\pi = n\). (That is, the parent of a vertex is the vertex that we came from when we explored it.)

When we are done, we can look at the ending node (or any node) and follow its parent-pointers back to the starting node.

Another easy change is to record for each node its distance from the starting node. We call the distance of a node \(n\) \(n.d\) and make the following changes:

(step 2) Mark the starting vertex explored, set its distance to 0, and enqueue it.

(step 5) Mark \(n\) explored and enqueue all its unexplored neighbors. When we enqueue a neighbor \(m\), set \(m.d = n.d + 1\).

The paths found by breadth-first search are in fact the shortest paths from the starting node to any other node. We can see that this is the case by virtue of how the BFS explores nodes at increasing distances: if the shortest path from start to finish is of length \(d\), it will be found when the fringe is \(d\) steps from the starting node. This means that the distances computed are also the minimum distances from the starting node.

Example…

Complexity analysis

The time taken by the algorithm depends on whether we use the adjacency list or adjacency matrix representation:

For this algorithm, the adjacency list representation is better.

DFS can be thought of as what you’d get it you replaced the queue from BFS with a stack. Because it uses the stack, we can implement it recursively without too much trouble.

Example…

Often it is useful to color nodes by their exploration: “white” is totally unexplored, “gray” is in the process of being explored (i.e., set to gray at the start of visit) and “black” is done being explored (set to black at the end of visit).

As with breadth-first search, an easy change to DFS is to add parent pointers, constructing the DFS-tree, so that we can recover the path from start to finish. Note, however, than unlike BFS, this path is most likely not the shortest path in any sense. It is merely the first path discovered by search.

Another common change is to record the two timestamps for each node:

For any given node \(u\), the values of \(u.d\) and \(u.f\) have some interesting properties.

Example…

By examining the colors of neighboring nodes when we visit them, we can extract some interesting information:

Directed acyclic graphs and topological sort

A directed acyclic graph is a directed graph which contains no cycles. DAGs occur fairly often when talking about tasks and dependencies. If we have a set of tasks, where tasks can depend on other tasks, we cannot allow cycles, or the tasks could never be completed. But it is possible for one task to depend on multiple other tasks, or for multiple tasks to all depend on the same task. Given such a graph, we want to output all the tasks in dependency order, the order we would have to complete them in so that every task’s dependencies are completed before it. This is called a topological sort of the DAG, and it’s easy to find with DFS:

When we finish visiting a node, add it to the front of a linked list. We the DFS is finished, the linked list contains the nodes in topological order. (In reverse finishing time order.)

Graph Implementation

There are various ways to implement graphs in C++. The “best” implementations make writing the various graph algorithms straightforward, but that requires a lot of work on the implementors part (writing custom iterators and working with templates and such). We’ll write a simple directed graph with weighted edges, where we can ignore the edge weights if we want to.

class adj_list {
  public:
    adj_list(int node_count) {
        edges = vector<list<edge_type>>{node_count}
    }

    struct edge_type {
        float weight;
        int destination;
    };

    vector<list<edge_type>> edges;
};

Nodes are identified by the indices in the edges vector, so we can add methods to create new edges and check whether an edge exists:

void adj_list::create_edge(int src, int dest, float w = 1) {
    edges.at(source).push_back(edge_type{w, destination})
}

bool adj_list::has_edge(int src, int dest) {
    for(edge_type& e : edges.at(src))
        if(e.destination == dest)
            return true;

    return false;
}

Implementing DFS and BFS over this structure is not too hard:

void adj_list::bfs(int start, function<void(int)> visit) {
    queue q;
    vector<bool> explored{edges.size()};

    q.enqueue(start);

    while(!q.empty()) {
        int n = q.dequeue();

        visit(n);

        explored.at(n) = true;
        for(auto e : edges.at(n))
            if(!explored.at(e.destination))
                q.enqueue(e.destination);
    }
}

This version just does a traversal over the entire graph, calling the function-object visit on each node. It does not retain the parent-tree, or keep distances, though adding those is relatively easy.

void adj_list::dfs(int start, function<void(int)> visit) {
    vector<bool> explored{edges.size()};

    dfs(start, visit, explored);
}

void adj_list::dfs(int start, 
                   function<void(int)> visit, 
                   vector<bool>& explored) {

    visit(start);
    explored.at(start) = true;

    for(auto e : edges.at(start))
        if(!explored.at(e.destination))
            dfs(e.destination, visit, explored);

}

Once again, we just do a simple traversal, and don’t maintain the parent-tree or the timestamps.