Graphs and Graph Theory

Next up we’re going to take a look at a new data structure, intended to model the mathematical object of graphs. Mathematically, a graph consists of several things:

A set of vertices \(V\). Vertices are sometimes called nodes.
A set of edges \(E \subseteq (V \times V)\), connecting pairs of vertices. Edges are sometimes called arcs.

A graph is called undirected if every edge goes “both ways” (that is, for every \(m,n) \in E, (n,m) \in E\)). A directed graph is one in which at least one edge does not go both ways. Usually we assume that every node has an (implicit) edge to itself (i.e., for all \(n \in V, (n,n) \in E\)), but sometimes we have explicit self-edges. Some graphs allow you to mix directed and undirected edges.

Some graphs have labeled edges, meaning the edges have some extra information attached to them. Some graphs have extra information attached to each vertex. Usually we assume that the “names” of vertices are integers starting at 0.

At any particular vertex \(n\), we have several attributes:

An in-edge is an edge that ends at \(n\).
An out-edge is an edge that begins at \(n\).
A vertex \(m\) is a neighbor of \(n\) if there is an out-edge from \(n\) to \(m\)
The indegree is the number of edges that lead into that vertex, \(|\{i \mid (i,e) \in E\} |\).
The outdegree is the number of edges out of \(n\). \(|\{o \mid (e,o) \in E\} |\). Equivalently, this is the number of neighbors that \(n\) has.

Some graphs allow more than one edge between a single pair of vertices (called a multigraph), but usually we assume a single edge. If we only allow a single edge between vertices, then the maximum number of edges is \(|V|^2\). If \(|E|\) is close to \(|V|^2\) then we call the graph dense. If \(|E|\) is much smaller than \(|V|^2\) then we call the graph sparse.

A node with only in-edges is called a sink. A node with only out-edges is called a source. A node that has only out-edges to every other node, and no in edges, is called a universal source; similarly, a node with only in-edges from every other node (and no out edges) is a universal sink.

The transpose of a graph is another graph that is formed by reversing the directions of all the edges.

A path is a sequence of nodes connected by edges. The length of a path is one less than the number of nodes in it (so that a single node by itself is a path of length 0). Two nodes are connected if there exists at least one path between them.

An undirected graph is called connected if there is a path between any two vertices. A directed graph is called weakly connected if replacing all directed edges with undirected ones would produce a connected graph. A directed graph is called strongly connected if there is a path between all pairs of vertices.

A cycle is a path that begins and ends at the same vertex. A graph that contains no cycles is called acyclic.

A graph where there is at most one path between any pair of nodes is actually a tree.

Graph representations

If that’s a graph mathematically, how do we represent them in the computer? There are two general schemes, with different performance characteristics:

Adjacency matrix
Adjacency list

(Often we need to store some attributes per-vertex: this is easy in either representation, as we can just give ourselves a vector of size \(|V|\).)

Adjacency matrix

An adjacency matrix is a two-dimensional, \(|V| \times |V|\) array of bools, where an element is true if there is an edge between the node given by the x-coordinate and the y-coordinate. E.g.,

    0  1  2  3  4  (from)
  +-------------- 
 0|    T 
 1|       T 
 2|    T
 3|    T  T   
 4| T         T          

(to)

If we need to attach extra information to edges, we can store it in the array elements by making them something other than bool. Similarly for a multigraph, we can make the elements ints and store the edge count, or store a linked-list of edges if edges have extra data attached to them.

In this representation, we have the following implementations of some common operations:

Doing something for all edges: \(O(|V| \times |V|)\) because we have to loop over the entire matrix.
Getting all the edges out of a node: \(O(|V|)\) because we have to scan through an entire column.
Getting all the edges into a node: also \(O(|V|)\) because we have to scan the row.
Determining whether there is an edge from m to n: \(O(1)\), just look at location \((m,n)\) in the array.
Space consumption: \(O(|V| \times |V|)\). This makes the adj. matrix representation good for dense graphs, where we’re not wasting much space.

Adjacency list

An adjacency list stores a graph as an array (or vector) of linked lists. Each element of the array represents a vertex, and the elements of its list are all its neighbors (or rather, the edges to all its neighbors).

Doing something for all edges: \(O(|E|)\) because we are really only storing the edges. (Technically, if \(|E| < |V|\) then it is \(O(|V|)\) because we have to at least store the vector of all vertices.)
Getting all edges out of a node takes time proportional to the number of out edges, because we just walk down the list.
Getting all in-edges takes \(O(|E|)\) time because we have to walk through the entire vector-of-lists, looking for edges into the node.
Checking whether an edge from m to n exists takes time proportional to the out-degree of m (because we have to walk down its linked list).
Space consumption: \(O(|E|)\), which makes this representation suitable for sparse graphs.

Depending on the algorithm, one or the other representation may perform better. Most algorithms which run on adj. matrix representations require \(O(|V|^2)\) time, however, there is one interesting case where the adj. matrix representation is faster than an adj. list:

Finding universal sources/sinks

Recall that a universal source is a node with 0 indegree, and outdegree of \(|V|-1\). A naive way to find a universal source would be to check every column of the matrix, to see whether it contains \(|V|-1\) true values, which would require \(O(|V|^2)\) time. However, this is ignoring information which we discover while processing previous nodes.

Suppose we are looking at an entry in the matrix G[m][n]. This entry is true if there is an edge from node m to node n.

If G[m][n] is true, then node n cannot be a universal source (because it has at least one in-edge, m --> n, and universal sources have 0 in edges).
If G[m][n] is false, then node m cannot be a universal source, because there is no edge m --> n.

We start at i = 0, j = 0 looking at G[i][i] (assuming the diagonal entries are all false).

If an entry is true, this rules out the current j, so increment j.
If an entry is false, this rules out the current i, so increment i.
If \(i \ge |V|\) then we perform one last check on row/column i to verify the universal source property.

This gives us an \(O(|V|)\) algorithm for finding a universal source. It can be easily modified to find a universal sink.

Graph search

There are two fundamental algorithms that come up constantly when we work with graphs, and both have to deal with the problem of searching. Specifically, we want to solve this problem:

Given a starting node and an ending node, determine whether a path exists from start to end, and return the path if it exists.

In both algorithms, we begin at the starting node and proceed to
search its neighbors, marking a vertex as “explored” when we have exhausted its possibilities. The difference between the two algorithms deals with how they handle branching:

Breadth-first search explores all branches simultaneously. At any given time, a BFS will have explored all nodes that are less than or equal to some distance (number of edges) from the starting node. As the distance increases, eventually the entire graph will be searched.
Depth-first search explores a particular path all the way to its end (i.e. until it reaches a node with out-degree 0, or whose neighbors have all already been explored, or it finds the ending node). If it has not found the ending node, then it backtracks to the most recent branch point and tries a different choice.

Both BFS and DFS rely on us being able to mark vertices as “explored” or “unexplored”. Actually, they both use three colors:

White – totally unexplored. Initially the entire graph is white.
Gray – in the process of being explored. This means that the search has visited this node, and is looking at this node, its neighbors, its neighbors’ neighbors, etc.
Black – finished. We have completed the exploration of this node and all the nodes that it is connected to.

Breadth-first search

The idea behind BFS is that when we explore a vertex, we add all its neighbors to a the end of a queue. We then dequeue the next vertex and explore that. Thus, the neighbors of a vertex at distance \(d\) from the source will not be explore until all other nodes at distance \(d\) have also been exlored. This gives BFS the appearance of exploring a continually growing “fringe” around the starting node. At the edge of the fringe are the nodes which are in the queue. Inside the fringe are nodes which were in the queue at one time, but which have since been explored. Beyond the fringe are the nodes which we haven’t explored yet.

We need to keep track of whether we’ve explored a node or not, so we keep an extra vector<bool> to store whether a particular vertex has been explored. (If the graph is acyclic, we don’t need this, its only only useful if cycles exist.)

A general sketch of a BFS is something like this:

Mark all vertices unexplored
Mark the starting vertex explored, and enqueue it.
While the queue is not empty, dequeue a node \(n\)
If \(n\) is the ending node, we are done, return True.
Otherwise, mark \(n\) explored, and enqueue all its neighbors that are not already explored. Goto 2.
If the queue is empty and we have not found the ending node, then no path from start to end exists, return False.

(Most descriptions of BFS search use colors to distinguish explored/unexplored. A vertex is “white” if it is unexplored, “gray” while we are processing it, and “black” after all its neighbors have been enqueued and we are finished with it. But the algorithm works the same if we only distinguish between “white” and “non-white”.)

This takes time proportional to the number of vertices \(|V|\), because once we’ve explored a vertex, we never revisit it.

Note that if we run the algorithm until it has explored everything, and if then there are still unexplored nodes, then that means that the graph is not connected. It’s possible to use this to find the connected components of the graph, by marking all explored nodes as being in a component, and then repeating the search at an unexplored node. Continue until there are no unexplored nodes.

If we run the search only until we find the target node, then we call it a search algorithm. If, on the other hand, we run it until it visits all (connected) nodes then we call it a graph traversal algorithm, because it traverses the graph. We could, of course, “traverse” all the nodes in the graph by just doing a loop over all the nodes, but this ignores the connections between them. When we traverse a graph, we would like to do it in a way that gives us some information about connectivity.

Example…

This gives us a true/false value indicating whether or not a path exists, but it doesn’t tell us anything about the path itself. Fortunately, it’s easy to modify the algorithm to create a BFS tree which can actually tell us the path from the starting vertex to any other vertex, if we let the algorithm run until all nodes have been processed (i.e., skip (3)). To do this, we create a vector of parent pointers, pointers to nodes. We refer to the parent of a node \(n\) as \(n.\pi\). We modify the algorithm as follows:

(step 5) Mark \(n\) explored, and enqueue all unexplored neighbors. When we enqueue a neighbor \(m\), set \(m.\pi = n\). (That is, the parent of a vertex is the vertex that we came from when we explored it.)

When we are done, we can look at the ending node (or any node) and follow its parent-pointers back to the starting node.

Another easy change is to record for each node its distance from the starting node. We call the distance of a node \(n\) \(n.d\) and make the following changes:

(step 2) Mark the starting vertex explored, set its distance to 0, and enqueue it.

(step 5) Mark \(n\) explored and enqueue all its unexplored neighbors. When we enqueue a neighbor \(m\), set \(m.d = n.d + 1\).

The paths found by breadth-first search are in fact the shortest paths from the starting node to any other node. We can see that this is the case by virtue of how the BFS explores nodes at increasing distances: if the shortest path from start to finish is of length \(d\), it will be found when the fringe is \(d\) steps from the starting node. This means that the distances computed are also the minimum distances from the starting node.

Example…

Complexity analysis

The time taken by the algorithm depends on whether we use the adjacency list or adjacency matrix representation:

If we use the adj. list representation, then finding the neighbors of a node is easy: just walk down its list. Step 1 takes \(O(|V|)\) time, while the main loop can (in the worst case) visit every edge, taking \(O(\|E|)\) time, for a total of \(O(|E| + |V|)\) time.
If we use the adj. matrix version, then finding the neighbors of a node takes \(O(|V|)\) time because we have to loop through the entire row. Step 1 still takes \(O(|V|)\) time but the rest of the function now takes \(O(|V|^2)\) time (because we are effectively visiting every possible edge, in the worst case), for a total of \(O(|V|^2)\).

For this algorithm, the adjacency list representation is better.

Depth-first search

DFS can be thought of as what you’d get it you replaced the queue from BFS with a stack. Because it uses the stack, we can implement it recursively without too much trouble.

To visit a node, mark it explored. If the node is the finishing node, return True. Otherwise, recursively visit any unexplored neighbor nodes, and return the logical OR of their return values.
Start by visit-ing the starting node, returning whatever it returns.

Example…

Often it is useful to color nodes by their exploration: “white” is totally unexplored, “gray” is in the process of being explored (i.e., set to gray at the start of visit) and “black” is done being explored (set to black at the end of visit).

As with breadth-first search, an easy change to DFS is to add parent pointers, constructing the DFS-tree, so that we can recover the path from start to finish. Note, however, than unlike BFS, this path is most likely not the shortest path in any sense. It is merely the first path discovered by search.

Another common change is to record the two timestamps for each node:

We keep a counter time, which is incremented every time we visit a node.
The discovery time \(u.d\) of a node is the value of time immediately after it has been incremented, when a node is visited.
After we finish recursively visiting all a nodes neighbors, we increment time again
The finishing \(u.f\) time of a node is this final value of time.

For any given node \(u\), the values of \(u.d\) and \(u.f\) have some interesting properties.

In the DFS tree, every child of \(u\) has a discovery and finishing time that is between those of \(u\).
For any two vertices, either their time-spans do not overlap, or one is totally inside the other. It’s never possible for two node’s time-spans to partially overlap. Furthermore, if one is inside the other, then they are related in the DFS tree by a child-parent relationship.

Example…

By examining the colors of neighboring nodes when we visit them, we can extract some interesting information:

If we find a neighbor which is white then we call the edge to it a tree edge (because we will explore that neighbor, and hence this edge becomes part of the DFS tree).
If we find a neighbor which is gray then we call that edge a “back” edge. Back edges are those that lead up the DFS tree, from child to ancestor. If there are no back edges, then the graph is acyclic.
There are also “cross” and “forward” edge that occur in other scenarios.

Directed acyclic graphs and topological sort

A directed acyclic graph is a directed graph which contains no cycles. DAGs occur fairly often when talking about tasks and dependencies. If we have a set of tasks, where tasks can depend on other tasks, we cannot allow cycles, or the tasks could never be completed. But it is possible for one task to depend on multiple other tasks, or for multiple tasks to all depend on the same task. Given such a graph, we want to output all the tasks in dependency order, the order we would have to complete them in so that every task’s dependencies are completed before it. This is called a topological sort of the DAG, and it’s easy to find with DFS:

When we finish visiting a node, add it to the front of a linked list. We the DFS is finished, the linked list contains the nodes in topological order. (In reverse finishing time order.)

Graph Implementation

There are various ways to implement graphs in C++. The “best” implementations make writing the various graph algorithms straightforward, but that requires a lot of work on the implementors part (writing custom iterators and working with templates and such). We’ll write a simple directed graph with weighted edges, where we can ignore the edge weights if we want to.

class adj_list {
  public:
    adj_list(int node_count) {
        edges = vector<list<edge_type>>{node_count}
    }

    struct edge_type {
        float weight;
        int destination;
    };

    vector<list<edge_type>> edges;
};

Nodes are identified by the indices in the edges vector, so we can add methods to create new edges and check whether an edge exists:

void adj_list::create_edge(int src, int dest, float w = 1) {
    edges.at(source).push_back(edge_type{w, destination})
}

bool adj_list::has_edge(int src, int dest) {
    for(edge_type& e : edges.at(src))
        if(e.destination == dest)
            return true;

    return false;
}

Implementing DFS and BFS over this structure is not too hard:

void adj_list::bfs(int start, function<void(int)> visit) {
    queue q;
    vector<bool> explored{edges.size()};

    q.enqueue(start);

    while(!q.empty()) {
        int n = q.dequeue();

        visit(n);

        explored.at(n) = true;
        for(auto e : edges.at(n))
            if(!explored.at(e.destination))
                q.enqueue(e.destination);
    }
}

This version just does a traversal over the entire graph, calling the function-object visit on each node. It does not retain the parent-tree, or keep distances, though adding those is relatively easy.

void adj_list::dfs(int start, function<void(int)> visit) {
    vector<bool> explored{edges.size()};

    dfs(start, visit, explored);
}

void adj_list::dfs(int start, 
                   function<void(int)> visit, 
                   vector<bool>& explored) {

    visit(start);
    explored.at(start) = true;

    for(auto e : edges.at(start))
        if(!explored.at(e.destination))
            dfs(e.destination, visit, explored);

}

Once again, we just do a simple traversal, and don’t maintain the parent-tree or the timestamps.