Syllabus info

What class is this? CSci 133 (123 is a prereq.)

Who am I? Andrew Clifton, “Andy” or “Mr. Clifton” or “Prof. Clifton” are all fine.

What is this class about? If 123 gave you the tools to get started, 133 is about the tools themselves: how do they work, how were they built, how can we build new ones ourselves? So some of the things that you just used in 123, we’re actually going to build from scratch ourselves, to help understand how they work.

Grading

This course uses specifications grading a point-less (ha!) grading system where your final grade is based on proving to me that you’ve mastered the material in the course. Fortunately for you, this is fairly simple: this class is divided into 8 “modules”. Each module has an assignment and a section on the midterms. To pass an assignment, you must pass its assignment and section. (Note that later midterms include all the sections from previous ones, so if you fail a section on a midterm, you can just try again on the next one. There is no penalty for failing a section as long as you later pass it.) There are three midterms and the final, but the final is basically just the fourth midterm, so each test effectively adds the two most recent modules.

Depending on how things turn out, I might apply an adjustment at the end of the semester to keep things fair (e.g., drop the requirement for an A to 7/8 if no one gets all 8) but you shouldn’t count on that.

I’ll be posting your grades to Canvas, but it will basically just be 0/1 for each module. 1 indicating that you’ve passed it at some point.

Assignments are graded pass/fail, but you can resubmit failed assignments as many times as you like. Because each assignment includes automated tests telling you whether or not it passes, there’s really no point in submitting an assignment that you know will fail.

I probably shouldn’t point this out, but note that if you’re only aiming for a C, you could technically stop coming to class after you’ve completed your 5 modules and your grade won’t suffer at all.

The Personal Software Process

The PSP is a software development “exercise” (think pushups) to help you get some sick gainz in your software development skills. Essentially, it requires you to make estimates, before you begin coding, as to how many lines of code, how much time, how many bugs, etc. you think you will produce. Then, while you are programming you keep detailed records for all those categories. When you’re done, you compare your estimates with your real-world performance.

I wouldn’t suggest doing the PSP on the first assignment, as you’re likely to mess something up and have to resubmit. The upside is that the PSP forces you to pay attention to details, so you’re more likely to catch your own mistakes.

Note that the PSP is something you have to do at the same time as the assignment. You can’t do the assignment and then fill it out afterwards; you’ll get bad data because you’ll have forgotten about all the little bugs you fixed, and how much time you spent fixing them.

Additional requirements to get a C or better

As this is now effectively an online course, there are a few additional tasks you must complete in order to get a C or better:

Logging into the server

What you need:

On Mac/Linux:

ssh username@fccsci.fullcoll.edu -p 5150

Replace username with your username. It will prompt you for your password.

On PuTTY, fill in the host and port and then click Connect. It will prompt you for your username and password. You can also “save” your connection details from the connect screen so you don’t have to type them in every time.

Moving around

Editing and Compiling

There are four editors on the server: Micro, Nano, Emacs, and VIM. All have their advantages and disadvantages:

Compiling using GCC:

g++ -c source1.cpp
g++ -c source2.cpp
...
g++ -o program source1.o source2.o ...

or for short

g++ -o program source1.cpp source2.cpp ...

You can list multiple source/object files in the second form.

But I’ve also built a shortcut for you:

compile source1.cpp source2.cpp ...

This will compile-and-link all the listed source files, creating an executable named source1. (I.e., named after the first source file.) Even better, compile handles figuring out the right order in which to list source files! (Remember that if you use G++ you have to list source files in a particular order, based one which files use which other files.)

If you want to compile-and-link-and-run you can do

compile source1.cpp source2.cpp ... && ./source1

The program will only run if the compile was successful.

Using compile also has the advantage that it will colorize the error messages printed by G++.

You don’t have to do your development on the server; if you want to work in Visual Studio or whatever on your own machine, that’s fine. However, the server is where I will collect your submissions, and compile and test your code, so you must make sure that your source code is on the server before the due date, and that it compiles and runs on the server. I don’t accept email submissions, and “it works on my machine” isn’t good enough. (If you write correct standard C++ it will work everywhere; a program that works with one compiler and not with another is a program that has something wrong with it.)

You should also make sure that whatever compiler/IDE you use understands C++11. You might have to change an option to enable C++11 mode, otherwise you’ll get error messages when you try to write code using the ranged-for loops or nullptr.

To move files to/from the server you can use FileZilla (on Windows/Linux/Mac) or WinSCP (on Windows), or SCP from the command-line (Linux/Mac). You still need all the above information. To use SCP from the command-line you can do

scp -P 5150 path/localfile.cpp username@fccsci.fullcoll.edu:path/remotefile.cpp

to copy the local (on your computer) file path/localfile.cpp into the remote (on the server) file path/remotefile.cpp. Just as on the server, you can use ~ in either the local or remote path to refer to either your local or remote home directory.

To use FileZilla or WinSCP, you need the above information (host, port, username, password); for FileZilla set the protocol to “SFTP”. Once you connect, you should be able to drag-and-drop files between your computer and the server.

I’ve heard that you can install FUSE for Mac OS and MacFusion which will give you the ability to permanently “mount” the server like a folder in Finder, but you’re on your own if you want to do that. (On Linux it’s possible to use sshfs to mount the server as a directory.)

Finally, you can run Emacs on your personal computer and use TRAMP to both save/load files on the server and even to do your compilation on the server.

Review Example 1: The bag data structure

Although most of you just had 123 last semester, I thought it would be good to review C++, just to make sure we’re all on the same page, especially since not all of you had 123 from me.

We’re going to begin our review by building a basic data structure called a “bag”. The idea of a bag is its something that supports the following operations:

An example of using a bag might look something like this:

bag b(10); // Capacity (max. size) = 10

b.insert(1);
b.insert(2);
b.insert(3);

cout << b.size() << endl;  // Prints 3
cout << b.empty() << endl; // Prints 0 (false)
cout << b.full() << endl;  // Prints 0 (false)

// This should print 1,2,3, but not necessarily in that order.
for(int i = 0; i < b.size(); ++i)
  cout << b.at(i) << endl;

b.remove(2);

cout << b.size() << endl; // Prints 2

// This should print 1,3, but not necessarily in that order.
for(int i = 0; i < b.size(); ++i)
  cout << b.at(i) << endl;

Because we can construct multiple bags, and because bags are required to be independent, we will pretty must have to implement bags as a class. The space used by the bag will have to be stored as a data member inside the class, so that each instance of the class gets its own space for its own elements.

class bag {
  public:

  private:

};

We could use a std::vector to store the contents of each bag, but I’m going to go a bit lower level and use a pointer to dynamically-allocated (i.e., heap) memory. We will have to manage this memory using new and delete ourselves. Since bags store ints, each bag will have an int* member, pointing to the space it is using:

class bag {
  public:

  private:
    int* data;
};

(If you remember 123, this means we will need to implement a destructor, copy constructor, and copy-assignment operator in order for bag to function correctly.)

To consider how to create this storage (in the constructor), we have to think about how we will use it:

Those are the only two operations that change the size of a bag (clear can be thought of as just while(!b.empty()) b.remove(0);). There are two basic ways we can implement these on top of a dynamically-allocated array:

The second approach is probably a better fit. (It’s also the method that the first assignment requires you to use.)

How do we insert a new element (assuming the bag is not full)? Because insert is not required to place the element in any particular position, the easiest place to put it is at the end:

void insert(int e) 
{
  if(!full()) {
    data[sz] = e;
    ++sz;
  }
}

This could be shortened to just data[sz++] = e; if you’re feeling clever.

How do we remove an element? If the element is at the end, it’s easy to “remove”: just decrement size and its effectively gone (it’s still there in memory, of course, but it will be overwritten by a later insert). What if the element is not at the end? Again, remove is not required to preserve the order of the other elements, so we can just swap the to-be-removed element with the last element, reducing the case of remove(i) to remove(size()-1):

void remove(int i)
{
  if(i >= 0 && i < size()) {
    std::swap(data[i], data[size() - 1]);
    --sz;
  }
}

Those are the most complex methods. The rest of the class looks like:

class bag {
  public:
    bag(int c)
    {
      cap = c;
      sz = 0;
      data = new int[cap];
    }

    int size()   { return sz; }
    bool empty() { return size() == 0; }
    bool full()  { return size() == cap; }

    int at(int i) 
    { 
      if(i >= 0 && i < size())
        return data[i];
      else
        throw std::out_of_range("at(i) out of range!");
    }

    void insert(int e)
    {
      if(!full)
        data[size++] = e;
    }

    void remove(int i)
    {
      if(!empty()) {
        std::swap(data[i], data[size()-1]);
        --sz;
      }
    }

    void clear()
    {
      sz = 0;
    }

  private:
    int cap, sz;
    int* data;
};

This implementation will work, however it has two problems:

To fix this, we must add the “big three”: destructor, copy constructor, and overloaded assignment operator:

// Destructor
~bag()
{
  delete[] data;
}

// Copy constructor: 
//   bag b2 = b1;
bag(const bag& other)
{
  sz = other.sz;
  cap = other.cap;
  data = new int[cap];

  // Copy bag contents:
  for(int i = 0; i < sz; ++i)
    data[i] = other.data[i];
}

// Overloaded assignment operator: 
//   b2 = b1;
bag& operator= (const bag& rhs)
{
  // "Copy and swap";
  bag copy = rhs; 

  // Swap *this with copy
  std::swap(sz, copy.sz);
  std::swap(cap, copy.cap);
  std::swap(data, copy.data);

  return *this;
}

The easiest way to write the overloaded assignment operator is to combine the steps of the destructor and copy constructor:

However, this has two problems: one, it involves writing the same code twice, and two, if anything goes wrong, notice that the bag is left in a messed-up state. If something happens between delete and new, we have a bag whose data points to already-deleted memory. The “copy-and-swap” idiom reverses these steps, and reuses the code we’ve already written in the copy constructor and destructor: it first makes a copy, and then swaps the copy with *this. (When the copy goes out of scope, it will be destroyed, but it contains our original data.)

The disadvantage to the copy-and-swap idiom is that it requires twice as much memory, because both the copy and the original must exist together, for a short period of time. This is the only way to get true safety, unfortunately, so that if anything goes wrong, we still have a valid bag.

C++11 coolness

With C++11 there are a couple of additional constructors/overloads we can add, bringing the big-three up to the big-five. Unlike the copy constructor and overloaded assignment operator, these are not required for correct operation: your code will work just fine without them. These make your programs faster in certain circumstances.

To see why these are needed, let’s add a method to bag remove_duplicates which returns a new bag, the same as the existing one except that there’s only one copy of each unique element:

bag remove_duplicates()
{
  bag b(cap);

  for(int i = 0; i < size(); ++i) {

    bool found = false;
    for(int j = 0; j < b.size(); ++j)
      if(at(i) == b.at(j)) {
        found = true;
        break;
      }

    if(!found)
      b.insert(at(i));
  }

  return b;
}

Consider what happens if we do something like this:

bag b1(10);
// Insert stuff into b1

bag b2 = b1.remove_duplicates();

In the last line, the following steps are performed:

It seems a little wasteful to create a bag, copy from it, and then immediately destroy it. Why not just steal from the temporary, and then tell the temporary somehow that it doesn’t need to bother delete-ing data in its destructor. This way, the new bag built by remove_duplicates is effectively constructed in-place, inside of b2, instead of being built separately and then copied.

This process is referred to as a move, and it can only occur in particular circumstances: We can only move from a temporary object, and object that will soon be destroyed. This is because moves are implemented as “stealing resources”; if we stole resources from a long-lived object, it would cause all kinds of problems. Stealing from a temporary is fine, though, because they were going to die soon anyway.

The move constructor looks like this:

bag(bag&& other)
{
  // Steal from other
  data = other.data;
  sz = other.sz;
  cap = other.cap;

  // Make sure other doesn't destroy *our* data
  other.data = nullptr;
}

bag&& is a special kind of reference called an rvalue-reference. “Rvalue” is the technical name for “temporary object”. bag&& can only bind to temporary objects, never to “real” bags. (Also note that unlike the normal copy constructor, we do not pass other as const, because we intend to modify other, by stealing from it.) At the end of the move-constructor, we set other.data = nullptr, so that when other is destroyed, it won’t do anything (delete[] nullptr always does nothing).

The move assignment operator is similar, except that we destroy our own data first:

bag& operator= (bag&& rhs)
{
  delete[] data;

  data = rhs.data;
  sz = other.sz;
  cap = other.cap;

  other.data = nullptr;

  return *this;
}

Additional operations

There are a few additional operations we can add to our bag to make it more useful:

find looks like this:

int find(int x)
{
  for(int i = 0; i < size(); ++i)
    if(at(i) == x)
      return i;

  return -1; // Not found
}

while count looks like this:

int count(int x)
{
  int c = 0;
  for(int i = 0; i < size(); ++i)
    if(at(i) == x)
      ++c;

  return c;
}

(You might notice that I use the size() method instead of sz and the at method instead of accessing the array directly. This is to help “future proof” my class. If I later decide I want to change how the class is implemented, I only have to update size, at insert and remove; find and count don’t need to change.)

Given these two, we have a choice as to how to write exists:

bool exists(int x) { return find(x) != -1; }
bool exists(int x) { return count(x) > 0;  }

Which is better? This brings us to the question of algorithmic efficiency. Both find and count contain a loop up to size(), so they could both potentially have to scan through the entire array. However, there is a crucial difference: find, because it is only looking for the first copy, and not all copies, can exit the loop early. It returns as soon as it finds its target, which means that, on average, find will be faster than count. Hence, we should write exists in terms of find.

Const-correctness

C++ allows us to declare variables as const, indicating that we are not allowed to modify them:

const int x = 12;
++x; // ERROR!

A const bag is not particularly useful, because we can’t insert anything into it, but still, for completeness, we will make the bag class const-correct. To do so, we look at each method and ask whether or not it modifies the data members of the bag. If it does not, we label it as const, like this:

bool exists(int x) const
{
  return find(x) != -1;
}

Note that a const method cannot call a non-const method, so this forces find to be const also. In the end, only insert and remove need to be non-const.

We can create a const reference to a non-const object, so we can do things like this:

bag b;
b.insert(1);
b.insert(2);
b.insert(3);

const bag& br = b;
cout << br.at(0) << endl; // Fine
br.insert(12);            // ERROR

Template classes

Currently, our bag only stores ints. We can make it more flexible, so that it can store any kind of value, by making it into a template class:

template<typename T>
class bag {
  ...
};

Within the class definition, wherever we used int for the type of the elements of the elements of the bag, we replace it with T.

template<typename T>
class bag {
  public:
    bag(int c) ...

    void insert(T x) ...

  private:
    T* data;
    int sz, cap;
};

Note that we do not replace every int: some ints represent the size or capacity of the bag, or indexes within the bag. Only those that represent values within the bag become Ts.

After making this change, we can build a bag of strings via

bag<string> bs; 
bs.insert("Hello");

Review Example 2: Text processing

Suppose we want to read a line of input from the user, split it into an array of “words”, and then filter out certain “noise” words.

We need to do several things:

Representing strings of text: some of you may have come from a class where you used char* as strings. Here we’re going to use the built-in string class which comes with “batteries included” and doesn’t require you to worry about allocation or length or anything:

#include <string>
using std::string;
...
string line;
getline(cin, line); // Read a line of text from cin into `line`

Now we need to split the line into words, separated by spaces.

How are we going to store our list of words? We could use an array, but then we’d have to decide on a maximum size, and keep track of how many words we had actually read in. All of our current commands have relatively few words, so this isn’t too bad, but we’d prefer something simpler. We’re going to use the built-in vector class.

#include <vector>
using std::vector;

A vector is like an array: you can access elements at a specific numeric index via

v.at(n)
v[n]

The first is “safe” in that if \(n \lt 0\) or \(n \gt \) the size of the vector it will throw an exception. The second is unsafe (it doesn’t do any “bounds-checking”) but it’s just as fast as a plain array access.

The nice thing about vectors is that we can add new elements onto the end, without having to worry about how big the vector is; the vector will resize itself when necessary:

vector<string> words;
words.push_back("word"); // Add another word

Vectors should normally be preferred to “raw” arrays in almost every situation (the exception is when you are implementing low-level data structures; e.g., when we sit down to write our own vector class, you’ll have to use arrays to build it). Vectors are just as fast as arrays, and safer and more flexible to boot.

To accomplish the splitting, we have two methods we can use:

  1. We can write a loop to do it ourselves. We’ll have to keep track of whether we are “inside” a word, and if so, collect characters into a string. When we transition from non-word to word, we have to start a new word, and when we go from word to non-word, we have to add the word to the list of words.

    Note that with a string, we don’t need to pick a “maximum size”; we can add a new character to the end by just doing

     s.append(c);
    

    and the string will be expanded to hold it.

    To use this method, we need to figure out all the things that can “happen” while reading the next character: it can be a word character (i.e., not a space) or a space character. There’s a third option, however, that is easy to overlook: the next character might not exist, if we’ve reached the end of the string. So we have two “states” and three possible transitions for each state, which leads us to a maximum of six cases we need to consider:

    In word?\next charSpaceWordEnd-of-string
    WORDFinish wordContinue wordFinish word, end
    SPACEIgnoreStart wordEnd

    “Finish word” means to add the word that is being constructed to the vector of words, clear the current word, and set the curren state to SPACE. “Start word” means to add the current character to the word (which should be initially empty, because we cleared it in “Finish”), and change the state to WORD. “Continue word” means to add the character to the end of the current word. “End” means we’re done; return the vector of words. “Ignore” means we basically don’t do anything with the current character, just continue to the next one.

    We can also draw a graphical representation of this as a state machine diagram (sometimes called a statechart diagram). We’ll see an example of this later.

    The one last thing we have to think about is what state to start in. If we look at the states, if we start in WORD and the first character is a space, then we will try to “finish” a word that hasn’t been started yet! On the other hand, if we start in SPACE and see a space, it’s ignored; if we see a non-space, we’ll start a new word. Both of those are fine, so our “start state” should be SPACE.

  2. We can let the standard library do the work for us, and use a stringstream. This is basically an input stream, like cin, except that it gets its input from an existing string, rather than from the user. This is useful to us because if we do

     string w;
     cin >> w;
    

    then w will contain the next word (it will skip over any leading spaces, and stop reading when it encounters a space at the end).

Using a vector to hold the words, and a stringstream to get them we have something like this:

// same comment
vector<string> split_words(string input) {
    vector<string> words; 
    string current_word;  

    std::stringstream in(input);

    while(in >> current_word) 
        words.push_back(current_word);

    return words;
}

(Note that a stringstream can also be used for output, like cout, so that anything you write to it gets “printed” into the string. That’s not useful for what we’re doing here, but you might find it useful in other problems.)

One thing you should notice is that I have no problems with you using the standard library. It’s there to make your life easier, so why not use it? (The exception is, of course, if I ask to you to rewrite something provided by the library; then I expect you to actually write it yourself.) When possible, you should prefer high-level methods (which is what the library provides) to lower-level ones (like doing everything yourself).

Coding standards

Some things to note about the coding standard that I use: