Git and Code Versioning for Scientists

Posted 5/10/2022

I recently gave a talk to the Joint Lab on using git to write code collaboratively with other scientists. I’m going to recycle that talk in blog form, in case anyone else would like a quick crash-course. This post assumes that you’ve used git a few times before, and are trying to get a better handle on it.

My examples use the command line and diagrams to explain the concepts behind git operations. You may find graphical git software, like GitHub Desktop, or the git integration in your text editor, more approachable. That’s perfectly fine; I just find the command line makes git operations more explicit, and so is clearer for teaching.

What is Git?

Git is a version-control system: it tracks changes to files in a folder, typically source code, allowing developers to undo their changes, examine the history of changes, and merge changes that they and others have made.

A group of changes to one or more files is called a commit. Each commit includes a reference to the previous commit, creating a timeline of every change since the start of the project.

The current state of the repository can be described as the sum of the most recent commit, and every ancestor commit that came before it. The current state of the repository is referred to as HEAD.

A git branch is a series of commits. The default branch is typically called main or master. More on branches later.

Remote Servers

Git can technically be used offline: you can create a local repository on your computer, track changes to files in a project, and never share the commits with anyone. However, we most frequently use git as a collaboration tool, typically by creating a repository on a central site like GitHub or GitLab and giving multiple developers access.

When you clone a GitHub repository, your computer assigns a name to the remote server, origin by default. Your computer then downloads the repository contents, and creates two branches: an origin/main branch, representing the commits GitHub is aware of, and a main branch, representing your own work.

Pull

When you pull new changes down from GitHub, git first downloads the commits to the origin/main branch, then fast-forwards your own main branch to match:

Push

Similarly, when you push local changes to GitHub, you’re sending changes from your own main branch to GitHub, so the origin/main branch catches up to your own work:

Conflicts

If you are the only developer on a repository, and you only develop from one computer, then you can push and pull to your heart’s content. However, if someone else has pushed changes to GitHub since you have, then you have a conflict, where GitHub has commits that you don’t, and you have commits that GitHub doesn’t:

In order to sync your local commits with GitHub, you need to resolve the conflict with one of two strategies.

Rebasing

If the commits on GitHub (C in this diagram) edit different files than the local commits (D and E), then the “conflict” is purely bureaucratic. The file changes aren’t incompatible, you just need to re-order the commits to create a coherent timeline. The rebase command will rewrite your local commits so that they come after C:

The full commands to accomplish this are:

git fetch origin main
git rebase origin/main

The first downloads changes from GitHub (origin) to the origin/main branch, but does not attempt to combine them with the local main branch. The second rebases the local commits to occur after the origin/main commits.

Merging

If commit C does change the same files as commits D or E, then the two histories may be incompatible. In order to combine them, we need to add a new commit, F, that incorporates the changes from all three commits, possibly changing files further to make the changes mesh.

Similarly, the commands for this are:

git fetch origin main
git merge origin/main

If you’ve never run these commands before, that’s because git pull is shorthand for the above two commands!

Branches

So far we’ve only considered repositories with a single branch: main. However, using multiple branches is key to successful collaboration. You can create side branches to work on a new feature, or otherwise make large breaking changes to a codebase. Branches let you work in a corner, contain your mess, and only merge those changes back to the main branch when you’re done and ready to share.

Creating Local Branches

Creating a local branch is as easy as running:

git checkout -b new_feature

This creates a new branch called new_feature, and switches HEAD to track the new branch. When you add new commits, they’ll now be added to the new_feature branch instead of main:

Switching Branches

To switch back to the main branch, run git checkout main:

This will un-apply the changes from commits D and E, reverting the codebase to the state of the main branch. If you’ve made changes that you haven’t committed, git will not let you change branches. In other words, switching branches will never destroy your work.

When you’re ready to combine your changes from your side branch to the main branch, simply checkout the main branch, and run git merge new_feature or git rebase new_feature as appropriate.

Creating Remote Branches

By default, newly created branches only exist in your local repository, and are not pushed up to GitHub. This is so that you can do as much internal bookkeeping and branching as you want, without cluttering what other developers see. Make side branches to your heart’s content!

However, if you want to share your branch with other developers, to collaborate on a side branch without merging everything to main, then you need to create a “remote” branch.

If you already have a local branch checked out, you can create a corresponding remote branch (and then push to send those changes to GitHub) with:

git branch --set-upstream-to origin/new_feature
git push

If you are creating a new branch for the first time, and already know you want to share it with others, you can create the branch like:

git checkout -b new_feature origin/new_feature

In either case, your commit tree will now look like:

From now on, you can git push and git pull in the new_feature branch to sync it to GitHub’s origin/new_feature branch.

Common Pitfalls

Editing on the Wrong Branch

If you’ve made some changes to the main branch that you wanted to put on new_feature and you haven’t committed yet, then you can move those changes over by stashing them in a corner:

git stash
git checkout new_feature
git stash pop

Stashing uncommitted changes bypasses git’s rule about not permitting you to switch branches before committing, because your changes will not be overwritten this way.

Committing to the Wrong Branch

If you’ve already committed your changes to main instead of new_feature and you haven’t pushed yet then you can fix the mishap with:

git checkout new_feature
git merge main
git checkout main
git reset --hard HEAD~3

This will move the changes to the new_feature branch (by merging them in from main), then on the main branch, undoes the last three commits, rewriting history.

Since the commits are still referenced by the new_feature branch, no work is lost, but they no longer appear in the main branch’s history.

Reversing a Catastrophic Commit

Usually when you make a mistake, like committing a typo, the solution is to add a new commit that corrects the mistake. There is rarely any need to rewrite history and un-commit something.

One glaring exception is when you commit security-sensitive information. For example, if you accidentally commit your Amazon EC2 API keys to a public GitHub repository, then adding a new commit to erase them is insufficient. Anyone could look in the commit history of the project and fetch the keys back out.

In this rare circumstance, you can roll back back the most recent commits on the local main branch with:

git reset --hard HEAD~3 # Replace '3' with the number of commits to undo

And then push those changes up to GitHub, being very explicit about forcing GitHub to update their origin/main to match your local main, even though you aren’t adding new commits:

git push origin main --force

Note that if anyone else has pulled the mistaken commit down from GitHub, they’ll still be able to view the withdrawn credentials, and they’ll now get an error when they try to pull new changes down. This is because their own git history is now incompatible with the one on GitHub. To fix this, they need to reset their main branch to overwrite their local history with GitHub’s:

git fetch origin main
git reset --hard origin/main

Needless to say, this is extremely messy, and should be avoided whenever possible.

Using .gitignore to avoid mistakes

Git has a special configuration file, .gitignore, that tells it to ignore specific files. For example, you might include a .gitignore file at the top level of the repository containing:

*.png
database_credentials.json
logs/

This will tell git to ignore all PNG files, the database credentials file, and the entire logs folder. You will never be prompted to add or commit those files, so there is no risk of accidentally pushing them to a public repository.

Rename and move files within git!

Git has poor support for moving and renaming files. When you rename a file without telling git, it compares the contents of the “new” file and the “missing” file, and if the files are identical or extremely similar, it assumes the file has simply been moved. However, if you rename a file and then make some changes to it, chances are git will no longer recognize it as the same file, and will prompt you to commit deleting the old file and creating a new file. This is inconvenient, since it destroys the history associated with the file, making it difficult to track changes. Fortunately, the solution is simple - move your files using git:

git mv oldfile newfile

This guarantees that git knows the file is renamed, and maintains history.

Conclusion

Git is a powerful tool for collaborating with others on code. It has an extraordinary breadth of functionality, because it was created by the Linux kernel developers to help them manage what is likely the single most complicated open source project in human history. However, this post covers most of the basics, and as much as I’ve needed for almost any project. I picked up git on my own, as I think many scientists do, and for several years felt like I half-understood what I was doing and stumbled my way through getting git to do what I wanted. Hopefully this can help someone in a similar place!