I recently gave a talk to the Joint Lab on using git to write code collaboratively with other scientists. I’m going to recycle that talk in blog form, in case anyone else would like a quick crash-course. This post assumes that you’ve used git a few times before, and are trying to get a better handle on it.
My examples use the command line and diagrams to explain the concepts behind git operations. You may find graphical git software, like GitHub Desktop, or the git integration in your text editor, more approachable. That’s perfectly fine; I just find the command line makes git operations more explicit, and so is clearer for teaching.
Git is a version-control system: it tracks changes to files in a folder, typically source code, allowing developers to undo their changes, examine the history of changes, and merge changes that they and others have made.
A group of changes to one or more files is called a commit. Each commit includes a reference to the previous commit, creating a timeline of every change since the start of the project.
The current state of the repository can be described as the sum of the most recent commit, and every ancestor commit that came before it. The current state of the repository is referred to as
A git branch is a series of commits. The default branch is typically called
master. More on branches later.
Git can technically be used offline: you can create a local repository on your computer, track changes to files in a project, and never share the commits with anyone. However, we most frequently use git as a collaboration tool, typically by creating a repository on a central site like GitHub or GitLab and giving multiple developers access.
clone a GitHub repository, your computer assigns a name to the remote server,
origin by default. Your computer then downloads the repository contents, and creates two branches: an
origin/main branch, representing the commits GitHub is aware of, and a
main branch, representing your own work.
pull new changes down from GitHub, git first downloads the commits to the
origin/main branch, then fast-forwards your own
main branch to match:
Similarly, when you
push local changes to GitHub, you’re sending changes from your own
main branch to GitHub, so the
origin/main branch catches up to your own work:
If you are the only developer on a repository, and you only develop from one computer, then you can
pull to your heart’s content. However, if someone else has pushed changes to GitHub since you have, then you have a conflict, where GitHub has commits that you don’t, and you have commits that GitHub doesn’t:
In order to sync your local commits with GitHub, you need to resolve the conflict with one of two strategies.
If the commits on GitHub (
C in this diagram) edit different files than the local commits (
E), then the “conflict” is purely bureaucratic. The file changes aren’t incompatible, you just need to re-order the commits to create a coherent timeline. The
rebase command will rewrite your local commits so that they come after
The full commands to accomplish this are:
git fetch origin main git rebase origin/main
The first downloads changes from GitHub (
origin) to the
origin/main branch, but does not attempt to combine them with the local
main branch. The second rebases the local commits to occur after the
C does change the same files as commits
E, then the two histories may be incompatible. In order to combine them, we need to add a new commit,
F, that incorporates the changes from all three commits, possibly changing files further to make the changes mesh.
Similarly, the commands for this are:
git fetch origin main git merge origin/main
If you’ve never run these commands before, that’s because
git pull is shorthand for the above two commands!
So far we’ve only considered repositories with a single branch:
main. However, using multiple branches is key to successful collaboration. You can create side branches to work on a new feature, or otherwise make large breaking changes to a codebase. Branches let you work in a corner, contain your mess, and only merge those changes back to the main branch when you’re done and ready to share.
Creating a local branch is as easy as running:
git checkout -b new_feature
This creates a new branch called
new_feature, and switches
HEAD to track the new branch. When you add new commits, they’ll now be added to the
new_feature branch instead of
To switch back to the
main branch, run
git checkout main:
This will un-apply the changes from commits
E, reverting the codebase to the state of the
main branch. If you’ve made changes that you haven’t committed, git will not let you change branches. In other words, switching branches will never destroy your work.
When you’re ready to combine your changes from your side branch to the main branch, simply checkout the main branch, and run
git merge new_feature or
git rebase new_feature as appropriate.
By default, newly created branches only exist in your local repository, and are not pushed up to GitHub. This is so that you can do as much internal bookkeeping and branching as you want, without cluttering what other developers see. Make side branches to your heart’s content!
However, if you want to share your branch with other developers, to collaborate on a side branch without merging everything to
main, then you need to create a “remote” branch.
If you already have a local branch checked out, you can create a corresponding remote branch (and then push to send those changes to GitHub) with:
git branch --set-upstream-to origin/new_feature git push
If you are creating a new branch for the first time, and already know you want to share it with others, you can create the branch like:
git checkout -b new_feature origin/new_feature
In either case, your commit tree will now look like:
From now on, you can
git push and
git pull in the
new_feature branch to sync it to GitHub’s
If you’ve made some changes to the
main branch that you wanted to put on
new_feature and you haven’t committed yet, then you can move those changes over by stashing them in a corner:
git stash git checkout new_feature git stash pop
Stashing uncommitted changes bypasses git’s rule about not permitting you to switch branches before committing, because your changes will not be overwritten this way.
If you’ve already committed your changes to
main instead of
new_feature and you haven’t pushed yet then you can fix the mishap with:
git checkout new_feature git merge main git checkout main git reset --hard HEAD~3
This will move the changes to the
new_feature branch (by merging them in from main), then on the main branch, undoes the last three commits, rewriting history.
Since the commits are still referenced by the
new_feature branch, no work is lost, but they no longer appear in the
main branch’s history.
Usually when you make a mistake, like committing a typo, the solution is to add a new commit that corrects the mistake. There is rarely any need to rewrite history and un-commit something.
One glaring exception is when you commit security-sensitive information. For example, if you accidentally commit your Amazon EC2 API keys to a public GitHub repository, then adding a new commit to erase them is insufficient. Anyone could look in the commit history of the project and fetch the keys back out.
In this rare circumstance, you can roll back back the most recent commits on the local main branch with:
git reset --hard HEAD~3 # Replace '3' with the number of commits to undo
And then push those changes up to GitHub, being very explicit about forcing GitHub to update their
origin/main to match your local
main, even though you aren’t adding new commits:
git push origin main --force
Note that if anyone else has pulled the mistaken commit down from GitHub, they’ll still be able to view the withdrawn credentials, and they’ll now get an error when they try to pull new changes down. This is because their own git history is now incompatible with the one on GitHub. To fix this, they need to reset their main branch to overwrite their local history with GitHub’s:
git fetch origin main git reset --hard origin/main
Needless to say, this is extremely messy, and should be avoided whenever possible.
Git has a special configuration file,
.gitignore, that tells it to ignore specific files. For example, you might include a
.gitignore file at the top level of the repository containing:
*.png database_credentials.json logs/
This will tell git to ignore all PNG files, the database credentials file, and the entire logs folder. You will never be prompted to add or commit those files, so there is no risk of accidentally pushing them to a public repository.
Git has poor support for moving and renaming files. When you rename a file without telling git, it compares the contents of the “new” file and the “missing” file, and if the files are identical or extremely similar, it assumes the file has simply been moved. However, if you rename a file and then make some changes to it, chances are git will no longer recognize it as the same file, and will prompt you to commit deleting the old file and creating a new file. This is inconvenient, since it destroys the history associated with the file, making it difficult to track changes. Fortunately, the solution is simple - move your files using git:
git mv oldfile newfile
This guarantees that git knows the file is renamed, and maintains history.
Git is a powerful tool for collaborating with others on code. It has an extraordinary breadth of functionality, because it was created by the Linux kernel developers to help them manage what is likely the single most complicated open source project in human history. However, this post covers most of the basics, and as much as I’ve needed for almost any project. I picked up git on my own, as I think many scientists do, and for several years felt like I half-understood what I was doing and stumbled my way through getting git to do what I wanted. Hopefully this can help someone in a similar place!