Posted 5/10/2022
I recently gave a talk to the Joint Lab on using git to write code collaboratively with other scientists. I’m going to recycle that talk in blog form, in case anyone else would like a quick crash-course. This post assumes that you’ve used git a few times before, and are trying to get a better handle on it.
My examples use the command line and diagrams to explain the concepts behind git operations. You may find graphical git software, like GitHub Desktop, or the git integration in your text editor, more approachable. That’s perfectly fine; I just find the command line makes git operations more explicit, and so is clearer for teaching.
Git is a version-control system: it tracks changes to files in a folder, typically source code, allowing developers to undo their changes, examine the history of changes, and merge changes that they and others have made.
A group of changes to one or more files is called a commit. Each commit includes a reference to the previous commit, creating a timeline of every change since the start of the project.
The current state of the repository can be described as the sum of the most recent commit, and every ancestor commit that came before it. The current state of the repository is referred to as HEAD
.
A git branch is a series of commits. The default branch is typically called main
or master
. More on branches later.
Git can technically be used offline: you can create a local repository on your computer, track changes to files in a project, and never share the commits with anyone. However, we most frequently use git as a collaboration tool, typically by creating a repository on a central site like GitHub or GitLab and giving multiple developers access.
When you clone
a GitHub repository, your computer assigns a name to the remote server, origin
by default. Your computer then downloads the repository contents, and creates two branches: an origin/main
branch, representing the commits GitHub is aware of, and a main
branch, representing your own work.
When you pull
new changes down from GitHub, git first downloads the commits to the origin/main
branch, then fast-forwards your own main
branch to match:
Similarly, when you push
local changes to GitHub, you’re sending changes from your own main
branch to GitHub, so the origin/main
branch catches up to your own work:
If you are the only developer on a repository, and you only develop from one computer, then you can push
and pull
to your heart’s content. However, if someone else has pushed changes to GitHub since you have, then you have a conflict, where GitHub has commits that you don’t, and you have commits that GitHub doesn’t:
In order to sync your local commits with GitHub, you need to resolve the conflict with one of two strategies.
If the commits on GitHub (C
in this diagram) edit different files than the local commits (D
and E
), then the “conflict” is purely bureaucratic. The file changes aren’t incompatible, you just need to re-order the commits to create a coherent timeline. The rebase
command will rewrite your local commits so that they come after C
:
The full commands to accomplish this are:
git fetch origin main
git rebase origin/main
The first downloads changes from GitHub (origin
) to the origin/main
branch, but does not attempt to combine them with the local main
branch. The second rebases the local commits to occur after the origin/main
commits.
If commit C
does change the same files as commits D
or E
, then the two histories may be incompatible. In order to combine them, we need to add a new commit, F
, that incorporates the changes from all three commits, possibly changing files further to make the changes mesh.
Similarly, the commands for this are:
git fetch origin main
git merge origin/main
If you’ve never run these commands before, that’s because git pull
is shorthand for the above two commands!
So far we’ve only considered repositories with a single branch: main
. However, using multiple branches is key to successful collaboration. You can create side branches to work on a new feature, or otherwise make large breaking changes to a codebase. Branches let you work in a corner, contain your mess, and only merge those changes back to the main branch when you’re done and ready to share.
Creating a local branch is as easy as running:
git checkout -b new_feature
This creates a new branch called new_feature
, and switches HEAD
to track the new branch. When you add new commits, they’ll now be added to the new_feature
branch instead of main
:
To switch back to the main
branch, run git checkout main
:
This will un-apply the changes from commits D
and E
, reverting the codebase to the state of the main
branch. If you’ve made changes that you haven’t committed, git will not let you change branches. In other words, switching branches will never destroy your work.
When you’re ready to combine your changes from your side branch to the main branch, simply checkout the main branch, and run git merge new_feature
or git rebase new_feature
as appropriate.
By default, newly created branches only exist in your local repository, and are not pushed up to GitHub. This is so that you can do as much internal bookkeeping and branching as you want, without cluttering what other developers see. Make side branches to your heart’s content!
However, if you want to share your branch with other developers, to collaborate on a side branch without merging everything to main
, then you need to create a “remote” branch.
If you already have a local branch checked out, you can create a corresponding remote branch (and then push to send those changes to GitHub) with:
git branch --set-upstream-to origin/new_feature
git push
If you are creating a new branch for the first time, and already know you want to share it with others, you can create the branch like:
git checkout -b new_feature origin/new_feature
In either case, your commit tree will now look like:
From now on, you can git push
and git pull
in the new_feature
branch to sync it to GitHub’s origin/new_feature
branch.
If you’ve made some changes to the main
branch that you wanted to put on new_feature
and you haven’t committed yet, then you can move those changes over by stashing them in a corner:
git stash
git checkout new_feature
git stash pop
Stashing uncommitted changes bypasses git’s rule about not permitting you to switch branches before committing, because your changes will not be overwritten this way.
If you’ve already committed your changes to main
instead of new_feature
and you haven’t pushed yet then you can fix the mishap with:
git checkout new_feature
git merge main
git checkout main
git reset --hard HEAD~3
This will move the changes to the new_feature
branch (by merging them in from main), then on the main branch, undoes the last three commits, rewriting history.
Since the commits are still referenced by the new_feature
branch, no work is lost, but they no longer appear in the main
branch’s history.
Usually when you make a mistake, like committing a typo, the solution is to add a new commit that corrects the mistake. There is rarely any need to rewrite history and un-commit something.
One glaring exception is when you commit security-sensitive information. For example, if you accidentally commit your Amazon EC2 API keys to a public GitHub repository, then adding a new commit to erase them is insufficient. Anyone could look in the commit history of the project and fetch the keys back out.
In this rare circumstance, you can roll back back the most recent commits on the local main branch with:
git reset --hard HEAD~3 # Replace '3' with the number of commits to undo
And then push those changes up to GitHub, being very explicit about forcing GitHub to update their origin/main
to match your local main
, even though you aren’t adding new commits:
git push origin main --force
Note that if anyone else has pulled the mistaken commit down from GitHub, they’ll still be able to view the withdrawn credentials, and they’ll now get an error when they try to pull new changes down. This is because their own git history is now incompatible with the one on GitHub. To fix this, they need to reset their main branch to overwrite their local history with GitHub’s:
git fetch origin main
git reset --hard origin/main
Needless to say, this is extremely messy, and should be avoided whenever possible.
Git has a special configuration file, .gitignore
, that tells it to ignore specific files. For example, you might include a .gitignore
file at the top level of the repository containing:
*.png
database_credentials.json
logs/
This will tell git to ignore all PNG files, the database credentials file, and the entire logs folder. You will never be prompted to add or commit those files, so there is no risk of accidentally pushing them to a public repository.
Git has poor support for moving and renaming files. When you rename a file without telling git, it compares the contents of the “new” file and the “missing” file, and if the files are identical or extremely similar, it assumes the file has simply been moved. However, if you rename a file and then make some changes to it, chances are git will no longer recognize it as the same file, and will prompt you to commit deleting the old file and creating a new file. This is inconvenient, since it destroys the history associated with the file, making it difficult to track changes. Fortunately, the solution is simple - move your files using git:
git mv oldfile newfile
This guarantees that git knows the file is renamed, and maintains history.
Git is a powerful tool for collaborating with others on code. It has an extraordinary breadth of functionality, because it was created by the Linux kernel developers to help them manage what is likely the single most complicated open source project in human history. However, this post covers most of the basics, and as much as I’ve needed for almost any project. I picked up git on my own, as I think many scientists do, and for several years felt like I half-understood what I was doing and stumbled my way through getting git to do what I wanted. Hopefully this can help someone in a similar place!