Git

Git Shallow Clone and Clone Depth

Understanding Git Shallow Clone and Clone Depth

Git is a distributed version control system. That is one of the advantages of using Git. You don’t have to depend on a central server or repository to work locally. Everything you need regarding your modules history is right at your fingertip. However, It can become a problem when you are dealing with repositories with large binary files or repositories that have a long history. Especially if you have a situation where you need to download it fresh every time, like a build server, then the size and download times can become an issue.

Git’s solution to the problem is shallow clone where you can use clone depth to define how deep your clone should go. For example, if you use –depth 1, then during cloning, Git will only get the latest copy of the relevant files. It can save you a lot of space and time.

Git Shallow Clone and Size

Let’s take a look at the popular Git repository for Django. If you full clone the repo, you get the following:

$ git clone https://github.com/django/django.git

Cloning into 'django'...
remote: Counting objects: 409053, done.
remote: Compressing objects: 100% (26/26), done.
remote: Total 409053 (delta 6), reused 8 (delta 1), pack-reused 409026
Receiving objects: 100% (409053/409053), 167.77 MiB | 5.95 MiB/s, done.
Resolving deltas: 100% (297045/297045), done.
Checking connectivity... done.
Checking out files: 100% (5860/5860), done.

Now if you check the size of your local copy, it is:

$ du -sh django/

225M    django/

Let’s get the same Django repository with a shallow clone:

$ git clone --depth 1 https://github.com/django/django.git

Cloning into 'django'...
remote: Counting objects: 8091, done.
remote: Compressing objects: 100% (4995/4995), done.
remote: Total 8091 (delta 2036), reused 5507 (delta 1833), pack-reused 0
Receiving objects: 100% (8091/8091), 8.82 MiB | 3.29 MiB/s, done.
Resolving deltas: 100% (2036/2036), done.
Checking connectivity... done.
Checking out files: 100% (5860/5860), done.

Now if you check the size of your local copy, it should be significantly less:

$ du -sh django/

55M       django/

When your server is dealing with hundreds of product lines, this kind of hard disk space saving can be helpful. In cases of game projects where there are heavy binaries, this can have a dramatic effect. It also helps with longtime projects. For example, the full Linux repository cloning from the GitHub is more than 7GB, but you can shallow clone it for less than 1GB.

Git Shallow Clone and History

You can locally check out shallow cloning with your own repository. Let’s create a file in our local repository, make changes and commit it 10 times. And then we can clone the repository:

$ mkdir _example
$ cd _example
$ ls
$ git init
Initialized empty Git repository in /Users/zakh/git_repo/_example/.git/
$ echo x > large_file
$ git add -A
$ git commit -m "Initial commit"
[master (root-commit) dd11686] Initial commit
1 file changed, 1 insertion(+)
create mode 100644 large_file

$ echo xx > large_file
$ git add -A
$ git commit -m "Modification to large_file 1"
[master 9efa367] Modification to large_file 1
1 file changed, 1 insertion(+), 1 deletion(-)

..........
..........

$ mkdir test
$ cd test
$ git clone file:////Users/zakh/git_repo/_example

Cloning into '_example'...
remote: Counting objects: 33, done.
remote: Compressing objects: 100% (22/22), done.
remote: Total 33 (delta 10), reused 0 (delta 0)
Receiving objects: 100% (33/33), 50.03 MiB | 42.10 MiB/s, done.
Resolving deltas: 100% (10/10), done.
Checking connectivity... done.

In this example, we have created the _example git repository in the /Users/zakh/git_repo/ folder with a single large_file. Only the first two commits are shown. Then we are creating a full clone of that repository in a different location.

Then let’s check the history of our commits:

$ git log --oneline

7fa451f Modification to large_file 10
648d8c9 Modification to large_file 9
772547a Modification to large_file 8
13dd9ab Modification to large_file 7
5e73b67 Modification to large_file 6
030a6e7 Modification to large_file 5
1d14922 Modification to large_file 4
bc0f2c2 Modification to large_file 3
2794f11 Modification to large_file 2
d4374fb Modification to large_file 1
924829d Initial commit

We see all the commits in the full clone.
Now let’s delete the current copy and then shallow clone with a depth of 1:

$ git clone --depth 1 file:////Users/zakh/git_repo/_example

Cloning into '_example'...
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Receiving objects: 100% (3/3), 50.02 MiB | 65.12 MiB/s, done.
Checking connectivity... done.

If we look at the history now, we see only the last commit history:

$ git log --oneline

7fa451f Modification to large_file 10

Let’s shallow clone with a depth of 3:

$ git clone --depth 3 file:////Users/zakh/git_repo/_example

Cloning into '_example'...
remote: Counting objects: 9, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 9 (delta 2), reused 0 (delta 0)
Receiving objects: 100% (9/9), 50.02 MiB | 65.15 MiB/s, done.
Resolving deltas: 100% (2/2), done.
Checking connectivity... done.

Now we see more commits:

$ git log --oneline

7fa451f Modification to large_file 10
648d8c9 Modification to large_file 9
772547a Modification to large_file 8

Problems with Git Shallow Clone

Users should understand that the size and download time savings depend on the organization of the commits. They can significantly differ from one repository to another. It’s a good idea to test the repository with a shallow clone to check how much hard disk space and download time it will save you.

Another consideration is that even though you can push code from a shallow clone, it might take longer because of the calculations between the remote and the local server. So if you are committing code regularly from the local copy, it probably makes sense to use a full clone.

Multiple Branch Option

When you use the –depth flag with clone command, Git assumes the –single-branch flag by default. But you can use –no-single-branch flag to tell Git to get histories from the specified depth of each branch.

Here are the Django branches without –no-single-branch option (depth 1):

$ git branch -a
* master
remotes/origin/HEAD -> origin/master
remotes/origin/master

Only the master branch is present.

Here are the Django branches after using the –no-single-branch option:

$ git clone --depth 1 --no-single-branch https://github.com/django/django.git

Cloning into 'django'...
remote: Counting objects: 95072, done.
remote: Compressing objects: 100% (42524/42524), done.
remote: Total 95072 (delta 52343), reused 82284 (delta 42389), pack-reused 0
Receiving objects: 100% (95072/95072), 74.69 MiB | 3.95 MiB/s, done.
Resolving deltas: 100% (52343/52343), done.
Checking connectivity... done.
Checking out files: 100% (5860/5860), done.

$ du -sh django

124M        django

Notice even though the depth is still 1, the size of the clone is 124M instead of the 55M for the previous case.
If we check the branches, we should see a lot more branches on this clone:

$ cd django
$ git branch -a
* master
remotes/origin/HEAD -> origin/master
remotes/origin/attic/boulder-oracle-sprint
remotes/origin/attic/full-history
remotes/origin/attic/generic-auth
remotes/origin/attic/gis
remotes/origin/attic/i18n
remotes/origin/attic/magic-removal
remotes/origin/attic/multi-auth
remotes/origin/attic/multiple-db-support
remotes/origin/attic/new-admin
remotes/origin/attic/newforms-admin
remotes/origin/attic/per-object-permissions
remotes/origin/attic/queryset-refactor
remotes/origin/attic/schema-evolution
remotes/origin/attic/schema-evolution-ng
remotes/origin/attic/search-api
remotes/origin/attic/sqlalchemy
remotes/origin/attic/unicode
remotes/origin/master
remotes/origin/soc2009/admin-ui
remotes/origin/soc2009/http-wsgi-improvements
remotes/origin/soc2009/i18n-improvements
remotes/origin/soc2009/model-validation
remotes/origin/soc2009/multidb
remotes/origin/soc2009/test-improvements
remotes/origin/soc2010/app-loading
remotes/origin/soc2010/query-refactor
remotes/origin/soc2010/test-refactor
remotes/origin/stable/0.90.x
remotes/origin/stable/0.91.x
remotes/origin/stable/0.95.x
remotes/origin/stable/0.96.x
remotes/origin/stable/1.0.x
remotes/origin/stable/1.1.x
remotes/origin/stable/1.10.x
remotes/origin/stable/1.11.x
remotes/origin/stable/1.2.x
remotes/origin/stable/1.3.x
remotes/origin/stable/1.4.x
remotes/origin/stable/1.5.x
remotes/origin/stable/1.6.x
remotes/origin/stable/1.7.x
remotes/origin/stable/1.8.x
remotes/origin/stable/1.9.x
remotes/origin/stable/2.0.x

Summary

Git shallow clone can help you save time and hard disk space. But it comes at a price. If you are regularly pushing code to remote repositories, it will increase commit times. So, for regular workflows, it’s a good idea to avoid shallow clones.

References:

About the author

Zak H

Zak H. lives in Los Angeles. He enjoys the California sunshine and loves working in emerging technologies and writing about Linux and DevOps topics.