Just One Single History

Josh combines the advantages of monorepos with those of multirepos by leveraging a blazingly-fast, incremental, and reversible implementation of git history filtering.

Concept

Traditionally, history filtering has been viewed as an expensive operation that should only be performed to fix issues with a repository, such as purging big binary files or removing accidentally-committed secrets, or as part of a migration to a different repository structure, like switching from multirepo to monorepo (or vice versa).

The implementation shipped with git (git-filter branch) is only usable as a once-in-a-lifetime last resort for anything but tiny repositories.

Faster versions of history filtering have been implemented, such as git-filter-repo or the BFG repo cleaner. Those, while much faster, are designed for doing occasional, destructive maintenance tasks, usually with the idea already in mind that once the filtering is complete the old history should be discarded.

The idea behind josh started with two questions:

  1. What if history filtering could be so fast that it can be part of a normal, everyday workflow, running on every single push and fetch without the user even noticing?
  2. What if history filtering was a non-destructive, reversible operation?

Under those two premises a filter operation stops being a maintenance task. It seamlessly relates histories between repos, which can be used by developers and CI systems interchangeably in whatever way is most suitable to the task at hand.

How is this possible?

Filtering history is a highly predictable task: The set of filters that tend to be used for any given repository is limited, such that the input to the filter (a git branch) only gets modified in an incremental way. Thus, by keeping a persistent cache between filter runs, the work needed to re-run a filter on a new commit (and its history) becomes proportional to the number of changes since the last run; The work to filter no longer depends on the total length of the history. Additionally, most filters also do not depend on the size of the trees.

What has long been known to be true for performing merges also applies to history filtering: The more often it is done the less work it takes each time.

To guarantee filters are reversible we have to restrict the kind of filter that can be used; It is not possible to write arbitrary filters using a scripting language like is allowed in other tools. To still be able to cover a wide range of use cases we have introduced a domain-specific language to express more complex filters as a combination of simpler ones. Apart from guaranteeing reversibility, the use of a DSL also enables pre-optimization of filter expressions to minimize both the amount of work to be done to execute the filter as well as the on-disk size of the persistent cache.

From Linus Torvalds 2007 talk at Google about git:

Audience:

Can you have just a part of files pulled out of a repository, not the entire repository?

Linus:

You can export things as tarballs, you can export things as individual files, you can rewrite the whole history to say "I want a new version of that repository that only contains that part", you can do that, it is a fairly expensive operation it's something you would do for example when you import an old repository into a one huge git repository and then you can split it later on to be multiple smaller ones, you can do it, what I am trying to say is that you should generally try to avoid it. It's not that git can not handle huge projects, git would not perform as well as it would otherwise. And you will have issues that you wish you didn't not have.

So I am skipping this issue and going back to the performance issue. One of the things I want to say about performance is that a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true.

That is not what performance is all about. If you can do something really fast, really well, people will start using it differently.

Use cases

Partial cloning

Reduce scope and size of clones by treating subdirectories of the monorepo as individual repositories.

$ git clone http://josh/monorepo.git:/path/to/library.git

The partial repo will act as a normal git repository but only contain the files found in the subdirectory and only commits affecting those files. The partial repo supports both fetch as well as push operation.

This helps not just to improve performace on the client due to having fewer files in the tree, it also enables collaboration on parts of the monorepo with other parties utilizing git's normal distributed development features. For example, this makes it easy to mirror just selected parts of your repo to public github repositories or specific customers.

Project composition / Workspaces

Simplify code sharing and dependency management. Beyond just subdirectories, Josh supports filtering, re-mapping and composition of arbitrary virtual repositories from the content found in the monorepo.

The mapping itself is also stored in the repository and therefore versioned alongside the code.

Multiple projects, depending on a shared set of libraries, can thus live together in a single repository. This approach is commonly referred to as “monorepo”, and was popularized by Google, Facebook or Twitter to name a few.

In this example, two projects (project1 and project2) coexist in the central monorepo.

Central monorepo Project workspaces workspace.josh file
Folders and files in central.git Folders and files in project1.git
dependencies = :/modules:[
    ::tools/
    ::library1/
]
Folders and files in project2.git
libs/library1 = :/modules/library1

Workspaces act as normal git repos:

$ git clone http://josh/central.git:workspace=workspaces/project1.git

Each of the subprojects defines a workspace.josh file, defining the mapping between the original central.git repository and the hierarchy in use inside of the project.

In this setup, project1 and project2 can seemlessly depend on the latest version of library1, while only checking out the part of the central monorepo that's needed for their purpose. What's more, any changes to a shared module will be synced in both directions.

If a developer of the library1 pushed a new update, both projects will get the new version, and the developer will be able to check if they broke any test. If a developer of project1 needs to update the library, the changes will be automatically shared back into central, and project2.

Simplified CI/CD

With everything stored in one repo, CI/CD systems only need to look into one source for each particular deliverable. However in traditional monorepo environments dependency mangement is handled by the build system. Build systems are usually taylored to specific languages and need their input already checked out on the filesystem. So the question:

"What deliverables are affected by a given commit and need to be rebuild?"

cannot be answered without cloning the entire repository and understanding how the languages used handle dependencies.

In particular when using C familiy languages, hidden dependencies on header files are easy to miss. For this reason limiting the visibility of files to the compiler by sandboxing is pretty much a requirement for reproducible builds.

With Josh, each deliverable gets it's own virtual git repository with dependencies declared in the workspace.josh file. This means answering the above question becomes as simple as comparing commit ids. Furthermore due to the tree filtering each build is guaranteed to be perfectly sandboxed and only sees those parts of the monorepo that have actually been mapped.

This also means the deliverables to be re-built can be determined without cloning any repos like typically necessary with normal build tools.

GraphQL API

It is often desireable to access content stored in git without requiring a clone of the repository. This is usefull for CI/CD systems or web frontends such as dashboards.

Josh exposes a GraphQL API for that purpose. For example, it can be used to find all workspaces currently present in the tree:

query {
  rev(at:"refs/heads/master", filter:"::**/workspace.josh") {
    files { path }
  }
}

Caching proxy

Even without using the more advanced features like partial cloning or workspaces, josh-proxy can act as a cache to reduce traffic between locations or keep your CI from performing many requests to the main git host.

Frequently Asked Questions

How is Josh different from git sparse-checkout?

Josh operates on the git object graph and is unrelated to checking out files and the working tree on the filesystem, which is the only thing sparse-checkout is concerned with. A sparse checkout does not influence the contents of the object database and also not what gets downloaded over the network. Both can certainly be used together if needed.

How is Josh different from partial clone?

A partial clone will cause git to download only parts of an object database according to some predicate. It is still the same object database with the history having the same commits and sha1s. It still allows loading skipped parts of the object database at a later point. Josh creates an alternate history that has no reference to the skipped parts. It is as such very similar to git filter-branch just faster, with added features and a different user interface.

How is it different from submodules?

Where git submodules are multiple, independant repos, referencing each other with SHAs, Josh supports the monorepo approach. All of the code is in one single repo which can easily be kept in sync, and Josh provides any sub folder (or in the case of workspaces, more complicated recombination of folders) as their own git repository. These repos are transparently synchronised both ways with the main monorepo. Josh can thus do more than submodules can, and is easier and faster to use.

How is it different from git subtree?

The basic idea behind Josh is quite similar to git subtree. However git subtree, just like git filter-branch, is way too slow for everyday use, even on medium sized repos. git subtree can only achieve acceptable performance when squashing commits and therefore losing history. One core part of Josh is essentially a much faster implementation of git subtree split which has been specifically optimized for being run frequently inside the same repository.

How is Josh different from git filter-repo?

Both josh-filter as well as git filter-repo enable very fast rewriting of Git history and thus can in simple cases be used for the same purpose.

Which one is right in more advanced use cases depends on your goals: git filter-repo offers more flexibility and options on what kind of filtering it supports, like rewriting commit messages or even plugging arbitrary scripts into the filtering.

Josh uses a DSL instead of arbitary scripts for complex filters and is much more restrictive in the kind of filtering possilbe, but in exchange for those limitations offers incremental filtering as well as bidirectional operation, meaning converting changes between both the original and the filtered repos.

How is Josh different from all of the above alternatives?

Josh includes josh-proxy which offers repo filtering as a service, mainly intended to support monorepo workflows.

Getting Started

This book will guide you into setting up the josh proxy to serve your own git repository.

NOTE

All the commands are included from the file gettingstarted.t which can be run with cram.

Setting up the proxy

Josh is distributed via Docker Hub, and is installed and started with the following command:

  $ docker run \
  >   --name josh-proxy \
  >   --detach \
  >   --publish 8000:8000 \
  >   --env JOSH_REMOTE=https://github.com \
  >   --volume josh-vol:/data/git \
  >   joshproject/josh-proxy:latest >/dev/null

This starts Josh as a proxy to github.com, in a Docker container, creating a volume josh-vol and mounting it to the image for use by Josh.

Cloning a repository

Once Josh is running, we can clone a repository through it. For example, let's clone Josh:

  $ git clone http://localhost:8000/josh-project/josh.git
  Cloning into 'josh'...
  $ cd josh

As we can see, this repository is simply the normal Josh one:

  $ ls
  Cargo.lock
  Cargo.toml
  Dockerfile
  Dockerfile.tests
  LICENSE
  Makefile
  README.md
  docs
  josh-proxy
  run-josh.sh
  run-tests.sh
  rustfmt.toml
  scripts
  src
  static
  tests
  $ git log -2
  commit fc6af1e10c865f790bff7135d02b1fa82ddebe29
  Author: Christian Schilling <christian.schilling@esrlabs.com>
  Date:   Fri Mar 19 11:15:57 2021 +0100
  
      Update release.yml
  
  commit 975581064fa21b3a3d6871a4e888fd6dc1129a13
  Author: Christian Schilling <christian.schilling@esrlabs.com>
  Date:   Fri Mar 19 11:11:45 2021 +0100
  
      Update release.yml

Cloning a part of the repo

Josh becomes interesting when we want to clone a part of the repo. Let's check out the Josh repository again, but this time let's filter only the documentation out:

  $ cd ..
  $ git clone http://localhost:8000/josh-project/josh.git:/docs.git
  Cloning into 'docs'...
  $ cd docs

Note the addition of :/docs at the end of the url. This is called a filter, and it instructs josh to only check out the given folder.

Looking inside the repository, we now see that the history is quite different. Indeed, it contains only the commits pertaining to the subfolder that we checked out.

  $ ls
  book.toml
  src
  $ git log -2
  commit dd26c506f6d6a218903b9f42a4869184fbbeb940
  Author: Christian Schilling <christian.schilling@esrlabs.com>
  Date:   Mon Mar 8 09:22:21 2021 +0100
  
      Update docs to use docker for default setup
  
  commit ee6abba0fed9b99c9426f5224ff93cfee2813edc
  Author: Louis-Marie Givel <louis-marie.givel@esrlabs.com>
  Date:   Fri Feb 26 11:41:37 2021 +0100
  
      Update proxy.md

This repository is a real repository in which we can pull, commit, push, as with a regular one. Josh will take care of synchronizing it with the main one in a transparent fashion.

Working with workspaces

NOTE

All the commands are included from the file workspaces.t which can be run with cram.

Josh really starts to shine when using workspaces.

Simply put, they are a list of files and folders, remapped from the central repository to a new repository. For example, a shared library could be used by various workspaces, each mapping it to their appropriate subdirectory.

In this chapter, we're going to set up a new git repository with a couple of libraries, and then use it to demonstrate the use of workspaces.

Test set-up

NOTE

The following section describes how to set-up a local git server with made-up content for the sake of this tutorial. You're free to follow it, or to use your own existing repository, in which case you can skip to the next section

To host the repository for this test, we need a git server. We're going to run git as a cgi program using its provided http backend, served with the test server included in the hyper_cgi crate.

Serving the git repo

First, we create a bare repository, which will be served by hyper_cgi. We enable the option http.receivepack to allow the use of git push from the clients.

  $ git init --bare ./remote/real_repo.git/
  Initialized empty Git repository in */real_repo.git/ (glob)
  $ git config -f ./remote/real_repo.git/config http.receivepack true

Then we start the server which will allow clients to access the repository through http.

  $ GIT_DIR=./remote/ GIT_PROJECT_ROOT=${TESTTMP}/remote/ GIT_HTTP_EXPORT_ALL=1 hyper-cgi-test-server\
  >  --port=8001\
  >  --dir=./remote/\
  >  --cmd=git\
  >  --args=http-backend\
  >  > ./hyper-cgi-test-server.out 2>&1 &
  $ echo $! > ./server_pid

Our server is ready, serving all the repos in the remote folder on port 8001.

  $ git clone http://localhost:8001/real_repo.git
  Cloning into 'real_repo'...
  warning: You appear to have cloned an empty repository.

Adding some content

Of course, the repository is for now empty, and we need to populate it. The populate.sh script creates a couple of libraries, as well as two applications that use them.

  $ cd real_repo
  $ sh ${TESTDIR}/populate.sh > ../populate.out

  $ git push origin HEAD
  To http://localhost:8001/real_repo.git
   * [new branch]      HEAD -> master

  $ tree
  .
  |-- application1
  |   `-- app.c
  |-- application2
  |   `-- guide.c
  |-- doc
  |   |-- guide.md
  |   |-- library1.md
  |   `-- library2.md
  |-- library1
  |   `-- lib1.h
  `-- library2
      `-- lib2.h
  
  5 directories, 7 files
  $ git log --oneline --graph
  * f65e94b Add documentation
  * f240612 Add application2
  * 0a7f473 Add library2
  * 1079ef1 Add application1
  * 6476861 Add library1

Creating our first workspace

Now that we have a git repo populated with content, let's serve it through josh:

$ docker run -d --network="host" -e JOSH_REMOTE=http://127.0.0.1:8001 -v josh-vol:$(pwd)/git_data joshproject/josh-proxy:latest > josh.out

NOTE

For the sake of this example, we run docker with --network="host" instead of publishing the port. This is so that docker can access localhost, where our ad-hoc git repository is served.

To facilitate developement on applications 1 and 2, we want to create workspaces for them. Creating a new workspace looks very similar to checking out a subfolder through josh, as explained in "Getting Started".

Instead of just the name of the subfolder, though, we also use the :workspace= filter:

  $ git clone http://127.0.0.1:8000/real_repo.git:workspace=application1.git application1
  Cloning into 'application1'...
  $ cd application1
  $ tree
  .
  `-- app.c
  
  0 directories, 1 file
  $ git log -2
  commit 50cd6112e173df4cac1aca9cb88b5c2a180bc526
  Author: Josh <josh@example.com>
  Date:   Thu Apr 7 22:13:13 2005 +0000
  
      Add application1

Looking into the newly cloned workspace, we see our expected files and the history containing the only relevant commit.

NOTE

Josh allows us to create a workspace out of any directory, even one that doesn't exist yet.

Adding workspace.josh

The workspace.josh file describes how folders from the central repository (real_repo.git) should be mapped to the workspace repository.

Since we depend on library1, let's add it to the workspace file.

  $ echo "modules/lib1 = :/library1" >> workspace.josh

  $ git add workspace.josh

  $ git commit -m "Map library1 to the application1 workspace"
  [master 06361ee] Map library1 to the application1 workspace
   1 file changed, 1 insertion(+)
   create mode 100644 workspace.josh

We decided to map library1 to modules/lib1 in the workspace. We can now sync up with the server:

  $ git sync origin HEAD
    HEAD -> refs/heads/master
  From http://127.0.0.1:8000/real_repo.git:workspace=application1
   * branch            753d62ca1af960a3d071bb3b40722471228abbf6 -> FETCH_HEAD
  HEAD is now at 753d62c Map library1 to the application1 workspace
  Pushing to http://127.0.0.1:8000/real_repo.git:workspace=application1.git
  POST git-receive-pack (477 bytes)
  remote: josh-proxy        
  remote: response from upstream:        
  remote: To http://localhost:8001/real_repo.git        
  remote:    f65e94b..37184cc  JOSH_PUSH -> master        
  remote: REWRITE(06361eedf6d6f6d7ada6000481a47363b0f0c3de -> 753d62ca1af960a3d071bb3b40722471228abbf6)        
  remote: 
  remote: 
  updating local tracking ref 'refs/remotes/origin/master'
  

let's observe the result:

  $ tree
  .
  |-- app.c
  |-- modules
  |   `-- lib1
  |       `-- lib1.h
  `-- workspace.josh
  
  2 directories, 3 files
  $ git log --graph --oneline
  *   753d62c Map library1 to the application1 workspace
  |\  
  | * 366adba Add library1
  * 50cd611 Add application1

After pushing and fetching the result, we see that it has been succesfully mapped by josh.

One suprising thing is the history: our "mapping" commit became a merge commit! This is because josh needs to merge the history of the module we want to map into the repository of the workspace. After this is done, all commits will be present in both of the histories.

NOTE

git sync is a utility provided with josh which will push contents, and, if josh tells it to, fetch the transformed result. Otherwise, it works like git push.

By the way, what does the history look like on the real_repo ?

  $ cd ../real_repo
  $ git pull origin master
  From http://localhost:8001/real_repo
   * branch            master     -> FETCH_HEAD
     f65e94b..37184cc  master     -> origin/master
  Updating f65e94b..37184cc
  Fast-forward
   application1/workspace.josh | 1 +
   1 file changed, 1 insertion(+)
   create mode 100644 application1/workspace.josh
  Current branch master is up to date.

  $ tree
  .
  |-- application1
  |   |-- app.c
  |   `-- workspace.josh
  |-- application2
  |   `-- guide.c
  |-- doc
  |   |-- guide.md
  |   |-- library1.md
  |   `-- library2.md
  |-- library1
  |   `-- lib1.h
  `-- library2
      `-- lib2.h
  
  5 directories, 8 files
  $ git log --graph --oneline
  * 37184cc Map library1 to the application1 workspace
  * f65e94b Add documentation
  * f240612 Add application2
  * 0a7f473 Add library2
  * 1079ef1 Add application1
  * 6476861 Add library1

We can see the newly added commit for workspace.josh in application1, and as expected, no merge here.

Interacting with workspaces

Let's now create a second workspce, this time for application2. It depends on library1 and library2.

  $ git clone http://127.0.0.1:8000/real_repo.git:workspace=application2.git application2
  Cloning into 'application2'...
  $ cd application2
  $ echo "libs/lib1 = :/library1" >> workspace.josh
  $ echo "libs/lib2 = :/library2" >> workspace.josh
  $ git add workspace.josh && git commit -m "Create workspace for application2"
  [master 566a489] Create workspace for application2
   1 file changed, 2 insertions(+)
   create mode 100644 workspace.josh

Syncing as before:

  $ git sync origin HEAD
    HEAD -> refs/heads/master
  From http://127.0.0.1:8000/real_repo.git:workspace=application2
   * branch            5115fd2a5374cbc799da61a228f7fece3039250b -> FETCH_HEAD
  HEAD is now at 5115fd2 Create workspace for application2
  Pushing to http://127.0.0.1:8000/real_repo.git:workspace=application2.git
  POST git-receive-pack (478 bytes)
  remote: josh-proxy        
  remote: response from upstream:        
  remote: To http://localhost:8001/real_repo.git        
  remote:    37184cc..feb3a5b  JOSH_PUSH -> master        
  remote: REWRITE(566a4899f0697d0bde1ba064ed81f0654a316332 -> 5115fd2a5374cbc799da61a228f7fece3039250b)        
  remote: 
  remote: 
  updating local tracking ref 'refs/remotes/origin/master'
  

And our local folder now contains all the files requested:

  $ tree
  .
  |-- guide.c
  |-- libs
  |   |-- lib1
  |   |   `-- lib1.h
  |   `-- lib2
  |       `-- lib2.h
  `-- workspace.josh
  
  3 directories, 4 files

And the history includes the history of both of the libraries:

  $ git log --oneline --graph
  *   5115fd2 Create workspace for application2
  |\  
  | * ffaf58d Add library2
  | * f4e4e40 Add library1
  * ee8a5d7 Add application2

Note that since we created the workspace and added the dependencies in one single commit, the history just contains this one single merge commit.

Pushing a change from a workspace

While testing application2, we noticed a typo in the library1 dependency. Let's go ahead a fix it!

  $ sed -i 's/41/42/' libs/lib1/lib1.h
  $ git commit -a -m "fix lib1 typo"
  [master 82238bf] fix lib1 typo
   1 file changed, 1 insertion(+), 1 deletion(-)

We can push this change like any normal git change:

  $ git push origin master
  remote: josh-proxy        
  remote: response from upstream:        
  remote: To http://localhost:8001/real_repo.git        
  remote:    feb3a5b..31e8fab  JOSH_PUSH -> master        
  remote: 
  remote: 
  To http://127.0.0.1:8000/real_repo.git:workspace=application2.git
     5115fd2..82238bf  master -> master

Since the change was merged in the central repository, a developper can now pull from the application1 workspace.

  $ cd ../application1
  $ git pull
  From http://127.0.0.1:8000/real_repo.git:workspace=application1
   + 06361ee...c64b765 master     -> origin/master  (forced update)
  Updating 753d62c..c64b765
  Fast-forward
   modules/lib1/lib1.h | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  Current branch master is up to date.

The change has been propagated!

  $ git log --oneline --graph
  * c64b765 fix lib1 typo
  *   753d62c Map library1 to the application1 workspace
  |\  
  | * 366adba Add library1
  * 50cd611 Add application1

Importing projects

When moving to a monorepo setup, especially in existing organisations, it is common that the need to consolidate existing project repositories arises.

The simplest possible case is one where the previous history of a project does not need to be retained. In this case, the projects files can simply be copied into the monoreop at the appropriate location and committed.

If history should be retained, josh can be used for importing a project as an alternative to built-in git commands like git subtree.

Josh's filter capability lets you perform transformations on the history of a git repository to arbitrarily (re-)compose paths inside of a repository.

A key aspect of this functionality is that all transformations are reversible. This means that if you apply a transformation moving files from the root of a repository to, say, tools/project-b, followed by an inverse transformation moving files from tools/project-b back to the root, you receive the same commit hashes you put in.

We can use this feature to import a project into our monorepo while allowing external users to keep pulling on top of the same git history they already have, just with a new git remote.

There are multiple ways of doing this, with the most common ones outlined below. You can look at josh#596 for a discussion of several other methods.

Import with josh-filter

Currently, the easiest way to do this is by using the josh-filter binary which is a command-line frontend to josh's filter capabilities.

Inside of our target repository, it would work like this:

  1. Fetch the repository we want to import (say, "Project B", from $REPO_URL).

    $ git fetch $REPO_URL master
    

    This will set the FETCH_HEAD reference to the fetched repository.

  2. Rewrite the history of that repository through josh to make it look as if the project had always been developed at our target path (say, tools/project-b).

    $ josh-filter ':prefix=tools/project-b' FETCH_HEAD
    

    This will set the FILTERED_HEAD reference to the rewritten history.

  3. Merge the rewritten history into our target repository.

    $ git merge --allow-unrelated FILTERED_HEAD
    

    After this merge commit, the previously external project now lives at tools/project-b as expected.

  4. Any external users can now use the :/tools/project-b josh filter to retrieve changes made in the new project location - without the git hashes of their existing commits changing (that is to say, without conflicting).

Import by pushing to josh

If your monorepo is already running a josh-proxy in front of it, you can also import a project by pushing a project merge to josh.

This has the benefit of not needing to clone the entire monorepo locally to do a merge, but the drawback of using a different, slightly slower filter mechanism when exporting the tree back out. For projects with very large history, consider using the josh-filter mechanism outlined above.

Pushing a project merge to josh works like this:

  1. Assume we have a local checkout of "Project B", and we want to merge this into our monorepo. There is a josh-proxy running at https://git.company.name/monorepo.git. We want to merge this project into /tools/project-b in the monorepo.

  2. In the checkout of "Project B", add the josh remote:

    git remote add josh https://git.company.name/monorepo.git:/tools/project-b.git
    

    Note the use of the /tools/project-b.git josh filter, which points to a path that should not yet exist in the monorepo.

  3. Push the repository to josh with the -o merge option, creating a merge commit introducing the project history at that location, while retaining its history:

    git push josh $ref -o merge
    

Note for Gerrit users

With either method, when merging a set of new commits into a Gerrit repository and going through the standard code-review process, Gerrit might complain about missing Change-IDs in the imported commits.

To work around this, the commits need to first be made "known" to Gerrit. This can be achieved by pushing the new parent of the merge commit to a separate branch in Gerrit directly (without going through the review mechanism). After this Gerrit will accept merge commits referencing that parent, as long as the merge commit itself has a Change-ID.

Some monorepo setups on Gerrit use a special unrestricted branch like merge-staging for this, to which users with permission to import projects can force-push unrelated histories.

History filtering

Josh transforms commits by applying filters to them. As any commit in git represents not just a single state of the file system but also its entire history, applying a filter to a commit produces an entirely new history. The result of a filter is a normal git commit and therefore can be filtered again, making filters chainable.

Syntax

Filters always begin with a colon and can be chained:

:filter1:filter2

When used as part of an URL filters can not contain white space or newlines. When read from a file however white space can be inserted between filters (not after the leading colon). Additionally newlines can be used instead of , inside of composition filters.

Some filters take arguments, and arguments can optionally be quoted using double quotes, if special characters used by the filter language need to be used (like : or space):

:filter=argument1,"argument2"

Available filters

Subdirectory :/a

Take only the selected subdirectory from the input and make it the root of the filtered tree. Note that :/a/b and :/a:/b are equivalent ways to get the same result.

Directory ::a/

A shorthand for the commonly occurring filter combination :/a:prefix=a.

File ::a

Produces a tree with only the specified file in it's root. Note that ::a/b is equivalent to ::a/::b.

Prefix :prefix=a

Take the input tree and place it into subdirectory a. Note that :prefix=a/b and :prefix=b:prefix=a are equivalent.

Composition :[:filter1,:filter2,...,:filterN]

Compose a tree by overlaying the outputs of :filter1 ... :filterN on top of each other. It is guaranteed that each file will only appear at most once in the output. The first filter that consumes a file is the one deciding it's mapped location. Therefore the order in which filters are composed matters.

Inside of a composition x=:filter can be used as an alternative spelling for :filter:prefix=x.

Exclusion :exclude[:filter]

Remove all paths present in the output of :filter from the input tree. It should generally be avoided to use any filters that change paths and instead only use filters that select paths without altering them.

Workspace :workspace=a

Similar to :/a but also looks for a workspace.josh file inside the specified directory (called the "workspace root"). The resulting tree will contain the contents of the workspace root as well as additional files specified in the workspace.josh file. (see Workspaces)

Text replacement :replace("regex_0":"replacement_0",...,"regex_N":"replacement_N")

Applies the supplied regular expressions to every file in the input tree.

Signature removal :unsign

The default behaviour of Josh is to copy, if it exsists, the signature of the original commit in the filtered commit. This makes the signature invalid, but allows a perfect round-trip: josh will be able to recreate the original commit from the filtered one.

This behaviour might not be desirable, and this filter drops the signatures from the history.

Pattern filters

The following filters accept a glob like pattern X that can contain * to match any number of characters. Note that two or more consecutive wildcards (**) are not allowed.

Match directories ::X/

All matching subdirectories in the input root

Match files or directories ::X

All matching files or directories in the input root

Match nested directories ::**/X/

All subdirectories matching the pattern in arbitrarily deep subdirectories of the input

Match nested files ::**/X

All files matching the pattern in arbitrarily deep subdirectories of the input

History filters

These filter do not modify git trees, but instead only operate on the commit graph.

Linearise history :linear

Produce a filtered history that does not contain any merge commits. This is done by simply dropping all parents except the first on every commit.

Filter specific parts of the history :rev(<sha_0>:filter_0,...,<sha_N>:filter_N)

Produce a history where the commits specified by <sha_N> are replaced by the result of applying :filter_N to it.

It will appear like <sha_N> and all its ancestors are also filtered with <filter_N>. If an ancestor also has a matching entry in the :rev(...) it's filter will replace <filter_N> for all further ancestors (and so on).

This special value 0000000000000000000000000000000000000000 can be used as a <sha_n> to filter commits that don't match any of the other shas.

Filter order matters

Filters are applied in the left-to-right order they are given in the filter specification, and they are not commutative.

For example, this command will filter out just the josh documentation, and store it in a ref named FILTERED_HEAD:

$ josh-filter :/docs:prefix=josh-docs

However, this command will produce an empty branch:

$ josh-filter :prefix=josh-docs:/docs

What's happening in the latter command is that because the prefix filter is applied first, the entire josh history already lives within the josh-docs directory, as it was just transformed to exist there. Thus, to still get the docs, the command would need to be:

$ josh-filter :prefix=josh-docs:/josh-docs/docs

which will contain the josh documentation at the base of the tree. We've lost the prefix, what gives?? Because the original git tree was already transformed, and then the subdirectory filter was applied to pull documentation from josh-docs/docs, the prefix is gone - it was filtered out again by the subdirectory filter. Thus, the order in which filters are provided is crucial, as each filter further transforms the latest transformation of the tree.

josh-proxy

Josh provides an HTTP proxy server that can be used with any git hosting service which communicates via HTTP.

It needs the URL of the upstream server and a local directory to store its data. Optionally, a port to listen on can be specified. For example, running a local josh-proxy instance for github.com on port 8000:

$ docker run -p 8000:8000 -e JOSH_REMOTE=https://github.com -v josh-vol:/data/git joshproject/josh-proxy:latest

Note: While josh-proxy is intended to be used with a http upstream it can also proxy for an ssh upstream when ssh is used instead of http in the url. In that case it will use the ssh private key of the current user (just like git would) and take the username from the downstream http request. This mode of operation can be useful for evaluation or local use by individual developers but should never be used on a normal server deployment.

For a first example of how to make use of josh, just the josh documentation can be checked out as its own repository via this command:

$ git clone http://localhost:8000/josh-project/josh.git:/docs.git

Note: This URL needs to contain the .git suffix twice: once after the original path and once more after the filter spec.

josh-proxy supports read and write access to the repository, so when making changes to any files in the filtered repository, you can just commit and push them like you are used to.

Note: The proxy is semantically stateless. The data inside the docker volume is only persisted across runs for performance reasons. This has two important implications for deployment:

  1. The data does not need to be backed up unless working with very large repos where rebuilding would be very expensive. And 2) Multiple instances of josh-proxy can be used interchangeably for availability or load balancing purposes.

URL syntax and breakdown

This is the URL of a josh-proxy instance:

http://localhost:8000

This is the repository location on the upstream host on which to perform the filter operations:

/josh-project/josh.git

This is the set of filter operations to perform:

:/docs.git

Much more information on the available filters and the syntax of all filters is covered in detail in the filters section.

Repository naming

By default, a git URL is used to point to the remote repository to download and also to dictate how the local repository shall be named. It's important to learn that the last name in the URL is what the local git client will name the new, local repository. For example:

$ git clone http://localhost:8000/josh-project/josh.git:/docs.git

will create the new repository at directory docs, as docs.git is the last name in the URL.

By default, this leads to rather odd-looking repositories when the prefix filter is the final filter of a URL:

$ git clone http://localhost:8000/josh-project/josh.git:/docs:prefix=josh-docs.git

This will still clone just the josh documentation, but the final directory structure will look like this:

- prefix=josh-docs
  - josh-docs
    - <docs>

Having the root repository directory name be the fully-specified filter is most likely not what was intended. This results from git's reuse and repurposing of the remote URL, as prefix=josh-docs is the final name in the URL. With no other alternatives, this gets used for the repository name.

To explicitly specify a repository name, provide the desired name after the URL when cloning a new repository:

$ git clone http://localhost:8000/josh-project/josh.git:/docs:prefix=josh-docs.git my-repo

Serving a github repo

To prompt for authentication, Josh relies on the server requesting it on fetch. When using a server which doesn't need authentication for fetching, Josh will not automatically prompt for authentication when pushing, and it will be impossible to provide credentials for pushing.

To solve this, you need to pass the --require-auth option to josh-proxy. This can be done with JOSH_EXTRA_OPTS when using the docker image like so:

docker run -d -p 8000:8000 -e JOSH_EXTRA_OPTS="--require-auth" -e JOSH_REMOTE=https://github.com/josh-project -v josh-vol:$(pwd)/git_data joshproject/josh-proxy:latest

In this example, we serve only the josh-project repositories. Be aware that if you don't add the organisation or repo URL, your instance will be able to serve all of github. You can (and should) restrict it to your repository or organisation by making it part of the URL.

Working with workspaces

For the sake of this example we will assume a josh-proxy instance is running and serving a repo on http://josh/world.git with some shared code in shared.

Create a new workspace

To create a new workspace in the path ws/hello simply clone it as if it already exists:

$ git clone http://josh/world.git:workspace=ws/hello.git

git will report that you appear to have cloned an empty repository if that path does not yet exist. If you don't get this message it means that the path already exists in the repo but may not yet have configured any path mappings.

The next step is to add some path mapping to the workspace.josh file in the root of the workspace:

$ cd hello
$ echo "mod/a = :/shared/a" > workspace.josh

And and commit the changes:

$ git add workspace.josh
$ git commit -m "add workspace"

If the path did not exist previously, the resulting commit will be a root commit that does not share any history with the world.git repo. This means a normal git push will be rejected at this point. To get correct history, the resulting commit needs to be a based on the history that already exists in world.git. There is however no way to do this locally, because we don't have the data required for this. Also, the resulting tree should contain the contents of shared/a mapped to mod/a which means it needs to be produced on the server side because we don't have the files to put there.

To accomplish that push with the create option:

$ git push -o create origin master

Note: While it is perfectly possible to use Josh without a code review system, it is strongly recommended to use some form of code review to be able to inspect commits created by Josh before they get into the immutable history of your main repository.

As the resulting commit is created on the server side we need to get it from the server:

$ git pull --rebase

Now you should see mod/a populated with the content of the shared code.

Map a shared path into a workspace

To add shared path to a location in the workspace that does not exist yet, first add an entry to the workspace.josh file and commit that.

You can add the mapping at the end of the file using a simple syntax, and rely on josh to rewrite it for you in a canonical way.

...
new/mapping/location/in/workspace = :/new/mapping/location/in/monorepo

At this point the path is of course empty, so the commit needs to be pushed to the server. When the same commit is then fetched back it will have the mapped path populated with the shared content.

When the commit is pushed, josh will notify you about the rewrite. You can fetch the rewritten commit using the advertised SHA. Alternatively, you can use git sync which will do it for you.

Publish a non-shared path into a shared location

The steps here are exactly the same as for the mapping example above. The only difference being that the path already exists in the workspace but not in the shared location.

Remove a mapping

To remove a mapping remove the corresponding entry from the workspace.josh file. The content of the previously shared path will stay in the workspace. That means the main repo will have two copies of that path from that point on. Effectivly creating a fork of that code.

Remove a mapped path

To remove a mapped path as well as it's contents, remove the entry from the workspace.josh file and also remove the path inside the workspace using git rm.

Container configuration

Container options

Variable Meaning
JOSH_REMOTE HTTP remote, including protocol. Example: https://github.com
JOSH_REMOTE_SSH SSH remote, including protocol. Example: ssh://git@github.com
JOSH_HTTP_PORT HTTP port to listen on. Default: 8000
JOSH_SSH_PORT SSH port to listen on. Default: 8022
JOSH_SSH_MAX_STARTUPS Maximum number of concurrent SSH authentication attempts. Default: 16
JOSH_SSH_TIMEOUT Timeout, in seconds, for a single request when serving repos over SSH. This time should cover fetch from upstream repo, filtering, and serving repo to client. Default: 300
JOSH_EXTRA_OPTS Extra options passed directly to josh-proxy process

Container volumes

Volume Purpose
/data/git Git cache volume. If this volume is not mounted, the cache will be lost every time the container is shut down.
/data/keys SSH server keys. If this volume is not mounted, a new key will be generated on each container startup

Configuring SSH access

Josh supports SSH access (just pull without pushing, for now). To use SSH, you need to add the following lines to your ~/.ssh/config:

Host your-josh-instance.com
    ForwardAgent yes
    PreferredAuthentications publickey

Alternatively, you can pass those options via GIT_SSH_COMMAND:

GIT_SSH_COMMAND="ssh -o PreferredAuthentications=publickey -o ForwardAgent=yes" git clone ssh://git@your-josh-instance.com/...

In other words, you need to ensure SSH agent forwarding is enabled.

josh-filter

Command to rewrite history using josh filter specs. By default it will use HEAD as input and update FILTERED_HEAD with the filtered history, taking a filter specification as argument. (Note that input and output are swapped with --reverse.)

It can be installed with the following Cargo command, assuming Rust is installed:

cargo install josh --git https://github.com/josh-project/josh.git

git-sync

A utility to make working with server side rewritten commits easier. Those commits frequently get produced when making changes to workspace.josh files.

The command is available in the script directory. It should be put downloaded and added to the PATH. It can then be used as a drop-in replacement for git push. It enables the server to return commits back to the client after a push. This is done by parsing the messages sent back by the server for announcements of rewritten commits and then fetching those to update the local references. In case of a normal git server that does not rewrite anything, git sync will do exactly the same as git push, also accepting the same arguments.

GraphQL API

Josh implements a GraphQL API to query the content of repositories without a need to clone them via a git client.

The API is exposed at:

http://hostname/~/graphql/name_of_repo.git

To explore the API and generated documentation, an interactive GraphQL shell can be found at:

http://hostname/~/graphiql/name_of_repo.git

Testing

Currently the Josh project mainly uses integration tests for it's verification, so make sure you will be able to run and check them.

The following sections will describe how to run the different kind's of tests used for the verification of the Josh project.

UnitTests & DocTests

cargo test --all

Integration Tests

1. Setup the test environment

Due to the fact that the integration tests need additional tools and a more complex environment and due to the fact that the integration test are done using cram. you will need to crate an extra environment to run these tests. To simplify the setup of the integration testing we have setup a Nix Shell environment which you can start by using the following command if you have installed the Nix Shell.

Attention: Currently it is still necessary to install the following tools in your host system.

  • curl
  • hyper_cgi
    cargo install hyper_cgi --features=test-server
    

Setup the Nix Shell

Attention: When running this command the first time, this command will take quite a bit to finish. You also will need internet access while executing this command. Depending on performance of your connection the command will take more or less time.

nix-shell shell.nix

Once the command is finished you will be prompted with the nix-shell which will provide the needed shell environment to execute the integration tests.

2. Verify you have built all necessary binaries

cargo build
cargo build --bin josh-filter
cargo build --manifest-path josh-proxy/Cargo.toml
cargo build --manifest-path josh-ui/Cargo.toml

3. Setup static files for the josh-ui

cd josh-ui 
trunk build 
cd ..

4. Run the integration tests

Attention: Be aware that all tests except the once in experimental should be green.

sh run-tests.sh -v tests/

UI Tests

TBD: Currently disabled, stabilize, enable and document process.

Dev-Tools