Josh combines the advantages of monorepos with those of multirepos by leveraging a blazingly-fast, incremental, and reversible implementation of git history filtering.
Concept
Traditionally, history filtering has been viewed as an expensive operation that should only be performed to fix issues with a repository, such as purging big binary files or removing accidentally-committed secrets, or as part of a migration to a different repository structure, like switching from multirepo to monorepo (or vice versa).
The implementation shipped with git (git-filter branch
) is only usable as a once-in-a-lifetime
last resort for anything but tiny repositories.
Faster versions of history filtering have been implemented, such as git-filter-repo or the BFG repo cleaner. Those, while much faster, are designed for doing occasional, destructive maintenance tasks, usually with the idea already in mind that once the filtering is complete the old history should be discarded.
The idea behind josh
started with two questions:
- What if history filtering could be so fast that it can be part of a normal, everyday workflow, running on every single push and fetch without the user even noticing?
- What if history filtering was a non-destructive, reversible operation?
Under those two premises a filter operation stops being a maintenance task. It seamlessly relates histories between repos, which can be used by developers and CI systems interchangeably in whatever way is most suitable to the task at hand.
How is this possible?
Filtering history is a highly predictable task: The set of filters that tend to be used for any given repository is limited, such that the input to the filter (a git branch) only gets modified in an incremental way. Thus, by keeping a persistent cache between filter runs, the work needed to re-run a filter on a new commit (and its history) becomes proportional to the number of changes since the last run; The work to filter no longer depends on the total length of the history. Additionally, most filters also do not depend on the size of the trees.
What has long been known to be true for performing merges also applies to history filtering: The more often it is done the less work it takes each time.
To guarantee filters are reversible we have to restrict the kind of filter that can be used; It is not possible to write arbitrary filters using a scripting language like is allowed in other tools. To still be able to cover a wide range of use cases we have introduced a domain-specific language to express more complex filters as a combination of simpler ones. Apart from guaranteeing reversibility, the use of a DSL also enables pre-optimization of filter expressions to minimize both the amount of work to be done to execute the filter as well as the on-disk size of the persistent cache.
From Linus Torvalds 2007 talk at Google about git:
Audience:
Can you have just a part of files pulled out of a repository, not the entire repository?
Linus:
You can export things as tarballs, you can export things as individual files, you can rewrite the whole history to say "I want a new version of that repository that only contains that part", you can do that, it is a fairly expensive operation it's something you would do for example when you import an old repository into a one huge git repository and then you can split it later on to be multiple smaller ones, you can do it, what I am trying to say is that you should generally try to avoid it. It's not that git can not handle huge projects, git would not perform as well as it would otherwise. And you will have issues that you wish you didn't not have.
So I am skipping this issue and going back to the performance issue. One of the things I want to say about performance is that a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true.
That is not what performance is all about. If you can do something really fast, really well, people will start using it differently.
Use cases
Partial cloning
Reduce scope and size of clones by treating subdirectories of the monorepo as individual repositories.
$ git clone http://josh/monorepo.git:/path/to/library.git
The partial repo will act as a normal git repository but only contain the files found in the subdirectory and only commits affecting those files. The partial repo supports both fetch as well as push operation.
This helps not just to improve performace on the client due to having fewer files in the tree, it also enables collaboration on parts of the monorepo with other parties utilizing git's normal distributed development features. For example, this makes it easy to mirror just selected parts of your repo to public github repositories or specific customers.
Project composition / Workspaces
Simplify code sharing and dependency management. Beyond just subdirectories, Josh supports filtering, re-mapping and composition of arbitrary virtual repositories from the content found in the monorepo.
The mapping itself is also stored in the repository and therefore versioned alongside the code.
Multiple projects, depending on a shared set of libraries, can thus live together in a single repository. This approach is commonly referred to as “monorepo”, and was popularized by Google, Facebook or Twitter to name a few.
In this example, two projects (project1
and project2
) coexist in the central
monorepo.
Central monorepo | Project workspaces | workspace.josh file |
---|---|---|
dependencies = :/modules:[ ::tools/ ::library1/ ] | ||
libs/library1 = :/modules/library1 |
Workspaces act as normal git repos:
$ git clone http://josh/central.git:workspace=workspaces/project1.git
Each of the subprojects defines a workspace.josh
file, defining the mapping between the original central.git repository and the hierarchy in use inside of the project.
In this setup, project1 and project2 can seemlessly depend on the latest version of library1, while only checking out the part of the central monorepo that's needed for their purpose. What's more, any changes to a shared module will be synced in both directions.
If a developer of the library1 pushed a new update, both projects will get the new version, and the developer will be able to check if they broke any test. If a developer of project1 needs to update the library, the changes will be automatically shared back into central, and project2.
Simplified CI/CD
With everything stored in one repo, CI/CD systems only need to look into one source for each particular deliverable. However in traditional monorepo environments dependency mangement is handled by the build system. Build systems are usually taylored to specific languages and need their input already checked out on the filesystem. So the question:
"What deliverables are affected by a given commit and need to be rebuild?"
cannot be answered without cloning the entire repository and understanding how the languages used handle dependencies.
In particular when using C familiy languages, hidden dependencies on header files are easy to miss. For this reason limiting the visibility of files to the compiler by sandboxing is pretty much a requirement for reproducible builds.
With Josh, each deliverable gets it's own virtual git repository with dependencies declared in the workspace.josh
file. This means answering the above question becomes as simple as comparing commit ids.
Furthermore due to the tree filtering each build is guaranteed to be perfectly sandboxed
and only sees those parts of the monorepo that have actually been mapped.
This also means the deliverables to be re-built can be determined without cloning any repos like typically necessary with normal build tools.
GraphQL API
It is often desireable to access content stored in git without requiring a clone of the repository. This is usefull for CI/CD systems or web frontends such as dashboards.
Josh exposes a GraphQL API for that purpose. For example, it can be used to find all workspaces currently present in the tree:
query {
rev(at:"refs/heads/master", filter:"::**/workspace.josh") {
files { path }
}
}
Caching proxy
Even without using the more advanced features like partial cloning or workspaces,
josh-proxy
can act as a cache to reduce traffic between locations or keep your CI from
performing many requests to the main git host.
Frequently Asked Questions
How is Josh different from git sparse-checkout?
Josh operates on the git object graph and is unrelated to checking out files and the working tree on the filesystem, which is the only thing sparse-checkout is concerned with. A sparse checkout does not influence the contents of the object database and also not what gets downloaded over the network. Both can certainly be used together if needed.
How is Josh different from partial clone?
A partial clone will cause git to download only parts of an object database according to some predicate. It is still the same object database with the history having the same commits and sha1s. It still allows loading skipped parts of the object database at a later point. Josh creates an alternate history that has no reference to the skipped parts. It is as such very similar to git filter-branch just faster, with added features and a different user interface.
How is it different from submodules?
Where git submodules are multiple, independant repos, referencing each other with SHAs, Josh supports the monorepo approach. All of the code is in one single repo which can easily be kept in sync, and Josh provides any sub folder (or in the case of workspaces, more complicated recombination of folders) as their own git repository. These repos are transparently synchronised both ways with the main monorepo. Josh can thus do more than submodules can, and is easier and faster to use.
How is it different from git subtree
?
The basic idea behind Josh is quite similar to git subtree
. However git subtree
, just like git filter-branch
, is way too slow for everyday use, even on medium sized repos.
git subtree
can only achieve acceptable performance when squashing commits and therefore losing history. One core part of Josh is essentially a much faster implementation
of git subtree split
which has been specifically optimized for being run frequently inside the same repository.
How is Josh different from git filter-repo
?
Both josh-filter
as well as git filter-repo
enable very fast rewriting of Git history and thus can in simple cases be used
for the same purpose.
Which one is right in more advanced use cases depends on your goals: git filter-repo
offers more flexibility and options
on what kind of filtering it supports, like rewriting commit messages or even plugging arbitrary scripts into the filtering.
Josh uses a DSL instead of arbitary scripts for complex filters and is much more restrictive in the kind of filtering possilbe, but in exchange for those limitations offers incremental filtering as well as bidirectional operation, meaning converting changes between both the original and the filtered repos.
How is Josh different from all of the above alternatives?
Josh includes josh-proxy
which offers repo filtering as a service, mainly intended to support monorepo workflows.
Getting Started
This book will guide you into setting up the josh proxy to serve your own git repository.
NOTE
All the commands are included from the file
gettingstarted.t
which can be run with cram.
Setting up the proxy
Josh is distributed via Docker Hub, and is installed and started with the following command:
$ docker run \
> --name josh-proxy \
> --detach \
> --publish 8000:8000 \
> --env JOSH_REMOTE=https://github.com \
> --volume josh-vol:/data/git \
> joshproject/josh-proxy:latest >/dev/null
This starts Josh as a proxy to github.com
, in a Docker container,
creating a volume josh-vol
and mounting it to the image for use by Josh.
Cloning a repository
Once Josh is running, we can clone a repository through it. For example, let's clone Josh:
$ git clone http://localhost:8000/josh-project/josh.git
Cloning into 'josh'...
$ cd josh
As we can see, this repository is simply the normal Josh one:
$ ls
Cargo.lock
Cargo.toml
Dockerfile
Dockerfile.tests
LICENSE
Makefile
README.md
docs
josh-proxy
run-josh.sh
run-tests.sh
rustfmt.toml
scripts
src
static
tests
$ git log -2
commit fc6af1e10c865f790bff7135d02b1fa82ddebe29
Author: Christian Schilling <christian.schilling@esrlabs.com>
Date: Fri Mar 19 11:15:57 2021 +0100
Update release.yml
commit 975581064fa21b3a3d6871a4e888fd6dc1129a13
Author: Christian Schilling <christian.schilling@esrlabs.com>
Date: Fri Mar 19 11:11:45 2021 +0100
Update release.yml
Cloning a part of the repo
Josh becomes interesting when we want to clone a part of the repo. Let's check out the Josh repository again, but this time let's filter only the documentation out:
$ cd ..
$ git clone http://localhost:8000/josh-project/josh.git:/docs.git
Cloning into 'docs'...
$ cd docs
Note the addition of :/docs
at the end of the url.
This is called a filter, and it instructs josh to only check out the
given folder.
Looking inside the repository, we now see that the history is quite different. Indeed, it contains only the commits pertaining to the subfolder that we checked out.
$ ls
book.toml
src
$ git log -2
commit dd26c506f6d6a218903b9f42a4869184fbbeb940
Author: Christian Schilling <christian.schilling@esrlabs.com>
Date: Mon Mar 8 09:22:21 2021 +0100
Update docs to use docker for default setup
commit ee6abba0fed9b99c9426f5224ff93cfee2813edc
Author: Louis-Marie Givel <louis-marie.givel@esrlabs.com>
Date: Fri Feb 26 11:41:37 2021 +0100
Update proxy.md
This repository is a real repository in which we can pull, commit, push, as with a regular one. Josh will take care of synchronizing it with the main one in a transparent fashion.
Working with workspaces
NOTE
All the commands are included from the file
workspaces.t
which can be run with cram.
Josh really starts to shine when using workspaces.
Simply put, they are a list of files and folders, remapped from the central repository to a new repository. For example, a shared library could be used by various workspaces, each mapping it to their appropriate subdirectory.
In this chapter, we're going to set up a new git repository with a couple of libraries, and then use it to demonstrate the use of workspaces.
Test set-up
NOTE
The following section describes how to set-up a local git server with made-up content for the sake of this tutorial. You're free to follow it, or to use your own existing repository, in which case you can skip to the next section
To host the repository for this test, we need a git server. We're going to run git as a cgi program using its provided http backend, served with the test server included in the hyper_cgi crate.
Serving the git repo
First, we create a bare repository, which will be served by hyper_cgi. We enable
the option http.receivepack
to allow the use of git push
from the clients.
$ git init --bare ./remote/real_repo.git/
Initialized empty Git repository in */real_repo.git/ (glob)
$ git config -f ./remote/real_repo.git/config http.receivepack true
Then we start the server which will allow clients to access the repository through http.
$ GIT_DIR=./remote/ GIT_PROJECT_ROOT=${TESTTMP}/remote/ GIT_HTTP_EXPORT_ALL=1 hyper-cgi-test-server\
> --port=8001\
> --dir=./remote/\
> --cmd=git\
> --args=http-backend\
> > ./hyper-cgi-test-server.out 2>&1 &
$ echo $! > ./server_pid
Our server is ready, serving all the repos in the remote
folder on port 8001
.
$ git clone http://localhost:8001/real_repo.git
Cloning into 'real_repo'...
warning: You appear to have cloned an empty repository.
Adding some content
Of course, the repository is for now empty, and we need to populate it. The populate.sh script creates a couple of libraries, as well as two applications that use them.
$ cd real_repo
$ sh ${TESTDIR}/populate.sh > ../populate.out
$ git push origin HEAD
To http://localhost:8001/real_repo.git
* [new branch] HEAD -> master
$ tree
.
|-- application1
| `-- app.c
|-- application2
| `-- guide.c
|-- doc
| |-- guide.md
| |-- library1.md
| `-- library2.md
|-- library1
| `-- lib1.h
`-- library2
`-- lib2.h
5 directories, 7 files
$ git log --oneline --graph
* f65e94b Add documentation
* f240612 Add application2
* 0a7f473 Add library2
* 1079ef1 Add application1
* 6476861 Add library1
Creating our first workspace
Now that we have a git repo populated with content, let's serve it through josh:
$ docker run -d --network="host" -e JOSH_REMOTE=http://127.0.0.1:8001 -v josh-vol:$(pwd)/git_data joshproject/josh-proxy:latest > josh.out
NOTE
For the sake of this example, we run docker with --network="host" instead of publishing the port. This is so that docker can access localhost, where our ad-hoc git repository is served.
To facilitate developement on applications 1 and 2, we want to create workspaces for them. Creating a new workspace looks very similar to checking out a subfolder through josh, as explained in "Getting Started".
Instead of just the name of the subfolder, though, we also use the :workspace=
filter:
$ git clone http://127.0.0.1:8000/real_repo.git:workspace=application1.git application1
Cloning into 'application1'...
$ cd application1
$ tree
.
`-- app.c
0 directories, 1 file
$ git log -2
commit 50cd6112e173df4cac1aca9cb88b5c2a180bc526
Author: Josh <josh@example.com>
Date: Thu Apr 7 22:13:13 2005 +0000
Add application1
Looking into the newly cloned workspace, we see our expected files and the history containing the only relevant commit.
NOTE
Josh allows us to create a workspace out of any directory, even one that doesn't exist yet.
Adding workspace.josh
The workspace.josh file describes how folders from the central repository (real_repo.git) should be mapped to the workspace repository.
Since we depend on library1, let's add it to the workspace file.
$ echo "modules/lib1 = :/library1" >> workspace.josh
$ git add workspace.josh
$ git commit -m "Map library1 to the application1 workspace"
[master 06361ee] Map library1 to the application1 workspace
1 file changed, 1 insertion(+)
create mode 100644 workspace.josh
We decided to map library1 to modules/lib1 in the workspace. We can now sync up with the server:
$ git sync origin HEAD
HEAD -> refs/heads/master
From http://127.0.0.1:8000/real_repo.git:workspace=application1
* branch 753d62ca1af960a3d071bb3b40722471228abbf6 -> FETCH_HEAD
HEAD is now at 753d62c Map library1 to the application1 workspace
Pushing to http://127.0.0.1:8000/real_repo.git:workspace=application1.git
POST git-receive-pack (477 bytes)
remote: josh-proxy
remote: response from upstream:
remote: To http://localhost:8001/real_repo.git
remote: f65e94b..37184cc JOSH_PUSH -> master
remote: REWRITE(06361eedf6d6f6d7ada6000481a47363b0f0c3de -> 753d62ca1af960a3d071bb3b40722471228abbf6)
remote:
remote:
updating local tracking ref 'refs/remotes/origin/master'
let's observe the result:
$ tree
.
|-- app.c
|-- modules
| `-- lib1
| `-- lib1.h
`-- workspace.josh
2 directories, 3 files
$ git log --graph --oneline
* 753d62c Map library1 to the application1 workspace
|\
| * 366adba Add library1
* 50cd611 Add application1
After pushing and fetching the result, we see that it has been succesfully mapped by josh.
One suprising thing is the history: our "mapping" commit became a merge commit! This is because josh needs to merge the history of the module we want to map into the repository of the workspace. After this is done, all commits will be present in both of the histories.
NOTE
git sync
is a utility provided with josh which will push contents, and, if josh tells it to, fetch the transformed result. Otherwise, it works like git push.
By the way, what does the history look like on the real_repo ?
$ cd ../real_repo
$ git pull origin master
From http://localhost:8001/real_repo
* branch master -> FETCH_HEAD
f65e94b..37184cc master -> origin/master
Updating f65e94b..37184cc
Fast-forward
application1/workspace.josh | 1 +
1 file changed, 1 insertion(+)
create mode 100644 application1/workspace.josh
Current branch master is up to date.
$ tree
.
|-- application1
| |-- app.c
| `-- workspace.josh
|-- application2
| `-- guide.c
|-- doc
| |-- guide.md
| |-- library1.md
| `-- library2.md
|-- library1
| `-- lib1.h
`-- library2
`-- lib2.h
5 directories, 8 files
$ git log --graph --oneline
* 37184cc Map library1 to the application1 workspace
* f65e94b Add documentation
* f240612 Add application2
* 0a7f473 Add library2
* 1079ef1 Add application1
* 6476861 Add library1
We can see the newly added commit for workspace.josh in application1, and as expected, no merge here.
Interacting with workspaces
Let's now create a second workspce, this time for application2. It depends on library1 and library2.
$ git clone http://127.0.0.1:8000/real_repo.git:workspace=application2.git application2
Cloning into 'application2'...
$ cd application2
$ echo "libs/lib1 = :/library1" >> workspace.josh
$ echo "libs/lib2 = :/library2" >> workspace.josh
$ git add workspace.josh && git commit -m "Create workspace for application2"
[master 566a489] Create workspace for application2
1 file changed, 2 insertions(+)
create mode 100644 workspace.josh
Syncing as before:
$ git sync origin HEAD
HEAD -> refs/heads/master
From http://127.0.0.1:8000/real_repo.git:workspace=application2
* branch 5115fd2a5374cbc799da61a228f7fece3039250b -> FETCH_HEAD
HEAD is now at 5115fd2 Create workspace for application2
Pushing to http://127.0.0.1:8000/real_repo.git:workspace=application2.git
POST git-receive-pack (478 bytes)
remote: josh-proxy
remote: response from upstream:
remote: To http://localhost:8001/real_repo.git
remote: 37184cc..feb3a5b JOSH_PUSH -> master
remote: REWRITE(566a4899f0697d0bde1ba064ed81f0654a316332 -> 5115fd2a5374cbc799da61a228f7fece3039250b)
remote:
remote:
updating local tracking ref 'refs/remotes/origin/master'
And our local folder now contains all the files requested:
$ tree
.
|-- guide.c
|-- libs
| |-- lib1
| | `-- lib1.h
| `-- lib2
| `-- lib2.h
`-- workspace.josh
3 directories, 4 files
And the history includes the history of both of the libraries:
$ git log --oneline --graph
* 5115fd2 Create workspace for application2
|\
| * ffaf58d Add library2
| * f4e4e40 Add library1
* ee8a5d7 Add application2
Note that since we created the workspace and added the dependencies in one single commit, the history just contains this one single merge commit.
Pushing a change from a workspace
While testing application2, we noticed a typo in the library1
dependency.
Let's go ahead a fix it!
$ sed -i 's/41/42/' libs/lib1/lib1.h
$ git commit -a -m "fix lib1 typo"
[master 82238bf] fix lib1 typo
1 file changed, 1 insertion(+), 1 deletion(-)
We can push this change like any normal git change:
$ git push origin master
remote: josh-proxy
remote: response from upstream:
remote: To http://localhost:8001/real_repo.git
remote: feb3a5b..31e8fab JOSH_PUSH -> master
remote:
remote:
To http://127.0.0.1:8000/real_repo.git:workspace=application2.git
5115fd2..82238bf master -> master
Since the change was merged in the central repository, a developper can now pull from the application1 workspace.
$ cd ../application1
$ git pull
From http://127.0.0.1:8000/real_repo.git:workspace=application1
+ 06361ee...c64b765 master -> origin/master (forced update)
Updating 753d62c..c64b765
Fast-forward
modules/lib1/lib1.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Current branch master is up to date.
The change has been propagated!
$ git log --oneline --graph
* c64b765 fix lib1 typo
* 753d62c Map library1 to the application1 workspace
|\
| * 366adba Add library1
* 50cd611 Add application1
Importing projects
When moving to a monorepo setup, especially in existing organisations, it is common that the need to consolidate existing project repositories arises.
The simplest possible case is one where the previous history of a project does not need to be retained. In this case, the projects files can simply be copied into the monoreop at the appropriate location and committed.
If history should be retained, josh can be used for importing a
project as an alternative to built-in git commands like git subtree
.
Josh's filter capability lets you perform transformations on the history of a git repository to arbitrarily (re-)compose paths inside of a repository.
A key aspect of this functionality is that all transformations are
reversible. This means that if you apply a transformation moving
files from the root of a repository to, say, tools/project-b
,
followed by an inverse transformation moving files from
tools/project-b
back to the root, you receive the same commit hashes
you put in.
We can use this feature to import a project into our monorepo while allowing external users to keep pulling on top of the same git history they already have, just with a new git remote.
There are multiple ways of doing this, with the most common ones
outlined below. You can look at josh#596
for a
discussion of several other methods.
Import with josh-filter
Currently, the easiest way to do this is by using the josh-filter
binary which is a command-line frontend to josh's filter capabilities.
Inside of our target repository, it would work like this:
-
Fetch the repository we want to import (say, "Project B", from
$REPO_URL
).$ git fetch $REPO_URL master
This will set the
FETCH_HEAD
reference to the fetched repository. -
Rewrite the history of that repository through josh to make it look as if the project had always been developed at our target path (say,
tools/project-b
).$ josh-filter ':prefix=tools/project-b' FETCH_HEAD
This will set the
FILTERED_HEAD
reference to the rewritten history. -
Merge the rewritten history into our target repository.
$ git merge --allow-unrelated FILTERED_HEAD
After this merge commit, the previously external project now lives at
tools/project-b
as expected. -
Any external users can now use the
:/tools/project-b
josh filter to retrieve changes made in the new project location - without the git hashes of their existing commits changing (that is to say, without conflicting).
Import by pushing to josh
If your monorepo is already running a josh-proxy
in front of it, you
can also import a project by pushing a project merge to josh.
This has the benefit of not needing to clone the entire monorepo
locally to do a merge, but the drawback of using a different, slightly
slower filter mechanism when exporting the tree back out. For projects
with very large history, consider using the josh-filter
mechanism
outlined above.
Pushing a project merge to josh works like this:
-
Assume we have a local checkout of "Project B", and we want to merge this into our monorepo. There is a
josh-proxy
running athttps://git.company.name/monorepo.git
. We want to merge this project into/tools/project-b
in the monorepo. -
In the checkout of "Project B", add the josh remote:
git remote add josh https://git.company.name/monorepo.git:/tools/project-b.git
Note the use of the
/tools/project-b.git
josh filter, which points to a path that should not yet exist in the monorepo. -
Push the repository to josh with the
-o merge
option, creating a merge commit introducing the project history at that location, while retaining its history:git push josh $ref -o merge
Note for Gerrit users
With either method, when merging a set of new commits into a Gerrit repository and going through the standard code-review process, Gerrit might complain about missing Change-IDs in the imported commits.
To work around this, the commits need to first be made "known" to Gerrit. This can be achieved by pushing the new parent of the merge commit to a separate branch in Gerrit directly (without going through the review mechanism). After this Gerrit will accept merge commits referencing that parent, as long as the merge commit itself has a Change-ID.
Some monorepo setups on Gerrit use a special unrestricted branch like
merge-staging
for this, to which users with permission to import
projects can force-push unrelated histories.
History filtering
Josh transforms commits by applying filters to them. As any commit in git represents not just a single state of the file system but also its entire history, applying a filter to a commit produces an entirely new history. The result of a filter is a normal git commit and therefore can be filtered again, making filters chainable.
Syntax
Filters always begin with a colon and can be chained:
:filter1:filter2
When used as part of an URL filters can not contain white space or newlines. When read from a file
however white space can be inserted between filters (not after the leading colon).
Additionally newlines can be used instead of ,
inside of composition filters.
Some filters take arguments, and arguments can optionally be quoted using double quotes,
if special characters used by the filter language need to be used (like :
or space):
:filter=argument1,"argument2"
Available filters
Subdirectory :/a
Take only the selected subdirectory from the input and make it the root
of the filtered tree.
Note that :/a/b
and :/a:/b
are equivalent ways to get the same result.
Directory ::a/
A shorthand for the commonly occurring filter combination :/a:prefix=a
.
File ::a
Produces a tree with only the specified file in it's root.
Note that ::a/b
is equivalent to ::a/::b
.
Prefix :prefix=a
Take the input tree and place it into subdirectory a
.
Note that :prefix=a/b
and :prefix=b:prefix=a
are equivalent.
Composition :[:filter1,:filter2,...,:filterN]
Compose a tree by overlaying the outputs of :filter1
... :filterN
on top of each other.
It is guaranteed that each file will only appear at most once in the output. The first filter
that consumes a file is the one deciding it's mapped location. Therefore the order in which
filters are composed matters.
Inside of a composition x=:filter
can be used as an alternative spelling for
:filter:prefix=x
.
Exclusion :exclude[:filter]
Remove all paths present in the output of :filter
from the input tree.
It should generally be avoided to use any filters that change paths and instead only
use filters that select paths without altering them.
Workspace :workspace=a
Similar to :/a
but also looks for a workspace.josh
file inside the
specified directory (called the "workspace root").
The resulting tree will contain the contents of the
workspace root as well as additional files specified in the workspace.josh
file.
(see Workspaces)
Text replacement :replace("regex_0":"replacement_0",...,"regex_N":"replacement_N")
Applies the supplied regular expressions to every file in the input tree.
Signature removal :unsign
The default behaviour of Josh is to copy, if it exsists, the signature of the original commit in the filtered commit. This makes the signature invalid, but allows a perfect round-trip: josh will be able to recreate the original commit from the filtered one.
This behaviour might not be desirable, and this filter drops the signatures from the history.
Pattern filters
The following filters accept a glob like pattern X
that can contain *
to
match any number of characters. Note that two or more consecutive wildcards (**
) are not
allowed.
Match directories ::X/
All matching subdirectories in the input root
Match files or directories ::X
All matching files or directories in the input root
Match nested directories ::**/X/
All subdirectories matching the pattern in arbitrarily deep subdirectories of the input
Match nested files ::**/X
All files matching the pattern in arbitrarily deep subdirectories of the input
History filters
These filter do not modify git trees, but instead only operate on the commit graph.
Linearise history :linear
Produce a filtered history that does not contain any merge commits. This is done by simply dropping all parents except the first on every commit.
Filter specific parts of the history :rev(<sha_0>:filter_0,...,<sha_N>:filter_N)
Produce a history where the commits specified by <sha_N>
are replaced by the result of applying
:filter_N
to it.
It will appear like <sha_N>
and all its ancestors are also filtered with <filter_N>
. If an
ancestor also has a matching entry in the :rev(...)
it's filter will replace <filter_N>
for all further ancestors (and so on).
This special value 0000000000000000000000000000000000000000
can be used as a <sha_n>
to filter
commits that don't match any of the other shas.
Join multiple histories into one :join(<sha_0>:filter_0,...,<sha_N>:filter_N)
Produce the history that would be the result of pushing the passed branches with the passed filters into the upstream.
Filter order matters
Filters are applied in the left-to-right order they are given in the filter specification,
and they are not
commutative.
For example, this command will filter out just the josh documentation, and store it in a
ref named FILTERED_HEAD
:
$ josh-filter :/docs:prefix=josh-docs
However, this
command will produce an empty branch:
$ josh-filter :prefix=josh-docs:/docs
What's happening in the latter command is that because the prefix filter is applied first, the
entire josh
history already lives within the josh-docs
directory, as it was just
transformed to exist there. Thus, to still get the docs, the command would need to be:
$ josh-filter :prefix=josh-docs:/josh-docs/docs
which will contain the josh documentation at the base of the tree. We've lost the prefix, what
gives?? Because the original git tree was already transformed, and then the subdirectory filter
was applied to pull documentation from josh-docs/docs
, the prefix is gone - it was filtered out
again by the subdirectory filter. Thus, the order in which filters are provided is crucial, as each
filter further transforms the latest transformation of the tree.
josh-proxy
Josh provides an HTTP proxy server that can be used with any git hosting service which communicates via HTTP.
It needs the URL of the upstream server and a local directory to store its data.
Optionally, a port to listen on can be specified. For example, running a local josh-proxy
instance for github.com on port 8000:
$ docker run -p 8000:8000 -e JOSH_REMOTE=https://github.com -v josh-vol:/data/git joshproject/josh-proxy:latest
Note: While
josh-proxy
is intended to be used with a http upstream it can also proxy for an ssh upstream whenssh
is used instead ofhttp
in the url. In that case it will use the ssh private key of the current user (just like git would) and take the username from the downstream http request. This mode of operation can be useful for evaluation or local use by individual developers but should never be used on a normal server deployment.
For a first example of how to make use of josh, just the josh documentation can be checked out as its own repository via this command:
$ git clone http://localhost:8000/josh-project/josh.git:/docs.git
Note: This URL needs to contain the
.git
suffix twice: once after the original path and once more after the filter spec.
josh-proxy
supports read and write access to the repository, so when making changes
to any files in the filtered repository, you can just commit and push them
like you are used to.
Note: The proxy is semantically stateless. The data inside the docker volume is only persisted across runs for performance reasons. This has two important implications for deployment:
- The data does not need to be backed up unless working with very large repos where rebuilding would be very expensive. And 2) Multiple instances of josh-proxy can be used interchangeably for availability or load balancing purposes.
URL syntax and breakdown
This is the URL of a josh-proxy
instance:
http://localhost:8000
This is the repository location on the upstream host on which to perform the filter operations:
/josh-project/josh.git
This is the set of filter operations to perform:
:/docs.git
Much more information on the available filters and the syntax of all filters is covered in detail in the filters section.
Repository naming
By default, a git URL is used to point to the remote repository to download and also to dictate how the local repository shall be named. It's important to learn that the last name in the URL is what the local git client will name the new, local repository. For example:
$ git clone http://localhost:8000/josh-project/josh.git:/docs.git
will create the new repository at directory docs
, as docs.git
is the last name in the URL.
By default, this leads to rather odd-looking repositories when the prefix
filter is the final
filter of a URL:
$ git clone http://localhost:8000/josh-project/josh.git:/docs:prefix=josh-docs.git
This will still clone just the josh documentation, but the final directory structure will look like this:
- prefix=josh-docs
- josh-docs
- <docs>
Having the root repository directory name be the fully-specified filter is most likely not what was
intended. This results from git's reuse and repurposing of the remote URL, as prefix=josh-docs
is the final name in the URL. With no other alternatives, this gets used for the repository name.
To explicitly specify a repository name, provide the desired name after the URL when cloning a new repository:
$ git clone http://localhost:8000/josh-project/josh.git:/docs:prefix=josh-docs.git my-repo
Serving a github repo
To prompt for authentication, Josh relies on the server requesting it on fetch. When using a server which doesn't need authentication for fetching, Josh will not automatically prompt for authentication when pushing, and it will be impossible to provide credentials for pushing.
To solve this, you need to pass the --require-auth
option to josh-proxy.
This can be done with JOSH_EXTRA_OPTS
when using the docker image like so:
docker run -d -p 8000:8000 -e JOSH_EXTRA_OPTS="--require-auth" -e JOSH_REMOTE=https://github.com/josh-project -v josh-vol:$(pwd)/git_data joshproject/josh-proxy:latest
In this example, we serve only the josh-project repositories. Be aware that if you don't add the organisation or repo URL, your instance will be able to serve all of github. You can (and should) restrict it to your repository or organisation by making it part of the URL.
Working with workspaces
For the sake of this example we will assume a josh-proxy
instance is running and serving a
repo on http://josh/world.git
with some shared code in shared
.
Create a new workspace
To create a new workspace in the path ws/hello
simply clone it as if it already exists:
$ git clone http://josh/world.git:workspace=ws/hello.git
git
will report that you appear to have cloned an empty repository if that path does not
yet exist.
If you don't get this message it means that the path already exists in the repo but may
not yet have configured any path mappings.
The next step is to add some path mapping to the workspace.josh
file in the root of the
workspace:
$ cd hello
$ echo "mod/a = :/shared/a" > workspace.josh
And and commit the changes:
$ git add workspace.josh
$ git commit -m "add workspace"
If the path did not exist previously, the resulting commit will be a root commit that does not share
any history with the world.git
repo.
This means a normal git push
will be rejected at this point.
To get correct history, the
resulting commit needs to be a based on the history that already exists in world.git
.
There is however no way to do this locally, because we don't have the data required for this.
Also, the resulting tree should contain the contents of shared/a
mapped to mod/a
which
means it needs to be produced on the server side because we don't have the files to put there.
To accomplish that push with the create option:
$ git push -o create origin master
Note: While it is perfectly possible to use Josh without a code review system, it is strongly recommended to use some form of code review to be able to inspect commits created by Josh before they get into the immutable history of your main repository.
As the resulting commit is created on the server side we need to get it from the server:
$ git pull --rebase
Now you should see mod/a
populated with the content of the shared code.
Map a shared path into a workspace
To add shared path to a location in the workspace that does not exist yet, first add an
entry to the workspace.josh
file and commit that.
You can add the mapping at the end of the file using a simple syntax, and rely on josh to rewrite it for you in a canonical way.
...
new/mapping/location/in/workspace = :/new/mapping/location/in/monorepo
At this point the path is of course empty, so the commit needs to be pushed to the server. When the same commit is then fetched back it will have the mapped path populated with the shared content.
When the commit is pushed, josh will notify you about the rewrite. You can fetch the rewritten commit using the advertised SHA. Alternatively, you can use git sync which will do it for you.
Publish a non-shared path into a shared location
The steps here are exactly the same as for the mapping example above. The only difference being that the path already exists in the workspace but not in the shared location.
Remove a mapping
To remove a mapping remove the corresponding entry from the workspace.josh
file.
The content of the previously shared path will stay in the workspace. That means the main
repo will have two copies of that path from that point on. Effectivly creating a fork of that code.
Remove a mapped path
To remove a mapped path as well as it's contents, remove the entry from the
workspace.josh
file and also remove the path inside the workspace using git rm
.
Container configuration
Container options
Variable | Meaning |
---|---|
JOSH_REMOTE
|
HTTP remote, including protocol.
Example: https://github.com
|
JOSH_REMOTE_SSH
|
SSH remote, including protocol.
Example: ssh://git@github.com
|
JOSH_HTTP_PORT
|
HTTP port to listen on. Default: 8000 |
JOSH_SSH_PORT
|
SSH port to listen on. Default: 8022 |
JOSH_SSH_MAX_STARTUPS
|
Maximum number of concurrent SSH authentication attempts. Default: 16 |
JOSH_SSH_TIMEOUT
|
Timeout, in seconds, for a single request when serving repos over SSH. This time should cover fetch from upstream repo, filtering, and serving repo to client. Default: 300 |
JOSH_EXTRA_OPTS
|
Extra options passed directly to
josh-proxy process
|
Container volumes
Volume | Purpose |
---|---|
/data/git
|
Git cache volume. If this volume is not mounted, the cache will be lost every time the container is shut down. |
/data/keys
|
SSH server keys. If this volume is not mounted, a new key will be generated on each container startup |
Configuring SSH access
Josh supports SSH access (just pull without pushing, for now).
To use SSH, you need to add the following lines to your ~/.ssh/config
:
Host your-josh-instance.com
ForwardAgent yes
PreferredAuthentications publickey
Alternatively, you can pass those options via GIT_SSH_COMMAND
:
GIT_SSH_COMMAND="ssh -o PreferredAuthentications=publickey -o ForwardAgent=yes" git clone ssh://git@your-josh-instance.com/...
In other words, you need to ensure SSH agent forwarding is enabled.
josh-filter
Command to rewrite history using josh
filter specs.
By default it will use HEAD
as input and update FILTERED_HEAD
with the filtered
history, taking a filter specification as argument.
(Note that input and output are swapped with --reverse
.)
It can be installed with the following Cargo command, assuming Rust is installed:
cargo install josh-filter --git https://github.com/josh-project/josh.git
git-sync
A utility to make working with server side rewritten commits easier.
Those commits frequently get produced when making changes to workspace.josh
files.
The command is available in the script
directory.
It should be put downloaded and added to the PATH
.
It can then be used as a drop-in replacement for git push
.
It enables the server to return commits back to the client after a push. This is done by parsing
the messages sent back by the server for announcements of rewritten commits and then fetching
those to update the local references.
In case of a normal git server that does not rewrite anything, git sync
will do exactly the
same as git push
, also accepting the same arguments.
GraphQL API
Josh implements a GraphQL API to query the content of repositories without a need to clone them via a git client.
The API is exposed at:
http://hostname/~/graphql/name_of_repo.git
To explore the API and generated documentation, an interactive GraphQL shell can be found at:
http://hostname/~/graphiql/name_of_repo.git
Testing
Currently the Josh project mainly uses integration tests for it's verification, so make sure you will be able to run and check them.
The following sections will describe how to run the different kind's of tests used for the verification of the Josh project.
UnitTests & DocTests
cargo test --all
Integration Tests
1. Setup the test environment
Due to the fact that the integration tests need additional tools and a more complex environment and due to the fact that the integration test are done using cram. you will need to crate an extra environment to run these tests. To simplify the setup of the integration testing we have setup a Nix Shell environment which you can start by using the following command if you have installed the Nix Shell.
Attention: Currently it is still necessary to install the following tools in your host system.
- curl
- hyper_cgi
cargo install hyper_cgi --features=test-server
Setup the Nix Shell
Attention: When running this command the first time, this command will take quite a bit to finish. You also will need internet access while executing this command. Depending on performance of your connection the command will take more or less time.
nix-shell shell.nix
Once the command is finished you will be prompted with the nix-shell which will provide the needed shell environment to execute the integration tests.
2. Verify you have built all necessary binaries
cargo build
cargo build --bin josh-filter
cargo build --manifest-path josh-proxy/Cargo.toml
cargo build --manifest-path josh-ui/Cargo.toml
3. Setup static files for the josh-ui
cd josh-ui
trunk build
cd ..
4. Run the integration tests
Attention: Be aware that all tests except the once in experimental should be green.
sh run-tests.sh -v tests/
UI Tests
TBD: Currently disabled, stabilize, enable and document process.