Process for Open Sourcing MATRIX repository
Context
A core goal of Every Cure and the MATRIX project is the open sourcing of the data as well as the code we use to generate it. However, the repository was initially set up in a haste and contains code that we do not have a license to open source as well as some information that we should not release publicly such as encrypted secrets of our infrastructure. While these files are encrypted, there are better ways to store them (e.g. in a separate, private repository).
Decision
We have decided to perform the following steps to enable the OSS release:
- Automatically review all comments of the repository using an LLM prompted for risk flagging. This is done using a utility function written to extract all comments and review comments
- Extract all encrypted secrets to a separate repository
matrix-secretswhich will be added as a submodule to the main matrix repository (and thus cannot be cloned by OSS contributors) - Extract all libraries to a separate repo
matrix-legacy-proprietary-packageswhich is added as a submodule until the library replacement is complete. This will block OSS contributors from using the fabricated data pipeline (test). - Rewrite all history to remove the above resources from the repository
- Rotate all secrets
Automated Comment Review
A command was added to the matrix CLI in apps/matrix-cli which allows for automated scanning of
all comments. This library sends each comment on all issues and PRs to a Gemini model for
inspection. Critical comments are flagged to the user and can be manually reviewed and/or deleted
until we see no further risks from our comments.
All comments deemed problematic have been stored in this sheet for the team to review and potentially mitigate. Generally, we only found some banter & jokes, but it is still a good idea to review and potentially tidy up our language before we open source.
Secret Extraction
Most secrets have already been moved to GCP Secrets Manager. However something has to bootstrap
the secrets and these bootstrapped values are stored in an encrypted file which is shared with the
respective environment administrators (dev/prod). We continue to deploy git-crypt for this but
move these encrypted files into a separate repository to minimize risk exposure. The
separate repository is added as a submodule which is only accessible to Every Cure
org members.
Rotate all secrets
All secrets stored in infra/secrets should be rotated and the old values invalidated after a grace
period. This ensures even if our history rewrite is faulty, all secrets have been rotated out and
are no longer valid.
Library Extraction
The key library to extract is data_fabricator which is extracted into a separate repository. We
can simply move this library to another repo and then replace the pipelines/matrix/packages folder
with a submodule reference to this repository.
PR 1325 contains this change.
History Rewrite
- Close or merge all open PRs. This will require careful coordination with the team to ensure we have a moment where all open work has been completed.
- Create a full mirror of the repository in a separate, private repository called
matrix-backup. This is a safety precaution and ideally will not be used further. - Perform a local history rewrite with
git-filter-repoto remove key folders and files from any commit in the repository. - Push the changes to Github, overwriting all existing branches with the new history chain
- Ask all current maintainers to reset their local repository to the new history chain using
scripts/reset_local_repository.sh
Consequences
- PRs and issues referencing commit IDs will be broken. This is an unavoidable side effect of the history rewrite. We still maintain all git history which is what we want to achieve.
- Administrators need to clone with submodules included to unlock the secrets.
- Documentation needs to be added to enable OSS contributors to still install the pipeline even without access to the fabricator
Benefits
After this work, the entire mono repository can be open sourced. We can also remove authentication from our documentation page at this stage and freely share links with people to get outside perspectives and contributions.