Date: 2022-10-07
Status: In Progress
Notes: This document is still in review and may be heavily modified based on stakeholder feedback. Please add any feedback or questions in:
https://github.com/testground/testground/issues/1491
- Table Of Content
- Context
- Vision
- Our Focus for 2022-2023
- Milestones
- 1. Bootstrap libp2p's interoperability testing story
- 2. Improve our Development & Testing infrastructure to meet engineering standards
- 3. Support libp2p's interoperability testing story as a way to drive "critical" Testground improvements
- 4. Testground Is Usable by Non-Testground experts
- 5. Refresh Testground's EKS support
- 6. Provide a Testground As A Service Cluster used by libp2p & ipfs teams
Testground is a platform for testing, benchmarking, and simulating distributed and peer-to-peer systems at scale. It's designed to be multi-lingual and runtime-agnostic, scaling gracefully from 2 to 10k instances when needed.
Testground was used successfully at Protocol Labs to validate improvements like libp2p's gossipsub security extensions, IPFS' massive DHT and Bitswap improvement, and Filecoin improvements. Today, we are actively working on it to support libp2p interoperability testing.
This document consists of two sections:
- Milestones is the list of key improvements we intend to release with dates and deliverables. It is the "deliverable" oriented slice of our plan.
- Appendix: Problems we focus on is the list of areas of improvement we intend to focus on. Some contain a few examples of milestones. It is the "problem" oriented slice of our plan and might cover multiple releases and multiple "milestones".
There is an issue dedicated to Tracking Discussions around this Roadmap.
The timeline we share is our best-educated guess (not a hard commitment) around when we plan to provide critical improvements.
Where possible, we've shared a list of deliverables for every Milestone. These lists are not exhaustive or definitive; they aim to create a sense of "what are concrete outcomes" delivered by an improvement.
As we agree on this Roadmap and refine it, we'll create & organize missing EPICs.
Making Testground project management sustainable is one of our key milestones.
This section is still a very high-level draft
Testground As A Service embodies our long-term vision.
- A single, scalable, installation that one or more organizations can use in all their projects,
- An easy-to-follow documentation and guidelines for users to jump straight into action,
- An example: pages that can hand-hold through the learning curve, thus fostering community-driven tutorials. As well as APIs section that can help navigate through the vast possibilities of testground.
- The ability to experiment with large-scale networks and simplify the integration testing of "any" network-related code,
- An example: having the ability to define standard experiments with a set of roll-up networks that deploy their data to a different DA network. This would help teams compare solutions using the same metrics.
- The ability to track the impact of a change in terms of stability & performance across multiple projects,
- An example: having the ability to run IPFS benchmarks and simulations with different combinations of libraries. This would help us measure regression and improvements as soon as they occur in upstream dependencies.
Products with similar ideas but specialized in different areas:
- database: CockroachDB performance tracker,
- browser: Webkit Performance Dashboard
- selenium : Selenium
Research and Templates
We focus on the following:
- Reliability: above all, Testground should be trusted by its users;
- Sustainability: implementing the processes & tools we need to maintain Testground in the medium and long term;
- Usefulness: solving needs that have been requested explicitly by projects.
We want to ensure Testground is valuable and stable before we grow its feature set.
- Delivery: Q4 2022
- Theme: usefulness
- Effort: approx. 4 developer-months
Why: Testground provides an excellent foundation for distributed testing. Supporting the libp2p team with interoperability tests generates long-term value outside of Testground. We can move faster, generate interest, and create measurable improvements by focusing on a use-case.
Deliverables:
- Tooling to support interoperability testing (mixed-builders, composition templating),
- Stability measurements & fixes to reach the libp2p team's expectations,
- A fully working example (ping test) used in go-libp2p and rust-libp2p CIs,
- An interoperability Dashboard that shows how implementations and versions are tested.
- Delivery: Q1 2023
- Theme: reliability & sustainability
- Effort: approx. 3 developer-months
Why: Testground proved itself valuable multiple times over the years. However, now we need bulletproof development processes to make the project sustainable and facilitate external contributions.
Extra care is taken on Testing and Stability: we are building a testing platform, and Testground's testing must be impeccable.
Deliverables:
- Automated release process with official releases, binary distribution, and Changelogs - EPIC 1430
- Well-defined Project management processes like triaging, contribution & reviewing processes, etc.
- Documentations for maintainers,
- Maintainable integration testing tooling (no more shell scripts or flakiness),
- A Stability Dashboard used to identify regression & discuss improvement with Maintainers and Users,
3. Support libp2p's interoperability testing story as a way to drive "critical" Testground improvements
- Delivery: Q3 2023
- Theme: usefulness
- Effort: approx. 8 developer-months
Why: By focusing on a use case, we can move faster, generate interest, and create measurable improvements outside the project.
Deliverables:
- +3 months
- Javascript & Browser support in Testground - issue 1386
- Reliable Network simulation in Docker
- Access to public networks - issue 1472
- Network Simulation Fixes - Epic 1492
- +3 months
- Remote-Runners for transport Benchmarking
- Improved Network Simulation in Docker
- NAT simulation - issue 1299
- Complex topologies - issue 1354
- +2 months
- Tooling
- Performance benchmarking tooling
- Debugging with tcpdump-related features - Issue #1384
- Composition Improvements
- Tooling
- Delivery: Q3 2023
- Theme: sustainability
- Effort: approx. 5 developer-months
Why: If we attract more people to use Testground, we'll have more contributions to the project because Testground is an Open Source project.
Deliverables:
- +1 month
- Working Examples (tested in CI)
- SDK implementers support
- Matrix of supported languages with links to SDKs
- Instructions for SDK Implementers
- +2 months
- Documentation EPICS 1741.
- Updated documentation infrastructure
- Quickstart guides
- Updated examples which are tested in CI
- New features & parameters, etc.
- guides for most helpful use cases and features
- composition templating, etc.
- Documentation EPICS 1741.
- +2 months
- Usability improvements
- Outbound Communication
- Publish guides, usage reports (whitepapers), and more.
- We want to multiply the impact of this effort by attracting more users, contributors, and candidates.
- Delivery: Q2 2024
- Theme: usefulness
- Effort: approx. 6 developer-months
Why: Testground can simulate small networks in CI, but it covers more use cases when it lives in a larger cluster. When we run Testgroun in Kubernetes, we can support whole organizations through the Testground As A Service product.
Using a managed service (Amazon's Elastic Kubernetes Service) means our maintenance costs are lower, and the team can focus on improvements.
Deliverables:
- An EKS installation script,
- Extra care is taken on Network infrastructure (CNIs).
- A (fixed) Kubernetes runner that runs on Amazon's EKS,
- The team can use the latest Testground version, create new releases, and upgrade the cluster.
- Delivery: Q3 2024
- Theme: usefulness
- Effort: approx. 2 developer-months
Why: TaaS enables tests on much bigger scale and makes it easier to use testground in new projects. It will improve build speed (thanks for docker caching) and run speed (thanks to parallelizaton), which are critical for testplans running in CI.
Deliverables:
- A stable cluster,
- Authentication,
- Tooling for users to use EKS cluster in their testing,
- Integration of the EKS feature in our testing infrastructure
- Test the EKS cluster during integration testing,
- (use short lived-clusters during nightly CI tests for example).
- Reliable Network simulation EKS
Why: an unreliable testing platform is just a noise machine. We need to secure our users' trust. In addition, testground maintainers need clear feedback about stability improvements & regressions.
- We expect strictly zero false positives (a test succeeds because Testground missed an error); these are critical bugs, we never merge a pull request that introduce a false positive into Testground.
- However, Testground users might encounter false negatives (a test fails because Testground encountered an issue). Our stability metrics will measure this.
Maintainers and users have a way to measure and follow Testground's Stability over any "relevant" axis.
This might combine different languages, runners (k8s, docker, local), and contexts (developer env, CI env, k8s env).
This dashboard will explicitly describe what to expect regarding false negatives (when an error is caused by Testground itself and not by the plan or the test).
Milestone 2: We have identified and reached our user's stability requirements. We met "most" of them
- local runner,
- docker runner,
- EKS runner,
Maintainers and users have a way to measure and follow Testground's Performance over any "relevant" axis. Therefore, this effort will start with identifying which metrics we want to measure first.
It might contain:
- Raw build / run performance for synthetic workload
- Performance of real-life usage in CI (like
libp2p/test-plans
) - Consideration for caching features, re-building, etc.
Why: We believe Testground can provide value across entire organizations. Making it easy to run a large-scale workload and efficient small-scale CI tests are core to its success.
- Deploy
- Security (authentication)
- domain setup
- Deploying short-lived clusters for benchmarking
- Deploying short-lived clusters for benchmarking
Why: We believe testing is valuable when everyone on the team can contribute. Our platform has to be approachable.
Related - EPICS 1741.
- The documentation is up-to-date
- Generates configuration documentation
- We provide introduction guides for every language
- We provide doc for Common Patterns
- "Public Relations".
Why: Testground has already provided value amongst many different projects. Now, we need bulletproof development processes to make the project sustainable and facilitate external contributions.
(I feel like this should be a top-level point)
- We use a single CI (GitHub CI)
- Refactor the testing to be easier to use and simpler to maintain
- remove shell scripts
- Plan for testing EKS
- Measure & remove flakiness
- We distribute versions following an explicit release process (no more
:edge
by default) - It's easy to follow changes (CHANGELOG)
- We distribute binaries
- The release process is automated
- public specifications
- public discussions
- community gatherings
- We have a single maintainer team for review,
- We have a clear label and triaging process,
- We have a reliable & transparent contribution process (protected branches, etc),
- We have precise project management tooling (port previous ZenHub planning into Github?)
5. Testground provides the networking tooling required to test complex Distributed / Decentralized Applications
Why: This is Testground's main feature.
There is a clear matrix of what features can be used with which Testground runner.
- MTU issues, networking simulation, etc.
- This will be a Matrix of feature x runner x precision (transport level, application level, etc.).
- Access to public networks - issue 1472
- NAT simulation - issue 1299
- Complex topologies - issue 1354
Why:
- composition files specification and improvements -
- Logging improvements - Epic 1355
- tcpdump'ing features - Issue #1384
7. Testground covers every essential combination of languages, runtimes, and libraries required by its users
Why: Drive adoption outside of Protocol Labs.
- Provide a simple matrix about which languages and builders are supported and how
- example: go support = great, nodejs support = deprecated, python support = non existent
- Provide a way to raise requests for more language support
- Provide an example + SDK for "most" languages using
docker:generic
rust
,browser
,nim
,python
,
- Provide an official
docker:xxxx
builder for "most" languages (docker:rust
,docker:browser
)- This will require deep knowledge of the packaging systems and how to do library rewrites, etc. See the
lipb2p/test-plans
rust support for this.
- This will require deep knowledge of the packaging systems and how to do library rewrites, etc. See the
- Custom setup & composition generation scripts
- Record every known SDK and its level of maintenance in an official "awesome-testground" page
- example: nim-sdk.
- Provide instructions and a path for SDK implementers