-
Notifications
You must be signed in to change notification settings - Fork 23
Design Considerations Notes
Mark Santcroos edited this page Aug 6, 2013
·
2 revisions
The loose collection of notes below represents a number of considerations and requirements for the saga-pilot design. That list is not ordered.
- API
- TROY is considered a primary API consumer, obviously
- it should be easy to support a PilotAPI wrapper for backward compatibility
- CLI
- this is a second order concern, and derives from the API
- flexibility
- support several usage types
- BigJob like deployment, i.e. module bound to application
- AIMES / Condor / Panda like deployment, i.e. as community service
- support production use
- support research (scheduling, middleware, infrastructure level)
- possibly support different language binding, at least for the service type deployment
- support several usage types
- stability
- agent deployment should be simple and stable
- error reporting should be intuitive, in particular for deployment and bootstrapping errors
- ease of use
- implementability
- we have finite resources, and need to consciously pick the points where we accept complexity
- modularity
- apart from the different deployment modes, possibly support pluggable schedulers, pluggable coordination layers, pluggable (and configurable) agents.
- agents in particular should be very lightweight, and have limited dependencies
- performance
- saga pilot should scale up (number of CUs, number of agents per backend)
- saga pilot should scale out (number agents across backends)
- notifications should work throughout the whole stack
- development should be driven by benchmarks (see Performance Tests)
- inspection
- logging, inspection and/or auditing are required on all levels, for
- operation transparency for the end user
- support of development
- support of experiments
- support of SAGA-Pilot level and higher level schedulers
- logging, inspection and/or auditing are required on all levels, for
- security
- are multi-user pilots needed for AIMES and gateway use cases? What does multi-user exactly mean in those contexts?
- credential delegation needed for pilot initiated data transfer operations.
- data
- classic stage-in/stage-out directives on CUs will be needed, as they allow automated file movement for large numbers of jobs
- stdout/stderr streaming would be good for interactivity, and debugging -- possibly via a pub-sub mechanism
- list of CU working directory, returns list of data access URLs? This would be in line with general CU inspection capabilities
- provenance
- Structured recording of what happened, where and who, etc. (Also related to performance metrics)
NOT considered to be design constraints are:
- dynamically resizable pilots
- this is considered to be a corner use case which introduces significant agent complexity -- which we want to avoid for various reasons, see above