Current approach to cleaning up nodes doesn't guarantee a clean test run #19

ewencp · 2015-02-18T02:10:35Z

This is potentially a pretty critical bug. Consider a test with three services: zookeeper, kafka, and a kafka client. If a previous run of the test failed and we didn't get to clean up properly, we would expect the next round to clean up before starting because we have each service do a _stop_and_clean before starting services.

However, what we actually need is for all nodes to be cleaned before we start any services. The reason is that, in the above example, we'll start the second test by making sure the zookeeper nodes are clean (killing the leftover processes, deleting their data), and then start them up. However, the Kafka nodes may still be running. Since we're also not doing any sort of port randomization, the old Kafka brokers will reconnect to the new Zookeeper nodes, and may do things like write ZK nodes in the mean time.

Cleaning up only the nodes we'll need is probably difficult because tests have no easy way to know the number of nodes they'll need. One option for fixing this bug is to just run the cleaning process on all nodes before doing anything else in a test. Another option is requiring each test to have an annotation indicating the number of nodes it needs (possibly dynamically determined after instantiation for tests that can be run on different cluster sizes) so that those nodes can all be preallocated and cleaned.

And of course all of this depends on being able to run a generic cleaning process to at least make sure that all processes are shutdown so they can't do anything bad before each individual service can do things like clean up old state before running on a node. This works ok for now because we have only simple java processes that we can find easily, but it's possible we should have a way for each service to register a cleanup method that will be run regardless of whether the current test will use that service, truly ensuring all old state is cleaned up.

ewencp · 2015-04-16T04:24:40Z

This refers to a lot of services that were specific to the particular tests, but this is a more general framework-level issue as well. We might want to provide a hook to allow users to clean nodes up before anything else is executed on them, so I'm leaving this in place as I clean up all the project-specific issues out of this repository.

Print an error and exit with a non-zero status code to indicate that the tests failed since specifying a non-existent path is probably an error on the user's part. Fixes #19.

granders · 2015-04-30T22:34:13Z

@ewencp
I think I covered this pretty well in #43

Any node allocated to a Service is cleaned via Service.force_clean_node() and the service-specific clean_node() before the Service is started.
I also added checks to warn users/developers if it appears that a Service hasn't properly cleaned up after itself.

Let me know if you think I left anything out.

granders · 2015-04-30T22:42:43Z

Also the service shutdown process is a lot more robust now given that stopping/cleaning will still occur even if running a test throws an exception, so it's much less likely that stale service process will manage to stick around.

ewencp · 2015-04-30T22:55:06Z

@granders I think we're closer now, but this isn't actually resolved. First the easy part: we still have the problem that the commands being used are muckrake specific. It's actually a combination of calls, both _kill_running_processes and force_clean_node that together get the node clean. Fixing this is probably just a matter of moving the muckrake specific code to muckrake and substituting stubbed out versions in ducttape. In muckrake we might just make a new MuckrakeService class that will be the new base class for all the Muckrake services and uses the current "kill all java processes, delete everything in /mnt" approach.

The harder problem is making sure that nodes not even allocated to the test don't cause problems -- this is the issue described in the initial report where a Kafka cluster that's still running from a previous test starts connecting to a zookeeper cluster that was started as part of the new test. This is harder to fix because it involves nodes that may never even be allocated to this test.

It might make sense to just punt on that edge case given that the cleanup procedure is much better now -- it really only becomes relevant if something goes very wrong and ducktape exits without cleaning up properly. It would be nice if the next execution of ducttape could make sure things were in a clean state. Maybe that's the solution, just make the first thing ducttape does with the cluster is to go clean up each node, then start running tests? Only issue is where you should specify what actions should be take to do that initial cleanup since it's no longer tied to an individual Service or Test.

granders · 2015-04-30T23:11:16Z

Yeah I agree that problems caused stray processes running on nodes not allocated to the test are not solved. That edge case is more unlikely now, and cleaning unallocated nodes seems tricky to make work in the parallel case. For nightly runs this is also mitigated by the fact that nodes are brought up from scratch.

I think I see your point about _kill_running_processes and force_clean_node although they are implemented in ducktape, the implementations share the assumptions made by muckrake wrt what kinds of processes are running (java) and where persistent state gets dumped (/mnt)

As far as cluster "pre-cleaning", this seems reasonable - for example it might make sense for the whole cluster to have a life cycle just like Services do - this would also give us the opportunity to try and provide robust "bring vagrant nodes up from scratch if necessary" logic.

But pre-cleaning is a bit tricky too - either we make some generic assumptions about processes (e.g. kill all java) and specific files, or do some funky thing involving Service discovery.

ewencp · 2015-04-30T23:22:13Z

Here's another idea: create a class in ducktape, maybe called something like ClusterNodeCleaner, that just defines the one method for cleaning a node. The default implementation just passes. Then, make ducktape discovery look for subclasses of this interface and use them to do a pre-clean step before any tests are run.

Then we could move the java + rm -rf /mnt/* code into a muckrake. This would clean up the cluster before tests start, and we could safely assume the test writer has done enough to clean up after themselves in their own services/tests as long as ducktape does a good job of catching exceptions.

To be really thorough, you could even have these classes only apply to tests under the directory they're defined in, e.g. you might define a top-level one, but then if you have suites of tests that run some other processes, you might override it just for that package to, e.g., kill another type of process as well. This could be quite a bit more complicated for the test loader though since it needs to search back up the directory tree to find the right cleaner class to use for each test. Although I guess if you specified a single test class you'd still have to do that search with this approach, but you could at least only do it one time during loading.

ewencp added the bug label Feb 18, 2015

ewencp self-assigned this Feb 18, 2015

ewencp added a commit that referenced this issue Apr 21, 2015

Verify input paths exist during test discovery.

fa2b866

Print an error and exit with a non-zero status code to indicate that the tests failed since specifying a non-existent path is probably an error on the user's part. Fixes #19.

granders mentioned this issue Jun 4, 2015

add --cleanup (or similar) option #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current approach to cleaning up nodes doesn't guarantee a clean test run #19

Current approach to cleaning up nodes doesn't guarantee a clean test run #19

ewencp commented Feb 18, 2015

ewencp commented Apr 16, 2015

granders commented Apr 30, 2015

granders commented Apr 30, 2015

ewencp commented Apr 30, 2015

granders commented Apr 30, 2015

ewencp commented Apr 30, 2015

Current approach to cleaning up nodes doesn't guarantee a clean test run #19

Current approach to cleaning up nodes doesn't guarantee a clean test run #19

Comments

ewencp commented Feb 18, 2015

ewencp commented Apr 16, 2015

granders commented Apr 30, 2015

granders commented Apr 30, 2015

ewencp commented Apr 30, 2015

granders commented Apr 30, 2015

ewencp commented Apr 30, 2015