OV SD card corruption #310

hpmax · 2022-05-31T00:00:05Z

Inevitably after some number of improper power cycles (i.e. power cycles that occur without proper shutdown) the FLASH card seems to get corrupted. I don't see how this is acceptable behavior for a flight instrument, as it would be easy to accidentally flip a master switch, or switch from one battery to another and cause a reboot. In my last reboot I seemed to lose my map and my configuration (albeit, it's possible it's just the configuration since the configuration probably points to the map).

I googled for how this is handled on a Raspberry Pi, and found a number of suggestions although some seemed more or less doable, and I'm not sure all of them properly address the issue.

Make the filesystem read-only. This is clearly the most straightforward solution. Much of the system could be read-only. It's possible that we could have a separate partition for the configuration and maps, and a third for logs. The maps and such would only get updated periodically.
Crond could come along and periodically sync the read/write partitions. This is still no guarantee, particularly for logs which are probably written frequently Alternately, XCSoar could flush after each write. This is probably the easiest solution for maximum reliability with minimum effort, but it'd have to get @MaxKellermann buy in. Perhaps there could be a flag, command line option, or configuration option on XCSoar that tells it to flush after each write or something similar.

I think there is no good reason whatsoever not to flush after each configuration update. The logs I can see as a bigger issue, perhaps you don't want to sync after each write, but would it be possible to adjust the way logs are written so instead of directly writing to disk, you write to a memory buffer and then once full, write the buffer out and flush it (essentially sort of internalizing how the file system actually works) to decrease the risk of data corruption by writing and flushing all at the same time without constantly writing?

I was speaking to my father who has a fair amount of experience dealing with large databases. His (somewhat modified by me) suggestion was to write to two separate but identical files, flushing after each write. Put a transaction serial number at the end (i.e. increment by 1 each time you write something new). When the machine boots up, it should check the serial numbers at the end of each file. If the serial number in file 1 is the same as file 2 the files are almost certainly fine. If they're different there was corruption because one write completed but not the other. However, since the second write doesn't start until the first one completes, we know that by definition one of them IS good. If we can detect (via checksum, hash, etc) that one is corrupt, then just copy the good one over the corrupt one and you've fixed whatever corruption was present.
There were suggestions that an industrial SLC SD card could be used, but I think the corruption is occurring because of it not being sync'ed and I doubt a better SD card would fix the problem.
There were some suggestions of an "uninterruptable power supply." Essentially, my thought on how to implement this would be: Put a schottky diode followed by a supercapacitor on the input of the power converter. You'd also want to a hang a voltage divider on the input, (something like 100k - (100k || 3.3V zener diode). If the voltage on the CB2 pin connected to the zener goes low, you immediately shut off the display and then issue a "sudo shutdown now." You could also use the i2c A/D converter and dispense with the additional circuitry, but I'd be inclined not to do that. But I don't know how long it'd have to run to properly shutdown, or how much current it would need to supply. I assume that it would be on the order of 500mA @ 5V minimum, and you'd need to keep it running for 5 seconds or so? I'm thinking you'd probably want 1-2 farads minimum. Alternately, a battery could be used, but I don't like that idea. Putting it on the output of the power converter is another alternative which might inherently seem easier, but it does have an issue. As soon as the input power is lost the output voltage will start dropping, and you have much less headroom on the output between working and not working than on the input.

I view this as both the best and worst solution. It should absolutely work, but it's also the most expensive, requires hardware redesign and without being fancy could keep the part powered when it shouldn't be.

I think it's worth pointing out that the current file image creates a pretty small file system... 16GB cards are dirt cheap, heck even 64GB cards can be had for less than $10 in quantity 1. There's no reason why we can't use the rest of the space, and I'd prefer the partition were enlarged anyway.

I think a combination of remapping the partitions and intelligent syncing makes sense and having a backup partition (as described in #3) with a backup log/configuration would be the icing on the cake. Anyone have any other thoughts on how to improve the reliability?

MaxKellermann · 2022-05-31T07:09:18Z

What exactly does "SD card corruption" mean? How does it look like? What exactly are the symptoms?

Make the filesystem read-only. This is clearly the most straightforward solution. Much of the system could be read-only. It's possible that we could have a separate partition for the configuration and maps, and a third for logs.

If you're talking about filesystem corruption - then this idea would not be helpful. If you lose power in the middle of a write, then files that are currently being written to may get corrupted, but not files that are not being written to. System files are not being written to.

Making the operating system (root partition) read-only is a good idea for other reasons, but not to prevent filesystem corruption.

Crond could come along and periodically sync the read/write partitions.

All this would do is decrease performance, but would not prevent filesystem corruption. Note that the kernel already has a timer which periodically writes back dirty pages from the page cache ot permanent storage, so what you want to do essentially already exists. The difference is that an explicit "sync" would stall all pending writes until all dirty pages are flushed, which (unlike the writeback timer) would freeze the OpenVario for a few seconds, but as I said, would not prevent filesystem corruption.

I was speaking to my father who has a fair amount of experience dealing with large databases. His (somewhat modified by me) suggestion

Your father's suggestion sounds like he was an expert in the 80ies or early 90ies, but these things don't work anymore these days on modern computers (since mid-90ies);

However, since the second write doesn't start until the first one completes, we know that by definition one of them IS good.

This is the part that ceased to work since the mid-90ies: modern kernels may reorder writes at will, so no matter what nice system you come up with, it may end up broken anyway.

If you need guaranteed storage consistency, you need system calls like fdatasync() / fsync() (that's what databases like PostgreSQL do). Check XCSoar commits XCSoar/XCSoar@a9e7334 and XCSoar/XCSoar@56b7e24 for how XCSoar 7.24 will use it to prevent profile corruption. These system calls do a "sync" on a smaller scale, not the whole filesystem, but only a certain file, which avoids the extreme slowness of a global sync.

There were suggestions that an industrial SLC SD card could be used, but I think the corruption is occurring because of it not being sync'ed and I doubt a better SD card would fix the problem.

If this is a hardware problem, there's not much we can do, other than using better hardware.

mihu-ov · 2022-05-31T08:10:43Z

Inevitably after some number of improper power cycles (i.e. power cycles that occur without proper shutdown) the FLASH card seems to get corrupted.

There are some reported issues about corrupted XCSoar config files but this should be fixed already as explained by Max. I don´t see a significant number of other SD card corruption issues reported. I may have missed those, though.

I don't see how this is acceptable behavior for a flight instrument, as it would be easy to accidentally flip a master switch, or switch from one battery to another and cause a reboot.

Switching batteries never caused my OV to reboot. The DC/DC converter seems to store enough energy, at least for quick flipping of a toggle switch and my not so power hungry Pixel Qi variant.

In my last reboot I seemed to lose my map and my configuration (albeit, it's possible it's just the configuration since the configuration probably points to the map).

This should be the issue mentioned by Max which is fixed with XCSoar 7.24.

hpmax · 2022-05-31T13:36:17Z

@MaxKellermann My setup files were corrupted. It is possible that is the extent of it. And yes, I was adjusting my device settings prior to an improper shutdown so that may have been the cause of it.

I don't know what version of XCSoar I'm running and can't figure out how to find out. The splash screen displays version 7.0, and I don't see any functionality either within XCSoar or from the command line to display the version number. However, the date of the binary on the file system is April 28, 2021, and the current date is correct, so I doubt it's after that. Based on release information I can find on github and elsewhere, it seems that would correspond to around version 7.4, and hence considerably prior to version 7.24 (which doesn't appear to be released yet?)

@mihu-ov Has it been incorporated into the OV image or files, and if so, when?

Regarding "reordering writes," I disagree. Left to it's own devices, sure... but as you pointed out you can flush the stream which forces a write. If writes are being done infrequently and that there is very little time between the write and the flush/sync then this is probably sufficient. I still maintain that the double write, would drastically improve the safety of it, but simply flushing immediately after writing is probably already very safe.

The only other thing I can think of, which is NOT as a critical as the setup, but still important, is the flight logs. Do you have a technique for avoiding the same sort of corruption of flight logs?

linuxianer99 · 2022-08-02T14:37:51Z

@hpmax Is this still an issue ?

hpmax · 2022-08-02T19:41:40Z

@linuxianer99 Honestly, I've been distracted with other things, mainly quashing radio interference (EMI) issues which appear to be emenating from the USB ports and external power converters.

If @MaxKellermann says this was a known issue that was addressed, I can accept that as I don't have any contrary information. Is there a similar solution for flight logs? I can imagine the solution is to keep a RAM buffer and perform a "disk write" followed by file close/flush after the RAM buffer fills. That would at least reduce the chance of corruption.

linuxianer99 added the question Further information is requested label Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OV SD card corruption #310

OV SD card corruption #310

hpmax commented May 31, 2022 •

edited

Loading

MaxKellermann commented May 31, 2022

mihu-ov commented May 31, 2022

hpmax commented May 31, 2022

linuxianer99 commented Aug 2, 2022

hpmax commented Aug 2, 2022

OV SD card corruption #310

OV SD card corruption #310

Comments

hpmax commented May 31, 2022 • edited Loading

MaxKellermann commented May 31, 2022

mihu-ov commented May 31, 2022

hpmax commented May 31, 2022

linuxianer99 commented Aug 2, 2022

hpmax commented Aug 2, 2022

hpmax commented May 31, 2022 •

edited

Loading