protection against metadata inconsistency and/or data loss #2523

zebjonsa · 2022-08-12T12:23:35Z

zebjonsa
Aug 12, 2022

A small code review was performed on the Unlink() call in juicefs. There are constructs in the transaction handling that seem problematic.

Here is a pseudo-trace rundown of a file deletion from the FUSE layer.

func (fs *fileSystem) Unlink(cancel <-chan struct{}, header *fuse.InHeader, name string) (code fuse.Status) {
    func (v *VFS) Unlink(ctx Context, parent Ino, name string) (err syscall.Errno) {
        err = v.Meta.Unlink(ctx, parent, name)
        func (m *baseMeta) Unlink(ctx Context, parent Ino, name string) syscall.Errno {
            return m.en.doUnlink(ctx, m.checkRoot(parent), name)
            func (m *dbMeta) doUnlink(ctx Context, parent Ino, name string) syscall.Errno {
                m.fileDeleted(opened, n.Inode, n.Length) // outside tx
                func (m *baseMeta) fileDeleted(opened bool, inode Ino, length uint64) {
                    m.tryDeleteFileData(inode, length)
                        func (m *baseMeta) tryDeleteFileData(inode Ino, length uint64) {
                            m.en.doDeleteFileData(inode, length)
                                func (m *dbMeta) doDeleteFileData(inode Ino, length uint64) {
                                    err := m.deleteChunk(inode, c.Indx)
                                        func (m *dbMeta) deleteChunk(inode Ino, indx uint32) error {
                               A            err := m.txn(func(s *xorm.Session) error { UPDATE .. update jfs_chunk_ref set ...})
                               B            m.deleteSlice(s.id, s.size)
                                                func (m *baseMeta) deleteSlice(id uint64, size uint32) {
                               C                        if err := m.newMsg(DeleteSlice, id, size); err == nil || strings.Contains(err.Error(), "NoSuchKey") || strings.Contains(err.Error(), "not found") {
                                                            if err = m.en.doDeleteSlice(id, size); err != nil {
                               D                                    return m.txn(func(s *xorm.Session) error { _, err := s.Exec("delete from jfs_chunk_ref where chunkid=?", id) return err })
                               E    _ = m.txn(func(s *xorm.Session) error { _, err := s.Delete(delfile{Inode: inode}) return err })

A) deleteChunk() starts new tx, does changes to the database and commits it
B) deleteChunk() calls deleteSlice()
C) deleteSlice() calls m.newMsg(DeleteSlice) which runs the user's callback (which removes object data, as initialized by registerMetaMsg() in mount())
D) deleteSlice() starts new tx, updates the database and commits it
E) doDeleteFileData() starts a new tx, changes the database and commits it

Problem P1:

A proposal here is that if for example A) commits successfully but D) or E) fails then the resulting database state is invalid.
Alternatively, if for example D) commits successfully but E) fails then the resulting database state is invalid.

Problem P2:

A proposal here is that if C) runs successfully and the underlying object data is removed but D) and/or E) fail(s), then the user has lost object data. The next time the user tries to read this object, the metadata is found but the data is gone.

Summary:

P1. Statements that are part of the same logical operation do not execute in the same database transaction. This means that if one commit succeeds and the other fails, the properties of consistency and atomicity are violated.
P2. The coordination of object and metadata, which in itself can be seen as a transaction, may go out-of-sync, resulting in object data loss.

Proposed solutions:

P1. Perform all statements within the same transaction. If any statement fails, no changes take effect.
P2. It is not sufficient to first commit to DB and remove object later, because how then do you keep track of the need to remove the object if it fails? Nor is it sufficient to first remove the object and then commit to the database, because what if the commit fails (along with any other attempts to persist information about this situation). Therefore, commit intention (statements necessary for the operation) to database, remove object data and then perform the intention. If any step fails, it can be automatically re-tried.

Other information:

Running postgres SQL meta backend on commit fb191ed.
Only the Unlink() call was reviewed. The general issue of how transactional semantics are applied may potentially be a concern for other top-level calls as well.

davies · 2022-08-13T10:00:13Z

davies
Aug 13, 2022
Maintainer

It's slow to delete the data in object storage, so we can't put all these inside single transaction. When a file is ready to delete (no more links), we put a record into database to track that, and start a background job to delete the data in background. The deletion can fail, we will retry that by scanning the tracking records. Until we delete all the data in object storage, we delete the tracking record.

0 replies

SandyXSD · 2022-08-15T03:52:37Z

SandyXSD
Aug 15, 2022
Collaborator

You may follow the delfile table (the so-called tracking records) in sql.go to review more details.

0 replies

T-TRz879 · 2022-08-15T06:48:00Z

T-TRz879
Aug 15, 2022

JuiceFS using DelSlices,DelFiles and SliceRef ensures that even if there is a business interruption, the business result will not be affected,you can find them in JuiceFS's code. @zebjonsa

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protection against metadata inconsistency and/or data loss #2523

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

protection against metadata inconsistency and/or data loss #2523

zebjonsa Aug 12, 2022

Replies: 3 comments

davies Aug 13, 2022 Maintainer

SandyXSD Aug 15, 2022 Collaborator

T-TRz879 Aug 15, 2022

zebjonsa
Aug 12, 2022

davies
Aug 13, 2022
Maintainer

SandyXSD
Aug 15, 2022
Collaborator

T-TRz879
Aug 15, 2022