mirror of https://github.com/torvalds/linux.git
The crash happens rather often when we reset some cluster nodes while nodes contend fiercely to do truncate and append. The crash backtrace is below: dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18) ocfs2: End replay journal (node 318952601, slot 2) on device (253,18) ocfs2: Beginning quota recovery on device (253,18) for slot 2 ocfs2: Finishing quota recovery on device (253,18) for slot 2 (truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) (truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1 ------------[ cut here ]------------ kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 Supported: No, Unsupported modules are loaded CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014 task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000 RIP: 0010:[<ffffffffa05c8c30>] [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282 RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246 RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414 R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448 R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020 FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0 Call Trace: ocfs2_setattr+0x698/0xa90 [ocfs2] notify_change+0x1ae/0x380 do_truncate+0x5e/0x90 do_sys_ftruncate.constprop.11+0x108/0x160 entry_SYSCALL_64_fastpath+0x12/0x6d Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff <0f> 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff RIP [<ffffffffa05c8c30>] ocfs2_truncate_file+0x640/0x6c0 [ocfs2] It's because ocfs2_inode_lock() get us stale LVB in which the i_size is not equal to the disk i_size. We mistakenly trust the LVB because the underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with DLM_SBF_VALNOTVALID properly for us. But, why? The current code tries to downconvert lock without DLM_LKF_VALBLK flag to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even if the lock resource type needs LVB. This is not the right way for fsdlm. The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node failure happens. The following diagram briefly illustrates how this crash happens: RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB; The 1st round: Node1 Node2 RSB1: PR RSB1(master): NULL->EX ocfs2_downconvert_lock(PR->NULL, set_lvb==0) ocfs2_dlm_lock(no DLM_LKF_VALBLK) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - dlm_lock(no DLM_LKF_VALBLK) convert_lock(overwrite lkb->lkb_exflags with no DLM_LKF_VALBLK) RSB1: NULL RSB1: EX reset Node2 dlm_recover_rsbs() recover_lvb() /* The LVB is not trustable if the node with EX fails and * no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1. */ if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to return; * to invalid the LVB here. */ The 2nd round: Node 1 Node2 RSB1(become master from recovery) ocfs2_setattr() ocfs2_inode_lock(NULL->EX) /* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */ ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */ ocfs2_truncate_file() mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */ The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin is uesed. Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
||
|---|---|---|
| .. | ||
| 9p | ||
| adfs | ||
| affs | ||
| afs | ||
| autofs4 | ||
| befs | ||
| bfs | ||
| btrfs | ||
| cachefiles | ||
| ceph | ||
| cifs | ||
| coda | ||
| configfs | ||
| cramfs | ||
| crypto | ||
| debugfs | ||
| devpts | ||
| dlm | ||
| ecryptfs | ||
| efivarfs | ||
| efs | ||
| exofs | ||
| exportfs | ||
| ext2 | ||
| ext4 | ||
| f2fs | ||
| fat | ||
| freevxfs | ||
| fscache | ||
| fuse | ||
| gfs2 | ||
| hfs | ||
| hfsplus | ||
| hostfs | ||
| hpfs | ||
| hugetlbfs | ||
| isofs | ||
| jbd2 | ||
| jffs2 | ||
| jfs | ||
| kernfs | ||
| lockd | ||
| minix | ||
| ncpfs | ||
| nfs | ||
| nfs_common | ||
| nfsd | ||
| nilfs2 | ||
| nls | ||
| notify | ||
| ntfs | ||
| ocfs2 | ||
| omfs | ||
| openpromfs | ||
| orangefs | ||
| overlayfs | ||
| proc | ||
| pstore | ||
| qnx4 | ||
| qnx6 | ||
| quota | ||
| ramfs | ||
| reiserfs | ||
| romfs | ||
| squashfs | ||
| sysfs | ||
| sysv | ||
| tracefs | ||
| ubifs | ||
| udf | ||
| ufs | ||
| xfs | ||
| Kconfig | ||
| Kconfig.binfmt | ||
| Makefile | ||
| aio.c | ||
| anon_inodes.c | ||
| attr.c | ||
| bad_inode.c | ||
| binfmt_aout.c | ||
| binfmt_elf.c | ||
| binfmt_elf_fdpic.c | ||
| binfmt_em86.c | ||
| binfmt_flat.c | ||
| binfmt_misc.c | ||
| binfmt_script.c | ||
| block_dev.c | ||
| buffer.c | ||
| char_dev.c | ||
| compat.c | ||
| compat_binfmt_elf.c | ||
| compat_ioctl.c | ||
| coredump.c | ||
| dax.c | ||
| dcache.c | ||
| dcookies.c | ||
| direct-io.c | ||
| drop_caches.c | ||
| eventfd.c | ||
| eventpoll.c | ||
| exec.c | ||
| fcntl.c | ||
| fhandle.c | ||
| file.c | ||
| file_table.c | ||
| filesystems.c | ||
| fs-writeback.c | ||
| fs_pin.c | ||
| fs_struct.c | ||
| inode.c | ||
| internal.h | ||
| ioctl.c | ||
| iomap.c | ||
| libfs.c | ||
| locks.c | ||
| mbcache.c | ||
| mount.h | ||
| mpage.c | ||
| namei.c | ||
| namespace.c | ||
| no-block.c | ||
| nsfs.c | ||
| open.c | ||
| pipe.c | ||
| pnode.c | ||
| pnode.h | ||
| posix_acl.c | ||
| proc_namespace.c | ||
| read_write.c | ||
| readdir.c | ||
| select.c | ||
| seq_file.c | ||
| signalfd.c | ||
| splice.c | ||
| stack.c | ||
| stat.c | ||
| statfs.c | ||
| super.c | ||
| sync.c | ||
| timerfd.c | ||
| userfaultfd.c | ||
| utimes.c | ||
| xattr.c | ||