mirror of https://github.com/torvalds/linux.git
264 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
b898ab2336
|
ext4: convert to new timestamp accessors
Convert to using the new inode timestamp accessor functions. Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20231004185347.80880-33-jlayton@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
3ef96fcfd5 |
Many ext4 and jbd2 cleanups and bug fixes for v6.6-rc1.
* Cleanups in the ext4 remount code when going to and from read-only
* Cleanups in ext4's multiblock allocator
* Cleanups in the jbd2 setup/mounting code paths
* Performance improvements when appending to a delayed allocation file
* Miscenallenous syzbot and other bug fixes
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmTwqUMACgkQ8vlZVpUN
gaMqgwf6Aui6MlrtNJx6CrJt4dxLANQ8G6bsJ2Zr+6QNS1X/GAUrCCyLWWom1dfb
OJ/n4/JUCNc9v5yLCTqHOE5ZFTdQItOBJUKXbJYff8EdnR+zCUULpj6bPbEs5BKp
U7CiiZ9TIi9S2TWezvIJKIa2VxgPej7CH/HOt8ISh/Msq8nHvcEEJIyOEvVk9odQ
LEkiQCsikWaljB7qEOIYo+xgFffMZfttc4zuTkdr/h1I6OWhvQYmlwSnTuAiE7BS
BVf3ebD2Dg8TChUMXOsk2d743iZNWf/+yTfbXVu93/uEM9vgF0+HO6EerTK8RMeM
yxhshg9z7ccuFjdY/2NYDXe6pEuDKw==
=cMIX
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Many ext4 and jbd2 cleanups and bug fixes:
- Cleanups in the ext4 remount code when going to and from read-only
- Cleanups in ext4's multiblock allocator
- Cleanups in the jbd2 setup/mounting code paths
- Performance improvements when appending to a delayed allocation file
- Miscellaneous syzbot and other bug fixes"
* tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
ext4: fix slab-use-after-free in ext4_es_insert_extent()
libfs: remove redundant checks of s_encoding
ext4: remove redundant checks of s_encoding
ext4: reject casefold inode flag without casefold feature
ext4: use LIST_HEAD() to initialize the list_head in mballoc.c
ext4: do not mark inode dirty every time when appending using delalloc
ext4: rename s_error_work to s_sb_upd_work
ext4: add periodic superblock update check
ext4: drop dio overwrite only flag and associated warning
ext4: add correct group descriptors and reserved GDT blocks to system zone
ext4: remove unused function declaration
ext4: mballoc: avoid garbage value from err
ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple()
ext4: change the type of blocksize in ext4_mb_init_cache()
ext4: fix unttached inode after power cut with orphan file feature enabled
jbd2: correct the end of the journal recovery scan range
ext4: ext4_get_{dev}_journal return proper error value
ext4: cleanup ext4_get_dev_journal() and ext4_get_journal()
jbd2: jbd2_journal_init_{dev,inode} return proper error return value
jbd2: drop useless error tag in jbd2_journal_wipe()
...
|
|
|
|
ffb6844e28 |
ext4: drop read-only check in ext4_init_inode_table()
We better should not be initializing inode tables on read-only filesystem. The following transaction start will warn us and make the function bail anyway so drop the pointless check. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-8-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
eb8ab4443a |
ext4: make ext4_forced_shutdown() take struct super_block
Currently ext4_forced_shutdown() takes struct ext4_sb_info but most callers need to get it from struct super_block anyway. So just pass in struct super_block to save all callers from some boilerplate code. No functional changes. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230616165109.21695-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
1bc33893e7 |
ext4: convert to ctime accessor functions
In later patches, we're going to change how the inode's ctime field is used. Switch to using accessor functions instead of raw accesses of inode->i_ctime. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230705190309.579783-40-jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
|
|
|
5354b2af34 |
ext4: allow ext4_get_group_info() to fail
Previously, ext4_get_group_info() would treat an invalid group number as BUG(), since in theory it should never happen. However, if a malicious attaker (or fuzzer) modifies the superblock via the block device while it is the file system is mounted, it is possible for s_first_data_block to get set to a very large number. In that case, when calculating the block group of some block number (such as the starting block of a preallocation region), could result in an underflow and very large block group number. Then the BUG_ON check in ext4_get_group_info() would fire, resutling in a denial of service attack that can be triggered by root or someone with write access to the block device. For a quality of implementation perspective, it's best that even if the system administrator does something that they shouldn't, that it will not trigger a BUG. So instead of BUG'ing, ext4_get_group_info() will call ext4_error and return NULL. We also add fallback code in all of the callers of ext4_get_group_info() that it might NULL. Also, since ext4_get_group_info() was already borderline to be an inline function, un-inline it. The results in a next reduction of the compiled text size of ext4 by roughly 2k. Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230430154311.579720-2-tytso@mit.edu Reported-by: syzbot+e2efa3efc15a1c9e95c3@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> |
|
|
|
1df9bde48f |
ext4: remove unused group parameter in ext4_block_bitmap_csum_set
Remove unused group parameter in ext4_block_bitmap_csum_set. After this, group parameter in ext4_set_bitmap_checksums is also not used, just remove it too. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230221203027.2359920-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
4fd873c817 |
ext4: remove unused group parameter in ext4_inode_bitmap_csum_set
Remove unused group parameter in ext4_inode_bitmap_csum_set. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230221203027.2359920-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
b83acc7771 |
ext4: remove unused group parameter in ext4_inode_bitmap_csum_verify
Remove unused group parameter in ext4_inode_bitmap_csum_verify. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230221203027.2359920-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
c14329d39f
|
fs: port fs{g,u}id helpers to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
|
|
|
f2d40141d5
|
fs: port inode_init_owner() to mnt_idmap
Convert to struct mnt_idmap.
Last cycle we merged the necessary infrastructure in
|
|
|
|
deb9acc122 |
A large number of cleanups and bug fixes, with many of the bug fixes
found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc.) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented.) -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmOWqrMACgkQ8vlZVpUN gaMvmgf+P2C6vzjn13ZdF+GwFTi4fx4TJ5BZT78LQqvTZqhkfk4k1q2SFfHI7nXT ZWdu1KUQ0SYLo64oaSU9W+2B2pmGi/KgUlrwNhy8DFeGStogPuDVfmGWB63p1UQL ld42mE9q7bjY6nCZSKYXPp2jfSwsHuliHBJ4UfzVNAIwjiUEJ7pGeIrMFdLAEkVm TVNzvlUZaHUnVxhpsP6hs+5WNhHQ2IhWz4rwX01ussNgHTijYac4iaL05wpTvF5e 6NtvfmpOEMAbYrmIkJX4RVss4JNsHNOC0E8fjEHlgXJxBiAI6w8GxTxrS52Y4ELH nHXl/pc0L+I8+yh9B9+s0LBaSuPuTg== =lezv -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A large number of cleanups and bug fixes, with many of the bug fixes found by Syzbot and fuzzing. (Many of the bug fixes involve less-used ext4 features such as fast_commit, inline_data and bigalloc) In addition, remove the writepage function for ext4, since the medium-term plan is to remove ->writepage() entirely. (The VM doesn't need or want writepage() for writeback, since it is fine with ->writepages() so long as ->migrate_folio() is implemented)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: fix reserved cluster accounting in __es_remove_extent() ext4: fix inode leak in ext4_xattr_inode_create() on an error path ext4: allocate extended attribute value in vmalloc area ext4: avoid unaccounted block allocation when expanding inode ext4: initialize quota before expanding inode in setproject ioctl ext4: stop providing .writepage hook mm: export buffer_migrate_folio_norefs() ext4: switch to using write_cache_pages() for data=journal writeout jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout ext4: switch to using ext4_do_writepages() for ordered data writeout ext4: move percpu_rwsem protection into ext4_writepages() ext4: provide ext4_do_writepages() ext4: add support for writepages calls that cannot map blocks ext4: drop pointless IO submission from ext4_bio_write_page() ext4: remove nr_submitted from ext4_bio_write_page() ext4: move keep_towrite handling to ext4_bio_write_page() ext4: handle redirtying in ext4_bio_write_page() ext4: fix kernel BUG in 'ext4_write_inline_data_end()' ext4: make ext4_mb_initialize_context return void ext4: fix deadlock due to mbcache entry corruption ... |
|
|
|
6a518afcc2 |
fs.acl.rework.v6.2
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCY5bwTgAKCRCRxhvAZXjc
ovd2AQCK00NAtGjQCjQPQGyTa4GAPqvWgq1ef0lnhv+TL5US5gD9FncQ8UofeMXt
pBfjtAD6ettTPCTxUQfnTwWEU4rc7Qg=
=27Wm
-----END PGP SIGNATURE-----
Merge tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull VFS acl updates from Christian Brauner:
"This contains the work that builds a dedicated vfs posix acl api.
The origins of this work trace back to v5.19 but it took quite a while
to understand the various filesystem specific implementations in
sufficient detail and also come up with an acceptable solution.
As we discussed and seen multiple times the current state of how posix
acls are handled isn't nice and comes with a lot of problems: The
current way of handling posix acls via the generic xattr api is error
prone, hard to maintain, and type unsafe for the vfs until we call
into the filesystem's dedicated get and set inode operations.
It is already the case that posix acls are special-cased to death all
the way through the vfs. There are an uncounted number of hacks that
operate on the uapi posix acl struct instead of the dedicated vfs
struct posix_acl. And the vfs must be involved in order to interpret
and fixup posix acls before storing them to the backing store, caching
them, reporting them to userspace, or for permission checking.
Currently a range of hacks and duct tape exist to make this work. As
with most things this is really no ones fault it's just something that
happened over time. But the code is hard to understand and difficult
to maintain and one is constantly at risk of introducing bugs and
regressions when having to touch it.
Instead of continuing to hack posix acls through the xattr handlers
this series builds a dedicated posix acl api solely around the get and
set inode operations.
Going forward, the vfs_get_acl(), vfs_remove_acl(), and vfs_set_acl()
helpers must be used in order to interact with posix acls. They
operate directly on the vfs internal struct posix_acl instead of
abusing the uapi posix acl struct as we currently do. In the end this
removes all of the hackiness, makes the codepaths easier to maintain,
and gets us type safety.
This series passes the LTP and xfstests suites without any
regressions. For xfstests the following combinations were tested:
- xfs
- ext4
- btrfs
- overlayfs
- overlayfs on top of idmapped mounts
- orangefs
- (limited) cifs
There's more simplifications for posix acls that we can make in the
future if the basic api has made it.
A few implementation details:
- The series makes sure to retain exactly the same security and
integrity module permission checks. Especially for the integrity
modules this api is a win because right now they convert the uapi
posix acl struct passed to them via a void pointer into the vfs
struct posix_acl format to perform permission checking on the mode.
There's a new dedicated security hook for setting posix acls which
passes the vfs struct posix_acl not a void pointer. Basing checking
on the posix acl stored in the uapi format is really unreliable.
The vfs currently hacks around directly in the uapi struct storing
values that frankly the security and integrity modules can't
correctly interpret as evidenced by bugs we reported and fixed in
this area. It's not necessarily even their fault it's just that the
format we provide to them is sub optimal.
- Some filesystems like 9p and cifs need access to the dentry in
order to get and set posix acls which is why they either only
partially or not even at all implement get and set inode
operations. For example, cifs allows setxattr() and getxattr()
operations but doesn't allow permission checking based on posix
acls because it can't implement a get acl inode operation.
Thus, this patch series updates the set acl inode operation to take
a dentry instead of an inode argument. However, for the get acl
inode operation we can't do this as the old get acl method is
called in e.g., generic_permission() and inode_permission(). These
helpers in turn are called in various filesystem's permission inode
operation. So passing a dentry argument to the old get acl inode
operation would amount to passing a dentry to the permission inode
operation which we shouldn't and probably can't do.
So instead of extending the existing inode operation Christoph
suggested to add a new one. He also requested to ensure that the
get and set acl inode operation taking a dentry are consistently
named. So for this version the old get acl operation is renamed to
->get_inode_acl() and a new ->get_acl() inode operation taking a
dentry is added. With this we can give both 9p and cifs get and set
acl inode operations and in turn remove their complex custom posix
xattr handlers.
In the future I hope to get rid of the inode method duplication but
it isn't like we have never had this situation. Readdir is just one
example. And frankly, the overall gain in type safety and the more
pleasant api wise are simply too big of a benefit to not accept
this duplication for a while.
- We've done a full audit of every codepaths using variant of the
current generic xattr api to get and set posix acls and
surprisingly it isn't that many places. There's of course always a
chance that we might have missed some and if so I'm sure we'll find
them soon enough.
The crucial codepaths to be converted are obviously stacking
filesystems such as ecryptfs and overlayfs.
For a list of all callers currently using generic xattr api helpers
see [2] including comments whether they support posix acls or not.
- The old vfs generic posix acl infrastructure doesn't obey the
create and replace semantics promised on the setxattr(2) manpage.
This patch series doesn't address this. It really is something we
should revisit later though.
The patches are roughly organized as follows:
(1) Change existing set acl inode operation to take a dentry
argument (Intended to be a non-functional change)
(2) Rename existing get acl method (Intended to be a non-functional
change)
(3) Implement get and set acl inode operations for filesystems that
couldn't implement one before because of the missing dentry.
That's mostly 9p and cifs (Intended to be a non-functional
change)
(4) Build posix acl api, i.e., add vfs_get_acl(), vfs_remove_acl(),
and vfs_set_acl() including security and integrity hooks
(Intended to be a non-functional change)
(5) Implement get and set acl inode operations for stacking
filesystems (Intended to be a non-functional change)
(6) Switch posix acl handling in stacking filesystems to new posix
acl api now that all filesystems it can stack upon support it.
(7) Switch vfs to new posix acl api (semantical change)
(8) Remove all now unused helpers
(9) Additional regression fixes reported after we merged this into
linux-next
Thanks to Seth for a lot of good discussion around this and
encouragement and input from Christoph"
* tag 'fs.acl.rework.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: (36 commits)
posix_acl: Fix the type of sentinel in get_acl
orangefs: fix mode handling
ovl: call posix_acl_release() after error checking
evm: remove dead code in evm_inode_set_acl()
cifs: check whether acl is valid early
acl: make vfs_posix_acl_to_xattr() static
acl: remove a slew of now unused helpers
9p: use stub posix acl handlers
cifs: use stub posix acl handlers
ovl: use stub posix acl handlers
ecryptfs: use stub posix acl handlers
evm: remove evm_xattr_acl_change()
xattr: use posix acl api
ovl: use posix acl api
ovl: implement set acl method
ovl: implement get acl method
ecryptfs: implement set acl method
ecryptfs: implement get acl method
ksmbd: use vfs_remove_acl()
acl: add vfs_remove_acl()
...
|
|
|
|
5f3e240321 |
ext4: split ext4_journal_start trace for debug
we might want to know why jbd2 thread using high io for detail, split ext4_journal_start trace to ext4_journal_start_sb and ext4_journal_start_inode, show ino and handle type when possible. Signed-off-by: changfengnan <changfengnan@bytedance.com> Link: https://lore.kernel.org/r/20221008120518.74870-1-changfengnan@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
8032bf1233 |
treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by: @@ expression E; @@ - prandom_u32_max + get_random_u32_below (E) Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Reviewed-by: SeongJae Park <sj@kernel.org> # for damon Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
|
|
|
cac2f8b8d8
|
fs: rename current get acl method
The current way of setting and getting posix acls through the generic xattr interface is error prone and type unsafe. The vfs needs to interpret and fixup posix acls before storing or reporting it to userspace. Various hacks exist to make this work. The code is hard to understand and difficult to maintain in it's current form. Instead of making this work by hacking posix acls through xattr handlers we are building a dedicated posix acl api around the get and set inode operations. This removes a lot of hackiness and makes the codepaths easier to maintain. A lot of background can be found in [1]. The current inode operation for getting posix acls takes an inode argument but various filesystems (e.g., 9p, cifs, overlayfs) need access to the dentry. In contrast to the ->set_acl() inode operation we cannot simply extend ->get_acl() to take a dentry argument. The ->get_acl() inode operation is called from: acl_permission_check() -> check_acl() -> get_acl() which is part of generic_permission() which in turn is part of inode_permission(). Both generic_permission() and inode_permission() are called in the ->permission() handler of various filesystems (e.g., overlayfs). So simply passing a dentry argument to ->get_acl() would amount to also having to pass a dentry argument to ->permission(). We should avoid this unnecessary change. So instead of extending the existing inode operation rename it from ->get_acl() to ->get_inode_acl() and add a ->get_acl() method later that passes a dentry argument and which filesystems that need access to the dentry can implement instead of ->get_inode_acl(). Filesystems like cifs which allow setting and getting posix acls but not using them for permission checking during lookup can simply not implement ->get_inode_acl(). This is intended to be a non-functional change. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Suggested-by/Inspired-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org> |
|
|
|
a251c17aa5 |
treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around get_random_u32() for several releases now, and compiles down to the exact same code. Replace the deprecated wrapper with a direct call to the real function. The same also applies to get_random_int(), which is just a wrapper around get_random_u32(). This was done as a basic find and replace. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs Acked-by: Helge Deller <deller@gmx.de> # for parisc Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
|
|
|
8b3ccbc1f1 |
treewide: use prandom_u32_max() when possible, part 2
Rather than incurring a division or requesting too many random bytes for the given range, use the prandom_u32_max() function, which only takes the minimum required bytes from the RNG and avoids divisions. This was done by hand, covering things that coccinelle could not do on its own. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> # for ext2, ext4, and sbitmap Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> |
|
|
|
613c5a8589 |
ext4: make directory inode spreading reflect flexbg size
Currently the Orlov inode allocator searches for free inodes for a directory only in flex block groups with at most inodes_per_group/16 more directory inodes than average per flex block group. However with growing size of flex block group this becomes unnecessarily strict. Scale allowed difference from average directory count per flex block group with flex block group size as we do with other metrics. Tested-by: Stefan Wahren <stefan.wahren@i2se.com> Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
188c299e2a |
ext4: Support for checksumming from journal triggers
JBD2 layer support triggers which are called when journaling layer moves buffer to a certain state. We can use the frozen trigger, which gets called when buffer data is frozen and about to be written out to the journal, to compute block checksums for some buffer types (similarly as does ocfs2). This avoids unnecessary repeated recomputation of the checksum (at the cost of larger window where memory corruption won't be caught by checksumming) and is even necessary when there are unsynchronized updaters of the checksummed data. So add superblock and journal trigger type arguments to ext4_journal_get_write_access() and ext4_journal_get_create_access() so that frozen triggers can be set accordingly. Also add inode argument to ext4_walk_page_buffers() and all the callbacks used with that function for the same purpose. This patch is mostly only a change of prototype of the above mentioned functions and a few small helpers. Real checksumming will come later. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210816095713.16537-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
c89849cc02 |
ext4: fix avefreec in find_group_orlov
The avefreec should be average free clusters instead of average free blocks, otherwize Orlov's allocator will not work properly when bigalloc enabled. Cc: stable@kernel.org Signed-off-by: Pan Dong <pandong.peter@bytedance.com> Link: https://lore.kernel.org/r/20210525073656.31594-1-pandong.peter@bytedance.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
20e41d9bc8 |
Miscellaneous ext4 bug fixes for v5.13
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmC82AQACgkQ8vlZVpUN gaOkAgf+KH57P/P0sB6aVBHpAzqa9jTKJWMA5kpCqYUDkYlfF7n2hwsjMzWpJ5MY ZvFpKAflmRnve/ULUZQX6+zrcbieNs3e+6VFZrZ0PmxN0dupyISLY7jnvCRDleA7 BFO34AcH+QEst9zXJmgta9eoy3LA8sawhQ/d7ujVY+IRFk40m26fuAMiaGznlQJ5 dmrx7pHZWKFIDFIg2TdFlP+Voqbxs2VTT16gmWpGBdTyWYHKjbSOLKJFc9DwYeE9 aANf6iIzwXz7y9pZiOnTrGuKDEJcIZNESkbIqw62YgqsoObLbsbCZNmNcqxyHpYQ Mh3L59KtmjANW3iOxQfyxkNTugxchw== =BSnf -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Miscellaneous ext4 bug fixes" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Only advertise encrypted_casefold when encryption and unicode are enabled ext4: fix no-key deletion for encrypt+casefold ext4: fix memory leak in ext4_fill_super ext4: fix fast commit alignment issues ext4: fix bug on in ext4_es_cache_extent as ext4_split_extent_at failed ext4: fix accessing uninit percpu counter variable with fast_commit ext4: fix memory leak in ext4_mb_init_backend on error path. |
|
|
|
b45f189a19 |
ext4: fix accessing uninit percpu counter variable with fast_commit
When running generic/527 with fast_commit configuration, the following
issue is seen on Power. With fast_commit, during ext4_fc_replay()
(which can be called from ext4_fill_super()), if inode eviction
happens then it can access an uninitialized percpu counter variable.
This patch adds the check before accessing the counters in
ext4_free_inode() path.
[ 321.165371] run fstests generic/527 at 2021-04-29 08:38:43
[ 323.027786] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: block_validity. Quota mode: none.
[ 323.618772] BUG: Unable to handle kernel data access on read at 0x1fbd80000
[ 323.619767] Faulting instruction address: 0xc000000000bae78c
cpu 0x1: Vector: 300 (Data Access) at [c000000010706ef0]
pc: c000000000bae78c: percpu_counter_add_batch+0x3c/0x100
lr: c0000000006d0bb0: ext4_free_inode+0x780/0xb90
pid = 5593, comm = mount
ext4_free_inode+0x780/0xb90
ext4_evict_inode+0xa8c/0xc60
evict+0xfc/0x1e0
ext4_fc_replay+0xc50/0x20f0
do_one_pass+0xfe0/0x1350
jbd2_journal_recover+0x184/0x2e0
jbd2_journal_load+0x1c0/0x4a0
ext4_fill_super+0x2458/0x4200
mount_bdev+0x1dc/0x290
ext4_mount+0x28/0x40
legacy_get_tree+0x4c/0xa0
vfs_get_tree+0x4c/0x120
path_mount+0xcf8/0xd70
do_mount+0x80/0xd0
sys_mount+0x3fc/0x490
system_call_exception+0x384/0x3d0
system_call_common+0xec/0x278
Cc: stable@kernel.org
Fixes:
|
|
|
|
9f67672a81 |
New features for ext4 this cycle include support for encrypted
casefold, ensure that deleted file names are cleared in directory blocks by zeroing directory entries when they are unlinked or moved as part of a hash tree node split. We also improve the block allocator's performance on a freshly mounted file system by prefetching block bitmaps. There are also the usual cleanups and bug fixes, including fixing a page cache invalidation race when there is mixed buffered and direct I/O and the block size is less than page size, and allow the dax flag to be set and cleared on inline directories. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmCLei4ACgkQ8vlZVpUN gaPZkgf/VH08xjMf3VthC+BpvVmChQXfV4yjigHbO2pmPyYWZhyJzkEGCQD8u2eB b7ShW+B1NCifcTU34xAkKHwEtakzzEv3WIMrT1oZNWrpfo8tt850EkwQggaGGDpd /HnP1/wLtziJ5hE6DwutmX7qB4VFghVj898MjDrEPSOBqItOjWps9mn/JWL7SHyI Dqzhf5XZTYPaXWuJmSmKw3q8O70JDHnZe/rRWlfX1jLI5KDtqp71Nw1B+gszUB66 IUdncyZKvInsyjYhkbCQ8U6WFih82MrbKeuGYDp/RFvg5eMELEYkwT9j0ofuDHq8 zn62sAlbOXv1DiqkPDHKVm9GkHx8/g== =UpnH -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "New features for ext4 this cycle include support for encrypted casefold, ensure that deleted file names are cleared in directory blocks by zeroing directory entries when they are unlinked or moved as part of a hash tree node split. We also improve the block allocator's performance on a freshly mounted file system by prefetching block bitmaps. There are also the usual cleanups and bug fixes, including fixing a page cache invalidation race when there is mixed buffered and direct I/O and the block size is less than page size, and allow the dax flag to be set and cleared on inline directories" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits) ext4: wipe ext4_dir_entry2 upon file deletion ext4: Fix occasional generic/418 failure fs: fix reporting supported extra file attributes for statx() ext4: allow the dax flag to be set and cleared on inline directories ext4: fix debug format string warning ext4: fix trailing whitespace ext4: fix various seppling typos ext4: fix error return code in ext4_fc_perform_commit() ext4: annotate data race in jbd2_journal_dirty_metadata() ext4: annotate data race in start_this_handle() ext4: fix ext4_error_err save negative errno into superblock ext4: fix error code in ext4_commit_super ext4: always panic when errors=panic is specified ext4: delete redundant uptodate check for buffer ext4: do not set SB_ACTIVE in ext4_orphan_cleanup() ext4: make prefetch_block_bitmaps default ext4: add proc files to monitor new structures ext4: improve cr 0 / cr 1 group scanning ext4: add MB_NUM_ORDERS macro ext4: add mballoc stats proc file ... |
|
|
|
4811d9929c |
ext4: allow the dax flag to be set and cleared on inline directories
This is needed to allow generic/607 to pass for file systems with the inline data_feature enabled, and it allows the use of file systems where the directories use inline_data, while the files are accessed via DAX. Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
a149d2a5ca |
ext4: fix check to prevent false positive report of incorrect used inodes
Commit <50122847007> ("ext4: fix check to prevent initializing reserved
inodes") check the block group zero and prevent initializing reserved
inodes. But in some special cases, the reserved inode may not all belong
to the group zero, it may exist into the second group if we format
filesystem below.
mkfs.ext4 -b 4096 -g 8192 -N 1024 -I 4096 /dev/sda
So, it will end up triggering a false positive report of a corrupted
file system. This patch fix it by avoid check reserved inodes if no free
inode blocks will be zeroed.
Cc: stable@kernel.org
Fixes:
|
|
|
|
db998553cf
|
fs: introduce two inode i_{u,g}id initialization helpers
Give filesystem two little helpers that do the right thing when initializing the i_uid and i_gid fields on idmapped and non-idmapped mounts. Filesystems shouldn't have to be concerned with too many details. Link: https://lore.kernel.org/r/20210320122623.599086-5-christian.brauner@ubuntu.com Inspired-by: Vivek Goyal <vgoyal@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
a65e58e791
|
fs: document and rename fsid helpers
Vivek pointed out that the fs{g,u}id_into_mnt() naming scheme can be
misleading as it could be understood as implying they do the exact same
thing as i_{g,u}id_into_mnt(). The original motivation for this naming
scheme was to signal to callers that the helpers will always take care
to map the k{g,u}id such that the ownership is expressed in terms of the
mnt_users.
Get rid of the confusion by renaming those helpers to something more
sensible. Al suggested mapped_fs{g,u}id() which seems a really good fit.
Usually filesystems don't need to bother with these helpers directly
only in some cases where they allocate objects that carry {g,u}ids which
are either filesystem specific (e.g. xfs quota objects) or don't have a
clean set of helpers as inodes have.
Link: https://lore.kernel.org/r/20210320122623.599086-3-christian.brauner@ubuntu.com
Inspired-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
|
|
7d6beb71da |
idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
|
|
|
|
c6bf3f0e25 |
block: use an on-stack bio in blkdev_issue_flush
There is no point in allocating memory for a synchronous flush. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
14f3db5542
|
ext4: support idmapped mounts
Enable idmapped mounts for ext4. All dedicated helpers we need for this
exist. So this basically just means we're passing down the
user_namespace argument from the VFS methods to the relevant helpers.
Let's create simple example where we idmap an ext4 filesystem:
root@f2-vm:~# truncate -s 5G ext4.img
root@f2-vm:~# mkfs.ext4 ./ext4.img
mke2fs 1.45.5 (07-Jan-2020)
Discarding device blocks: done
Creating filesystem with 1310720 4k blocks and 327680 inodes
Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done
root@f2-vm:~# losetup -f --show ./ext4.img
/dev/loop0
root@f2-vm:~# mount /dev/loop0 /mnt
root@f2-vm:~# ls -al /mnt/
total 24
drwxr-xr-x 3 root root 4096 Oct 28 13:34 .
drwxr-xr-x 30 root root 4096 Oct 28 13:22 ..
drwx------ 2 root root 16384 Oct 28 13:34 lost+found
# Let's create an idmapped mount at /idmapped1 where we map uid and gid
# 0 to uid and gid 1000
root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/
root@f2-vm:/# ls -al /idmapped1/
total 24
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 13:34 .
drwxr-xr-x 30 root root 4096 Oct 28 13:22 ..
drwx------ 2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found
# Let's create an idmapped mount at /idmapped2 where we map uid and gid
# 0 to uid and gid 2000
root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/
root@f2-vm:/# ls -al /idmapped2/
total 24
drwxr-xr-x 3 2000 2000 4096 Oct 28 13:34 .
drwxr-xr-x 31 root root 4096 Oct 28 13:39 ..
drwx------ 2 2000 2000 16384 Oct 28 13:34 lost+found
Let's create another example where we idmap the rootfs filesystem
without a mapping for uid 0 and gid 0:
# Create an idmapped mount of for a full POSIX range of rootfs under
# /mnt but without a mapping for uid 0 to reduce attack surface
root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/
# Since we don't have a mapping for uid and gid 0 all files owned by
# uid and gid 0 should show up as uid and gid 65534:
root@f2-vm:/# ls -al /mnt/
total 664
drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 .
drwxr-xr-x 31 root root 4096 Oct 28 13:39 ..
lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 bin -> usr/bin
drwxr-xr-x 4 nobody nogroup 4096 Oct 28 13:17 boot
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:48 dev
drwxr-xr-x 81 nobody nogroup 4096 Oct 28 04:00 etc
drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 home
lrwxrwxrwx 1 nobody nogroup 7 Aug 25 07:44 lib -> usr/lib
lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib32 -> usr/lib32
lrwxrwxrwx 1 nobody nogroup 9 Aug 25 07:44 lib64 -> usr/lib64
lrwxrwxrwx 1 nobody nogroup 10 Aug 25 07:44 libx32 -> usr/libx32
drwx------ 2 nobody nogroup 16384 Aug 25 07:47 lost+found
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 media
drwxr-xr-x 31 nobody nogroup 4096 Oct 28 13:39 mnt
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 opt
drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 proc
drwx--x--x 6 nobody nogroup 4096 Oct 28 13:34 root
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:46 run
lrwxrwxrwx 1 nobody nogroup 8 Aug 25 07:44 sbin -> usr/sbin
drwxr-xr-x 2 nobody nogroup 4096 Aug 25 07:44 srv
drwxr-xr-x 2 nobody nogroup 4096 Apr 15 2020 sys
drwxrwxrwt 10 nobody nogroup 4096 Oct 28 13:19 tmp
drwxr-xr-x 14 nobody nogroup 4096 Oct 20 13:00 usr
drwxr-xr-x 12 nobody nogroup 4096 Aug 25 07:45 var
# Since we do have a mapping for uid and gid 1000 all files owned by
# uid and gid 1000 should simply show up as uid and gid 1000:
root@f2-vm:/# ls -al /mnt/home/ubuntu/
total 40
drwxr-xr-x 3 ubuntu ubuntu 4096 Oct 28 00:43 .
drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
-rw------- 1 ubuntu ubuntu 2936 Oct 28 12:26 .bash_history
-rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
-rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
-rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
-rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
-rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo
Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
|
|
|
|
21cb47be6f
|
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
96485e4462 |
The siginificant new ext4 feature this time around is Harshad's new
fast_commit mode. In addition, thanks to Mauricio for fixing a race where mmap'ed pages that are being changed in parallel with a data=journal transaction commit could result in bad checksums in the failure that could cause journal replays to fail. Also notable is Ritesh's buffered write optimization which can result in significant improvements on parallel write workloads. (The kernel test robot reported a 330.6% improvement on fio.write_iops on a 96 core system using DAX[1].) Besides that, we have the usual miscellaneous cleanups and bug fixes. [1] https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl+RuCQACgkQ8vlZVpUN gaNebgf/dUnQp5SG2/2zczSDqr+f8DOiuAdn9I54BAr2HwdkMbbiktKfenfpu41k SMGNV6rYSs248dWFtkzM7C2T1dpGrdAe2OCYrU6HPR/xoZlx/RcDz39u7nXBDeup NV7RnPgIzCAGZXCOY/Zu1k88T1eosLRTIWvIcNOspt75MC0vJ8GSmkx1bVEUsv8w Uq6T0OREfDiLJpEZxtfbl3o+8Rfs82t3Soj4pwN8ESL/RWBTT8PlwAGhIcdjnHy/ lsgT35IrY4OL6Eas9msUmFYrWhO6cW21kWOugYALQXZ3ny4A+r5nZZcY/wCq01NX J2Z02ZiMTZUmFFREbtc0eJukXWEVvA== =14K9 -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "The siginificant new ext4 feature this time around is Harshad's new fast_commit mode. In addition, thanks to Mauricio for fixing a race where mmap'ed pages that are being changed in parallel with a data=journal transaction commit could result in bad checksums in the failure that could cause journal replays to fail. Also notable is Ritesh's buffered write optimization which can result in significant improvements on parallel write workloads. (The kernel test robot reported a 330.6% improvement on fio.write_iops on a 96 core system using DAX) Besides that, we have the usual miscellaneous cleanups and bug fixes" Link: https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: fix invalid inode checksum ext4: add fast commit stats in procfs ext4: add a mount opt to forcefully turn fast commits on ext4: fast commit recovery path jbd2: fast commit recovery path ext4: main fast-commit commit path jbd2: add fast commit machinery ext4 / jbd2: add fast commit initialization ext4: add fast_commit feature and handling for extended mount options doc: update ext4 and journalling docs to include fast commit feature ext4: Detect already used quota file early jbd2: avoid transaction reuse after reformatting ext4: use the normal helper to get the actual inode ext4: fix bs < ps issue reported with dioread_nolock mount opt ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() ext4: data=journal: fixes for ext4_page_mkwrite() jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable() ext4: use ext4_sb_bread() instead of sb_bread() ... |
|
|
|
8016e29f43 |
ext4: fast commit recovery path
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
2d069c0889 |
ext4: use common helpers in all places reading metadata buffers
Revome all open codes that read metadata buffers, switch to use ext4_read_bh_*() common helpers. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
d9befedaaf |
ext4: clear buffer verified flag if read meta block from disk
The metadata buffer is no longer trusted after we read it from disk again because it is not uptodate for some reasons (e.g. failed to write back). Otherwise we may get below memory corruption problem in ext4_ext_split()->memset() if we read stale data from the newly allocated extent block on disk which has been failed to async write out but miss verify again since the verified bit has already been set on the buffer. [ 29.774674] BUG: unable to handle kernel paging request at ffff88841949d000 ... [ 29.783317] Oops: 0002 [#2] SMP [ 29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800 [ 29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G D W [ 29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8 [ 29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364 [ 29.787588] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000 [ 29.789288] Workqueue: writeback wb_workfn [ 29.790319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 29.790321] (flush-8:0) [ 29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0 [ 29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 29.792839] RIP: 0010:__memset+0x24/0x30 [ 29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033 [ 29.795161] Kernel panic - not syncing: Fatal exception in interrupt ... [ 29.808149] Call Trace: [ 29.808475] ext4_ext_insert_extent+0x102e/0x1be0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.810161] ext4_writepages+0xc85/0x17c0 ... Fix this by clearing buffer's verified bit if we read meta block from disk again. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
02ce5316af |
ext4: use fscrypt_prepare_new_inode() and fscrypt_set_context()
Convert ext4 to use the new functions fscrypt_prepare_new_inode() and fscrypt_set_context(). This avoids calling fscrypt_get_encryption_info() from within a transaction, which can deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe. For more details about this problem, see the earlier patch "fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()". Link: https://lore.kernel.org/r/20200917041136.178600-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> |
|
|
|
177cc0e710 |
ext4: factor out ext4_xattr_credits_for_new_inode()
To compute a new inode's xattr credits, we need to know whether the inode will be encrypted or not. When we switch to use the new helper function fscrypt_prepare_new_inode(), we won't find out whether the inode will be encrypted until slightly later than is currently the case. That will require moving the code block that computes the xattr credits. To make this easier and reduce the length of __ext4_new_inode(), move this code block into a new function ext4_xattr_credits_for_new_inode(). Link: https://lore.kernel.org/r/20200917041136.178600-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> |
|
|
|
3be20b6fc1 |
This is the second round of ext4 commits for 5.8 merge window. It
includes the per-inode DAX support, which was dependant on the DAX
infrastructure which came in via the XFS tree, and a number of
regression and bug fixes; most notably the "BUG: using
smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
by syzkaller.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7mgCcACgkQ8vlZVpUN
gaPftwf8C4w/7SG+CYLdwg0d2u9TKk77yDuWaioFHOcMSjZvG4TCSgtMhZxQnyty
9t4yqacILx12pCj/mZnrZp5BOSn9O2ZbuDoXNKNrFXU0BF+CsbnhvJvrrh1j/MUa
PPtcqyGFdOLSDvHSD9xPVT76juwh79aR8vB7qnQXaEO5wcLodZWoqBEFSKCl6Bo8
hjXs1EXidusKsoarQxW6mEITmnhU2S2fuCVDgVcoM/LmKwzbgqvlWrentq9u8qLH
W+XbjWgUtCM1byeDZWqe5FYyyJ8x+dTv7H5an3KR92EN6hKo5AOvzA0I41pZscq/
bJ9p2THDxJQX4rJBevGAS5mZ6hTkRw==
=z6eO
-----END PGP SIGNATURE-----
Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull more ext4 updates from Ted Ts'o:
"This is the second round of ext4 commits for 5.8 merge window [1].
It includes the per-inode DAX support, which was dependant on the DAX
infrastructure which came in via the XFS tree, and a number of
regression and bug fixes; most notably the "BUG: using
smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
by syzkaller"
[1] The pull request actually came in 15 minutes after I had tagged the
rc1 release. Tssk, tssk, late.. - Linus
* tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
ext4: support xattr gnu.* namespace for the Hurd
ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
ext4: avoid utf8_strncasecmp() with unstable name
ext4: stop overwrite the errcode in ext4_setup_super
ext4: fix partial cluster initialization when splitting extent
ext4: avoid race conditions when remounting with options that change dax
Documentation/dax: Update DAX enablement for ext4
fs/ext4: Introduce DAX inode flag
fs/ext4: Remove jflag variable
fs/ext4: Make DAX mount option a tri-state
fs/ext4: Only change S_DAX on inode load
fs/ext4: Update ext4_should_use_dax()
fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
fs/ext4: Disallow verity if inode is DAX
fs/ext4: Narrow scope of DAX check in setflags
|
|
|
|
68cd44920d |
Enable ext4 support for per-file/directory dax operations
This adds the same per-file/per-directory DAX support for ext4 as was done for xfs, now that we finally have consensus over what the interface should be. |
|
|
|
0b166a57e6 |
A lot of bug fixes and cleanups for ext4, including:
* Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
* Clean up fiemap handling in ext4
* Clean up and refactor multiple block allocator (mballoc) code
* Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
* Fixed a race in ext4_sync_parent() versus rename()
* Simplify the error handling in the extent manipulation code
* Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
ext4_make_inode_dirty()'s callers.
* Avoid passing an error pointer to brelse in ext4_xattr_set()
* Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
* Fix refcount handling if ext4_iget() fails
* Fix a crash in generic/019 caused by a corrupted extent node
-----BEGIN PGP SIGNATURE-----
iQEyBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl7Ze8kACgkQ8vlZVpUN
gaNChAf4xn0ytFSrweI/S2Sp05G/2L/ocZ2TZZk2ZdGeN1E+ABdSIv/zIF9zuFgZ
/pY/C+fyEZWt4E3FlNO8gJzoEedkzMCMnUhSIfI+wZbcclyTOSNMJtnrnJKAEtVH
HOvGZJmg357jy407RCGhZpJ773nwU2xhBTr5OFxvSf9mt/vzebxIOnw5D7HPlC1V
Fgm6Du8q+tRrPsyjv1Yu4pUEVXMJ7qUcvt326AXVM3kCZO1Aa5GrURX0w3J4mzW1
tc1tKmtbLcVVYTo9CwHXhk/edbxrhAydSP2iACand3tK6IJuI6j9x+bBJnxXitnr
vsxsfTYMG18+2SxrJ9LwmagqmrRq
=HMTs
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of bug fixes and cleanups for ext4, including:
- Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
- Clean up fiemap handling in ext4
- Clean up and refactor multiple block allocator (mballoc) code
- Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
- Fixed a race in ext4_sync_parent() versus rename()
- Simplify the error handling in the extent manipulation code
- Make sure all metadata I/O errors are felected to
ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.
- Avoid passing an error pointer to brelse in ext4_xattr_set()
- Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
- Fix refcount handling if ext4_iget() fails
- Fix a crash in generic/019 caused by a corrupted extent node"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
ext4: avoid unnecessary transaction starts during writeback
ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
fs: remove the access_ok() check in ioctl_fiemap
fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
fs: move fiemap range validation into the file systems instances
iomap: fix the iomap_fiemap prototype
fs: move the fiemap definitions out of fs.h
fs: mark __generic_block_fiemap static
ext4: remove the call to fiemap_check_flags in ext4_fiemap
ext4: split _ext4_fiemap
ext4: fix fiemap size checks for bitmap files
ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
add comment for ext4_dir_entry_2 file_type member
jbd2: avoid leaking transaction credits when unreserving handle
ext4: drop ext4_journal_free_reserved()
ext4: mballoc: use lock for checking free blocks while retrying
ext4: mballoc: refactor ext4_mb_good_group()
ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
ext4: mballoc: refactor ext4_mb_discard_preallocations()
...
|
|
|
|
3bbd0ef260 |
ext4: fix buffer_head refcnt leak when ext4_iget() fails
ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a reference of the specified buffer_head object to "bitmap_bh" with increased refcnt. When ext4_orphan_get() returns, local variable "bitmap_bh" becomes invalid, so the refcount should be decreased to keep refcount balanced. The reference counting issue happens in one exception handling path of ext4_orphan_get(). When ext4_iget() fails, the function forgets to decrease the refcnt increased by ext4_read_inode_bitmap(), causing a refcnt leak. Fix this issue by calling brelse() when ext4_iget() fails. Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn> Signed-off-by: Xin Tan <tanxin.ctf@gmail.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
043546e46d |
fs/ext4: Only change S_DAX on inode load
To prevent complications with in memory inodes we only set S_DAX on inode load. FS_XFLAG_DAX can be changed at any time and S_DAX will change after inode eviction and reload. Add init bool to ext4_set_inode_flags() to indicate if the inode is being newly initialized. Assert that S_DAX is not set on an inode which is just being loaded. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-6-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
9398554fb3 |
block: remove the error_sector argument to blkdev_issue_flush
The argument isn't used by any caller, and drivers don't fill out bi_sector for flush requests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
a17a9d935d |
ext4: increase wait time needed before reuse of deleted inode numbers
Current wait times have proven to be too short to protect against inode reuses that lead to metadata inconsistencies. Now that we will retry the inode allocation if we can't find any recently deleted inodes, it's a lot safer to increase the recently deleted time from 5 seconds to a minute. Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu Google-Bug-Id: 36602237 Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
9033783c8c |
ext4: fix return-value types in several function comments
The documentation comments for ext4_read_block_bitmap_nowait and ext4_read_inode_bitmap describe them as returning NULL on error, but they return an ERR_PTR on error; update the documentation to match. The documentation comment for ext4_wait_block_bitmap describes it as returning 1 on error, but it returns -errno on error; update the documentation to match. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Ritesh Harani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
54d3adbc29 |
ext4: save all error info in save_error_info() and drop ext4_set_errno()
Using a separate function, ext4_set_errno() to set the errno is
problematic because it doesn't do the right thing once
s_last_error_errorcode is non-zero. It's also less racy to set all of
the error information all at once. (Also, as a bonus, it shrinks code
size slightly.)
Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
Fixes:
|
|
|
|
d05466b27b |
ext4: avoid ENOSPC when avoiding to reuse recently deleted inodes
When ext4 is running on a filesystem without a journal, it tries not to reuse recently deleted inodes to provide better chances for filesystem recovery in case of crash. However this logic forbids reuse of freed inodes for up to 5 minutes and especially for filesystems with smaller number of inodes can lead to ENOSPC errors returned when allocating new inodes. Fix the problem by allowing to reuse recently deleted inode if there's no other inode free in the scanned range. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200318121317.31941-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> |
|
|
|
7c990728b9 |
ext4: fix potential race between s_flex_groups online resizing and access
During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org |
|
|
|
46f870d690 |
ext4: simulate various I/O and checksum errors when reading metadata
This allows us to test various error handling code paths Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> |