Commit Graph

1218 Commits

Author SHA1 Message Date
Caleb Sander Mateos 20fb3d05a3 io_uring/uring_cmd: avoid double indirect call in task work dispatch
io_uring task work dispatch makes an indirect call to struct io_kiocb's
io_task_work.func field to allow running arbitrary task work functions.
In the uring_cmd case, this calls io_uring_cmd_work(), which immediately
makes another indirect call to struct io_uring_cmd's task_work_cb field.
Change the uring_cmd task work callbacks to functions whose signatures
match io_req_tw_func_t. Add a function io_uring_cmd_from_tw() to convert
from the task work's struct io_tw_req argument to struct io_uring_cmd *.
Define a constant IO_URING_CMD_TASK_WORK_ISSUE_FLAGS to avoid
manufacturing issue_flags in the uring_cmd task work callbacks. Now
uring_cmd task work dispatch makes a single indirect call to the
uring_cmd implementation's callback. This also allows removing the
task_work_cb field from struct io_uring_cmd, freeing up 8 bytes for
future storage.
Since fuse_uring_send_in_task() now has access to the io_tw_token_t,
check its cancel field directly instead of relying on the
IO_URING_F_TASK_DEAD issue flag.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-03 08:31:26 -07:00
Linus Torvalds 9f388a653c for-6.18-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjxIRUACgkQxWXV+ddt
 WDuzyQ//X7a7IlmhrFYVLUbxhcjD+6gWddNl+zu2zTe0YUEh7rVtzq1W8rr2ku/3
 HD7fa18LoKJOQOitamPZqPkHarorlhEfnx7ukl4dgsvLE8jvWvfYnJq+jnfkXtfO
 d5mT0JKiTuD2MUAPpPjphLjr4Ok7pdeZ85k01D0sYPJ4gFLOJ37ORE7uSodVjPph
 /ry3vEQOtKOEimcyYPaLhs4MiAsg2Y5YgVk+2GjGihWp/nECFPUvfjolxguowNzN
 dw0zCyTzE5qu6rE/2zEXWYNHmgfNbzeqIKFwUCFJY2Mfanw8YR26z8xCr18bRmfz
 kwebmv7BWp0KSSpq06tSArRQZdPFFYKXiltsmA0CkvLR3CZx995Un3AQ7em6EnUz
 ln8nd+0liaK1ZT2Lrr8aRv0+dTTXBzTXRiJtqM+3eNMLgDTwEnhrAa2GOkfGNt/W
 M/iDGeMoQhbX4NoEH9Ac6OQuMztGa07w0u6pJ/sXjg3m17piy5bU1i+lvyQWebhw
 8R4fCpmUB45xgtdN7tT+G9pveU3/Ugy46Re0UZwb3s4whDkrzprgzCjyw5WGgxeb
 JLkx9+LctdMZM1Rcf6cbn4waERwM3iOUXtUPFljFXm/tw6i+CB4uR5GKBfLVes7u
 5hPDzOJgClUnWt2yleP+4rDNrPsmnO7qFnXcF+SJ1SuIQke59g0=
 =D6mR
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - in tree-checker fix extref bounds check

 - reorder send context structure to avoid
   -Wflex-array-member-not-at-end warning

 - fix extent readahead length for compressed extents

 - fix memory leaks on error paths (qgroup assign ioctl, zone loading
   with raid stripe tree enabled)

 - fix how device specific mount options are applied, in particular the
   'ssd' option will be set unexpectedly

 - fix tracking of relocation state when tasks are running and
   cancellation is attempted

 - adjust assertion condition for folios allocated for scrub

 - remove incorrect assertion checking for block group when populating
   free space tree

* tag 'for-6.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx
  btrfs: tree-checker: fix bounds check in check_inode_extref()
  btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST
  btrfs: fix incorrect readahead expansion length
  btrfs: do not assert we found block group item when creating free space tree
  btrfs: do not use folio_test_partial_kmap() in ASSERT()s
  btrfs: only set the device specific options after devices are opened
  btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl
  btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running
2025-10-16 10:22:38 -07:00
Miquel Sabaté Solà 53a4acbfc1 btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl
On 'btrfs_ioctl_qgroup_assign' we first duplicate the argument as
provided by the user, which is kfree'd in the end. But this was not the
case when allocating memory for 'prealloc'. In this case, if it somehow
failed, then the previous code would go directly into calling
'mnt_drop_write_file', without freeing the string duplicated from the
user space.

Fixes: 4addc1ffd6 ("btrfs: qgroup: preallocate memory before adding a relation")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-10-13 22:29:27 +02:00
Linus Torvalds 5832d26433 for-6.18/io_uring-20250929
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmjbLEcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnEUD/4/FgfQP2LFS/88BBF5ukZjRySe4wmyyZ2Q
 MFh2ehdxzkZxVXjbeA2wRAXdqjw2MbNhx8tzU9VrW7rweNDZxHbwi6jJIP7OAjxE
 4ZP0goAQj7P0TFyXC2KGj7k6dP20FkAltx5gGLVwsuOWDDrQKp2EykAcRnGYAD4W
 3yf+nojVr2bjHyO7dx8dM7jUDjMg7J8nmHD6zgHOlHRLblWwfzw907bhz+eBX/FI
 9kYvtX2c9MgY4Isa+43rZd5qvj9S3Cs8PD6tFPbq+n+3l7yWgMBTu/y+SNI8hupT
 W7CqjPcpvppFHhPkcXDA3yARnW7ccEx5aiQuvUCmRUioHtGwXvC63HMp8OjcQspV
 NNoIHYFsi1alzYq2kJLxY1IleWZ8j0hUkSSU8u7al8VIvtD43LGkv51xavxQUFjg
 BO9mLyS51H2agffySs4vhHJE82lZizvmh/RJfSJ0ezALzE2k42MrximX1D1rBJE6
 KPOhCiPt/jqpQMyqDYnY10FgTXQVwgPIVH1JLpo611tPFHlGW8Y4YxxR1Xduh5JX
 jbGLEjVREsDZ7EHrimLNLmJRAQpyQujv/yhf7k96gWBelVwVuISQLI4Ca5IeVQyk
 9yifgLXNGddgAwj0POMFeKXSm2We9nrrPDYLCKrsBMSN96/3SLveJC7fkW88aUZr
 ye4/K8Y3vA==
 =uc/3
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux

Pull io_uring updates from Jens Axboe:

 - Store ring provided buffers locally for the users, rather than stuff
   them into struct io_kiocb.

   These types of buffers must always be fully consumed or recycled in
   the current context, and leaving them in struct io_kiocb is hence not
   a good ideas as that struct has a vastly different life time.

   Basically just an architecture cleanup that can help prevent issues
   with ring provided buffers in the future.

 - Support for mixed CQE sizes in the same ring.

   Before this change, a CQ ring either used the default 16b CQEs, or it
   was setup with 32b CQE using IORING_SETUP_CQE32. For use cases where
   a few 32b CQEs were needed, this caused everything else to use big
   CQEs. This is wasteful both in terms of memory usage, but also memory
   bandwidth for the posted CQEs.

   With IORING_SETUP_CQE_MIXED, applications may use request types that
   post both normal 16b and big 32b CQEs on the same ring.

 - Add helpers for async data management, to make it harder for opcode
   handlers to mess it up.

 - Add support for multishot for uring_cmd, which ublk can use. This
   helps improve efficiency, by providing a persistent request type that
   can trigger multiple CQEs.

 - Add initial support for ring feature querying.

   We had basic support for probe operations, but the API isn't great.
   Rather than expand that, add support for QUERY which is easily
   expandable and can cover a lot more cases than the existing probe
   support. This will help applications get a better idea of what
   operations are supported on a given host.

 - zcrx improvements from Pavel:
        - Improve refill entry alignment for better caching
        - Various cleanups, especially around deduplicating normal
          memory vs dmabuf setup.
        - Generalisation of the niov size (Patch 12). It's still hard
          coded to PAGE_SIZE on init, but will let the user to specify
          the rx buffer length on setup.
        - Syscall / synchronous bufer return. It'll be used as a slow
          fallback path for returning buffers when the refill queue is
          full. Useful for tolerating slight queue size misconfiguration
          or with inconsistent load.
        - Accounting more memory to cgroups.
        - Additional independent cleanups that will also be useful for
          mutli-area support.

 - Various fixes and cleanups

* tag 'for-6.18/io_uring-20250929' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: (68 commits)
  io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
  io_uring: fix nvme's 32b cqes on mixed cq
  io_uring/query: cap number of queries
  io_uring/query: prevent infinite loops
  io_uring/zcrx: account niov arrays to cgroup
  io_uring/zcrx: allow synchronous buffer return
  io_uring/zcrx: introduce io_parse_rqe()
  io_uring/zcrx: don't adjust free cache space
  io_uring/zcrx: use guards for the refill lock
  io_uring/zcrx: reduce netmem scope in refill
  io_uring/zcrx: protect netdev with pp_lock
  io_uring/zcrx: rename dma lock
  io_uring/zcrx: make niov size variable
  io_uring/zcrx: set sgt for umem area
  io_uring/zcrx: remove dmabuf_offset
  io_uring/zcrx: deduplicate area mapping
  io_uring/zcrx: pass ifq to io_zcrx_alloc_fallback()
  io_uring/zcrx: check all niovs filled with dma addresses
  io_uring/zcrx: move area reg checks into io_import_area
  io_uring/zcrx: don't pass slot to io_zcrx_create_area
  ...
2025-10-02 09:56:23 -07:00
David Sterba a929904cf7 btrfs: add unlikely annotations to branches leading to transaction abort
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.

Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.

The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
David Sterba 9264d004a6 btrfs: add unlikely annotations to branches leading to EUCLEAN
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:26 +02:00
Qu Wenruo 98077f7f21 btrfs: enable experimental bs > ps support
With all the preparation patches, we're able to finally enable btrfs
block size (sector size) larger than page size support and give it a
full fstests run.

And obviously this new feature is hidden behind experimental flags, and
should not be considered as a core feature yet as btrfs' default block
size is still 4K.

But this is still a feature that will shine in the future where 16K
block sized device are widely adopted.

For now there are some features explicitly disabled:

- Direct IO
  This is the most complex part to support, the root reason is we can
  not control the pages of iov iter passed in.

  User space programs can only ensure the virtual addresses are
  contiguous, but have no control on their physical addresses.

  Our bs > ps support heavily relies on large folios, and direct IO
  memory can easily break it.

  So direct IO is disabled and will always fall back to buffered IO.

- RAID56
  In theory we can convert RAID56 to use large folios, but it will need
  to be converted back to page based if we want to support direct IO in
  the future.
  So just reject it for now.

- Encoded send
- Encoded read
  Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send
  is utilizing vmallocated memory.
  Unfortunately for vmallocated memory we can not guarantee the minimal
  folio order.

  For send, it will just always fallback to regular writes, which reads
  from page cache and will follow the existing folio order requirement.

- Encoded write
  Encoded write itself is allocating pages by themselves, and we can
  easily change it to follow the minimal order.
  But since encoded read is already disabled, there is no need to only
  enable encoded write.

Finally just like what we did for bs < ps support in the past, add a
warning message for bs > ps mounts.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:25 +02:00
David Sterba 17dc82dc1e btrfs: fix typos in comments and strings
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-23 08:49:16 +02:00
Caleb Sander Mateos ef9f603fd3 io_uring/cmd: drop unused res2 param from io_uring_cmd_done()
Commit 79525b51ac ("io_uring: fix nvme's 32b cqes on mixed cq") split
out a separate io_uring_cmd_done32() helper for ->uring_cmd()
implementations that return 32-byte CQEs. The res2 value passed to
io_uring_cmd_done() is now unused because __io_uring_cmd_done() ignores
it when is_cqe32 is passed as false. So drop the parameter from
io_uring_cmd_done() to simplify the callers and clarify that it's not
possible to return an extra value beyond the 32-bit CQE result.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-09-23 00:15:02 -06:00
David Sterba 67e78f983e btrfs: convert several int parameters to bool
We're almost done cleaning misused int/bool parameters. Convert a bunch
of them, found by manual grepping.  Note that btrfs_sync_fs() needs an
int as it's mandated by the struct super_operations prototype.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-09-22 10:54:32 +02:00
Linus Torvalds c3018a2c6a for-6.17/io_uring-20250728
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHcP8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphu+EACp+MNKBIx7iIM1arbmafEmroaR9Whi5CJu
 8BczAKRVPNhICPXAjr6UOmRuD4nBzPysPNi73xNRdXdnd/dnOL27zMqF/0JCJsd6
 V4Xt7pXcanJ4BoeRDv9P332h6/IVo1ulxqEzhWJt2KKoq4cWbSwlvk18NPOdGIkp
 PcMPARf15d4KbLprdST5Dzrg2uqZ3jZ4Y4zvIMKery56AZCkp5ZrtPFttw4jhTff
 FQzIjY/hXXDFURubv75EHFg7KC9rcZa4CuU40duTHmsXD/EsSsa0eAJMmiVAGjhM
 PXK/70ZOoxLK1zU7zN+iAMUnVL7onoNvo8g9e7/FJNuHV25am8aUx4DzzP3ci+He
 +2C4PpOKx9Egz0oKmC6hl9eMg/muPd/3j8uxQlGOvz38icgeirQB/ebr1KLtBkM9
 pKQmCsGaIzncXWddcTMCayEj3wbvn2lgWSqSgZuwTt1AXqSTxLOmTywCYcXDBdli
 2ejh/Hk45TALnBhiiDWdOT2euh2EEP1ADOZzVsZzeR29OzLqJDRjRnitb40NUU60
 +ny2HOcplIN6aah+QdFwK31FieVKDY/ufjt1yGvwCP+kIxUEDSYi+NRRldmwznxy
 8UZwFyvd8wOxUc+iG/wCK7ccY+MtYyaZAE4ok2QEvQvMUTBJE/LvI4bkd77fx7Sy
 2cMeDukZ/w==
 =EGH5
 -----END PGP SIGNATURE-----

Merge tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe:

 - Optimization to avoid reference counts on non-cloned registered
   buffers. This is how these buffers were handled prior to having
   cloning support, and we can still use that approach as long as the
   buffers haven't been cloned to another ring.

 - Cleanup and improvement for uring_cmd, where btrfs was the only user
   of storing allocated data for the lifetime of the uring_cmd. Clean
   that up so we can get rid of the need to do that.

 - Avoid unnecessary memory copies in uring_cmd usage. This is
   particularly important as a lot of uring_cmd usage necessitates the
   use of 128b SQEs.

 - A few updates for recv multishot, where it's now possible to add
   fairness limits for limiting how much is transferred for each retry
   loop. Additionally, recv multishot now supports an overall cap as
   well, where once reached the multishot recv will terminate. The
   latter is useful for buffer management and juggling many recv streams
   at the same time.

 - Add support for returning the TX timestamps via a new socket command.
   This feature can work in either singleshot or multishot mode, where
   the latter triggers a completion whenever new timestamps are
   available. This is an alternative to using the existing error queue.

 - Add support for an io_uring "mock" file, which is the start of being
   able to do 100% targeted testing in terms of exercising io_uring
   request handling. The idea is to have a file type that can be
   anything the tester would like, and behave exactly how you want it to
   behave in terms of hitting the code paths you want.

 - Improve zcrx by using sgtables to de-duplicate and improve dma
   address handling.

 - Prep work for supporting larger pages for zcrx.

 - Various little improvements and fixes.

* tag 'for-6.17/io_uring-20250728' of git://git.kernel.dk/linux: (42 commits)
  io_uring/zcrx: fix leaking pages on sg init fail
  io_uring/zcrx: don't leak pages on account failure
  io_uring/zcrx: fix null ifq on area destruction
  io_uring: fix breakage in EXPERT menu
  io_uring/cmd: remove struct io_uring_cmd_data
  btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
  io_uring/cmd: introduce IORING_URING_CMD_REISSUE flag
  io_uring/zcrx: account area memory
  io_uring: export io_[un]account_mem
  io_uring/net: Support multishot receive len cap
  io_uring: deduplicate wakeup handling
  io_uring/net: cast min_not_zero() type
  io_uring/poll: cleanup apoll freeing
  io_uring/net: allow multishot receive per-invocation cap
  io_uring/net: move io_sr_msg->retry_flags to io_sr_msg->flags
  io_uring/net: use passed in 'len' in io_recv_buf_select()
  io_uring/zcrx: prepare fallback for larger pages
  io_uring/zcrx: assert area type in io_zcrx_iov_page
  io_uring/zcrx: allocate sgtable for umem areas
  io_uring/zcrx: introduce io_populate_area_dma
  ...
2025-07-28 16:30:12 -07:00
Linus Torvalds 57fcb7d930 vfs-6.17-rc1.fileattr
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCpgAKCRCRxhvAZXjc
 oqfFAQDcy3rROUF3W34KcSi7rDmaKVSX53d1tUoqH+1zDRpSlwEAriKDNC1ybudp
 YAnxVzkRHjHs1296WIuwKq5lfhJ60Q4=
 =geAl
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull fileattr updates from Christian Brauner:
 "This introduces the new file_getattr() and file_setattr() system calls
  after lengthy discussions.

  Both system calls serve as successors and extensible companions to
  the FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR system calls which have
  started to show their age in addition to being named in a way that
  makes it easy to conflate them with extended attribute related
  operations.

  These syscalls allow userspace to set filesystem inode attributes on
  special files. One of the usage examples is the XFS quota projects.

  XFS has project quotas which could be attached to a directory. All new
  inodes in these directories inherit project ID set on parent
  directory.

  The project is created from userspace by opening and calling
  FS_IOC_FSSETXATTR on each inode. This is not possible for special
  files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
  with empty project ID. Those inodes then are not shown in the quota
  accounting but still exist in the directory. This is not critical but
  in the case when special files are created in the directory with
  already existing project quota, these new inodes inherit extended
  attributes. This creates a mix of special files with and without
  attributes. Moreover, special files with attributes don't have a
  possibility to become clear or change the attributes. This, in turn,
  prevents userspace from re-creating quota project on these existing
  files.

  In addition, these new system calls allow the implementation of
  additional attributes that we couldn't or didn't want to fit into the
  legacy ioctls anymore"

* tag 'vfs-6.17-rc1.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  fs: tighten a sanity check in file_attr_to_fileattr()
  tree-wide: s/struct fileattr/struct file_kattr/g
  fs: introduce file_getattr and file_setattr syscalls
  fs: prepare for extending file_get/setattr()
  fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP
  selinux: implement inode_file_[g|s]etattr hooks
  lsm: introduce new hooks for setting/getting inode fsxattr
  fs: split fileattr related helpers into separate file
2025-07-28 15:24:14 -07:00
David Sterba 009b2056cb btrfs: defrag: add flag to force no-compression
Currently the defrag ioctl cannot rewrite the extents without
compression. Add a new flag for that, as setting compression to 0 (or
"no compression") means to do no changes to compression so take what is
the current default, like mount options or properties.

The defrag setting overrides mount or properties. The compression
BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and
does not need to have a fixed value.

Mount with zstd:9, copy test file from /usr/bin/ (about 260KB):

  $ mount -o compress=zstd:9 /dev/vda /mnt
  $ filefrag -vsb testfile
  filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
  Filesystem type is: 9123683e
  File size of testfile is 297704 (292 blocks of 1024 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..     127:      13312..     13439:    128:             encoded
     1:      128..     255:      13364..     13491:    128:      13440: encoded
     2:      256..     291:      13424..     13459:     36:      13492: last,encoded,eof
  testfile: 3 extents found

  $ compsize testfile
  Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL       42%      124K         292K         292K
  zstd        42%      124K         292K         292K

Defrag to uncompressed:

  $ btrfs fi defrag --nocomp testfile
  $ filefrag -vsb testfile
  filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
  Filesystem type is: 9123683e
  File size of testfile is 297704 (292 blocks of 1024 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..     291:     291840..    292131:    292:             last,eof
  testfile: 1 extent found

  $ compsize testfile
  Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments.
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL      100%      292K         292K         292K
  none       100%      292K         292K         292K

Compress again with LZO:

  $ btrfs fi defrag -clzo testfile
  $ filefrag -vsb testfile
  filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
  Filesystem type is: 9123683e
  File size of testfile is 297704 (292 blocks of 1024 bytes)
   ext:     logical_offset:        physical_offset: length:   expected: flags:
     0:        0..     127:      13312..     13439:    128:             encoded
     1:      128..     255:      13392..     13519:    128:      13440: encoded
     2:      256..     291:      13480..     13515:     36:      13520: last,encoded,eof
  testfile: 3 extents found

  $ compsize testfile
  Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
  Type       Perc     Disk Usage   Uncompressed Referenced
  TOTAL       64%      188K         292K         292K
  lzo         64%      188K         292K         292K

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 01:13:03 +02:00
Filipe Manana e560afc1a8 btrfs: set search_commit_root to false in iterate_inodes_from_logical()
There's no point in checking at iterate_inodes_from_logical() if the path
has search_commit_root set, the only caller never sets search_commit_root
to true and it doesn't make sense for it ever to be true for the current
use case (logical_to_ino ioctl). So stop checking for that and since the
only caller allocates the path just for it to be used by
iterate_inodes_from_logical(), move the path allocation into that function.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:09:20 +02:00
Filipe Manana a943812bff btrfs: qgroup: use btrfs_qgroup_enabled() in ioctls
We have a publicly exported btrfs_qgroup_enabled() and an ioctl.c private
qgroup_enabled() helper. Both of these test if qgroups are enabled, the
first check if the flag BTRFS_FS_QUOTA_ENABLED is set in fs_info->flags
while the second checks if fs_info->quota_root is not NULL while holding
the mutex fs_info->qgroup_ioctl_lock.

We can get away with the private ioctl.c:qgroup_enabled(), as all entry
points into the qgroup code check if fs_info->quota_root is NULL or not
while holding the mutex fs_info->qgroup_ioctl_lock, and returning the
error -ENOTCONN in case it's NULL.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:07:37 +02:00
Qu Wenruo de339cbfb4 btrfs: call bdev_fput() to reclaim the blk_holder immediately
As part of the preparation for btrfs blk_holder_ops, we want to ensure
the holder of a block device has a proper lifespan.

However btrfs is always using fput() to close a block device, which has
one problem:

- fput() is deferred
  Meaning we can have a block device with invalid (aka, freed) holder.

To avoid the problem and align the behavior to other code, just call
bdev_fput() instead.

There is some extra requirement on the locking, but that's all resolved
by previous patches and we should be safe to call bdev_fput().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:06:19 +02:00
Caleb Sander Mateos ea124ec327 btrfs: don't skip accounting in early ENOTTY return in btrfs_uring_encoded_read()
btrfs_uring_encoded_read() returns early with -ENOTTY if the uring_cmd
is issued with IO_URING_F_COMPAT but the kernel doesn't support compat
syscalls. However, this early return bypasses the syscall accounting.
Go to out_acct instead to ensure the syscall is counted.

Fixes: 34310c442e ("btrfs: add io_uring command for encoded reads (ENCODED_READ ioctl)")
CC: stable@vger.kernel.org # 6.15+
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:30 +02:00
David Sterba 8320febc64 btrfs: pass bool to indicate subvolume/snapshot creation type
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba a5f0e0a4df btrfs: pass dentry to btrfs_mksubvol() and btrfs_mksnapshot()
There's no reason to pass 'struct path' to btrfs_mksubvol(), though it's
been like the since the first commit 76dda93c6a ("Btrfs: add
snapshot/subvolume destroy ioctl").  We only use the dentry so we should
pass it directly.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba 34f6cc5b18 btrfs: use struct qstr for subvolume ioctl helpers
We pass name and length of subvolumes separately to the related
functions, while this can be a struct qstr which is otherwise used for
dentry interfaces.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
Brahmajit Das 164299ba11 btrfs: replace strcpy() with strscpy()
strcpy() is discouraged from use due to lack of bounds checking.
Replaces it with strscpy(), the recommended alternative for null
terminated strings, to follow best practices.

There are instances where strscpy() cannot be used such as where both
the source and destination are character pointers. In that instance we
can use sysfs_emit().

Link: https://github.com/KSPP/linux/issues/88
Suggested-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Brahmajit Das <bdas@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-22 00:05:00 +02:00
David Sterba ab5fcbb1ad btrfs: use pgoff_t for page index variables
Any conversion of offsets in the logical or the physical mapping space
of the pages is done by a shift and the target type should be pgoff_t
(type of struct page::index). Fix the locations where it's still
unsigned long.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:05 +02:00
Filipe Manana fd00922abc btrfs: add btrfs prefix to is_fstree() and make it return bool
This is an exported function and therefore it should have a 'btrfs_'
prefix, to make it clear it's btrfs specific, avoid future name collisions
with code outside btrfs, and make its naming consistent with most other
btrfs exported functions.

So add a 'btrfs_' prefix to it and make it return bool instead of int,
since all we need is to return true or false.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:04 +02:00
David Sterba 64b8c3851f btrfs: rename error to ret in btrfs_mksubvol()
Unify naming of return value to the preferred way.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
David Sterba bfa13b82cc btrfs: rename error to ret in btrfs_may_delete()
Unify naming of return value to the preferred way.

Reviewed-by: Daniel Vacek <neelx@suse.com>yy
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:58:03 +02:00
David Sterba 2eac2ae8b2 btrfs: switch RCU helper versions to btrfs_info()
The RCU protection is now done in the plain helpers, we can remove the
"_in_rcu" and "_rl_in_rcu".

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-07-21 23:56:38 +02:00
Caleb Sander Mateos 9aad72b4e3 btrfs/ioctl: store btrfs_uring_encoded_data in io_btrfs_cmd
btrfs is the only user of struct io_uring_cmd_data and its op_data
field. Switch its ->uring_cmd() implementations to store the struct
btrfs_uring_encoded_data * in the struct io_btrfs_cmd, overlayed with
io_uring_cmd's pdu field. This avoids having to touch another cache line
to access the struct btrfs_uring_encoded_data *, and allows op_data and
struct io_uring_cmd_data to be removed.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Acked-by: David Sterba <dsterba@suse.com>
Link: https://lore.kernel.org/r/20250708202212.2851548-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-18 12:34:56 -06:00
Christian Brauner ca115d7e75
tree-wide: s/struct fileattr/struct file_kattr/g
Now that we expose struct file_attr as our uapi struct rename all the
internal struct to struct file_kattr to clearly communicate that it is a
kernel internal struct. This is similar to struct mount_{k}attr and
others.

Link: https://lore.kernel.org/20250703-restlaufzeit-baurecht-9ed44552b481@brauner
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-07-04 16:14:39 +02:00
Linus Torvalds 4c06e63b92 for-6.16-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhmwFQACgkQxWXV+ddt
 WDsVMA/+NuSth71V0AfiDnFyqjgDMqIlZL2+dqBiTYHXQQHKbqiUlKvYkWICCT6T
 1YgDV+95XJYy4TDBoA49Ndd/l+CiDcMLbOYeneIfbJy13ts84jVANPkl4n03gPkF
 ktibCw15h0MENVctTCPc71dX2X0cV9WPf4iDmoxUZiukDA376akGTArZKwH4tVVg
 4qVpzUtDdNOf848D+8DZKGd+ot/RWgEdLkFCZES27BMg/OFemxBK1MU6K8VjxiKF
 VoaSVJRDXuug8oVBAGNl86XpiSgd4gHyoNNA5b4mhdSWMSBMxUAaILsONT9pNQZA
 CFyHA1Jp2gLOIzQIzeXwWgXaAOQDtco8YWYaXhf0v0mySs89tweXjOibfj2mU9pS
 wPaJyeD+nyRDMwPa4VWEws64D3vXX6aKwiThUENuDmxBvrRXjrkGYH9tf0LNzDDe
 OKv/vOCfeyutxbjKhP+qElMhdh73BZnJ4UCxxYRRDq2v1Mg+k06swl+6uL6xenme
 a2KLJlwEoG6LAlkpZzV66ZEaIHDyGBZNdVYtuA/G3dDtmlt0aLXDdp1eq7NivS1j
 aV7cd0JMX89lAUtqKT932ZOw8RoDrUPPjsnXzCaZJ69mMVyEkxyCV+iYHTTJPDga
 W5Vg8Tq3d1gwxMebZHvyI6wwUhmGA0wUFG2eohYY/tcSrrUlrHQ=
 =Ke0p
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - tree-log fixes:
    - fixes of log tracking of directories and subvolumes
    - fix iteration and error handling of inode references
      during log replay

 - fix free space tree rebuild (reported by syzbot)

* tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: use btrfs_record_snapshot_destroy() during rmdir
  btrfs: propagate last_unlink_trans earlier when doing a rmdir
  btrfs: record new subvolume in parent dir earlier to avoid dir logging races
  btrfs: fix inode lookup error handling during log replay
  btrfs: fix iteration of extrefs during log replay
  btrfs: fix missing error handling when searching for inode refs during log replay
  btrfs: fix failure to rebuild free space tree using multiple transactions
2025-07-03 13:29:56 -07:00
Filipe Manana bf5bcf9a6f btrfs: record new subvolume in parent dir earlier to avoid dir logging races
Instead of recording that a new subvolume was created in a directory after
we add the entry do the directory, record it before adding the entry. This
is to avoid races where after creating the entry and before recording the
new subvolume in the directory (the call to btrfs_record_new_subvolume()),
another task logs the directory, so we end up with a log tree where we
logged a directory that has an entry pointing to a root that was not yet
committed, resulting in an invalid entry if the log is persisted and
replayed later due to a power failure or crash.

Also state this requirement in the function comment for
btrfs_record_new_subvolume(), similar to what we do for the
btrfs_record_unlink_dir() and btrfs_record_snapshot_destroy().

Fixes: 45c4102f0d ("btrfs: avoid transaction commit on any fsync after subvolume creation")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-27 19:57:24 +02:00
Linus Torvalds 5ca7fe213b for-6.16-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmhZQ9MACgkQxWXV+ddt
 WDsyvw/+K5N4zbig9D5QL5SdsQwMe/ZUk1KF0LLu6H3hFetdICeM/Z4K46EBh40X
 c9Sxb13gLnIAm8DR/IFTTlOZVrrbJ3CTazZuJbncCpaZchH863aYb/1KboxjJnpW
 KqOen20KdUh8HdevrJFhkFc7rOjp7KupfIHsbWqIxaWYPf8ORvUyK55lKxQz0HES
 E5tFXLNr6z/8Ws5pc2HnRLgnRcCHuRUNJUb1PEaTfPKxoFvTwjda6cDsYnXOJEO9
 NOnh6lluurqja+3FUEFig2f292/CbKGtByYUDgfhHO21P//IHSDhlouvwipzI/kh
 6WUoH1K+DWCxxNbIVFFbUYLxrDGu7R7/aWFHH2q0dNjqQeiQBbUnbn4WIjAAwDWf
 k9cmE+WgVqwQI+vpfG3eENUafG5MpcQQo2wKrxG0whWaC2fiA6QtI+3DfKyMj4XJ
 JI1jUhfCwHrqzoGQ4XBE3UYENqQw9RICNC+Z3UfZx+5sQMWcb+ac5qIGygvCfU8N
 Gtfx4ladZshpQUSuRneiLozxdxLyXX3LzCt2Ls1s5fPPikZft/+2QRu5rzSbb/Cp
 50TDSn/pE1N/TEMVZaP5M2PxquBVDOZ4TFSsSm3IvceqFInm0UerAGaJ7+T2eZhM
 3XHhIp6xTecHfwukvGqs+XSxB9PMLfF5M0gc+9PR+3oxzFRpowI=
 =XLWR
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Fixes:

   - fix invalid inode pointer dereferences during log replay

   - fix a race between renames and directory logging

   - fix shutting down delayed iput worker

   - fix device byte accounting when dropping chunk

   - in zoned mode, fix offset calculations for DUP profile when
     conventional and sequential zones are used together

  Regression fixes:

   - fix possible double unlock of extent buffer tree (xarray
     conversion)

   - in zoned mode, fix extent buffer refcount when writing out extents
     (xarray conversion)

  Error handling fixes and updates:

   - handle unexpected extent type when replaying log

   - check and warn if there are remaining delayed inodes when putting a
     root

   - fix assertion when building free space tree

   - handle csum tree error with mount option 'rescue=ibadroot'

  Other:

   - error message updates: add prefix to all scrub related messages,
     include other information in messages"

* tag 'for-6.16-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix alloc_offset calculation for partly conventional block groups
  btrfs: handle csum tree error with rescue=ibadroots correctly
  btrfs: fix race between async reclaim worker and close_ctree()
  btrfs: fix assertion when building free space tree
  btrfs: don't silently ignore unexpected extent type when replaying log
  btrfs: fix invalid inode pointer dereferences during log replay
  btrfs: fix double unlock of buffer_tree xarray when releasing subpage eb
  btrfs: update superblock's device bytes_used when dropping chunk
  btrfs: fix a race between renames and directory logging
  btrfs: scrub: add prefix for the error messages
  btrfs: warn if leaking delayed_nodes in btrfs_put_root()
  btrfs: fix delayed ref refcount leak in debug assertion
  btrfs: include root in error message when unlinking inode
  btrfs: don't drop a reference if btrfs_check_write_meta_pointer() fails
2025-06-23 11:16:38 -07:00
Anand Jain 65d5112b4d btrfs: scrub: add prefix for the error messages
Add a "scrub: " prefix to all messages logged by scrub so that it's
easy to filter them from dmesg for analysis.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-06-19 15:19:06 +02:00
Linus Torvalds 5e82ed5ca4 for-6.16-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmgtuJgACgkQxWXV+ddt
 WDt79g//YndozUasOP0raqNVvod4wYvmG/CX1yHOkFQpfRQSVG4av0KlTWnupXKG
 oEQvFbZ639tmXbBYlKlK8Ts8fy1dpj+2iG4ValukA4L7xkY8ML5DrGQfKYbPEm2i
 Ab9lp4qnZZutYVH2/5UGQqkEUA3/YIiOZ0hsZWir//zbkTCL9cuHwl2FUYbmFlHi
 Hxkd30QC0kZuxINdMxXGauF4JkFJFyiNnmI5dMjj07xMMWk1cv8vunoZ3LVjAlbW
 gX16+4rUmtJl33HbYqofee4Dcovvcuvt/fEM1LX0rGbKXOnKA2dQPoMQsjMAV82B
 mjhma5T709MgVHQiDdJduh86seaul4Cuv/E/OqoDj7Kfkoew/YquHEfU4TB4bvCX
 KmONEyJFd9QDq5CUyvfow7HENja6QbU31Fw6akrbfpsVcla0MKAUWPi+Vqpqf+pe
 qIWNcovorD2g/EVJV6y+w0K+kXTarPtXXmVnJnJPYtOkBWpARI3Y8wVxDCKX8Nfo
 7Kpi/h9K87+d9opjjEajydNONDL9GQa4AY4u/oeiwcSuJHvCt/rsKKwHZRyycRiI
 q+nGwsNcmY/ih/EVUzLgYomGG08H9nOcKvZOQkfHpOTI1EgvILAeV9SpGMex7du1
 PiPqVtv9Z60dKy6OValh7ttMpt7LszAK4Dk7XiyHrN1Q3sYDyrs=
 =bDOD
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "Apart from numerous cleanups, there are some performance improvements
  and one minor mount option update. There's one more radix-tree
  conversion (one remaining), and continued work towards enabling large
  folios (almost finished).

  Performance:

   - extent buffer conversion to xarray gains throughput and runtime
     improvements on metadata heavy operations doing writeback (sample
     test shows +50% throughput, -33% runtime)

   - extent io tree cleanups lead to performance improvements by
     avoiding unnecessary searches or repeated searches

   - more efficient extent unpinning when committing transaction
     (estimated run time improvement 3-5%)

  User visible changes:

   - remove standalone mount option 'nologreplay', deprecated in 5.9,
     replacement is 'rescue=nologreplay'

   - in scrub, update reporting, add back device stats message after
     detected errors (accidentally removed during recent refactoring)

  Core:

   - convert extent buffer radix tree to xarray

   - in subpage mode, move block perfect compression out of experimental
     build

   - in zoned mode, introduce sub block groups to allow managing special
     block groups, like the one for relocation or tree-log, to handle
     some corner cases of ENOSPC

   - in scrub, simplify bitmaps for block tracking status

   - continued preparations for large folios:
       - remove assertions for folio order 0
       - add support where missing: compression, buffered write, defrag,
         hole punching, subpage, send

   - fix fsync of files with no hard links not persisting deletion

   - reject tree blocks which are not nodesize aligned, a precaution
     from 4.9 times

   - move transaction abort calls closer to the error sites

   - remove usage of some struct bio_vec internals

   - simplifications in extent map

   - extent IO cleanups and optimizations

   - error handling improvements

   - enhanced ASSERT() macro with optional format strings

   - cleanups:
       - remove unused code
       - naming unifications, dropped __, added prefix
       - merge similar functions
       - use common helpers for various data structures"

* tag 'for-6.16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
  btrfs: move misplaced comment of btrfs_path::keep_locks
  btrfs: remove standalone "nologreplay" mount option
  btrfs: use a single variable to track return value at btrfs_page_mkwrite()
  btrfs: don't return VM_FAULT_SIGBUS on failure to set delalloc for mmap write
  btrfs: simplify early error checking in btrfs_page_mkwrite()
  btrfs: pass true to btrfs_delalloc_release_space() at btrfs_page_mkwrite()
  btrfs: fix wrong start offset for delalloc space release during mmap write
  btrfs: fix harmless race getting delayed ref head count when running delayed refs
  btrfs: log error codes during failures when writing super blocks
  btrfs: simplify error return logic when getting folio at prepare_one_folio()
  btrfs: return real error from __filemap_get_folio() calls
  btrfs: remove superfluous return value check at btrfs_dio_iomap_begin()
  btrfs: fix invalid data space release when truncating block in NOCOW mode
  btrfs: update Kconfig option descriptions
  btrfs: update list of features built under experimental config
  btrfs: send: remove btrfs_debug() calls
  btrfs: use boolean for delalloc argument to btrfs_free_reserved_extent()
  btrfs: use boolean for delalloc argument to btrfs_free_reserved_bytes()
  btrfs: fold error checks when allocating ordered extent and update comments
  btrfs: check we grabbed inode reference when allocating an ordered extent
  ...
2025-05-26 12:24:43 -07:00
Linus Torvalds 6d5b940e1e vfs-6.16-rc1.async.dir
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaDBN6wAKCRCRxhvAZXjc
 ok32AQD9DTiSCAoVg+7s+gSBuLTi8drPTN++mCaxdTqRh5WpRAD9GVyrGQT0s6LH
 eo9bm8d1TAYjilEWM0c0K0TxyQ7KcAA=
 =IW7H
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs directory lookup updates from Christian Brauner:
 "This contains cleanups for the lookup_one*() family of helpers.

  We expose a set of functions with names containing "lookup_one_len"
  and others without the "_len". This difference has nothing to do with
  "len". It's rater a historical accident that can be confusing.

  The functions without "_len" take a "mnt_idmap" pointer. This is found
  in the "vfsmount" and that is an important question when choosing
  which to use: do you have a vfsmount, or are you "inside" the
  filesystem. A related question is "is permission checking relevant
  here?".

  nfsd and cachefiles *do* have a vfsmount but *don't* use the non-_len
  functions. They pass nop_mnt_idmap and refuse to work on filesystems
  which have any other idmap.

  This work changes nfsd and cachefile to use the lookup_one family of
  functions and to explictily pass &nop_mnt_idmap which is consistent
  with all other vfs interfaces used where &nop_mnt_idmap is explicitly
  passed.

  The remaining uses of the "_one" functions do not require permission
  checks so these are renamed to be "_noperm" and the permission
  checking is removed.

  This series also changes these lookup function to take a qstr instead
  of separate name and len. In many cases this simplifies the call"

* tag 'vfs-6.16-rc1.async.dir' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  VFS: change lookup_one_common and lookup_noperm_common to take a qstr
  Use try_lookup_noperm() instead of d_hash_and_lookup() outside of VFS
  VFS: rename lookup_one_len family to lookup_noperm and remove permission check
  cachefiles: Use lookup_one() rather than lookup_one_len()
  nfsd: Use lookup_one() rather than lookup_one_len()
  VFS: improve interface for lookup_one functions
2025-05-26 08:02:43 -07:00
David Sterba f963e0128b btrfs: trivial conversion to return bool instead of int
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:49 +02:00
Filipe Manana 242570e80b btrfs: add btrfs prefix to main lock, try lock and unlock extent functions
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-05-15 14:30:43 +02:00
NeilBrown 5741909697
VFS: improve interface for lookup_one functions
The family of functions:
  lookup_one()
  lookup_one_unlocked()
  lookup_one_positive_unlocked()

appear designed to be used by external clients of the filesystem rather
than by filesystems acting on themselves as the lookup_one_len family
are used.

They are used by:
   btrfs/ioctl - which is a user-space interface rather than an internal
     activity
   exportfs - i.e. from nfsd or the open_by_handle_at interface
   overlayfs - at access the underlying filesystems
   smb/server - for file service

They should be used by nfsd (more than just the exportfs path) and
cachefs but aren't.

It would help if the documentation didn't claim they should "not be
called by generic code".

Also the path component name is passed as "name" and "len" which are
(confusingly?) separate by the "base".  In some cases the len in simply
"strlen" and so passing a qstr using QSTR() would make the calling
clearer.
Other callers do pass separate name and len which are stored in a
struct.  Sometimes these are already stored in a qstr, other times it
easily could be.

So this patch changes these three functions to receive a 'struct qstr *',
and improves the documentation.

QSTR_LEN() is added to make it easy to pass a QSTR containing a known
len.

[brauner@kernel.org: take a struct qstr pointer]
Signed-off-by: NeilBrown <neil@brown.name>
Link: https://lore.kernel.org/r/20250319031545.2999807-2-neil@brown.name
Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-07 09:25:32 +02:00
Sidong Yang 8e587ab43c btrfs: ioctl: don't free iov when btrfs_encoded_read() returns -EAGAIN
Fix a bug in encoded read that mistakenly frees the iov in case
btrfs_encoded_read() returns -EAGAIN assuming the structure will be
reused.  This can happen when when receiving requests concurrently, the
io_uring subsystem does not reset the data, and the last free will
happen in btrfs_uring_read_finished().

Handle the -EAGAIN error and skip freeing iov.

CC: stable@vger.kernel.org # 6.13+
Signed-off-by: Sidong Yang <sidong.yang@furiosa.ai>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-04-01 01:02:26 +02:00
Sun YangKai 140ac522de btrfs: simplify the return value handling in search_ioctl()
Move the assignment of -EFAULT to within the error condition check
in fault_in_subpage_writeable(). The previous placement outside the
condition could lead to the error value being overwritten by subsequent
assignments, cause unnecessary assignments.

Simplify loop exit logic by removing redundant goto.
The original code used 'goto err' to bypass post-loop processing after
handling errors from btrfs_search_forward(). However, the loop's
termination naturally falls through to the post-loop section, which
already handles 'ret' values. Replacing 'goto err' with 'break'
eliminates redundant control flow, consolidates error handling, and
makes the loop's exit conditions explicit.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:51 +01:00
Filipe Manana b204e5c7d4 btrfs: make btrfs_iget() return a btrfs inode instead
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:50 +01:00
David Sterba f6e8a43611 btrfs: unify inode variable naming
Rename binode to inode in local variables or parameters so it's more
unified with the rest of the code.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba f272c004d2 btrfs: pass struct to btrfs_ioctl_subvol_getflags()
Pass a struct btrfs_inode to btrfs_ioctl_subvol_getflags() as it's an
internal interface, allowing to remove some use of BTRFS_I.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba f6c2ccfc3b btrfs: simplify local variables in btrfs_ioctl_resize()
Remove some redundant variables and assignments, move variable
declarations to their closest scope.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba 4f27a69394 btrfs: pass struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags()
Pass a struct btrfs_inode to btrfs_sync_inode_flags_to_i_flags() as it's
an internal interface.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba 68dc1cb231 btrfs: pass root pointers to search tree ioctl helpers
The search tree ioctl use btrfs_root so change that from btrfs_inode
pointers so we don't have to do the conversion.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba 4e043cd196 btrfs: pass btrfs_root pointers to send ioctl parameters
The ioctl switch btrfs_ioctl() provides several parameter types for
convenience so we don't have to do the conversion in the callbacks.
Pass root pointers to the send related functions.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba 5e54f9420f btrfs: parameter constification in ioctl.c
Add const to function parameters that are not changed.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:49 +01:00
David Sterba fc11fd0cb8 btrfs: pass struct btrfs_inode to btrfs_defrag_file()
Pass a struct btrfs_inode to btrfs_defrag_file() as it's an internal
interface, allowing to remove some use of BTRFS_I.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:43 +01:00
David Sterba dba6ae0b43 btrfs: unify ordering of btrfs_key initializations
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:42 +01:00
Qu Wenruo ecde48a1a6 btrfs: expose per-inode stable writes flag
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will
wait for the folio to finish its writeback.

For btrfs, due to the default data checksum behavior, if we modify the
folio while it's still under writeback, it will cause data checksum
mismatch.  Thus for quite some call sites we manually call
folio_wait_writeback() to prevent such problem from happening.

Currently there is only one call site inside btrfs really utilizing
FGP_STABLE, and in that case we also manually call folio_wait_writeback()
to do the waiting.

But it's better to properly expose the stable writes flag to a per-inode
basis, to allow call sites to fully benefit from FGP_STABLE flag.
E.g. for inodes with NODATASUM allowing beginning dirtying the page
without waiting for writeback.

This involves:

- Update the mapping's stable write flag when setting/clearing NODATASUM
  inode flag using ioctl
  This only works for empty files, so it should be fine.

- Update the mapping's stable write flag when reading an inode from disk

- Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call
  site

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-03-18 20:35:41 +01:00