Commit Graph

2601 Commits

Author SHA1 Message Date
Kees Cook ceb35b666d bpf/verifier: Use kmalloc_size_roundup() to match ksize() usage
Most allocation sites in the kernel want an explicitly sized allocation
(and not "more"), and that dynamic runtime analysis tools (e.g. KASAN,
UBSAN_BOUNDS, FORTIFY_SOURCE, etc) are looking for precise bounds checking
(i.e. not something that is rounded up). A tiny handful of allocations
were doing an implicit alloc/realloc loop that actually depended on
ksize(), and didn't actually always call realloc. This has created a
long series of bugs and problems over many years related to the runtime
bounds checking, so these callers are finally being adjusted to _not_
depend on the ksize() side-effect, by doing one of several things:

- tracking the allocation size precisely and just never calling ksize()
  at all [1].

- always calling realloc and not using ksize() at all. (This solution
  ends up actually be a subset of the next solution.)

- using kmalloc_size_roundup() to explicitly round up the desired
  allocation size immediately [2].

The bpf/verifier case is this another of this latter case, and is the
last outstanding case to be fixed in the kernel.

Because some of the dynamic bounds checking depends on the size being an
_argument_ to an allocator function (i.e. see the __alloc_size attribute),
the ksize() users are rare, and it could waste local variables, it
was been deemed better to explicitly separate the rounding up from the
allocation itself [3].

Round up allocations with kmalloc_size_roundup() so that the verifier's
use of ksize() is always accurate.

[1] e.g.:
    https://git.kernel.org/linus/712f210a457d
    https://git.kernel.org/linus/72c08d9f4c72

[2] e.g.:
    https://git.kernel.org/netdev/net-next/c/12d6c1d3a2ad
    https://git.kernel.org/netdev/net-next/c/ab3f7828c979
    https://git.kernel.org/netdev/net-next/c/d6dd508080a3

[3] https://lore.kernel.org/lkml/0ea1fc165a6c6117f982f4f135093e69cb884930.camel@redhat.com/

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20221118183409.give.387-kees@kernel.org
2022-11-21 15:07:04 +01:00
Yonghong Song a35b9af4ec bpf: Add a kfunc for generic type cast
Implement bpf_rdonly_cast() which tries to cast the object
to a specified type. This tries to support use case like below:
  #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))
where skb_end_pointer(SKB) is a 'unsigned char *' and needs to
be casted to 'struct skb_shared_info *'.

The signature of bpf_rdonly_cast() looks like
   void *bpf_rdonly_cast(void *obj, __u32 btf_id)
The function returns the same 'obj' but with PTR_TO_BTF_ID with
btf_id. The verifier will ensure btf_id being a struct type.

Since the supported type cast may not reflect what the 'obj'
represents, the returned btf_id is marked as PTR_UNTRUSTED, so
the return value and subsequent pointer chasing cannot be
used as helper/kfunc arguments.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195437.3114585-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:45:26 -08:00
Yonghong Song fd264ca020 bpf: Add a kfunc to type cast from bpf uapi ctx to kernel ctx
Implement bpf_cast_to_kern_ctx() kfunc which does a type cast
of a uapi ctx object to the corresponding kernel ctx. Previously
if users want to access some data available in kctx but not
in uapi ctx, bpf_probe_read_kernel() helper is needed.
The introduction of bpf_cast_to_kern_ctx() allows direct
memory access which makes code simpler and easier to understand.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195432.3113982-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:43:37 -08:00
Yonghong Song cfe1456440 bpf: Add support for kfunc set with common btf_ids
Later on, we will introduce kfuncs bpf_cast_to_kern_ctx() and
bpf_rdonly_cast() which apply to all program types. Currently kfunc set
only supports individual prog types. This patch added support for kfunc
applying to all program types.

Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221120195426.3113828-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:43:37 -08:00
Kumar Kartikeya Dwivedi e181d3f143 bpf: Disallow bpf_obj_new_impl call when bpf_mem_alloc_init fails
In the unlikely event that bpf_global_ma is not correctly initialized,
instead of checking the boolean everytime bpf_obj_new_impl is called,
simply check it while loading the program and return an error if
bpf_global_ma_set is false.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221120212610.2361700-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 15:38:28 -08:00
David Vernet 90660309b0 bpf: Add kfuncs for storing struct task_struct * as a kptr
Now that BPF supports adding new kernel functions with kfuncs, and
storing kernel objects in maps with kptrs, we can add a set of kfuncs
which allow struct task_struct objects to be stored in maps as
referenced kptrs. The possible use cases for doing this are plentiful.
During tracing, for example, it would be useful to be able to collect
some tasks that performed a certain operation, and then periodically
summarize who they are, which cgroup they're in, how much CPU time
they've utilized, etc.

In order to enable this, this patch adds three new kfuncs:

struct task_struct *bpf_task_acquire(struct task_struct *p);
struct task_struct *bpf_task_kptr_get(struct task_struct **pp);
void bpf_task_release(struct task_struct *p);

A follow-on patch will add selftests validating these kfuncs.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-4-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 09:16:21 -08:00
David Vernet 3f00c52393 bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs
Kfuncs currently support specifying the KF_TRUSTED_ARGS flag to signal
to the verifier that it should enforce that a BPF program passes it a
"safe", trusted pointer. Currently, "safe" means that the pointer is
either PTR_TO_CTX, or is refcounted. There may be cases, however, where
the kernel passes a BPF program a safe / trusted pointer to an object
that the BPF program wishes to use as a kptr, but because the object
does not yet have a ref_obj_id from the perspective of the verifier, the
program would be unable to pass it to a KF_ACQUIRE | KF_TRUSTED_ARGS
kfunc.

The solution is to expand the set of pointers that are considered
trusted according to KF_TRUSTED_ARGS, so that programs can invoke kfuncs
with these pointers without getting rejected by the verifier.

There is already a PTR_UNTRUSTED flag that is set in some scenarios,
such as when a BPF program reads a kptr directly from a map
without performing a bpf_kptr_xchg() call. These pointers of course can
and should be rejected by the verifier. Unfortunately, however,
PTR_UNTRUSTED does not cover all the cases for safety that need to
be addressed to adequately protect kfuncs. Specifically, pointers
obtained by a BPF program "walking" a struct are _not_ considered
PTR_UNTRUSTED according to BPF. For example, say that we were to add a
kfunc called bpf_task_acquire(), with KF_ACQUIRE | KF_TRUSTED_ARGS, to
acquire a struct task_struct *. If we only used PTR_UNTRUSTED to signal
that a task was unsafe to pass to a kfunc, the verifier would mistakenly
allow the following unsafe BPF program to be loaded:

SEC("tp_btf/task_newtask")
int BPF_PROG(unsafe_acquire_task,
             struct task_struct *task,
             u64 clone_flags)
{
        struct task_struct *acquired, *nested;

        nested = task->last_wakee;

        /* Would not be rejected by the verifier. */
        acquired = bpf_task_acquire(nested);
        if (!acquired)
                return 0;

        bpf_task_release(acquired);
        return 0;
}

To address this, this patch defines a new type flag called PTR_TRUSTED
which tracks whether a PTR_TO_BTF_ID pointer is safe to pass to a
KF_TRUSTED_ARGS kfunc or a BPF helper function. PTR_TRUSTED pointers are
passed directly from the kernel as a tracepoint or struct_ops callback
argument. Any nested pointer that is obtained from walking a PTR_TRUSTED
pointer is no longer PTR_TRUSTED. From the example above, the struct
task_struct *task argument is PTR_TRUSTED, but the 'nested' pointer
obtained from 'task->last_wakee' is not PTR_TRUSTED.

A subsequent patch will add kfuncs for storing a task kfunc as a kptr,
and then another patch will add selftests to validate.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-3-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 09:16:21 -08:00
David Vernet ef66c5475d bpf: Allow multiple modifiers in reg_type_str() prefix
reg_type_str() in the verifier currently only allows a single register
type modifier to be present in the 'prefix' string which is eventually
stored in the env type_str_buf. This currently works fine because there
are no overlapping type modifiers, but once PTR_TRUSTED is added, that
will no longer be the case. This patch updates reg_type_str() to support
having multiple modifiers in the prefix string, and updates the size of
type_str_buf to be 128 bytes.

Signed-off-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221120051004.3605026-2-void@manifault.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-20 09:16:21 -08:00
Kumar Kartikeya Dwivedi c22dfdd215 bpf: Add comments for map BTF matching requirement for bpf_list_head
The old behavior of bpf_map_meta_equal was that it compared timer_off
to be equal (but not spin_lock_off, because that was not allowed), and
did memcmp of kptr_off_tab.

Now, we memcmp the btf_record of two bpf_map structs, which has all
fields.

We preserve backwards compat as we kzalloc the array, so if only spin
lock and timer exist in map, we only compare offset while the rest of
unused members in the btf_field struct are zeroed out.

In case of kptr, btf and everything else is of vmlinux or module, so as
long type is same it will match, since kernel btf, module, dtor pointer
will be same across maps.

Now with list_head in the mix, things are a bit complicated. We
implicitly add a requirement that both BTFs are same, because struct
btf_field_list_head has btf and value_rec members.

We obviously shouldn't force BTFs to be equal by default, as that breaks
backwards compatibility.

Currently it is only implicitly required due to list_head matching
struct btf and value_rec member. value_rec points back into a btf_record
stashed in the map BTF (btf member of btf_field_list_head). So that
pointer and btf member has to match exactly.

Document all these subtle details so that things don't break in the
future when touching this code.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-19-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi 534e86bc6c bpf: Add 'release on unlock' logic for bpf_list_push_{front,back}
This commit implements the delayed release logic for bpf_list_push_front
and bpf_list_push_back.

Once a node has been added to the list, it's pointer changes to
PTR_UNTRUSTED. However, it is only released once the lock protecting the
list is unlocked. For such PTR_TO_BTF_ID | MEM_ALLOC with PTR_UNTRUSTED
set but an active ref_obj_id, it is still permitted to read them as long
as the lock is held. Writing to them is not allowed.

This allows having read access to push items we no longer own until we
release the lock guarding the list, allowing a little more flexibility
when working with these APIs.

Note that enabling write support has fairly tricky interactions with
what happens inside the critical section. Just as an example, currently,
bpf_obj_drop is not permitted, but if it were, being able to write to
the PTR_UNTRUSTED pointer while the object gets released back to the
memory allocator would violate safety properties we wish to guarantee
(i.e. not crashing the kernel). The memory could be reused for a
different type in the BPF program or even in the kernel as it gets
eventually kfree'd.

Not enabling bpf_obj_drop inside the critical section would appear to
prevent all of the above, but that is more of an artifical limitation
right now. Since the write support is tangled with how we handle
potential aliasing of nodes inside the critical section that may or may
not be part of the list anymore, it has been deferred to a future patch.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-18-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi 8cab76ec63 bpf: Introduce single ownership BPF linked list API
Add a linked list API for use in BPF programs, where it expects
protection from the bpf_spin_lock in the same allocation as the
bpf_list_head. For now, only one bpf_spin_lock can be present hence that
is assumed to be the one protecting the bpf_list_head.

The following functions are added to kick things off:

// Add node to beginning of list
void bpf_list_push_front(struct bpf_list_head *head, struct bpf_list_node *node);

// Add node to end of list
void bpf_list_push_back(struct bpf_list_head *head, struct bpf_list_node *node);

// Remove node at beginning of list and return it
struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head);

// Remove node at end of list and return it
struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head);

The lock protecting the bpf_list_head needs to be taken for all
operations. The verifier ensures that the lock that needs to be taken is
always held, and only the correct lock is taken for these operations.
These checks are made statically by relying on the reg->id preserved for
registers pointing into regions having both bpf_spin_lock and the
objects protected by it. The comment over check_reg_allocation_locked in
this change describes the logic in detail.

Note that bpf_list_push_front and bpf_list_push_back are meant to
consume the object containing the node in the 1st argument, however that
specific mechanism is intended to not release the ref_obj_id directly
until the bpf_spin_unlock is called. In this commit, nothing is done,
but the next commit will be introducing logic to handle this case, so it
has been left as is for now.

bpf_list_pop_front and bpf_list_pop_back delete the first or last item
of the list respectively, and return pointer to the element at the
list_node offset. The user can then use container_of style macro to get
the actual entry type. The verifier however statically knows the actual
type, so the safety properties are still preserved.

With these additions, programs can now manage their own linked lists and
store their objects in them.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-17-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi df57f38a0d bpf: Permit NULL checking pointer with non-zero fixed offset
Pointer increment on seeing PTR_MAYBE_NULL is already protected against,
hence make an exception for PTR_TO_BTF_ID | MEM_ALLOC while still
keeping the warning for other unintended cases that might creep in.

bpf_list_pop_{front,_back} helpers planned to be introduced in next
commit will return a MEM_ALLOC register with incremented offset pointing
to bpf_list_node field. The user is supposed to then obtain the pointer
to the entry using container_of after NULL checking it. The current
restrictions trigger a warning when doing the NULL checking. Revisiting
the reason, it is meant as an assertion which seems to actually work and
catch the bad case.

Hence, under no other circumstances can reg->off be non-zero for a
register that has the PTR_MAYBE_NULL type flag set.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-16-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi ac9f06050a bpf: Introduce bpf_obj_drop
Introduce bpf_obj_drop, which is the kfunc used to free allocated
objects (allocated using bpf_obj_new). Pairing with bpf_obj_new, it
implicitly destructs the fields part of object automatically without
user intervention.

Just like the previous patch, btf_struct_meta that is needed to free up
the special fields is passed as a hidden argument to the kfunc.

For the user, a convenience macro hides over the kernel side kfunc which
is named bpf_obj_drop_impl.

Continuing the previous example:

void prog(void) {
	struct foo *f;

	f = bpf_obj_new(typeof(*f));
	if (!f)
		return;
	bpf_obj_drop(f);
}

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-15-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi 958cf2e273 bpf: Introduce bpf_obj_new
Introduce type safe memory allocator bpf_obj_new for BPF programs. The
kernel side kfunc is named bpf_obj_new_impl, as passing hidden arguments
to kfuncs still requires having them in prototype, unlike BPF helpers
which always take 5 arguments and have them checked using bpf_func_proto
in verifier, ignoring unset argument types.

Introduce __ign suffix to ignore a specific kfunc argument during type
checks, then use this to introduce support for passing type metadata to
the bpf_obj_new_impl kfunc.

The user passes BTF ID of the type it wants to allocates in program BTF,
the verifier then rewrites the first argument as the size of this type,
after performing some sanity checks (to ensure it exists and it is a
struct type).

The second argument is also fixed up and passed by the verifier. This is
the btf_struct_meta for the type being allocated. It would be needed
mostly for the offset array which is required for zero initializing
special fields while leaving the rest of storage in unitialized state.

It would also be needed in the next patch to perform proper destruction
of the object's special fields.

Under the hood, bpf_obj_new will call bpf_mem_alloc and bpf_mem_free,
using the any context BPF memory allocator introduced recently. To this
end, a global instance of the BPF memory allocator is initialized on
boot to be used for this purpose. This 'bpf_global_ma' serves all
allocations for bpf_obj_new. In the future, bpf_obj_new variants will
allow specifying a custom allocator.

Note that now that bpf_obj_new can be used to allocate objects that can
be linked to BPF linked list (when future linked list helpers are
available), we need to also free the elements using bpf_mem_free.
However, since the draining of elements is done outside the
bpf_spin_lock, we need to do migrate_disable around the call since
bpf_list_head_free can be called from map free path where migration is
enabled. Otherwise, when called from BPF programs migration is already
disabled.

A convenience macro is included in the bpf_experimental.h header to hide
over the ugly details of the implementation, leading to user code
looking similar to a language level extension which allocates and
constructs fields of a user type.

struct bar {
	struct bpf_list_node node;
};

struct foo {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(bar, node);
};

void prog(void) {
	struct foo *f;

	f = bpf_obj_new(typeof(*f));
	if (!f)
		return;
	...
}

A key piece of this story is still missing, i.e. the free function,
which will come in the next patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-14-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:14 -08:00
Kumar Kartikeya Dwivedi a50388dbb3 bpf: Support constant scalar arguments for kfuncs
Allow passing known constant scalars as arguments to kfuncs that do not
represent a size parameter. We use mark_chain_precision for the constant
scalar argument to mark it precise. This makes the search pruning
optimization of verifier more conservative for such kfunc calls, and
each non-distinct argument is considered unequivalent.

We will use this support to then expose a bpf_obj_new function where it
takes the local type ID of a type in program BTF, and returns a
PTR_TO_BTF_ID | MEM_ALLOC to the local type, and allows programs to
allocate their own objects.

Each type ID resolves to a distinct type with a possibly distinct size,
hence the type ID constant matters in terms of program safety and its
precision needs to be checked between old and cur states inside regsafe.
The use of mark_chain_precision enables this.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-13-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:22:13 -08:00
Kumar Kartikeya Dwivedi 00b85860fe bpf: Rewrite kfunc argument handling
As we continue to add more features, argument types, kfunc flags, and
different extensions to kfuncs, the code to verify the correctness of
the kfunc prototype wrt the passed in registers has become ad-hoc and
ugly to read. To make life easier, and make a very clear split between
different stages of argument processing, move all the code into
verifier.c and refactor into easier to read helpers and functions.

This also makes sharing code within the verifier easier with kfunc
argument processing. This will be more and more useful in later patches
as we are now moving to implement very core BPF helpers as kfuncs, to
keep them experimental before baking into UAPI.

Remove all kfunc related bits now from btf_check_func_arg_match, as
users have been converted away to refactored kfunc argument handling.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-12-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:16:49 -08:00
Kumar Kartikeya Dwivedi b7ff97925b bpf: Allow locking bpf_spin_lock in inner map values
There is no need to restrict users from locking bpf_spin_lock in map
values of inner maps. Each inner map lookup gets a unique reg->id
assigned to the returned PTR_TO_MAP_VALUE which will be preserved after
the NULL check. Distinct lookups into different inner map get unique
IDs, and distinct lookups into same inner map also get unique IDs.

Hence, lift the restriction by removing the check return -ENOTSUPP in
map_in_map.c. Later commits will add comprehensive test cases to ensure
that invalid cases are rejected.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-11-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi d0d78c1df9 bpf: Allow locking bpf_spin_lock global variables
Global variables reside in maps accessible using direct_value_addr
callbacks, so giving each load instruction's rewrite a unique reg->id
disallows us from holding locks which are global.

The reason for preserving reg->id as a unique value for registers that
may point to spin lock is that two separate lookups are treated as two
separate memory regions, and any possible aliasing is ignored for the
purposes of spin lock correctness.

This is not great especially for the global variable case, which are
served from maps that have max_entries == 1, i.e. they always lead to
map values pointing into the same map value.

So refactor the active_spin_lock into a 'active_lock' structure which
represents the lock identity, and instead of the reg->id, remember two
fields, a pointer and the reg->id. The pointer will store reg->map_ptr
or reg->btf. It's only necessary to distinguish for the id == 0 case of
global variables, but always setting the pointer to a non-NULL value and
using the pointer to check whether the lock is held simplifies code in
the verifier.

This is generic enough to allow it for global variables, map lookups,
and allocated objects at the same time.

Note that while whether a lock is held can be answered by just comparing
active_lock.ptr to NULL, to determine whether the register is pointing
to the same held lock requires comparing _both_ ptr and id.

Finally, as a result of this refactoring, pseudo load instructions are
not given a unique reg->id, as they are doing lookup for the same map
value (max_entries is never greater than 1).

Essentially, we consider that the tuple of (ptr, id) will always be
unique for any kind of argument to bpf_spin_{lock,unlock}.

Note that this can be extended in the future to also remember offset
used for locking, so that we can introduce multiple bpf_spin_lock fields
in the same allocation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-10-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi 4e814da0d5 bpf: Allow locking bpf_spin_lock in allocated objects
Allow locking a bpf_spin_lock in an allocated object, in addition to
already supported map value pointers. The handling is similar to that of
map values, by just preserving the reg->id of PTR_TO_BTF_ID | MEM_ALLOC
as well, and adjusting process_spin_lock to work with them and remember
the id in verifier state.

Refactor the existing process_spin_lock to work with PTR_TO_BTF_ID |
MEM_ALLOC in addition to PTR_TO_MAP_VALUE. We need to update the
reg_may_point_to_spin_lock which is used in mark_ptr_or_null_reg to
preserve reg->id, that will be used in env->cur_state->active_spin_lock
to remember the currently held spin lock.

Also update the comment describing bpf_spin_lock implementation details
to also talk about PTR_TO_BTF_ID | MEM_ALLOC type.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi 865ce09a49 bpf: Verify ownership relationships for user BTF types
Ensure that there can be no ownership cycles among different types by
way of having owning objects that can hold some other type as their
element. For instance, a map value can only hold allocated objects, but
these are allowed to have another bpf_list_head. To prevent unbounded
recursion while freeing resources, elements of bpf_list_head in local
kptrs can never have a bpf_list_head which are part of list in a map
value. Later patches will verify this by having dedicated BTF selftests.

Also, to make runtime destruction easier, once btf_struct_metas is fully
populated, we can stash the metadata of the value type directly in the
metadata of the list_head fields, as that allows easier access to the
value type's layout to destruct it at runtime from the btf_field entry
of the list head itself.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi 8ffa5cc142 bpf: Recognize lock and list fields in allocated objects
Allow specifying bpf_spin_lock, bpf_list_head, bpf_list_node fields in a
allocated object.

Also update btf_struct_access to reject direct access to these special
fields.

A bpf_list_head allows implementing map-in-map style use cases, where an
allocated object with bpf_list_head is linked into a list in a map
value. This would require embedding a bpf_list_node, support for which
is also included. The bpf_spin_lock is used to protect the bpf_list_head
and other data.

While we strictly don't require to hold a bpf_spin_lock while touching
the bpf_list_head in such objects, as when have access to it, we have
complete ownership of the object, the locking constraint is still kept
and may be conditionally lifted in the future.

Note that the specification of such types can be done just like map
values, e.g.:

struct bar {
	struct bpf_list_node node;
};

struct foo {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(bar, node);
	struct bpf_list_node node;
};

struct map_value {
	struct bpf_spin_lock lock;
	struct bpf_list_head head __contains(foo, node);
};

To recognize such types in user BTF, we build a btf_struct_metas array
of metadata items corresponding to each BTF ID. This is done once during
the btf_parse stage to avoid having to do it each time during the
verification process's requirement to inspect the metadata.

Moreover, the computed metadata needs to be passed to some helpers in
future patches which requires allocating them and storing them in the
BTF that is pinned by the program itself, so that valid access can be
assumed to such data during program runtime.

A key thing to note is that once a btf_struct_meta is available for a
type, both the btf_record and btf_field_offs should be available. It is
critical that btf_field_offs is available in case special fields are
present, as we extensively rely on special fields being zeroed out in
map values and allocated objects in later patches. The code ensures that
by bailing out in case of errors and ensuring both are available
together. If the record is not available, the special fields won't be
recognized, so not having both is also fine (in terms of being a
verification error and not a runtime bug).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi 282de143ea bpf: Introduce allocated objects support
Introduce support for representing pointers to objects allocated by the
BPF program, i.e. PTR_TO_BTF_ID that point to a type in program BTF.
This is indicated by the presence of MEM_ALLOC type flag in reg->type to
avoid having to check btf_is_kernel when trying to match argument types
in helpers.

Whenever walking such types, any pointers being walked will always yield
a SCALAR instead of pointer. In the future we might permit kptr inside
such allocated objects (either kernel or program allocated), and it will
then form a PTR_TO_BTF_ID of the respective type.

For now, such allocated objects will always be referenced in verifier
context, hence ref_obj_id == 0 for them is a bug. It is allowed to write
to such objects, as long fields that are special are not touched
(support for which will be added in subsequent patches). Note that once
such a pointer is marked PTR_UNTRUSTED, it is no longer allowed to write
to it.

No PROBE_MEM handling is therefore done for loads into this type unless
PTR_UNTRUSTED is part of the register type, since they can never be in
an undefined state, and their lifetime will always be valid.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi f73e601aaf bpf: Populate field_offs for inner_map_meta
Far too much code simply assumes that both btf_record and btf_field_offs
are set to valid pointers together, or both are unset. They go together
hand in hand as btf_record describes the special fields and
btf_field_offs is compact representation for runtime copying/zeroing.

It is very difficult to make this clear in the code when the only
exception to this universal invariant is inner_map_meta which is used
as reg->map_ptr in the verifier. This is simply a bug waiting to happen,
as in verifier context we cannot easily distinguish if PTR_TO_MAP_VALUE
is coming from an inner map, and if we ever end up using field_offs for
any reason in the future, we will silently ignore the special fields for
inner map case (as NULL is not an error but unset field_offs).

Hence, simply copy field_offs from inner map together with btf_record.

While at it, refactor code to unwind properly on errors with gotos.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:32 -08:00
Kumar Kartikeya Dwivedi d48995723c bpf: Free inner_map_meta when btf_record_dup fails
Whenever btf_record_dup fails, we must free inner_map_meta that was
allocated before.

This fixes a memory leak (in case of errors) during inner map creation.

Fixes: aa3496accc ("bpf: Refactor kptr_off_tab into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Kumar Kartikeya Dwivedi d7f5ef653c bpf: Do btf_record_free outside map_free callback
Since the commit being fixed, we now miss freeing btf_record for local
storage maps which will have a btf_record populated in case they have
bpf_spin_lock element.

This was missed because I made the choice of offloading the job to free
kptr_off_tab (now btf_record) to the map_free callback when adding
support for kptrs.

Revisiting the reason for this decision, there is the possibility that
the btf_record gets used inside map_free callback (e.g. in case of maps
embedding kptrs) to iterate over them and free them, hence doing it
before the map_free callback would be leaking special field memory, and
do invalid memory access. The btf_record keeps module references which
is critical to ensure the dtor call made for referenced kptr is safe to
do.

If doing it after map_free callback, the map area is already freed, so
we cannot access bpf_map structure anymore.

To fix this and prevent such lapses in future, move bpf_map_free_record
out of the map_free callback, and do it after map_free by remembering
the btf_record pointer. There is no need to access bpf_map structure in
that case, and we can avoid missing this case when support for new map
types is added for other special fields.

Since a btf_record and its btf_field_offs are used together, for
consistency delay freeing of field_offs as well. While not a problem
right now, a lot of code assumes that either both record and field_offs
are set or none at once.

Note that in case of map of maps (outer maps), inner_map_meta->record is
only used during verification, not to free fields in map value, hence we
simply keep the bpf_map_free_record call as is in bpf_map_meta_free and
never touch map->inner_map_meta in bpf_map_free_deferred.

Add a comment making note of these details.

Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Kumar Kartikeya Dwivedi c237bfa528 bpf: Fix early return in map_check_btf
Instead of returning directly with -EOPNOTSUPP for the timer case, we
need to free the btf_record before returning to userspace.

Fixes: db55911782 ("bpf: Consolidate spin_lock, timer management into btf_record")
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221118015614.2013203-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-17 19:11:31 -08:00
Jakub Kicinski 224b744abf Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/bpf.h
  1f6e04a1c7 ("bpf: Fix offset calculation error in __copy_map_value and zero_map_value")
  aa3496accc ("bpf: Refactor kptr_off_tab into btf_record")
  f71b2f6417 ("bpf: Refactor map->off_arr handling")
https://lore.kernel.org/all/20221114095000.67a73239@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17 18:30:39 -08:00
Jason A. Donenfeld 8032bf1233 treewide: use get_random_u32_below() instead of deprecated function
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Reviewed-by: SeongJae Park <sj@kernel.org> # for damon
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> # for infiniband
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> # for arm
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-11-18 02:15:15 +01:00
Hou Tao 3af43ba4c6 bpf: Pass map file to .map_update_batch directly
Currently bpf_map_do_batch() first invokes fdget(batch.map_fd) to get
the target map file, then it invokes generic_map_update_batch() to do
batch update. generic_map_update_batch() will get the target map file
by using fdget(batch.map_fd) again and pass it to bpf_map_update_value().

The problem is map file returned by the second fdget() may be NULL or a
totally different file compared by map file in bpf_map_do_batch(). The
reason is that the first fdget() only guarantees the liveness of struct
file instead of file descriptor and the file description may be released
by concurrent close() through pick_file().

It doesn't incur any problem as for now, because maps with batch update
support don't use map file in .map_fd_get_ptr() ops. But it is better to
fix the potential access of an invalid map file.

Using __bpf_map_get() again in generic_map_update_batch() can not fix
the problem, because batch.map_fd may be closed and reopened, and the
returned map file may be different with map file got in
bpf_map_do_batch(), so just passing the map file directly to
.map_update_batch() in bpf_map_do_batch().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221116075059.1551277-1-houtao@huaweicloud.com
2022-11-17 17:12:35 +01:00
Eduard Zingerman befae75856 bpf: propagate nullness information for reg to reg comparisons
Propagate nullness information for branches of register to register
equality compare instructions. The following rules are used:
- suppose register A maybe null
- suppose register B is not null
- for JNE A, B, ... - A is not null in the false branch
- for JEQ A, B, ... - A is not null in the true branch

E.g. for program like below:

  r6 = skb->sk;
  r7 = sk_fullsock(r6);
  r0 = sk_fullsock(r6);
  if (r0 == 0) return 0;    (a)
  if (r0 != r7) return 0;   (b)
  *r7->type;                (c)
  return 0;

It is safe to dereference r7 at point (c), because of (a) and (b).

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221115224859.2452988-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 17:38:36 -08:00
Toke Høiland-Jørgensen 32637e3300 bpf: Expand map key argument of bpf_redirect_map to u64
For queueing packets in XDP we want to add a new redirect map type with
support for 64-bit indexes. To prepare fore this, expand the width of the
'key' argument to the bpf_redirect_map() helper. Since BPF registers are
always 64-bit, this should be safe to do after the fact.

Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20221108140601.149971-3-toke@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-15 09:00:27 -08:00
Kumar Kartikeya Dwivedi 6728aea721 bpf: Refactor btf_struct_access
Instead of having to pass multiple arguments that describe the register,
pass the bpf_reg_state into the btf_struct_access callback. Currently,
all call sites simply reuse the btf and btf_id of the reg they want to
check the access of. The only exception to this pattern is the callsite
in check_ptr_to_map_access, hence for that case create a dummy reg to
simulate PTR_TO_BTF_ID access.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi 894f2a8b16 bpf: Rename MEM_ALLOC to MEM_RINGBUF
Currently, verifier uses MEM_ALLOC type tag to specially tag memory
returned from bpf_ringbuf_reserve helper. However, this is currently
only used for this purpose and there is an implicit assumption that it
only refers to ringbuf memory (e.g. the check for ARG_PTR_TO_ALLOC_MEM
in check_func_arg_reg_off).

Hence, rename MEM_ALLOC to MEM_RINGBUF to indicate this special
relationship and instead open the use of MEM_ALLOC for more generic
allocations made for user types.

Also, since ARG_PTR_TO_ALLOC_MEM_OR_NULL is unused, simply drop it.

Finally, update selftests using 'alloc_' verifier string to 'ringbuf_'.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi 2de2669b4e bpf: Rename RET_PTR_TO_ALLOC_MEM
Currently, the verifier has two return types, RET_PTR_TO_ALLOC_MEM, and
RET_PTR_TO_ALLOC_MEM_OR_NULL, however the former is confusingly named to
imply that it carries MEM_ALLOC, while only the latter does. This causes
confusion during code review leading to conclusions like that the return
value of RET_PTR_TO_DYNPTR_MEM_OR_NULL (which is RET_PTR_TO_ALLOC_MEM |
PTR_MAYBE_NULL) may be consumable by bpf_ringbuf_{submit,commit}.

Rename it to make it clear MEM_ALLOC needs to be tacked on top of
RET_PTR_TO_MEM.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi f0c5941ff5 bpf: Support bpf_list_head in map values
Add the support on the map side to parse, recognize, verify, and build
metadata table for a new special field of the type struct bpf_list_head.
To parameterize the bpf_list_head for a certain value type and the
list_node member it will accept in that value type, we use BTF
declaration tags.

The definition of bpf_list_head in a map value will be done as follows:

struct foo {
	struct bpf_list_node node;
	int data;
};

struct map_value {
	struct bpf_list_head head __contains(foo, node);
};

Then, the bpf_list_head only allows adding to the list 'head' using the
bpf_list_node 'node' for the type struct foo.

The 'contains' annotation is a BTF declaration tag composed of four
parts, "contains:name:node" where the name is then used to look up the
type in the map BTF, with its kind hardcoded to BTF_KIND_STRUCT during
the lookup. The node defines name of the member in this type that has
the type struct bpf_list_node, which is actually used for linking into
the linked list. For now, 'kind' part is hardcoded as struct.

This allows building intrusive linked lists in BPF, using container_of
to obtain pointer to entry, while being completely type safe from the
perspective of the verifier. The verifier knows exactly the type of the
nodes, and knows that list helpers return that type at some fixed offset
where the bpf_list_node member used for this list exists. The verifier
also uses this information to disallow adding types that are not
accepted by a certain list.

For now, no elements can be added to such lists. Support for that is
coming in future patches, hence draining and freeing items is done with
a TODO that will be resolved in a future patch.

Note that the bpf_list_head_free function moves the list out to a local
variable under the lock and releases it, doing the actual draining of
the list items outside the lock. While this helps with not holding the
lock for too long pessimizing other concurrent list operations, it is
also necessary for deadlock prevention: unless every function called in
the critical section would be notrace, a fentry/fexit program could
attach and call bpf_map_update_elem again on the map, leading to the
same lock being acquired if the key matches and lead to a deadlock.
While this requires some special effort on part of the BPF programmer to
trigger and is highly unlikely to occur in practice, it is always better
if we can avoid such a condition.

While notrace would prevent this, doing the draining outside the lock
has advantages of its own, hence it is used to also fix the deadlock
related problem.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Kumar Kartikeya Dwivedi 2d57725257 bpf: Remove BPF_MAP_OFF_ARR_MAX
In f71b2f6417 ("bpf: Refactor map->off_arr handling"), map->off_arr
was refactored to be btf_field_offs. The number of field offsets is
equal to maximum possible fields limited by BTF_FIELDS_MAX. Hence, reuse
BTF_FIELDS_MAX as spin_lock and timer no longer are to be handled
specially for offset sorting, fix the comment, and remove incorrect
WARN_ON as its rec->cnt can never exceed this value. The reason to keep
separate constant was the it was always more 2 more than total kptrs.
This is no longer the case.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221114191547.1694267-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-14 21:52:45 -08:00
Jakub Kicinski f4c4ca70de bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEET63h6RnJhTJHuKTjXOwUVIRcSScFAmNu2EkACgkQXOwUVIRc
 SSebKhAA0ffmp5jJgEJpQYNABGLYIJcwKkBrGClDbMJLtwCjevGZJajT9fpbCLb1
 eK6EIhdfR0NTO+0KtUVkZ8WMa81OmLEJYdTNtJfNE23ENMpssiAWhlhDF8AoXeKv
 Bo3j719gn3Cw9PWXQoircH3wpj+5RMDnjxy4iYlA5yNrvzC7XVmssMF+WALvQnuK
 CGrfR57hxdgmphmasRqeCzEoriwihwPsG3k6eQN8rf7ZytLhs90tMVgT9L3Cd2u9
 DafA0Xl8mZdz2mHhThcJhQVq4MUymZj44ufuHDiOs1j6nhUlWToyQuvegPOqxKti
 uLGtZul0ls+3UP0Lbrv1oEGU/MWMxyDz4IBc0EVs0k3ItQbmSKs6r9WuPFGd96Sb
 GHk68qFVySeLGN0LfKe3rCHJ9ZoIOPYJg9qT8Rd5bOhetgGwSsxZTxUI39BxkFup
 CEqwIDnts1TMU37GDjj+vssKW91k4jEzMZVtRfsL3J36aJs28k/Ez4AqLXg6WU6u
 ADqFaejVPcXbN9rX90onIYxxiL28gZSeT+i8qOPELZtqTQmNWz+tC/ySVuWnD8Mn
 Nbs7PZ1IWiNZpsKS8pZnpd6j4mlBeJnwXkPKiFy+xHGuwRSRdYl6G9e5CtlRely/
 rwQ8DtaOpRYMrGhnmBEdAOCa9t/iqzrzHzjoigjJ7iAST4ToJ5s=
 =Y+/e
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Andrii Nakryiko says:

====================
bpf-next 2022-11-11

We've added 49 non-merge commits during the last 9 day(s) which contain
a total of 68 files changed, 3592 insertions(+), 1371 deletions(-).

The main changes are:

1) Veristat tool improvements to support custom filtering, sorting, and replay
   of results, from Andrii Nakryiko.

2) BPF verifier precision tracking fixes and improvements,
   from Andrii Nakryiko.

3) Lots of new BPF documentation for various BPF maps, from Dave Tucker,
   Donald Hunter, Maryam Tahhan, Bagas Sanjaya.

4) BTF dedup improvements and libbpf's hashmap interface clean ups, from
   Eduard Zingerman.

5) Fix veth driver panic if XDP program is attached before veth_open, from
   John Fastabend.

6) BPF verifier clean ups and fixes in preparation for follow up features,
   from Kumar Kartikeya Dwivedi.

7) Add access to hwtstamp field from BPF sockops programs,
   from Martin KaFai Lau.

8) Various fixes for BPF selftests and samples, from Artem Savkov,
   Domenico Cerasuolo, Kang Minchul, Rong Tao, Yang Jihong.

9) Fix redirection to tunneling device logic, preventing skb->len == 0, from
   Stanislav Fomichev.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (49 commits)
  selftests/bpf: fix veristat's singular file-or-prog filter
  selftests/bpf: Test skops->skb_hwtstamp
  selftests/bpf: Fix incorrect ASSERT in the tcp_hdr_options test
  bpf: Add hwtstamp field for the sockops prog
  selftests/bpf: Fix xdp_synproxy compilation failure in 32-bit arch
  bpf, docs: Document BPF_MAP_TYPE_ARRAY
  docs/bpf: Document BPF map types QUEUE and STACK
  docs/bpf: Document BPF ARRAY_OF_MAPS and HASH_OF_MAPS
  docs/bpf: Document BPF_MAP_TYPE_CPUMAP map
  docs/bpf: Document BPF_MAP_TYPE_LPM_TRIE map
  libbpf: Hashmap.h update to fix build issues using LLVM14
  bpf: veth driver panics when xdp prog attached before veth_open
  selftests: Fix test group SKIPPED result
  selftests/bpf: Tests for btf_dedup_resolve_fwds
  libbpf: Resolve unambigous forward declarations
  libbpf: Hashmap interface update to allow both long and void* keys/values
  samples/bpf: Fix sockex3 error: Missing BPF prog type
  selftests/bpf: Fix u32 variable compared with less than zero
  Documentation: bpf: Escape underscore in BPF type name prefix
  selftests/bpf: Use consistent build-id type for liburandom_read.so
  ...
====================

Link: https://lore.kernel.org/r/20221111233733.1088228-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11 18:33:04 -08:00
Xu Kuohai 4b45cd81f7 bpf: Initialize same number of free nodes for each pcpu_freelist
pcpu_freelist_populate() initializes nr_elems / num_possible_cpus() + 1
free nodes for some CPUs, and then possibly one CPU with fewer nodes,
followed by remaining cpus with 0 nodes. For example, when nr_elems == 256
and num_possible_cpus() == 32, CPU 0~27 each gets 9 free nodes, CPU 28 gets
4 free nodes, CPU 29~31 get 0 free nodes, while in fact each CPU should get
8 nodes equally.

This patch initializes nr_elems / num_possible_cpus() free nodes for each
CPU firstly, then allocates the remaining free nodes by one for each CPU
until no free nodes left.

Fixes: e19494edab ("bpf: introduce percpu_freelist")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221110122128.105214-1-xukuohai@huawei.com
2022-11-11 12:05:14 -08:00
Maryam Tahhan 161939abc8 docs/bpf: Document BPF_MAP_TYPE_CPUMAP map
Add documentation for BPF_MAP_TYPE_CPUMAP including
kernel version introduced, usage and examples.

Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20221107165207.2682075-2-mtahhan@redhat.com
2022-11-11 11:32:54 -08:00
Jakub Kicinski 966a9b4903 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/pch_can.c
  ae64438be1 ("can: dev: fix skb drop check")
  1dd1b521be ("can: remove obsolete PCH CAN driver")
https://lore.kernel.org/all/20221110102509.1f7d63cc@canb.auug.org.au/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-10 17:43:53 -08:00
Wang Yufen eb86559a69 bpf: Fix memory leaks in __check_func_call
kmemleak reports this issue:

unreferenced object 0xffff88817139d000 (size 2048):
  comm "test_progs", pid 33246, jiffies 4307381979 (age 45851.820s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000045f075f0>] kmalloc_trace+0x27/0xa0
    [<0000000098b7c90a>] __check_func_call+0x316/0x1230
    [<00000000b4c3c403>] check_helper_call+0x172e/0x4700
    [<00000000aa3875b7>] do_check+0x21d8/0x45e0
    [<000000001147357b>] do_check_common+0x767/0xaf0
    [<00000000b5a595b4>] bpf_check+0x43e3/0x5bc0
    [<0000000011e391b1>] bpf_prog_load+0xf26/0x1940
    [<0000000007f765c0>] __sys_bpf+0xd2c/0x3650
    [<00000000839815d6>] __x64_sys_bpf+0x75/0xc0
    [<00000000946ee250>] do_syscall_64+0x3b/0x90
    [<0000000000506b7f>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root case here is: In function prepare_func_exit(), the callee is
not released in the abnormal scenario after "state->curframe--;". To
fix, move "state->curframe--;" to the very bottom of the function,
right when we free callee and reset frame[] pointer to NULL, as Andrii
suggested.

In addition, function __check_func_call() has a similar problem. In
the abnormal scenario before "state->curframe++;", the callee also
should be released by free_func_state().

Fixes: 69c087ba62 ("bpf: Add bpf_for_each_map_elem() helper")
Fixes: fd978bf7fd ("bpf: Add reference tracking to verifier")
Signed-off-by: Wang Yufen <wangyufen@huawei.com>
Link: https://lore.kernel.org/r/1667884291-15666-1-git-send-email-wangyufen@huawei.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-08 09:21:35 -08:00
Nathan Chancellor a679120edf bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE()
When building with clang:

  kernel/bpf/dispatcher.c:126:33: error: pointer type mismatch ('void *' and 'unsigned int (*)(const void *, const struct bpf_insn *, bpf_func_t)' (aka 'unsigned int (*)(const void *, const struct bpf_insn *, unsigned int (*)(const void *, const struct bpf_insn *))')) [-Werror,-Wpointer-type-mismatch]
          __BPF_DISPATCHER_UPDATE(d, new ?: &bpf_dispatcher_nop_func);
                                     ~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~
  ./include/linux/bpf.h:1045:54: note: expanded from macro '__BPF_DISPATCHER_UPDATE'
          __static_call_update((_d)->sc_key, (_d)->sc_tramp, (_new))
                                                              ^~~~
  1 error generated.

The warning is pointing out that the type of new ('void *') and
&bpf_dispatcher_nop_func are not compatible, which could have side
effects coming out of a conditional operator due to promotion rules.

Add the explicit cast to 'void *' to make it clear that this is
expected, as __BPF_DISPATCHER_UPDATE() expands to a call to
__static_call_update(), which expects a 'void *' as its final argument.

Fixes: c86df29d11 ("bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)")
Link: https://github.com/ClangBuiltLinux/linux/issues/1755
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: "kernelci.org bot" <bot@kernelci.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221107170711.42409-1-nathan@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-11-07 16:07:35 -08:00
Peter Zijlstra c86df29d11 bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)
The dispatcher function is currently abusing the ftrace __fentry__
call location for its own purposes -- this obviously gives trouble
when the dispatcher and ftrace are both in use.

A previous solution tried using __attribute__((patchable_function_entry()))
which works, except it is GCC-8+ only, breaking the build on the
earlier still supported compilers. Instead use static_call() -- which
has its own annotations and does not conflict with ftrace -- to
rewrite the dispatch function.

By using: return static_call()(ctx, insni, bpf_func) you get a perfect
forwarding tail call as function body (iow a single jmp instruction).
By having the default static_call() target be bpf_dispatcher_nop_func()
it retains the default behaviour (an indirect call to the argument
function). Only once a dispatcher program is attached is the target
rewritten to directly call the JIT'ed image.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lkml.kernel.org/r/Y1/oBlK0yFk5c/Im@hirez.programming.kicks-ass.net
Link: https://lore.kernel.org/bpf/20221103120647.796772565@infradead.org
2022-11-04 23:13:57 +01:00
Peter Zijlstra 18acb7fac2 bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop")
Because __attribute__((patchable_function_entry)) is only available
since GCC-8 this solution fails to build on the minimum required GCC
version.

Undo these changes so we might try again -- without cluttering up the
patches with too many changes.

This is an almost complete revert of:

  dbe69b2998 ("bpf: Fix dispatcher patchable function entry to 5 bytes nop")
  ceea991a01 ("bpf: Move bpf_dispatcher function out of ftrace locations")

(notably the arch/x86/Kconfig hunk is kept).

Reported-by: David Laight <David.Laight@aculab.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lkml.kernel.org/r/439d8dc735bb4858875377df67f1b29a@AcuMS.aculab.com
Link: https://lore.kernel.org/bpf/20221103120647.728830733@infradead.org
2022-11-04 23:13:08 +01:00
Andrii Nakryiko 7a830b53c1 bpf: aggressively forget precise markings during state checkpointing
Exploit the property of about-to-be-checkpointed state to be able to
forget all precise markings up to that point even more aggressively. We
now clear all potentially inherited precise markings right before
checkpointing and branching off into child state. If any of children
states require precise knowledge of any SCALAR register, those will be
propagated backwards later on before this state is finalized, preserving
correctness.

There is a single selftests BPF program change, but tremendous one: 25x
reduction in number of verified instructions and states in
trace_virtqueue_add_sgs.

Cilium results are more modest, but happen across wider range of programs.

SELFTESTS RESULTS
=================

$ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results.csv ~/imprecise-aggressive-results.csv | grep -v '+0'
File                 Program                  Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------------  -----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
loop6.bpf.linked1.o  trace_virtqueue_add_sgs           398057            15114   -382943 (-96.20%)              8717               336      -8381 (-96.15%)
-------------------  -----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

CILIUM RESULTS
==============

$ ./veristat -C -e file,prog,insns,states ~/imprecise-early-results-cilium.csv ~/imprecise-aggressive-results-cilium.csv | grep -v '+0'
File           Program                           Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------  --------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_host.o     tail_handle_nat_fwd_ipv4                    23426            23221       -205 (-0.88%)              1537              1515         -22 (-1.43%)
bpf_host.o     tail_handle_nat_fwd_ipv6                    13009            12904       -105 (-0.81%)               719               708         -11 (-1.53%)
bpf_host.o     tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
bpf_host.o     tail_nodeport_nat_ipv6_egress                3446             3406        -40 (-1.16%)               203               198          -5 (-2.46%)
bpf_lxc.o      tail_handle_nat_fwd_ipv4                    23426            23221       -205 (-0.88%)              1537              1515         -22 (-1.43%)
bpf_lxc.o      tail_handle_nat_fwd_ipv6                    13009            12904       -105 (-0.81%)               719               708         -11 (-1.53%)
bpf_lxc.o      tail_ipv4_ct_egress                          5074             4897       -177 (-3.49%)               255               248          -7 (-2.75%)
bpf_lxc.o      tail_ipv4_ct_ingress                         5100             4923       -177 (-3.47%)               255               248          -7 (-2.75%)
bpf_lxc.o      tail_ipv4_ct_ingress_policy_only             5100             4923       -177 (-3.47%)               255               248          -7 (-2.75%)
bpf_lxc.o      tail_ipv6_ct_egress                          4558             4536        -22 (-0.48%)               188               187          -1 (-0.53%)
bpf_lxc.o      tail_ipv6_ct_ingress                         4578             4556        -22 (-0.48%)               188               187          -1 (-0.53%)
bpf_lxc.o      tail_ipv6_ct_ingress_policy_only             4578             4556        -22 (-0.48%)               188               187          -1 (-0.53%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6               5261             5196        -65 (-1.24%)               247               243          -4 (-1.62%)
bpf_overlay.o  tail_nodeport_nat_ipv6_egress                3482             3442        -40 (-1.15%)               204               201          -3 (-1.47%)
bpf_xdp.o      tail_nodeport_nat_egress_ipv4               17200            15619      -1581 (-9.19%)              1111              1010        -101 (-9.09%)
-------------  --------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-6-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko f63181b6ae bpf: stop setting precise in current state
Setting reg->precise to true in current state is not necessary from
correctness standpoint, but it does pessimise the whole precision (or
rather "imprecision", because that's what we want to keep as much as
possible) tracking. Why is somewhat subtle and my best attempt to
explain this is recorded in an extensive comment for __mark_chain_precise()
function. Some more careful thinking and code reading is probably required
still to grok this completely, unfortunately. Whiteboarding and a bunch
of extra handwaiving in person would be even more helpful, but is deemed
impractical in Git commit.

Next patch pushes this imprecision property even further, building on top of
the insights described in this patch.

End results are pretty nice, we get reduction in number of total instructions
and states verified due to a better states reuse, as some of the states are now
more generic and permissive due to less unnecessary precise=true requirements.

SELFTESTS RESULTS
=================

$ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results.csv ~/imprecise-early-results.csv | grep -v '+0'
File                                     Program                 Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
---------------------------------------  ----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_iter_ksym.bpf.linked1.o              dump_ksym                           347              285       -62 (-17.87%)                20                19          -1 (-5.00%)
pyperf600_bpf_loop.bpf.linked1.o         on_event                           3678             3736        +58 (+1.58%)               276               285          +9 (+3.26%)
setget_sockopt.bpf.linked1.o             skops_sockopt                      4038             3947        -91 (-2.25%)               347               343          -4 (-1.15%)
test_l4lb.bpf.linked1.o                  balancer_ingress                   4559             2611     -1948 (-42.73%)               118               105        -13 (-11.02%)
test_l4lb_noinline.bpf.linked1.o         balancer_ingress                   6279             6268        -11 (-0.18%)               237               236          -1 (-0.42%)
test_misc_tcp_hdr_options.bpf.linked1.o  misc_estab                         1307             1303         -4 (-0.31%)               100                99          -1 (-1.00%)
test_sk_lookup.bpf.linked1.o             ctx_narrow_access                   456              447         -9 (-1.97%)                39                38          -1 (-2.56%)
test_sysctl_loop1.bpf.linked1.o          sysctl_tcp_mem                     1389             1384         -5 (-0.36%)                26                25          -1 (-3.85%)
test_tc_dtime.bpf.linked1.o              egress_fwdns_prio101                518              485        -33 (-6.37%)                51                46          -5 (-9.80%)
test_tc_dtime.bpf.linked1.o              egress_host                         519              468        -51 (-9.83%)                50                44         -6 (-12.00%)
test_tc_dtime.bpf.linked1.o              ingress_fwdns_prio101               842             1000      +158 (+18.76%)                73                88        +15 (+20.55%)
xdp_synproxy_kern.bpf.linked1.o          syncookie_tc                     405757           373173     -32584 (-8.03%)             25735             22882      -2853 (-11.09%)
xdp_synproxy_kern.bpf.linked1.o          syncookie_xdp                    479055           371590   -107465 (-22.43%)             29145             22207      -6938 (-23.81%)
---------------------------------------  ----------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Slight regression in test_tc_dtime.bpf.linked1.o/ingress_fwdns_prio101
is left for a follow up, there might be some more precision-related bugs
in existing BPF verifier logic.

CILIUM RESULTS
==============

$ ./veristat -C -e file,prog,insns,states ~/subprog-precise-results-cilium.csv ~/imprecise-early-results-cilium.csv | grep -v '+0'
File           Program                         Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_host.o     cil_from_host                               762              556      -206 (-27.03%)                43                37         -6 (-13.95%)
bpf_host.o     tail_handle_nat_fwd_ipv4                  23541            23426       -115 (-0.49%)              1538              1537          -1 (-0.07%)
bpf_host.o     tail_nodeport_nat_egress_ipv4             33592            33566        -26 (-0.08%)              2163              2161          -2 (-0.09%)
bpf_lxc.o      tail_handle_nat_fwd_ipv4                  23541            23426       -115 (-0.49%)              1538              1537          -1 (-0.07%)
bpf_overlay.o  tail_nodeport_nat_egress_ipv4             33581            33543        -38 (-0.11%)              2160              2157          -3 (-0.14%)
bpf_xdp.o      tail_handle_nat_fwd_ipv4                  21659            20920       -739 (-3.41%)              1440              1376         -64 (-4.44%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6                  17084            17039        -45 (-0.26%)               907               905          -2 (-0.22%)
bpf_xdp.o      tail_lb_ipv4                              73442            73430        -12 (-0.02%)              4370              4369          -1 (-0.02%)
bpf_xdp.o      tail_lb_ipv6                             152114           151895       -219 (-0.14%)              6493              6479         -14 (-0.22%)
bpf_xdp.o      tail_nodeport_nat_egress_ipv4             17377            17200       -177 (-1.02%)              1125              1111         -14 (-1.24%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6             6405             6397         -8 (-0.12%)               309               308          -1 (-0.32%)
bpf_xdp.o      tail_rev_nodeport_lb4                      7126             6934       -192 (-2.69%)               414               402         -12 (-2.90%)
bpf_xdp.o      tail_rev_nodeport_lb6                     18059            17905       -154 (-0.85%)              1105              1096          -9 (-0.81%)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-5-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko be2ef81615 bpf: allow precision tracking for programs with subprogs
Stop forcing precise=true for SCALAR registers when BPF program has any
subprograms. Current restriction means that any BPF program, as soon as
it uses subprograms, will end up not getting any of the precision
tracking benefits in reduction of number of verified states.

This patch keeps the fallback mark_all_scalars_precise() behavior if
precise marking has to cross function frames. E.g., if subprogram
requires R1 (first input arg) to be marked precise, ideally we'd need to
backtrack to the parent function and keep marking R1 and its
dependencies as precise. But right now we give up and force all the
SCALARs in any of the current and parent states to be forced to
precise=true. We can lift that restriction in the future.

But this patch fixes two issues identified when trying to enable
precision tracking for subprogs.

First, prevent "escaping" from top-most state in a global subprog. While
with entry-level BPF program we never end up requesting precision for
R1-R5 registers, because R2-R5 are not initialized (and so not readable
in correct BPF program), and R1 is PTR_TO_CTX, not SCALAR, and so is
implicitly precise. With global subprogs, though, it's different, as
global subprog a) can have up to 5 SCALAR input arguments, which might
get marked as precise=true and b) it is validated in isolation from its
main entry BPF program. b) means that we can end up exhausting parent
state chain and still not mark all registers in reg_mask as precise,
which would lead to verifier bug warning.

To handle that, we need to consider two cases. First, if the very first
state is not immediately "checkpointed" (i.e., stored in state lookup
hashtable), it will get correct first_insn_idx and last_insn_idx
instruction set during state checkpointing. As such, this case is
already handled and __mark_chain_precision() already handles that by
just doing nothing when we reach to the very first parent state.
st->parent will be NULL and we'll just stop. Perhaps some extra check
for reg_mask and stack_mask is due here, but this patch doesn't address
that issue.

More problematic second case is when global function's initial state is
immediately checkpointed before we manage to process the very first
instruction. This is happening because when there is a call to global
subprog from the main program the very first subprog's instruction is
marked as pruning point, so before we manage to process first
instruction we have to check and checkpoint state. This patch adds
a special handling for such "empty" state, which is identified by having
st->last_insn_idx set to -1. In such case, we check that we are indeed
validating global subprog, and with some sanity checking we mark input
args as precise if requested.

Note that we also initialize state->first_insn_idx with correct start
insn_idx offset. For main program zero is correct value, but for any
subprog it's quite confusing to not have first_insn_idx set. This
doesn't have any functional impact, but helps with debugging and state
printing. We also explicitly initialize state->last_insns_idx instead of
relying on is_state_visited() to do this with env->prev_insns_idx, which
will be -1 on the very first instruction. This concludes necessary
changes to handle specifically global subprog's precision tracking.

Second identified problem was missed handling of BPF helper functions
that call into subprogs (e.g., bpf_loop and few others). From precision
tracking and backtracking logic's standpoint those are effectively calls
into subprogs and should be called as BPF_PSEUDO_CALL calls.

This patch takes the least intrusive way and just checks against a short
list of current BPF helpers that do call subprogs, encapsulated in
is_callback_calling_function() function. But to prevent accidentally
forgetting to add new BPF helpers to this "list", we also do a sanity
check in __check_func_call, which has to be called for each such special
BPF helper, to validate that BPF helper is indeed recognized as
callback-calling one. This should catch any missed checks in the future.
Adding some special flags to be added in function proto definitions
seemed like an overkill in this case.

With the above changes, it's possible to remove forceful setting of
reg->precise to true in __mark_reg_unknown, which turns on precision
tracking both inside subprogs and entry progs that have subprogs. No
warnings or errors were detected across all the selftests, but also when
validating with veristat against internal Meta BPF objects and Cilium
objects. Further, in some BPF programs there are noticeable reduction in
number of states and instructions validated due to more effective
precision tracking, especially benefiting syncookie test.

$ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/subprog-precise-results.csv  | grep -v '+0'
File                                      Program                     Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
----------------------------------------  --------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
pyperf600_bpf_loop.bpf.linked1.o          on_event                               3966             3678       -288 (-7.26%)               306               276         -30 (-9.80%)
pyperf_global.bpf.linked1.o               on_event                               7563             7530        -33 (-0.44%)               520               517          -3 (-0.58%)
pyperf_subprogs.bpf.linked1.o             on_event                              36358            36934       +576 (+1.58%)              2499              2531         +32 (+1.28%)
setget_sockopt.bpf.linked1.o              skops_sockopt                          3965             4038        +73 (+1.84%)               343               347          +4 (+1.17%)
test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect                          64965            64901        -64 (-0.10%)              4619              4612          -7 (-0.15%)
test_misc_tcp_hdr_options.bpf.linked1.o   misc_estab                             1491             1307      -184 (-12.34%)               110               100         -10 (-9.09%)
test_pkt_access.bpf.linked1.o             test_pkt_access                         354              349         -5 (-1.41%)                25                24          -1 (-4.00%)
test_sock_fields.bpf.linked1.o            egress_read_sock_fields                 435              375       -60 (-13.79%)                22                20          -2 (-9.09%)
test_sysctl_loop2.bpf.linked1.o           sysctl_tcp_mem                         1508             1501         -7 (-0.46%)                29                28          -1 (-3.45%)
test_tc_dtime.bpf.linked1.o               egress_fwdns_prio100                    468              435        -33 (-7.05%)                45                41          -4 (-8.89%)
test_tc_dtime.bpf.linked1.o               ingress_fwdns_prio100                   398              408        +10 (+2.51%)                42                39          -3 (-7.14%)
test_tc_dtime.bpf.linked1.o               ingress_fwdns_prio101                  1096              842      -254 (-23.18%)                97                73        -24 (-24.74%)
test_tcp_hdr_options.bpf.linked1.o        estab                                  2758             2408      -350 (-12.69%)               208               181        -27 (-12.98%)
test_urandom_usdt.bpf.linked1.o           urand_read_with_sema                    466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_urandom_usdt.bpf.linked1.o           urand_read_without_sema                 466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_urandom_usdt.bpf.linked1.o           urandlib_read_with_sema                 466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_urandom_usdt.bpf.linked1.o           urandlib_read_without_sema              466              448        -18 (-3.86%)                31                28          -3 (-9.68%)
test_xdp_noinline.bpf.linked1.o           balancer_ingress_v6                    4302             4294         -8 (-0.19%)               257               256          -1 (-0.39%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_tc                         583722           405757   -177965 (-30.49%)             35846             25735     -10111 (-28.21%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp                        609123           479055   -130068 (-21.35%)             35452             29145      -6307 (-17.79%)
----------------------------------------  --------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko 529409ea92 bpf: propagate precision across all frames, not just the last one
When equivalent completed state is found and it has additional precision
restrictions, BPF verifier propagates precision to
currently-being-verified state chain (i.e., including parent states) so
that if some of the states in the chain are not yet completed, necessary
precision restrictions are enforced.

Unfortunately, right now this happens only for the last frame (deepest
active subprogram's frame), not all the frames. This can lead to
incorrect matching of states due to missing precision marker. Currently
this doesn't seem possible as BPF verifier forces everything to precise
when validated BPF program has any subprograms. But with the next patch
lifting this restriction, this becomes problematic.

In fact, without this fix, we'll start getting failure in one of the
existing test_verifier test cases:

  #906/p precise: cross frame pruning FAIL
  Unexpected success to load!
  verification time 48 usec
  stack depth 0+0
  processed 26 insns (limit 1000000) max_states_per_insn 3 total_states 17 peak_states 17 mark_read 8

This patch adds precision propagation across all frames.

Fixes: a3ce685dd0 ("bpf: fix precision tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Andrii Nakryiko a3b666bfa9 bpf: propagate precision in ALU/ALU64 operations
When processing ALU/ALU64 operations (apart from BPF_MOV, which is
handled correctly already; and BPF_NEG and BPF_END are special and don't
have source register), if destination register is already marked
precise, this causes problem with potentially missing precision tracking
for the source register. E.g., when we have r1 >>= r5 and r1 is marked
precise, but r5 isn't, this will lead to r5 staying as imprecise. This
is due to the precision backtracking logic stopping early when it sees
r1 is already marked precise. If r1 wasn't precise, we'd keep
backtracking and would add r5 to the set of registers that need to be
marked precise. So there is a discrepancy here which can lead to invalid
and incompatible states matched due to lack of precision marking on r5.
If r1 wasn't precise, precision backtracking would correctly mark both
r1 and r5 as precise.

This is simple to fix, though. During the forward instruction simulation
pass, for arithmetic operations of `scalar <op>= scalar` form (where
<op> is ALU or ALU64 operations), if destination register is already
precise, mark source register as precise. This applies only when both
involved registers are SCALARs. `ptr += scalar` and `scalar += ptr`
cases are already handled correctly.

This does have (negative) effect on some selftest programs and few
Cilium programs.  ~/baseline-tmp-results.csv are veristat results with
this patch, while ~/baseline-results.csv is without it. See post
scriptum for instructions on how to make Cilium programs testable with
veristat. Correctness has a price.

$ ./veristat -C -e file,prog,insns,states ~/baseline-results.csv ~/baseline-tmp-results.csv | grep -v '+0'
File                     Program               Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-----------------------  --------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_cubic.bpf.linked1.o  bpf_cubic_cong_avoid              997             1700      +703 (+70.51%)                62                90        +28 (+45.16%)
test_l4lb.bpf.linked1.o  balancer_ingress                 4559             5469      +910 (+19.96%)               118               126          +8 (+6.78%)
-----------------------  --------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

$ ./veristat -C -e file,prog,verdict,insns,states ~/baseline-results-cilium.csv ~/baseline-tmp-results-cilium.csv | grep -v '+0'
File           Program                         Total insns (A)  Total insns (B)  Total insns (DIFF)  Total states (A)  Total states (B)  Total states (DIFF)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------
bpf_host.o     tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
bpf_host.o     tail_nodeport_nat_ipv6_egress              3396             3446        +50 (+1.47%)               201               203          +2 (+1.00%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6             4448             5261      +813 (+18.28%)               234               247         +13 (+5.56%)
bpf_xdp.o      tail_lb_ipv4                              71736            73442      +1706 (+2.38%)              4295              4370         +75 (+1.75%)
-------------  ------------------------------  ---------------  ---------------  ------------------  ----------------  ----------------  -------------------

P.S. To make Cilium ([0]) programs libbpf-compatible and thus
veristat-loadable, apply changes from topmost commit in [1], which does
minimal changes to Cilium source code, mostly around SEC() annotations
and BPF map definitions.

  [0] https://github.com/cilium/cilium/
  [1] https://github.com/anakryiko/cilium/commits/libbpf-friendliness

Fixes: b5dc0163d8 ("bpf: precise scalar_value tracking")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221104163649.121784-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-04 11:51:45 -07:00
Kumar Kartikeya Dwivedi f71b2f6417 bpf: Refactor map->off_arr handling
Refactor map->off_arr handling into generic functions that can work on
their own without hardcoding map specific code. The btf_fields_offs
structure is now returned from btf_parse_field_offs, which can be reused
later for types in program BTF.

All functions like copy_map_value, zero_map_value call generic
underlying functions so that they can also be reused later for copying
to values allocated in programs which encode specific fields.

Later, some helper functions will also require access to this
btf_field_offs structure to be able to skip over special fields at
runtime.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-9-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 23:09:40 -07:00
Kumar Kartikeya Dwivedi db55911782 bpf: Consolidate spin_lock, timer management into btf_record
Now that kptr_off_tab has been refactored into btf_record, and can hold
more than one specific field type, accomodate bpf_spin_lock and
bpf_timer as well.

While they don't require any more metadata than offset, having all
special fields in one place allows us to share the same code for
allocated user defined types and handle both map values and these
allocated objects in a similar fashion.

As an optimization, we still keep spin_lock_off and timer_off offsets in
the btf_record structure, just to avoid having to find the btf_field
struct each time their offset is needed. This is mostly needed to
manipulate such objects in a map value at runtime. It's ok to hardcode
just one offset as more than one field is disallowed.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-8-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 22:19:40 -07:00
Kumar Kartikeya Dwivedi aa3496accc bpf: Refactor kptr_off_tab into btf_record
To prepare the BPF verifier to handle special fields in both map values
and program allocated types coming from program BTF, we need to refactor
the kptr_off_tab handling code into something more generic and reusable
across both cases to avoid code duplication.

Later patches also require passing this data to helpers at runtime, so
that they can work on user defined types, initialize them, destruct
them, etc.

The main observation is that both map values and such allocated types
point to a type in program BTF, hence they can be handled similarly. We
can prepare a field metadata table for both cases and store them in
struct bpf_map or struct btf depending on the use case.

Hence, refactor the code into generic btf_record and btf_field member
structs. The btf_record represents the fields of a specific btf_type in
user BTF. The cnt indicates the number of special fields we successfully
recognized, and field_mask is a bitmask of fields that were found, to
enable quick determination of availability of a certain field.

Subsequently, refactor the rest of the code to work with these generic
types, remove assumptions about kptr and kptr_off_tab, rename variables
to more meaningful names, etc.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-7-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 21:44:53 -07:00
Kumar Kartikeya Dwivedi a28ace782e bpf: Drop reg_type_may_be_refcounted_or_null
It is not scalable to maintain a list of types that can have non-zero
ref_obj_id. It is never set for scalars anyway, so just remove the
conditional on register types and print it whenever it is non-zero.

Acked-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-6-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:18 -07:00
Kumar Kartikeya Dwivedi f5e477a861 bpf: Fix slot type check in check_stack_write_var_off
For the case where allow_ptr_leaks is false, code is checking whether
slot type is STACK_INVALID and STACK_SPILL and rejecting other cases.
This is a consequence of incorrectly checking for register type instead
of the slot type (NOT_INIT and SCALAR_VALUE respectively). Fix the
check.

Fixes: 01f810ace9 ("bpf: Allow variable-offset stack access")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-5-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:18 -07:00
Kumar Kartikeya Dwivedi 261f4664ca bpf: Clobber stack slot when writing over spilled PTR_TO_BTF_ID
When support was added for spilled PTR_TO_BTF_ID to be accessed by
helper memory access, the stack slot was not overwritten to STACK_MISC
(and that too is only safe when env->allow_ptr_leaks is true).

This means that helpers who take ARG_PTR_TO_MEM and write to it may
essentially overwrite the value while the verifier continues to track
the slot for spilled register.

This can cause issues when PTR_TO_BTF_ID is spilled to stack, and then
overwritten by helper write access, which can then be passed to BPF
helpers or kfuncs.

Handle this by falling back to the case introduced in a later commit,
which will also handle PTR_TO_BTF_ID along with other pointer types,
i.e. cd17d38f8b ("bpf: Permits pointers on stack for helper calls").

Finally, include a comment on why REG_LIVE_WRITTEN is not being set when
clobber is set to true. In short, the reason is that while when clobber
is unset, we know that we won't be writing, when it is true, we *may*
write to any of the stack slots in that range. It may be a partial or
complete write, to just one or many stack slots.

We cannot be sure, hence to be conservative, we leave things as is and
never set REG_LIVE_WRITTEN for any stack slot. However, clobber still
needs to reset them to STACK_MISC assuming writes happened. However read
marks still need to be propagated upwards from liveness point of view,
as parent stack slot's contents may still continue to matter to child
states.

Cc: Yonghong Song <yhs@meta.com>
Fixes: 1d68f22b3d ("bpf: Handle spilled PTR_TO_BTF_ID properly when checking stack_boundary")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:18 -07:00
Kumar Kartikeya Dwivedi 23da464dd6 bpf: Allow specifying volatile type modifier for kptrs
This is useful in particular to mark the pointer as volatile, so that
compiler treats each load and store to the field as a volatile access.
The alternative is having to define and use READ_ONCE and WRITE_ONCE in
the BPF program.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/r/20221103191013.1236066-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-11-03 19:31:17 -07:00
Youlin Li f1db20814a bpf: Fix wrong reg type conversion in release_reference()
Some helper functions will allocate memory. To avoid memory leaks, the
verifier requires the eBPF program to release these memories by calling
the corresponding helper functions.

When a resource is released, all pointer registers corresponding to the
resource should be invalidated. The verifier use release_references() to
do this job, by apply  __mark_reg_unknown() to each relevant register.

It will give these registers the type of SCALAR_VALUE. A register that
will contain a pointer value at runtime, but of type SCALAR_VALUE, which
may allow the unprivileged user to get a kernel pointer by storing this
register into a map.

Using __mark_reg_not_init() while NOT allow_ptr_leaks can mitigate this
problem.

Fixes: fd978bf7fd ("bpf: Add reference tracking to verifier")
Signed-off-by: Youlin Li <liulin063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221103093440.3161-1-liulin063@gmail.com
2022-11-04 00:24:12 +01:00
Jakub Kicinski b54a0d4094 bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY2GuKgAKCRDbK58LschI
 gy32AP9PI0e/bUGDExKJ8g97PeeEtnpj4TTI6g+XKILtYnyXlgD/Rk4j2D/f3IBF
 Ha9TmqYvAUim+U/g50vUrNuoNLNJ5w8=
 =OKC1
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
bpf-next 2022-11-02

We've added 70 non-merge commits during the last 14 day(s) which contain
a total of 96 files changed, 3203 insertions(+), 640 deletions(-).

The main changes are:

1) Make cgroup local storage available to non-cgroup attached BPF programs
   such as tc BPF ones, from Yonghong Song.

2) Avoid unnecessary deadlock detection and failures wrt BPF task storage
   helpers, from Martin KaFai Lau.

3) Add LLVM disassembler as default library for dumping JITed code
   in bpftool, from Quentin Monnet.

4) Various kprobe_multi_link fixes related to kernel modules,
   from Jiri Olsa.

5) Optimize x86-64 JIT with emitting BMI2-based shift instructions,
   from Jie Meng.

6) Improve BPF verifier's memory type compatibility for map key/value
   arguments, from Dave Marchevsky.

7) Only create mmap-able data section maps in libbpf when data is exposed
   via skeletons, from Andrii Nakryiko.

8) Add an autoattach option for bpftool to load all object assets,
   from Wang Yufen.

9) Various memory handling fixes for libbpf and BPF selftests,
   from Xu Kuohai.

10) Initial support for BPF selftest's vmtest.sh on arm64,
    from Manu Bretelle.

11) Improve libbpf's BTF handling to dedup identical structs,
    from Alan Maguire.

12) Add BPF CI and denylist documentation for BPF selftests,
    from Daniel Müller.

13) Check BPF cpumap max_entries before doing allocation work,
    from Florian Lehner.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits)
  samples/bpf: Fix typo in README
  bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
  bpf: check max_entries before allocating memory
  bpf: Fix a typo in comment for DFS algorithm
  bpftool: Fix spelling mistake "disasembler" -> "disassembler"
  selftests/bpf: Fix bpftool synctypes checking failure
  selftests/bpf: Panic on hard/soft lockup
  docs/bpf: Add documentation for new cgroup local storage
  selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x
  selftests/bpf: Add selftests for new cgroup local storage
  selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str
  bpftool: Support new cgroup local storage
  libbpf: Support new cgroup local storage
  bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  bpf: Refactor some inode/task/sk storage functions for reuse
  bpf: Make struct cgroup btf id global
  selftests/bpf: Tracing prog can still do lookup under busy lock
  selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection
  bpf: Add new bpf_task_storage_delete proto with no deadlock detection
  bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
  ...
====================

Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-02 08:18:27 -07:00
Kees Cook 42378a9ca5 bpf, verifier: Fix memory leak in array reallocation for stack state
If an error (NULL) is returned by krealloc(), callers of realloc_array()
were setting their allocation pointers to NULL, but on error krealloc()
does not touch the original allocation. This would result in a memory
resource leak. Instead, free the old allocation on the error handling
path.

The memory leak information is as follows as also reported by Zhengchao:

  unreferenced object 0xffff888019801800 (size 256):
  comm "bpf_repo", pid 6490, jiffies 4294959200 (age 17.170s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000b211474b>] __kmalloc_node_track_caller+0x45/0xc0
    [<0000000086712a0b>] krealloc+0x83/0xd0
    [<00000000139aab02>] realloc_array+0x82/0xe2
    [<00000000b1ca41d1>] grow_stack_state+0xfb/0x186
    [<00000000cd6f36d2>] check_mem_access.cold+0x141/0x1341
    [<0000000081780455>] do_check_common+0x5358/0xb350
    [<0000000015f6b091>] bpf_check.cold+0xc3/0x29d
    [<000000002973c690>] bpf_prog_load+0x13db/0x2240
    [<00000000028d1644>] __sys_bpf+0x1605/0x4ce0
    [<00000000053f29bd>] __x64_sys_bpf+0x75/0xb0
    [<0000000056fedaf5>] do_syscall_64+0x35/0x80
    [<000000002bd58261>] entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: c69431aab6 ("bpf: verifier: Improve function state reallocation")
Reported-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reported-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Bill Wendling <morbo@google.com>
Cc: Lorenz Bauer <oss@lmb.io>
Link: https://lore.kernel.org/bpf/20221029025433.2533810-1-keescook@chromium.org
2022-11-01 14:29:16 +01:00
Thomas Gleixner 97c4090bad bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
Now that the 32bit UP oddity is gone and 32bit uses always a sequence
count, there is no need for the fetch_irq() variants anymore.

Convert to the regular interface.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/bpf/20221026123110.331690-1-bigeasy@linutronix.de
2022-10-31 23:40:24 +01:00
Florian Lehner e39e739ab5 bpf: check max_entries before allocating memory
For maps of type BPF_MAP_TYPE_CPUMAP memory is allocated first before
checking the max_entries argument. If then max_entries is greater than
NR_CPUS additional work needs to be done to free allocated memory before
an error is returned.
This changes moves the check on max_entries before the allocation
happens.

Signed-off-by: Florian Lehner <dev@der-flo.net>
Link: https://lore.kernel.org/r/20221028183405.59554-1-dev@der-flo.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-10-28 15:45:58 -07:00
Xu Kuohai b6d207999c bpf: Fix a typo in comment for DFS algorithm
There is a typo in comment for DFS algorithm in bpf/verifier.c. The top
element should not be popped until all its neighbors have been checked.
Fix it.

Fixes: 475fb78fbf ("bpf: verifier (add branch/goto checks)")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221027034458.2925218-1-xukuohai@huaweicloud.com
2022-10-27 13:10:32 -07:00
Yonghong Song c4bcfb38a9 bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
Similar to sk/inode/task storage, implement similar cgroup local storage.

There already exists a local storage implementation for cgroup-attached
bpf programs.  See map type BPF_MAP_TYPE_CGROUP_STORAGE and helper
bpf_get_local_storage(). But there are use cases such that non-cgroup
attached bpf progs wants to access cgroup local storage data. For example,
tc egress prog has access to sk and cgroup. It is possible to use
sk local storage to emulate cgroup local storage by storing data in socket.
But this is a waste as it could be lots of sockets belonging to a particular
cgroup. Alternatively, a separate map can be created with cgroup id as the key.
But this will introduce additional overhead to manipulate the new map.
A cgroup local storage, similar to existing sk/inode/task storage,
should help for this use case.

The life-cycle of storage is managed with the life-cycle of the
cgroup struct.  i.e. the storage is destroyed along with the owning cgroup
with a call to bpf_cgrp_storage_free() when cgroup itself
is deleted.

The userspace map operations can be done by using a cgroup fd as a key
passed to the lookup, update and delete operations.

Typically, the following code is used to get the current cgroup:
    struct task_struct *task = bpf_get_current_task_btf();
    ... task->cgroups->dfl_cgrp ...
and in structure task_struct definition:
    struct task_struct {
        ....
        struct css_set __rcu            *cgroups;
        ....
    }
With sleepable program, accessing task->cgroups is not protected by rcu_read_lock.
So the current implementation only supports non-sleepable program and supporting
sleepable program will be the next step together with adding rcu_read_lock
protection for rcu tagged structures.

Since map name BPF_MAP_TYPE_CGROUP_STORAGE has been used for old cgroup local
storage support, the new map name BPF_MAP_TYPE_CGRP_STORAGE is used
for cgroup storage available to non-cgroup-attached bpf programs. The old
cgroup storage supports bpf_get_local_storage() helper to get the cgroup data.
The new cgroup storage helper bpf_cgrp_storage_get() can provide similar
functionality. While old cgroup storage pre-allocates storage memory, the new
mechanism can also pre-allocate with a user space bpf_map_update_elem() call
to avoid potential run-time memory allocation failure.
Therefore, the new cgroup storage can provide all functionality w.r.t.
the old one. So in uapi bpf.h, the old BPF_MAP_TYPE_CGROUP_STORAGE is alias to
BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED to indicate the old cgroup storage can
be deprecated since the new one can provide the same functionality.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042850.673791-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:19:19 -07:00
Yonghong Song c83597fa5d bpf: Refactor some inode/task/sk storage functions for reuse
Refactor codes so that inode/task/sk storage implementation
can maximally share the same code. I also added some comments
in new function bpf_local_storage_unlink_nolock() to make
codes easy to understand. There is no functionality change.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042845.672944-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:19:19 -07:00
Yonghong Song 5e67b8ef12 bpf: Make struct cgroup btf id global
Make struct cgroup btf id global so later patch can reuse
the same btf id.

Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221026042840.672602-1-yhs@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:19:19 -07:00
Martin KaFai Lau 8a7dac37f2 bpf: Add new bpf_task_storage_delete proto with no deadlock detection
The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
The situation is similar to the bpf_pid_task_storage_delete_elem()
which is called from the syscall map_delete_elem.  It does not need
deadlock detection.  Otherwise, it will cause unnecessary failure
when calling the bpf_task_storage_delete() helper.

This patch adds bpf_task_storage_delete proto that does not do deadlock
detection.  It will be used by bpf_lsm and bpf_iter program.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau fda64ae0bb bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
Similar to the earlier change in bpf_task_storage_get_recur.
This patch changes bpf_task_storage_delete_recur such that it
does the lookup first.  It only returns -EBUSY if it needs to
take the spinlock to do the deletion when potential deadlock
is detected.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau 4279adb094 bpf: Add new bpf_task_storage_get proto with no deadlock detection
The bpf_lsm and bpf_iter do not recur that will cause a deadlock.
The situation is similar to the bpf_pid_task_storage_lookup_elem()
which is called from the syscall map_lookup_elem.  It does not need
deadlock detection.  Otherwise, it will cause unnecessary failure
when calling the bpf_task_storage_get() helper.

This patch adds bpf_task_storage_get proto that does not do deadlock
detection.  It will be used by bpf_lsm and bpf_iter programs.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau e8b02296a6 bpf: Avoid taking spinlock in bpf_task_storage_get if potential deadlock is detected
bpf_task_storage_get() does a lookup and optionally inserts
new data if BPF_LOCAL_STORAGE_GET_F_CREATE is present.

During lookup, it will cache the lookup result and caching requires to
acquire a spinlock.  When potential deadlock is detected (by the
bpf_task_storage_busy pcpu-counter added in
commit bc235cdb42 ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")),
the current behavior is returning NULL immediately to avoid deadlock.  It is
too pessimistic.  This patch will go ahead to do a lookup (which is a
lockless operation) but it will avoid caching it in order to avoid
acquiring the spinlock.

When lookup fails to find the data and BPF_LOCAL_STORAGE_GET_F_CREATE
is set, an insertion is needed and this requires acquiring a spinlock.
This patch will still return NULL when a potential deadlock is detected.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau 6d65500c34 bpf: Refactor the core bpf_task_storage_get logic into a new function
This patch creates a new function __bpf_task_storage_get() and
moves the core logic of the existing bpf_task_storage_get()
into this new function.   This new function will be shared
by another new helper proto in the latter patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau 0593dd34e5 bpf: Append _recur naming to the bpf_task_storage helper proto
This patch adds the "_recur" naming to the bpf_task_storage_{get,delete}
proto.  In a latter patch, they will only be used by the tracing
programs that requires a deadlock detection because a tracing
prog may use bpf_task_storage_{get,delete} recursively and cause a
deadlock.

Another following patch will add a different helper proto for the non
tracing programs because they do not need the deadlock prevention.
This patch does this rename to prepare for this future proto
additions.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Martin KaFai Lau 271de525e1 bpf: Remove prog->active check for bpf_lsm and bpf_iter
The commit 64696c40d0 ("bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline")
removed prog->active check for struct_ops prog.  The bpf_lsm
and bpf_iter is also using trampoline.  Like struct_ops, the bpf_lsm
and bpf_iter have fixed hooks for the prog to attach.  The
kernel does not call the same hook in a recursive way.
This patch also removes the prog->active check for
bpf_lsm and bpf_iter.

A later patch has a test to reproduce the recursion issue
for a sleepable bpf_lsm program.

This patch appends the '_recur' naming to the existing
enter and exit functions that track the prog->active counter.
New __bpf_prog_{enter,exit}[_sleepable] function are
added to skip the prog->active tracking. The '_struct_ops'
version is also removed.

It also moves the decision on picking the enter and exit function to
the new bpf_trampoline_{enter,exit}().  It returns the '_recur' ones
for all tracing progs to use.  For bpf_lsm, bpf_iter,
struct_ops (no prog->active tracking after 64696c40d0), and
bpf_lsm_cgroup (no prog->active tracking after 69fd337a97),
it will return the functions that don't track the prog->active.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20221025184524.3526117-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-25 23:11:46 -07:00
Jakub Kicinski 96917bb3a3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/linux/net.h
  a5ef058dc4 ("net: introduce and use custom sockopt socket flag")
  e993ffe3da ("net: flag sockets supporting msghdr originated zerocopy")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 13:44:11 -07:00
Jakub Kicinski e28c44450b bpf-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmNVkYkACgkQ6rmadz2v
 bTqzHw/+NYMwfLm5Ck+BK0+HiYU5VVLoG4jp8G7B3sJL/6nUDduajzoqa+nM19Xl
 +HEjbMza7CizmhkCRkzIs1VVtx8mtvKdTxbhvm77SU2+GBn+X1es+XhtFd4EOpok
 MINNHs+cOC/HlnPD/QbFgvxKiKkjyjWxInjUp6Y/mLMcKCn7l9KOkc07/la9Dj3j
 RI0gXCywq1pJaPuTCnt0/wcYLJvzn6QsZnKmmksQwt59GQqOd11HWid3rBWZhDp6
 beEoHDIMGHROtu60vm4DB0p4l6tauZfeXyPCeu3Tx5ZSsypJIyU1iTdKiIUjG963
 ilpy55nrX9bWxadB7LIKHyYfW3in4o+D1ZZaUvLIato/69CZJZ7Uc4kU1RF4Ay1F
 Df1Fmal2WeNAxxETPmQPvVeCePvQvwLHl4KNogdZZvd/67cyc1cDhnuTJp37iPak
 FALHaaw0VOrTdTvxsWym7yEbkhPbCHpPrKYFZFHgGrRTFk/GM2k38mM07lcLxFGw
 aKyooS+eoIZMEgtK5Hma2wpeIVSlkJiJk1d0K20OxdnIUyYEsMXmI+uV1gMxq/8z
 EHNi0+296xOoxy22I1Bd5Tu7fIeecHFN44q7YFmpGsB54UNLpFsP0vYUmYT/6hLI
 Y0KVZu4c3oQDX7ttifMvkeOCURDJBPrZx37bpNpNXF55fB5ehNk=
 =eV7W
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2022-10-23

We've added 7 non-merge commits during the last 18 day(s) which contain
a total of 8 files changed, 69 insertions(+), 5 deletions(-).

The main changes are:

1) Wait for busy refill_work when destroying bpf memory allocator, from Hou.

2) Allow bpf_user_ringbuf_drain() callbacks to return 1, from David.

3) Fix dispatcher patchable function entry to 5 bytes nop, from Jiri.

4) Prevent decl_tag from being referenced in func_proto, from Stanislav.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Use __llist_del_all() whenever possbile during memory draining
  bpf: Wait for busy refill_work when destroying bpf memory allocator
  bpf: Fix dispatcher patchable function entry to 5 bytes nop
  bpf: prevent decl_tag from being referenced in func_proto
  selftests/bpf: Add reproducer for decl_tag in func_proto return type
  selftests/bpf: Make bpf_user_ringbuf_drain() selftest callback return 1
  bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
====================

Link: https://lore.kernel.org/r/20221023192244.81137-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24 10:32:01 -07:00
Dave Marchevsky d167330409 bpf: Consider all mem_types compatible for map_{key,value} args
After the previous patch, which added PTR_TO_MEM | MEM_ALLOC type
map_key_value_types, the only difference between map_key_value_types and
mem_types sets is PTR_TO_BUF and PTR_TO_MEM, which are in the latter set
but not the former.

Helpers which expect ARG_PTR_TO_MAP_KEY or ARG_PTR_TO_MAP_VALUE
already effectively expect a valid blob of arbitrary memory that isn't
necessarily explicitly associated with a map. When validating a
PTR_TO_MAP_{KEY,VALUE} arg, the verifier expects meta->map_ptr to have
already been set, either by an earlier ARG_CONST_MAP_PTR arg, or custom
logic like that in process_timer_func or process_kptr_func.

So let's get rid of map_key_value_types and just use mem_types for those
args.

This has the effect of adding PTR_TO_BUF and PTR_TO_MEM to the set of
compatible types for ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE.

PTR_TO_BUF is used by various bpf_iter implementations to represent a
chunk of valid r/w memory in ctx args for iter prog.

PTR_TO_MEM is used by networking, tracing, and ringbuf helpers to
represent a chunk of valid memory. The PTR_TO_MEM | MEM_ALLOC
type added in previous commit is specific to ringbuf helpers.
Presence or absence of MEM_ALLOC doesn't change the validity of using
PTR_TO_MEM as a map_{key,val} input.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221020160721.4030492-2-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:23:28 -07:00
Dave Marchevsky 9ef40974a8 bpf: Allow ringbuf memory to be used as map key
This patch adds support for the following pattern:

  struct some_data *data = bpf_ringbuf_reserve(&ringbuf, sizeof(struct some_data, 0));
  if (!data)
    return;
  bpf_map_lookup_elem(&another_map, &data->some_field);
  bpf_ringbuf_submit(data);

Currently the verifier does not consider bpf_ringbuf_reserve's
PTR_TO_MEM | MEM_ALLOC ret type a valid key input to bpf_map_lookup_elem.
Since PTR_TO_MEM is by definition a valid region of memory, it is safe
to use it as a key for lookups.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20221020160721.4030492-1-davemarchevsky@fb.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:22:42 -07:00
Hou Tao fa4447cb73 bpf: Use __llist_del_all() whenever possbile during memory draining
Except for waiting_for_gp list, there are no concurrent operations on
free_by_rcu, free_llist and free_llist_extra lists, so use
__llist_del_all() instead of llist_del_all(). waiting_for_gp list can be
deleted by RCU callback concurrently, so still use llist_del_all().

Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221021114913.60508-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:17:38 -07:00
Hou Tao 3d05818707 bpf: Wait for busy refill_work when destroying bpf memory allocator
A busy irq work is an unfinished irq work and it can be either in the
pending state or in the running state. When destroying bpf memory
allocator, refill_work may be busy for PREEMPT_RT kernel in which irq
work is invoked in a per-CPU RT-kthread. It is also possible for kernel
with arch_irq_work_has_interrupt() being false (e.g. 1-cpu arm32 host or
mips) and irq work is inovked in timer interrupt.

The busy refill_work leads to various issues. The obvious one is that
there will be concurrent operations on free_by_rcu and free_list between
irq work and memory draining. Another one is call_rcu_in_progress will
not be reliable for the checking of pending RCU callback because
do_call_rcu() may have not been invoked by irq work yet. The other is
there will be use-after-free if irq work is freed before the callback
of irq work is invoked as shown below:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor instruction fetch in kernel mode
 #PF: error_code(0x0010) - not-present page
 PGD 12ab94067 P4D 12ab94067 PUD 1796b4067 PMD 0
 Oops: 0010 [#1] PREEMPT_RT SMP
 CPU: 5 PID: 64 Comm: irq_work/5 Not tainted 6.0.0-rt11+ #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 RIP: 0010:0x0
 Code: Unable to access opcode bytes at 0xffffffffffffffd6.
 RSP: 0018:ffffadc080293e78 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffffcdc07fb6a388 RCX: ffffa05000a2e000
 RDX: ffffa05000a2e000 RSI: ffffffff96cc9827 RDI: ffffcdc07fb6a388
 ......
 Call Trace:
  <TASK>
  irq_work_single+0x24/0x60
  irq_work_run_list+0x24/0x30
  run_irq_workd+0x23/0x30
  smpboot_thread_fn+0x203/0x300
  kthread+0x126/0x150
  ret_from_fork+0x1f/0x30
  </TASK>

Considering the ease of concurrency handling, no overhead for
irq_work_sync() under non-PREEMPT_RT kernel and has-irq-work-interrupt
kernel and the short wait time used for irq_work_sync() under PREEMPT_RT
(When running two test_maps on PREEMPT_RT kernel and 72-cpus host, the
max wait time is about 8ms and the 99th percentile is 10us), just using
irq_work_sync() to wait for busy refill_work to complete before memory
draining and memory freeing.

Fixes: 7c8199e24f ("bpf: Introduce any context BPF specific memory allocator.")
Acked-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221021114913.60508-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-21 19:17:38 -07:00
Jiri Olsa dbe69b2998 bpf: Fix dispatcher patchable function entry to 5 bytes nop
The patchable_function_entry(5) might output 5 single nop
instructions (depends on toolchain), which will clash with
bpf_arch_text_poke check for 5 bytes nop instruction.

Adding early init call for dispatcher that checks and change
the patchable entry into expected 5 nop instruction if needed.

There's no need to take text_mutex, because we are using it
in early init call which is called at pre-smp time.

Fixes: ceea991a01 ("bpf: Move bpf_dispatcher function out of ftrace locations")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221018075934.574415-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-20 18:57:51 -07:00
Jakub Kicinski 94adb5e29e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
No conflicts.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-20 17:49:10 -07:00
Jakub Kicinski 3566a79c9e bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCY08JQQAKCRDbK58LschI
 g0M0AQCWGrJcnQFut1qwR9efZUadwxtKGAgpaA/8Smd8+v7c8AD/SeHQuGfkFiD6
 rx18hv1mTfG0HuPnFQy6YZQ98vmznwE=
 =DaeS
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2022-10-18

We've added 33 non-merge commits during the last 14 day(s) which contain
a total of 31 files changed, 874 insertions(+), 538 deletions(-).

The main changes are:

1) Add RCU grace period chaining to BPF to wait for the completion
   of access from both sleepable and non-sleepable BPF programs,
   from Hou Tao & Paul E. McKenney.

2) Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
   values. In the wild we have seen OS vendors doing buggy backports
   where helper call numbers mismatched. This is an attempt to make
   backports more foolproof, from Andrii Nakryiko.

3) Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions,
   from Roberto Sassu.

4) Fix libbpf's BTF dumper for structs with padding-only fields,
   from Eduard Zingerman.

5) Fix various libbpf bugs which have been found from fuzzing with
   malformed BPF object files, from Shung-Hsi Yu.

6) Clean up an unneeded check on existence of SSE2 in BPF x86-64 JIT,
   from Jie Meng.

7) Fix various ASAN bugs in both libbpf and selftests when running
   the BPF selftest suite on arm64, from Xu Kuohai.

8) Fix missing bpf_iter_vma_offset__destroy() call in BPF iter selftest
   and use in-skeleton link pointer to remove an explicit bpf_link__destroy(),
   from Jiri Olsa.

9) Fix BPF CI breakage by pointing to iptables-legacy instead of relying
   on symlinked iptables which got upgraded to iptables-nft,
   from Martin KaFai Lau.

10) Minor BPF selftest improvements all over the place, from various others.

* tag 'for-netdev' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (33 commits)
  bpf/docs: Update README for most recent vmtest.sh
  bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
  bpf: Use rcu_trace_implies_rcu_gp() in local storage map
  bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator
  rcu-tasks: Provide rcu_trace_implies_rcu_gp()
  selftests/bpf: Use sys_pidfd_open() helper when possible
  libbpf: Fix null-pointer dereference in find_prog_by_sec_insn()
  libbpf: Deal with section with no data gracefully
  libbpf: Use elf_getshdrnum() instead of e_shnum
  selftest/bpf: Fix error usage of ASSERT_OK in xdp_adjust_tail.c
  selftests/bpf: Fix error failure of case test_xdp_adjust_tail_grow
  selftest/bpf: Fix memory leak in kprobe_multi_test
  selftests/bpf: Fix memory leak caused by not destroying skeleton
  libbpf: Fix memory leak in parse_usdt_arg()
  libbpf: Fix use-after-free in btf_dump_name_dups
  selftests/bpf: S/iptables/iptables-legacy/ in the bpf_nf and xdp_synproxy test
  selftests/bpf: Alphabetize DENYLISTs
  selftests/bpf: Add tests for _opts variants of bpf_*_get_fd_by_id()
  libbpf: Introduce bpf_link_get_fd_by_id_opts()
  libbpf: Introduce bpf_btf_get_fd_by_id_opts()
  ...
====================

Link: https://lore.kernel.org/r/20221018210631.11211-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-18 18:56:43 -07:00
Hou Tao 4835f9ee98 bpf: Use rcu_trace_implies_rcu_gp() for program array freeing
To support both sleepable and normal uprobe bpf program, the freeing of
trace program array chains a RCU-tasks-trace grace period and a normal
RCU grace period one after the other.

With the introduction of rcu_trace_implies_rcu_gp(),
__bpf_prog_array_free_sleepable_cb() can check whether or not a normal
RCU grace period has also passed after a RCU-tasks-trace grace period
has passed. If it is true, it is safe to invoke kfree() directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-5-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Hou Tao d39d1445d3 bpf: Use rcu_trace_implies_rcu_gp() in local storage map
Local storage map is accessible for both sleepable and non-sleepable bpf
program, and its memory is freed by using both call_rcu_tasks_trace() and
kfree_rcu() to wait for both RCU-tasks-trace grace period and RCU grace
period to pass.

With the introduction of rcu_trace_implies_rcu_gp(), both
bpf_selem_free_rcu() and bpf_local_storage_free_rcu() can check whether
or not a normal RCU grace period has also passed after a RCU-tasks-trace
grace period has passed. If it is true, it is safe to call kfree()
directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Hou Tao 59be91e5e7 bpf: Use rcu_trace_implies_rcu_gp() in bpf memory allocator
The memory free logic in bpf memory allocator chains a RCU Tasks Trace
grace period and a normal RCU grace period one after the other, so it
can ensure that both sleepable and non-sleepable programs have finished.

With the introduction of rcu_trace_implies_rcu_gp(),
__free_rcu_tasks_trace() can check whether or not a normal RCU grace
period has also passed after a RCU Tasks Trace grace period has passed.
If it is true, freeing these elements directly, else freeing through
call_rcu().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20221014113946.965131-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-10-18 10:27:02 -07:00
Linus Torvalds bb1a114646 cgroup fixes for v6.1-rc1
* Fix a recent regression where a sleeping kernfs function is called with
   css_set_lock (spinlock) held.
 
 * Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file().
   Multiple users assume that the lookup only works for cgroup2 and breaks
   when fed a cgroup1 file. Instead, introduce a separate set of functions to
   lookup both v1 and v2 and use them where the user explicitly wants to
   support both versions.
 
 * Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.
 
 * Add Josef Bacik as a blkcg maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCY03MlA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGTkUAQD7fNcSPuc2m/BvW+gySKQkp9PZMA6E6yOIqirc
 QKmIgwEAwWECW7RR1alhOGD50RtYkuYVsLD1+6Ka4eMHe+EhwA4=
 =kGLI
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - Fix a recent regression where a sleeping kernfs function is called
   with css_set_lock (spinlock) held

 - Revert the commit to enable cgroup1 support for cgroup_get_from_fd/file()

   Multiple users assume that the lookup only works for cgroup2 and
   breaks when fed a cgroup1 file. Instead, introduce a separate set of
   functions to lookup both v1 and v2 and use them where the user
   explicitly wants to support both versions.

 - Compat update for tools/perf/util/bpf_skel/bperf_cgroup.bpf.c.

 - Add Josef Bacik as a blkcg maintainer.

* tag 'cgroup-for-6.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  blkcg: Update MAINTAINERS entry
  mm: cgroup: fix comments for get from fd/file helpers
  perf stat: Support old kernels for bperf cgroup counting
  bpf: cgroup_iter: support cgroup1 using cgroup fd
  cgroup: add cgroup_v1v2_get_from_[fd/file]()
  Revert "cgroup: enable cgroup_get_from_file() on cgroup1"
  cgroup: Reorganize css_set_lock and kernfs path processing
2022-10-17 18:52:43 -07:00
Stanislav Fomichev ea68376c8b bpf: prevent decl_tag from being referenced in func_proto
Syzkaller was able to hit the following issue:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 3609 at kernel/bpf/btf.c:1946
btf_type_id_size+0x2d5/0x9d0 kernel/bpf/btf.c:1946
Modules linked in:
CPU: 0 PID: 3609 Comm: syz-executor361 Not tainted
6.0.0-syzkaller-02734-g0326074ff465 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 09/22/2022
RIP: 0010:btf_type_id_size+0x2d5/0x9d0 kernel/bpf/btf.c:1946
Code: ef e8 7f 8e e4 ff 41 83 ff 0b 77 28 f6 44 24 10 18 75 3f e8 6d 91
e4 ff 44 89 fe bf 0e 00 00 00 e8 20 8e e4 ff e8 5b 91 e4 ff <0f> 0b 45
31 f6 e9 98 02 00 00 41 83 ff 12 74 18 e8 46 91 e4 ff 44
RSP: 0018:ffffc90003cefb40 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
RDX: ffff8880259c0000 RSI: ffffffff81968415 RDI: 0000000000000005
RBP: ffff88801270ca00 R08: 0000000000000005 R09: 000000000000000e
R10: 0000000000000011 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000011 R14: ffff888026ee6424 R15: 0000000000000011
FS:  000055555641b300(0000) GS:ffff8880b9a00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000f2e258 CR3: 000000007110e000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 btf_func_proto_check kernel/bpf/btf.c:4447 [inline]
 btf_check_all_types kernel/bpf/btf.c:4723 [inline]
 btf_parse_type_sec kernel/bpf/btf.c:4752 [inline]
 btf_parse kernel/bpf/btf.c:5026 [inline]
 btf_new_fd+0x1926/0x1e70 kernel/bpf/btf.c:6892
 bpf_btf_load kernel/bpf/syscall.c:4324 [inline]
 __sys_bpf+0xb7d/0x4cf0 kernel/bpf/syscall.c:5010
 __do_sys_bpf kernel/bpf/syscall.c:5069 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:5067 [inline]
 __x64_sys_bpf+0x75/0xb0 kernel/bpf/syscall.c:5067
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f0fbae41c69
Code: 28 c3 e8 2a 14 00 00 66 2e 0f 1f 84 00 00 00 00 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffc8aeb6228 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0fbae41c69
RDX: 0000000000000020 RSI: 0000000020000140 RDI: 0000000000000012
RBP: 00007f0fbae05e10 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000ffffffff R11: 0000000000000246 R12: 00007f0fbae05ea0
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

Looks like it tries to create a func_proto which return type is
decl_tag. For the details, see Martin's spot on analysis in [0].

0: https://lore.kernel.org/bpf/CAKH8qBuQDLva_hHxxBuZzyAcYNO4ejhovz6TQeVSk8HY-2SO6g@mail.gmail.com/T/#mea6524b3fcd6298347432226e81b1e6155efc62c

Cc: Yonghong Song <yhs@fb.com>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Fixes: bd16dee66a ("bpf: Add BTF_KIND_DECL_TAG typedef support")
Reported-by: syzbot+d8bd751aef7c6b39a344@syzkaller.appspotmail.com
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/r/20221015002444.2680969-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2022-10-17 10:56:12 -07:00
David Vernet c92a7a5224 bpf: Allow bpf_user_ringbuf_drain() callbacks to return 1
The bpf_user_ringbuf_drain() helper function allows a BPF program to
specify a callback that is invoked when draining entries from a
BPF_MAP_TYPE_USER_RINGBUF ring buffer map. The API is meant to allow the
callback to return 0 if it wants to continue draining samples, and 1 if
it's done draining. Unfortunately, bpf_user_ringbuf_drain() landed shortly
after commit 1bfe26fb08 ("bpf: Add verifier support for custom
callback return range"), which changed the default behavior of callbacks
to only support returning 0.

This patch corrects that oversight by allowing bpf_user_ringbuf_drain()
callbacks to return 0 or 1. A follow-on patch will update the
user_ringbuf selftests to also return 1 from a bpf_user_ringbuf_drain()
callback to prevent this from regressing in the future.

Fixes: 2057156738 ("bpf: Add bpf_user_ringbuf_drain() helper")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20221012232015.1510043-2-void@manifault.com
2022-10-13 08:27:37 -07:00
Jason A. Donenfeld a251c17aa5 treewide: use get_random_u32() when possible
The prandom_u32() function has been a deprecated inline wrapper around
get_random_u32() for several releases now, and compiles down to the
exact same code. Replace the deprecated wrapper with a direct call to
the real function. The same also applies to get_random_int(), which is
just a wrapper around get_random_u32(). This was done as a basic find
and replace.

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
Acked-by: Chuck Lever <chuck.lever@oracle.com> # for nfsd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> # for thunderbolt
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Acked-by: Helge Deller <deller@gmx.de> # for parisc
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:58 -06:00
Jason A. Donenfeld 81895a65ec treewide: use prandom_u32_max() when possible, part 1
Rather than incurring a division or requesting too many random bytes for
the given range, use the prandom_u32_max() function, which only takes
the minimum required bytes from the RNG and avoids divisions. This was
done mechanically with this coccinelle script:

@basic@
expression E;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
typedef u64;
@@
(
- ((T)get_random_u32() % (E))
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ((E) - 1))
+ prandom_u32_max(E * XXX_MAKE_SURE_E_IS_POW2)
|
- ((u64)(E) * get_random_u32() >> 32)
+ prandom_u32_max(E)
|
- ((T)get_random_u32() & ~PAGE_MASK)
+ prandom_u32_max(PAGE_SIZE)
)

@multi_line@
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
identifier RAND;
expression E;
@@

-       RAND = get_random_u32();
        ... when != RAND
-       RAND %= (E);
+       RAND = prandom_u32_max(E);

// Find a potential literal
@literal_mask@
expression LITERAL;
type T;
identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
position p;
@@

        ((T)get_random_u32()@p & (LITERAL))

// Add one to the literal.
@script:python add_one@
literal << literal_mask.LITERAL;
RESULT;
@@

value = None
if literal.startswith('0x'):
        value = int(literal, 16)
elif literal[0] in '123456789':
        value = int(literal, 10)
if value is None:
        print("I don't know how to handle %s" % (literal))
        cocci.include_match(False)
elif value == 2**32 - 1 or value == 2**31 - 1 or value == 2**24 - 1 or value == 2**16 - 1 or value == 2**8 - 1:
        print("Skipping 0x%x for cleanup elsewhere" % (value))
        cocci.include_match(False)
elif value & (value + 1) != 0:
        print("Skipping 0x%x because it's not a power of two minus one" % (value))
        cocci.include_match(False)
elif literal.startswith('0x'):
        coccinelle.RESULT = cocci.make_expr("0x%x" % (value + 1))
else:
        coccinelle.RESULT = cocci.make_expr("%d" % (value + 1))

// Replace the literal mask with the calculated result.
@plus_one@
expression literal_mask.LITERAL;
position literal_mask.p;
expression add_one.RESULT;
identifier FUNC;
@@

-       (FUNC()@p & (LITERAL))
+       prandom_u32_max(RESULT)

@collapse_ret@
type T;
identifier VAR;
expression E;
@@

 {
-       T VAR;
-       VAR = (E);
-       return VAR;
+       return E;
 }

@drop_var@
type T;
identifier VAR;
@@

 {
-       T VAR;
        ... when != VAR
 }

Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: KP Singh <kpsingh@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz> # for ext4 and sbitmap
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com> # for drbd
Acked-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390
Acked-by: Ulf Hansson <ulf.hansson@linaro.org> # for mmc
Acked-by: Darrick J. Wong <djwong@kernel.org> # for xfs
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-11 17:42:55 -06:00
Yosry Ahmed 35256d673a bpf: cgroup_iter: support cgroup1 using cgroup fd
Use cgroup_v1v2_get_from_fd() in cgroup_iter to support attaching to
both cgroup v1 and v2 using fds.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-10-11 06:50:15 -10:00
Linus Torvalds 27bc50fc90 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any negative
   reports (or any positive ones, come to that).
 
 - Also the Maple Tree from Liam R.  Howlett.  An overlapping range-based
   tree for vmas.  It it apparently slight more efficient in its own right,
   but is mainly targeted at enabling work to reduce mmap_lock contention.
 
   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.
 
   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   (https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com).
   This has yet to be addressed due to Liam's unfortunately timed
   vacation.  He is now back and we'll get this fixed up.
 
 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer.  It uses
   clang-generated instrumentation to detect used-unintialized bugs down to
   the single bit level.
 
   KMSAN keeps finding bugs.  New ones, as well as the legacy ones.
 
 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.
 
 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support
   file/shmem-backed pages.
 
 - userfaultfd updates from Axel Rasmussen
 
 - zsmalloc cleanups from Alexey Romanov
 
 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure
 
 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.
 
 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.
 
 - memcg cleanups from Kairui Song.
 
 - memcg fixes and cleanups from Johannes Weiner.
 
 - Vishal Moola provides more folio conversions
 
 - Zhang Yi removed ll_rw_block() :(
 
 - migration enhancements from Peter Xu
 
 - migration error-path bugfixes from Huang Ying
 
 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths.  For optimizations by PMEM drivers, DRM
   drivers, etc.
 
 - vma merging improvements from Jakub Matěn.
 
 - NUMA hinting cleanups from David Hildenbrand.
 
 - xu xin added aditional userspace visibility into KSM merging activity.
 
 - THP & KSM code consolidation from Qi Zheng.
 
 - more folio work from Matthew Wilcox.
 
 - KASAN updates from Andrey Konovalov.
 
 - DAMON cleanups from Kaixu Xia.
 
 - DAMON work from SeongJae Park: fixes, cleanups.
 
 - hugetlb sysfs cleanups from Muchun Song.
 
 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY0HaPgAKCRDdBJ7gKXxA
 joPjAQDZ5LlRCMWZ1oxLP2NOTp6nm63q9PWcGnmY50FjD/dNlwEAnx7OejCLWGWf
 bbTuk6U2+TKgJa4X7+pbbejeoqnt5QU=
 =xfWx
 -----END PGP SIGNATURE-----

Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull MM updates from Andrew Morton:

 - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
   linux-next for a couple of months without, to my knowledge, any
   negative reports (or any positive ones, come to that).

 - Also the Maple Tree from Liam Howlett. An overlapping range-based
   tree for vmas. It it apparently slightly more efficient in its own
   right, but is mainly targeted at enabling work to reduce mmap_lock
   contention.

   Liam has identified a number of other tree users in the kernel which
   could be beneficially onverted to mapletrees.

   Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
   at [1]. This has yet to be addressed due to Liam's unfortunately
   timed vacation. He is now back and we'll get this fixed up.

 - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
   clang-generated instrumentation to detect used-unintialized bugs down
   to the single bit level.

   KMSAN keeps finding bugs. New ones, as well as the legacy ones.

 - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
   memory into THPs.

 - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
   support file/shmem-backed pages.

 - userfaultfd updates from Axel Rasmussen

 - zsmalloc cleanups from Alexey Romanov

 - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
   memory-failure

 - Huang Ying adds enhancements to NUMA balancing memory tiering mode's
   page promotion, with a new way of detecting hot pages.

 - memcg updates from Shakeel Butt: charging optimizations and reduced
   memory consumption.

 - memcg cleanups from Kairui Song.

 - memcg fixes and cleanups from Johannes Weiner.

 - Vishal Moola provides more folio conversions

 - Zhang Yi removed ll_rw_block() :(

 - migration enhancements from Peter Xu

 - migration error-path bugfixes from Huang Ying

 - Aneesh Kumar added ability for a device driver to alter the memory
   tiering promotion paths. For optimizations by PMEM drivers, DRM
   drivers, etc.

 - vma merging improvements from Jakub Matěn.

 - NUMA hinting cleanups from David Hildenbrand.

 - xu xin added aditional userspace visibility into KSM merging
   activity.

 - THP & KSM code consolidation from Qi Zheng.

 - more folio work from Matthew Wilcox.

 - KASAN updates from Andrey Konovalov.

 - DAMON cleanups from Kaixu Xia.

 - DAMON work from SeongJae Park: fixes, cleanups.

 - hugetlb sysfs cleanups from Muchun Song.

 - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.

Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]

* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
  hugetlb: allocate vma lock for all sharable vmas
  hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
  hugetlb: fix vma lock handling during split vma and range unmapping
  mglru: mm/vmscan.c: fix imprecise comments
  mm/mglru: don't sync disk for each aging cycle
  mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
  mm: memcontrol: use do_memsw_account() in a few more places
  mm: memcontrol: deprecate swapaccounting=0 mode
  mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
  mm/secretmem: remove reduntant return value
  mm/hugetlb: add available_huge_pages() func
  mm: remove unused inline functions from include/linux/mm_inline.h
  selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
  selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
  selftests/vm: add thp collapse shmem testing
  selftests/vm: add thp collapse file and tmpfs testing
  selftests/vm: modularize thp collapse memory operations
  selftests/vm: dedup THP helpers
  mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
  mm/madvise: add file and shmem support to MADV_COLLAPSE
  ...
2022-10-10 17:53:04 -07:00
Linus Torvalds 3871d93b82 Perf events updates for v6.1:
- PMU driver updates:
 
      - Add AMD Last Branch Record Extension Version 2 (LbrExtV2)
        feature support for Zen 4 processors.
 
      - Extend the perf ABI to provide branch speculation information,
        if available, and use this on CPUs that have it (eg. LbrExtV2).
 
      - Improve Intel PEBS TSC timestamp handling & integration.
 
      - Add Intel Raptor Lake S CPU support.
 
      - Add 'perf mem' and 'perf c2c' memory profiling support on
        AMD CPUs by utilizing IBS tagged load/store samples.
 
      - Clean up & optimize various x86 PMU details.
 
  - HW breakpoints:
 
      - Big rework to optimize the code for systems with hundreds of CPUs and
        thousands of breakpoints:
 
         - Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
 	  per-CPU rwsem that is read-locked during most of the key operations.
 
 	- Improve the O(#cpus * #tasks) logic in toggle_bp_slot()
 	  and fetch_bp_busy_slots().
 
 	- Apply micro-optimizations & cleanups.
 
   - Misc cleanups & enhancements.
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmM/2pMRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iIMA/+J+MCEVTt9kwZeBtHoPX7iZ5gnq1+McoQ
 f6ALX19AO/ZSuA7EBA3cS3Ny5eyGy3ofYUnRW+POezu9CpflLW/5N27R2qkZFrWC
 A09B86WH676ZrmXt+oI05rpZ2y/NGw4gJxLLa4/bWF3g9xLfo21i+YGKwdOnNFpl
 DEdCVHtjlMcOAU3+on6fOYuhXDcYd7PKGcCfLE7muOMOAtwyj0bUDBt7m+hneZgy
 qbZHzDU2DZ5L/LXiMyuZj5rC7V4xUbfZZfXglG38YCW1WTieS3IjefaU2tREhu7I
 rRkCK48ULDNNJR3dZK8IzXJRxusq1ICPG68I+nm/K37oZyTZWtfYZWehW/d/TnPa
 tUiTwimabz7UUqaGq9ZptxwINcAigax0nl6dZ3EseeGhkDE6j71/3kqrkKPz4jth
 +fCwHLOrI3c4Gq5qWgPvqcUlUneKf3DlOMtzPKYg7sMhla2zQmFpYCPzKfm77U/Z
 BclGOH3FiwaK6MIjPJRUXTePXqnUseqCR8PCH/UPQUeBEVHFcMvqCaa15nALed8x
 dFi76VywR9mahouuLNq6sUNePlvDd2B124PygNwegLlBfY9QmKONg9qRKOnQpuJ6
 UprRJjLOOucZ/N/jn6+ShHkqmXsnY2MhfUoHUoMQ0QAI+n++e+2AuePo251kKWr8
 QlqKxd9PMQU=
 =LcGg
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf events updates from Ingo Molnar:
 "PMU driver updates:

   - Add AMD Last Branch Record Extension Version 2 (LbrExtV2) feature
     support for Zen 4 processors.

   - Extend the perf ABI to provide branch speculation information, if
     available, and use this on CPUs that have it (eg. LbrExtV2).

   - Improve Intel PEBS TSC timestamp handling & integration.

   - Add Intel Raptor Lake S CPU support.

   - Add 'perf mem' and 'perf c2c' memory profiling support on AMD CPUs
     by utilizing IBS tagged load/store samples.

   - Clean up & optimize various x86 PMU details.

  HW breakpoints:

   - Big rework to optimize the code for systems with hundreds of CPUs
     and thousands of breakpoints:

      - Replace the nr_bp_mutex global mutex with the bp_cpuinfo_sem
        per-CPU rwsem that is read-locked during most of the key
        operations.

      - Improve the O(#cpus * #tasks) logic in toggle_bp_slot() and
        fetch_bp_busy_slots().

      - Apply micro-optimizations & cleanups.

  - Misc cleanups & enhancements"

* tag 'perf-core-2022-10-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
  perf/hw_breakpoint: Annotate tsk->perf_event_mutex vs ctx->mutex
  perf: Fix pmu_filter_match()
  perf: Fix lockdep_assert_event_ctx()
  perf/x86/amd/lbr: Adjust LBR regardless of filtering
  perf/x86/utils: Fix uninitialized var in get_branch_type()
  perf/uapi: Define PERF_MEM_SNOOPX_PEER in kernel header file
  perf/x86/amd: Support PERF_SAMPLE_PHY_ADDR
  perf/x86/amd: Support PERF_SAMPLE_ADDR
  perf/x86/amd: Support PERF_SAMPLE_{WEIGHT|WEIGHT_STRUCT}
  perf/x86/amd: Support PERF_SAMPLE_DATA_SRC
  perf/x86/amd: Add IBS OP_DATA2 DataSrc bit definitions
  perf/mem: Introduce PERF_MEM_LVLNUM_{EXTN_MEM|IO}
  perf/x86/uncore: Add new Raptor Lake S support
  perf/x86/cstate: Add new Raptor Lake S support
  perf/x86/msr: Add new Raptor Lake S support
  perf/x86: Add new Raptor Lake S support
  bpf: Check flags for branch stack in bpf_read_branch_records helper
  perf, hw_breakpoint: Fix use-after-free if perf_event_open() fails
  perf: Use sample_flags for raw_data
  perf: Use sample_flags for addr
  ...
2022-10-10 09:27:46 -07:00
Linus Torvalds 0326074ff4 Networking changes for 6.1.
Core
 ----
 
  - Introduce and use a single page frag cache for allocating small skb
    heads, clawing back the 10-20% performance regression in UDP flood
    test from previous fixes.
 
  - Run packets which already went thru HW coalescing thru SW GRO.
    This significantly improves TCP segment coalescing and simplifies
    deployments as different workloads benefit from HW or SW GRO.
 
  - Shrink the size of the base zero-copy send structure.
 
  - Move TCP init under a new slow / sleepable version of DO_ONCE().
 
 BPF
 ---
 
  - Add BPF-specific, any-context-safe memory allocator.
 
  - Add helpers/kfuncs for PKCS#7 signature verification from BPF
    programs.
 
  - Define a new map type and related helpers for user space -> kernel
    communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).
 
  - Allow targeting BPF iterators to loop through resources of one
    task/thread.
 
  - Add ability to call selected destructive functions.
    Expose crash_kexec() to allow BPF to trigger a kernel dump.
    Use CAP_SYS_BOOT check on the loading process to judge permissions.
 
  - Enable BPF to collect custom hierarchical cgroup stats efficiently
    by integrating with the rstat framework.
 
  - Support struct arguments for trampoline based programs.
    Only structs with size <= 16B and x86 are supported.
 
  - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
    sockets (instead of just TCP and UDP sockets).
 
  - Add a helper for accessing CLOCK_TAI for time sensitive network
    related programs.
 
  - Support accessing network tunnel metadata's flags.
 
  - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.
 
  - Add support for writing to Netfilter's nf_conn:mark.
 
 Protocols
 ---------
 
  - WiFi: more Extremely High Throughput (EHT) and Multi-Link
    Operation (MLO) work (802.11be, WiFi 7).
 
  - vsock: improve support for SO_RCVLOWAT.
 
  - SMC: support SO_REUSEPORT.
 
  - Netlink: define and document how to use netlink in a "modern" way.
    Support reporting missing attributes via extended ACK.
 
  - IPSec: support collect metadata mode for xfrm interfaces.
 
  - TCPv6: send consistent autoflowlabel in SYN_RECV state
    and RST packets.
 
  - TCP: introduce optional per-netns connection hash table to allow
    better isolation between namespaces (opt-in, at the cost of memory
    and cache pressure).
 
  - MPTCP: support TCP_FASTOPEN_CONNECT.
 
  - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.
 
  - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.
 
  - Open vSwitch:
    - Allow specifying ifindex of new interfaces.
    - Allow conntrack and metering in non-initial user namespace.
 
  - TLS: support the Korean ARIA-GCM crypto algorithm.
 
  - Remove DECnet support.
 
 Driver API
 ----------
 
  - Allow selecting the conduit interface used by each port
    in DSA switches, at runtime.
 
  - Ethernet Power Sourcing Equipment and Power Device support.
 
  - Add tc-taprio support for queueMaxSDU parameter, i.e. setting
    per traffic class max frame size for time-based packet schedules.
 
  - Support PHY rate matching - adapting between differing host-side
    and link-side speeds.
 
  - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.
 
  - Validate OF (device tree) nodes for DSA shared ports; make
    phylink-related properties mandatory on DSA and CPU ports.
    Enforcing more uniformity should allow transitioning to phylink.
 
  - Require that flash component name used during update matches one
    of the components for which version is reported by info_get().
 
  - Remove "weight" argument from driver-facing NAPI API as much
    as possible. It's one of those magic knobs which seemed like
    a good idea at the time but is too indirect to use in practice.
 
  - Support offload of TLS connections with 256 bit keys.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Microchip KSZ9896 6-port Gigabit Ethernet Switch
    - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
    - Analog Devices ADIN1110 and ADIN2111 industrial single pair
      Ethernet (10BASE-T1L) MAC+PHY.
    - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).
 
  - Ethernet SFPs / modules:
    - RollBall / Hilink / Turris 10G copper SFPs
    - HALNy GPON module
 
  - WiFi:
    - CYW43439 SDIO chipset (brcmfmac)
    - CYW89459 PCIe chipset (brcmfmac)
    - BCM4378 on Apple platforms (brcmfmac)
 
 Drivers
 -------
 
  - CAN:
    - gs_usb: HW timestamp support
 
  - Ethernet PHYs:
    - lan8814: cable diagnostics
 
  - Ethernet NICs:
    - Intel (100G):
      - implement control of FCS/CRC stripping
      - port splitting via devlink
      - L2TPv3 filtering offload
    - nVidia/Mellanox:
      - tunnel offload for sub-functions
      - MACSec offload, w/ Extended packet number and replay
        window offload
      - significantly restructure, and optimize the AF_XDP support,
        align the behavior with other vendors
    - Huawei:
      - configuring DSCP map for traffic class selection
      - querying standard FEC statistics
      - querying SerDes lane number via ethtool
    - Marvell/Cavium:
      - egress priority flow control
      - MACSec offload
    - AMD/SolarFlare:
      - PTP over IPv6 and raw Ethernet
    - small / embedded:
      - ax88772: convert to phylink (to support SFP cages)
      - altera: tse: convert to phylink
      - ftgmac100: support fixed link
      - enetc: standard Ethtool counters
      - macb: ZynqMP SGMII dynamic configuration support
      - tsnep: support multi-queue and use page pool
      - lan743x: Rx IP & TCP checksum offload
      - igc: add xdp frags support to ndo_xdp_xmit
 
  - Ethernet high-speed switches:
    - Marvell (prestera):
      - support SPAN port features (traffic mirroring)
      - nexthop object offloading
    - Microchip (sparx5):
      - multicast forwarding offload
      - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - support RGMII cmode
    - NXP (felix):
      - standardized ethtool counters
    - Microchip (lan966x):
      - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
      - traffic policing and mirroring
      - link aggregation / bonding offload
      - QUSGMII PHY mode support
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - cold boot calibration support on WCN6750
    - support to connect to a non-transmit MBSSID AP profile
    - enable remain-on-channel support on WCN6750
    - Wake-on-WLAN support for WCN6750
    - support to provide transmit power from firmware via nl80211
    - support to get power save duration for each client
    - spectral scan support for 160 MHz
 
  - MediaTek WiFi (mt76):
    - WiFi-to-Ethernet bridging offload for MT7986 chips
 
  - RealTek WiFi (rtw89):
    - P2P support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmM7vtkACgkQMUZtbf5S
 Irvotg//dmh53rC+UMKO3OgOqPlSMnaqzbUdDEfN6mj4Mpox7Csb8zERVURHhBHY
 fvlXWsDgxmvgTebI5fvNC5+f1iW5xcqgJV2TWnNmDOKWwvQwb6qQfgixVmunvkpe
 IIukMXYt0dAf9bXeeEfbNXcCb85cPwB76stX0tMV6BX7osp3T0TL1fvFk0NJkL0j
 TeydLad/yAQtPb4TbeWYjNDoxPVDf0cVpUrevLGmWE88UMYmgTqPze+h1W5Wri52
 bzjdLklY/4cgcIZClHQ6F9CeRWqEBxvujA5Hj/cwOcn/ptVVJWUGi7sQo3sYkoSs
 HFu+F8XsTec14kGNC0Ab40eVdqs5l/w8+E+4jvgXeKGOtVns8DwoiUIzqXpyty89
 Ib04mffrwWNjFtHvo/kIsNwP05X2PGE9HUHfwsTUfisl/ASvMmQp7D7vUoqQC/4B
 AMVzT5qpjkmfBHYQQGuw8FxJhMeAOjC6aAo6censhXJyiUhIfleQsN0syHdaNb8q
 9RZlhAgQoVb6ZgvBV8r8unQh/WtNZ3AopwifwVJld2unsE/UNfQy2KyqOWBES/zf
 LP9sfuX0JnmHn8s1BQEUMPU1jF9ZVZCft7nufJDL6JhlAL+bwZeEN4yCiAHOPZqE
 ymSLHI9s8yWZoNpuMWKrI9kFexVnQFKmA3+quAJUcYHNMSsLkL8=
 =Gsio
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Introduce and use a single page frag cache for allocating small skb
     heads, clawing back the 10-20% performance regression in UDP flood
     test from previous fixes.

   - Run packets which already went thru HW coalescing thru SW GRO. This
     significantly improves TCP segment coalescing and simplifies
     deployments as different workloads benefit from HW or SW GRO.

   - Shrink the size of the base zero-copy send structure.

   - Move TCP init under a new slow / sleepable version of DO_ONCE().

  BPF:

   - Add BPF-specific, any-context-safe memory allocator.

   - Add helpers/kfuncs for PKCS#7 signature verification from BPF
     programs.

   - Define a new map type and related helpers for user space -> kernel
     communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).

   - Allow targeting BPF iterators to loop through resources of one
     task/thread.

   - Add ability to call selected destructive functions. Expose
     crash_kexec() to allow BPF to trigger a kernel dump. Use
     CAP_SYS_BOOT check on the loading process to judge permissions.

   - Enable BPF to collect custom hierarchical cgroup stats efficiently
     by integrating with the rstat framework.

   - Support struct arguments for trampoline based programs. Only
     structs with size <= 16B and x86 are supported.

   - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
     sockets (instead of just TCP and UDP sockets).

   - Add a helper for accessing CLOCK_TAI for time sensitive network
     related programs.

   - Support accessing network tunnel metadata's flags.

   - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.

   - Add support for writing to Netfilter's nf_conn:mark.

  Protocols:

   - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation
     (MLO) work (802.11be, WiFi 7).

   - vsock: improve support for SO_RCVLOWAT.

   - SMC: support SO_REUSEPORT.

   - Netlink: define and document how to use netlink in a "modern" way.
     Support reporting missing attributes via extended ACK.

   - IPSec: support collect metadata mode for xfrm interfaces.

   - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST
     packets.

   - TCP: introduce optional per-netns connection hash table to allow
     better isolation between namespaces (opt-in, at the cost of memory
     and cache pressure).

   - MPTCP: support TCP_FASTOPEN_CONNECT.

   - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.

   - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.

   - Open vSwitch:
      - Allow specifying ifindex of new interfaces.
      - Allow conntrack and metering in non-initial user namespace.

   - TLS: support the Korean ARIA-GCM crypto algorithm.

   - Remove DECnet support.

  Driver API:

   - Allow selecting the conduit interface used by each port in DSA
     switches, at runtime.

   - Ethernet Power Sourcing Equipment and Power Device support.

   - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per
     traffic class max frame size for time-based packet schedules.

   - Support PHY rate matching - adapting between differing host-side
     and link-side speeds.

   - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.

   - Validate OF (device tree) nodes for DSA shared ports; make
     phylink-related properties mandatory on DSA and CPU ports.
     Enforcing more uniformity should allow transitioning to phylink.

   - Require that flash component name used during update matches one of
     the components for which version is reported by info_get().

   - Remove "weight" argument from driver-facing NAPI API as much as
     possible. It's one of those magic knobs which seemed like a good
     idea at the time but is too indirect to use in practice.

   - Support offload of TLS connections with 256 bit keys.

  New hardware / drivers:

   - Ethernet:
      - Microchip KSZ9896 6-port Gigabit Ethernet Switch
      - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
      - Analog Devices ADIN1110 and ADIN2111 industrial single pair
        Ethernet (10BASE-T1L) MAC+PHY.
      - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).

   - Ethernet SFPs / modules:
      - RollBall / Hilink / Turris 10G copper SFPs
      - HALNy GPON module

   - WiFi:
      - CYW43439 SDIO chipset (brcmfmac)
      - CYW89459 PCIe chipset (brcmfmac)
      - BCM4378 on Apple platforms (brcmfmac)

  Drivers:

   - CAN:
      - gs_usb: HW timestamp support

   - Ethernet PHYs:
      - lan8814: cable diagnostics

   - Ethernet NICs:
      - Intel (100G):
         - implement control of FCS/CRC stripping
         - port splitting via devlink
         - L2TPv3 filtering offload
      - nVidia/Mellanox:
         - tunnel offload for sub-functions
         - MACSec offload, w/ Extended packet number and replay window
           offload
         - significantly restructure, and optimize the AF_XDP support,
           align the behavior with other vendors
      - Huawei:
         - configuring DSCP map for traffic class selection
         - querying standard FEC statistics
         - querying SerDes lane number via ethtool
      - Marvell/Cavium:
         - egress priority flow control
         - MACSec offload
      - AMD/SolarFlare:
         - PTP over IPv6 and raw Ethernet
      - small / embedded:
         - ax88772: convert to phylink (to support SFP cages)
         - altera: tse: convert to phylink
         - ftgmac100: support fixed link
         - enetc: standard Ethtool counters
         - macb: ZynqMP SGMII dynamic configuration support
         - tsnep: support multi-queue and use page pool
         - lan743x: Rx IP & TCP checksum offload
         - igc: add xdp frags support to ndo_xdp_xmit

   - Ethernet high-speed switches:
      - Marvell (prestera):
         - support SPAN port features (traffic mirroring)
         - nexthop object offloading
      - Microchip (sparx5):
         - multicast forwarding offload
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - support RGMII cmode
      - NXP (felix):
         - standardized ethtool counters
      - Microchip (lan966x):
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
         - traffic policing and mirroring
         - link aggregation / bonding offload
         - QUSGMII PHY mode support

   - Qualcomm 802.11ax WiFi (ath11k):
      - cold boot calibration support on WCN6750
      - support to connect to a non-transmit MBSSID AP profile
      - enable remain-on-channel support on WCN6750
      - Wake-on-WLAN support for WCN6750
      - support to provide transmit power from firmware via nl80211
      - support to get power save duration for each client
      - spectral scan support for 160 MHz

   - MediaTek WiFi (mt76):
      - WiFi-to-Ethernet bridging offload for MT7986 chips

   - RealTek WiFi (rtw89):
      - P2P support"

* tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits)
  eth: pse: add missing static inlines
  once: rename _SLOW to _SLEEPABLE
  net: pse-pd: add regulator based PSE driver
  dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller
  ethtool: add interface to interact with Ethernet Power Equipment
  net: mdiobus: search for PSE nodes by parsing PHY nodes.
  net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
  net: add framework to support Ethernet PSE and PDs devices
  dt-bindings: net: phy: add PoDL PSE property
  net: marvell: prestera: Propagate nh state from hw to kernel
  net: marvell: prestera: Add neighbour cache accounting
  net: marvell: prestera: add stub handler neighbour events
  net: marvell: prestera: Add heplers to interact with fib_notifier_info
  net: marvell: prestera: Add length macros for prestera_ip_addr
  net: marvell: prestera: add delayed wq and flush wq on deinit
  net: marvell: prestera: Add strict cleanup of fib arbiter
  net: marvell: prestera: Add cleanup of allocated fib_nodes
  net: marvell: prestera: Add router nexthops ABI
  eth: octeon: fix build after netif_napi_add() changes
  net/mlx5: E-Switch, Return EBUSY if can't get mode lock
  ...
2022-10-04 13:38:03 -07:00
Linus Torvalds 26b84401da lsm/stable-6.1 PR 20221003
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmM68YIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOTbA//TR8i+Wy8iswUCmtfmYg91h1uebpl
 /kjNsSmfgivAUTGamr3eN2WRlGhZfkFDPIHa25uybSA6Q+75p4lst83Rt3HDbjkv
 Ga7grCXnHwSDwJoHOSeFh0pojV2u7Zvfmiib2U5hPZEmd3kBw3NCgAJVcSGN80B2
 dct36fzZNXjvpWDbygmFtRRkmEseslSkft8bUVvNZBP+B0zvv3vcNY1QFuKuK+W2
 8wWpvO/cCSmke5i2c2ktHSk2f8/Y6n26Ik/OTHcTVfoKZLRaFbXEzLyxzLrNWd6m
 hujXgcxszTtHdmoXx+J6uBauju7TR8pi1x8mO2LSGrlpRc1cX0A5ED8WcH71+HVE
 8L1fIOmZShccPZn8xRok7oYycAUm/gIfpmSLzmZA76JsZYAe+mp9Ze9FA6fZtSwp
 7Q/rfw/Rlz25WcFBe4xypP078HkOmqutkCk2zy5liR+cWGrgy/WKX15vyC0TaPrX
 tbsRKuCLkipgfXrTk0dX3kmhz+3bJYjqeZEt7sfPSZYpaOGkNXVmAW0wnCOTuLMU
 +8pIVktvQxMmACEj2gBMz11iooR4DpWLxOcQQR/impgCpNdZ60nA0a6KPJoIXC+5
 NfTa422FZkc99QRVblUZyWSgJBW78Z3ZAQcQlo1AGLlFydbfrSFTRLbmNJZo/Nkl
 KwpGvWs5nB0rVw0=
 =VZl5
 -----END PGP SIGNATURE-----

Merge tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull LSM updates from Paul Moore:
 "Seven patches for the LSM layer and we've got a mix of trivial and
  significant patches. Highlights below, starting with the smaller bits
  first so they don't get lost in the discussion of the larger items:

   - Remove some redundant NULL pointer checks in the common LSM audit
     code.

   - Ratelimit the lockdown LSM's access denial messages.

     With this change there is a chance that the last visible lockdown
     message on the console is outdated/old, but it does help preserve
     the initial series of lockdown denials that started the denial
     message flood and my gut feeling is that these might be the more
     valuable messages.

   - Open userfaultfds as readonly instead of read/write.

     While this code obviously lives outside the LSM, it does have a
     noticeable impact on the LSMs with Ondrej explaining the situation
     in the commit description. It is worth noting that this patch
     languished on the VFS list for over a year without any comments
     (objections or otherwise) so I took the liberty of pulling it into
     the LSM tree after giving fair notice. It has been in linux-next
     since the end of August without any noticeable problems.

   - Add a LSM hook for user namespace creation, with implementations
     for both the BPF LSM and SELinux.

     Even though the changes are fairly small, this is the bulk of the
     diffstat as we are also including BPF LSM selftests for the new
     hook.

     It's also the most contentious of the changes in this pull request
     with Eric Biederman NACK'ing the LSM hook multiple times during its
     development and discussion upstream. While I've never taken NACK's
     lightly, I'm sending these patches to you because it is my belief
     that they are of good quality, satisfy a long-standing need of
     users and distros, and are in keeping with the existing nature of
     the LSM layer and the Linux Kernel as a whole.

     The patches in implement a LSM hook for user namespace creation
     that allows for a granular approach, configurable at runtime, which
     enables both monitoring and control of user namespaces. The general
     consensus has been that this is far preferable to the other
     solutions that have been adopted downstream including outright
     removal from the kernel, disabling via system wide sysctls, or
     various other out-of-tree mechanisms that users have been forced to
     adopt since we haven't been able to provide them an upstream
     solution for their requests. Eric has been steadfast in his
     objections to this LSM hook, explaining that any restrictions on
     the user namespace could have significant impact on userspace.
     While there is the possibility of impacting userspace, it is
     important to note that this solution only impacts userspace when it
     is requested based on the runtime configuration supplied by the
     distro/admin/user. Frederick (the pathset author), the LSM/security
     community, and myself have tried to work with Eric during
     development of this patchset to find a mutually acceptable
     solution, but Eric's approach and unwillingness to engage in a
     meaningful way have made this impossible. I have CC'd Eric directly
     on this pull request so he has a chance to provide his side of the
     story; there have been no objections outside of Eric's"

* tag 'lsm-pr-20221003' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
  lockdown: ratelimit denial messages
  userfaultfd: open userfaultfds with O_RDONLY
  selinux: Implement userns_create hook
  selftests/bpf: Add tests verifying bpf lsm userns_create hook
  bpf-lsm: Make bpf_lsm_userns_create() sleepable
  security, lsm: Introduce security_create_user_ns()
  lsm: clean up redundant NULL pointer check
2022-10-03 17:51:52 -07:00
Jakub Kicinski e52f7c1ddf Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in the left-over fixes before the net-next pull-request.

Conflicts:

drivers/net/ethernet/mediatek/mtk_ppe.c
  ae3ed15da5 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear")
  9d8cb4c096 ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc")
https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/

kernel/bpf/helpers.c
  8addbfc7b3 ("bpf: Gate dynptr API behind CAP_BPF")
  5679ff2f13 ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF")
  8a67f2de9b ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03 17:44:18 -07:00
Alexander Potapenko a6a7aaba7f bpf: kmsan: initialize BPF registers with zeroes
When executing BPF programs, certain registers may get passed
uninitialized to helper functions.  E.g.  when performing a JMP_CALL,
registers BPF_R1-BPF_R5 are always passed to the helper, no matter how
many of them are actually used.

Passing uninitialized values as function parameters is technically
undefined behavior, so we work around it by always initializing the
registers.

Link: https://lkml.kernel.org/r/20220915150417.722975-42-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-10-03 14:03:25 -07:00
Martin KaFai Lau 64696c40d0 bpf: Add __bpf_prog_{enter,exit}_struct_ops for struct_ops trampoline
The struct_ops prog is to allow using bpf to implement the functions in
a struct (eg. kernel module).  The current usage is to implement the
tcp_congestion.  The kernel does not call the tcp-cc's ops (ie.
the bpf prog) in a recursive way.

The struct_ops is sharing the tracing-trampoline's enter/exit
function which tracks prog->active to avoid recursion.  It is
needed for tracing prog.  However, it turns out the struct_ops
bpf prog will hit this prog->active and unnecessarily skipped
running the struct_ops prog.  eg.  The '.ssthresh' may run in_task()
and then interrupted by softirq that runs the same '.ssthresh'.
Skip running the '.ssthresh' will end up returning random value
to the caller.

The patch adds __bpf_prog_{enter,exit}_struct_ops for the
struct_ops trampoline.  They do not track the prog->active
to detect recursion.

One exception is when the tcp_congestion's '.init' ops is doing
bpf_setsockopt(TCP_CONGESTION) and then recurs to the same
'.init' ops.  This will be addressed in the following patches.

Fixes: ca06f55b90 ("bpf: Add per-program recursion prevention mechanism")
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20220929070407.965581-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-09-29 09:25:47 -07:00
Peter Zijlstra a1ebcd5943 Linux 6.0-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmMwwY4eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGdlwH/0ESzdb6F9zYWwHR
 E08har56/IfwjOsn1y+JuHibpwUjzskLzdwIfI5zshSZAQTj5/UyC0P7G/wcYh/Z
 INh1uHGazmDUkx4O3lwuWLR+mmeUxZRWdq4NTwYDRNPMSiPInVxz+cZJ7y0aPr2e
 wii7kMFRHgXmX5DMDEwuHzehsJF7vZrp8zBu2DqzVUGnbwD50nPbyMM3H4g9mute
 fAEpDG0X3+smqMaKL+2rK0W/Av/87r3U8ZAztBem3nsCJ9jT7hqMO1ICcKmFMviA
 DTERRMwWjPq+mBPE2CiuhdaXvNZBW85Ds81mSddS6MsO6+Tvuzfzik/zSLQJxlBi
 vIqYphY=
 =NqG+
 -----END PGP SIGNATURE-----

Merge branch 'v6.0-rc7'

Merge upstream to get RAPTORLAKE_S

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
2022-09-29 12:20:50 +02:00
Kui-Feng Lee 2c4fe44fb0 bpf: Handle show_fdinfo for the parameterized task BPF iterators
Show information of iterators in the respective files under
/proc/<pid>/fdinfo/.

For example, for a task file iterator with 1723 as the value of tid
parameter, its fdinfo would look like the following lines.

    pos:    0
    flags:  02000000
    mnt_id: 14
    ino:    38
    link_type:      iter
    link_id:        51
    prog_tag:       a590ac96db22b825
    prog_id:        299
    target_name:    task_file
    task_type:      TID
    tid: 1723

This patch add the last three fields.  task_type is the type of the
task parameter.  TID means the iterator visit only the thread
specified by tid.  The value of tid in the above example is 1723.  For
the case of PID task_type, it means the iterator visits only threads
of a process and will show the pid value of the process instead of a
tid.

Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/bpf/20220926184957.208194-4-kuifeng@fb.com
2022-09-28 16:30:00 -07:00
Kui-Feng Lee 21fb6f2aa3 bpf: Handle bpf_link_info for the parameterized task BPF iterators.
Add new fields to bpf_link_info that users can query it through
bpf_obj_get_info_by_fd().

Signed-off-by: Kui-Feng Lee <kuifeng@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/bpf/20220926184957.208194-3-kuifeng@fb.com
2022-09-28 16:29:55 -07:00