A skcipher_request object is made up of struct skcipher_request
followed by a variable-sized trailer. The allocation of the
skcipher_request and IV in crypt_iv_eboiv_gen is missing the
memory for struct skcipher_request. Fix it by adding it to
reqsize.
Fixes: e3023094df ("dm crypt: Avoid using MAX_CIPHER_BLOCKSIZE")
Cc: <stable@vger.kernel.org> #6.5+
Reported-by: Tatu Heikkilä <tatu.heikkila@gmail.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Now there are no readers of shrinker_rwsem, so we can simply replace it
with mutex lock.
[akpm@linux-foundation.org: update the fix to alloc_shrinker_info()]
Link: https://lkml.kernel.org/r/20230911094444.68966-46-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the md-bcache shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct cache_set.
Link: https://lkml.kernel.org/r/20230911094444.68966-27-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Coly Li <colyli@suse.de>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the md-raid5 shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct r5conf.
Link: https://lkml.kernel.org/r/20230911094444.68966-26-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Song Liu <song@kernel.org>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the dm-zoned-meta shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct dmz_metadata.
Link: https://lkml.kernel.org/r/20230911094444.68966-25-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the dm-bufio shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct dm_bufio_client.
Link: https://lkml.kernel.org/r/20230911094444.68966-24-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When raid5_get_active_stripe is called with a ctx containing a stripe_head in
its batch_last pointer, it can cause a deadlock if the task sleeps waiting on
another stripe_head to become available. The stripe_head held by batch_last
can be blocking the advancement of other stripe_heads, leading to no
stripe_heads being released so raid5_get_active_stripe waits forever.
Like with the quiesce state handling earlier in the function, batch_last
needs to be released by raid5_get_active_stripe before it waits for another
stripe_head.
Fixes: 3312e6c887 ("md/raid5: Keep a reference to last stripe_head for batch")
Cc: stable@vger.kernel.org # v6.0+
Signed-off-by: David Jeffery <djeffery@redhat.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20231002183422.13047-1-djeffery@redhat.com
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct dm_bio_prison.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@redhat.com
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200407.never.611-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct dm_stat.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@redhat.com
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200400.never.585-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct stripe_c.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@redhat.com
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200352.never.118-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct crypt_config.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@redhat.com
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200344.never.272-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct raid_set.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: dm-devel@redhat.com
Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20230915200335.never.098-kees@kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
We don't need to lock device to reject not supported request
in array_state_store(). No functional changes intended.
There are differences between ioctl and sysfs handling during stopping.
With this change, it will be easier to add additional steps which needs
to be completed before mddev_lock() is taken.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230928125517.12356-1-mariusz.tkaczyk@linux.intel.com
Before this patch, the implementation is hacky and hard to understand:
1) md_seq_start set pos to 1;
2) md_seq_show found pos is 1, then print Personalities;
3) md_seq_next found pos is 1, then it update pos to the first mddev;
4) md_seq_show found pos is not 1 or 2, show mddev;
5) md_seq_next found pos is not 1 or 2, update pos to next mddev;
6) loop 4-5 until the last mddev, then md_seq_next update pos to 2;
7) md_seq_show found pos is 2, then print unused devices;
8) md_seq_next found pos is 2, stop;
This patch remove the magic value and use seq_list_start/next/stop()
directly, and move printing "Personalities" to md_seq_start(),
"unsed devices" to md_seq_stop():
1) md_seq_start print Personalities, and then set pos to first mddev;
2) md_seq_show show mddev;
3) md_seq_next update pos to next mddev;
4) loop 2-3 until the last mddev;
5) md_seq_stop print unsed devices;
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230927061241.1552837-3-yukuai1@huaweicloud.com
`strncpy` is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
There are three such strncpy uses that this patch addresses:
The respective destination buffers are:
1) mddev->clevel
2) clevel
3) mddev->metadata_type
We expect mddev->clevel to be NUL-terminated due to its use with format
strings:
| ret = sprintf(page, "%s\n", mddev->clevel);
Furthermore, we can see that mddev->clevel is not expected to be
NUL-padded as `md_clean()` merely set's its first byte to NULL -- not
the entire buffer:
| static void md_clean(struct mddev *mddev)
| {
| mddev->array_sectors = 0;
| mddev->external_size = 0;
| ...
| mddev->level = LEVEL_NONE;
| mddev->clevel[0] = 0;
| ...
A suitable replacement for this instance is `memcpy` as we know the
number of bytes to copy and perform manual NUL-termination at a
specified offset. This really decays to just a byte copy from one buffer
to another. `strscpy` is also a considerable replacement but using
`slen` as the length argument would result in truncation of the last
byte unless something like `slen + 1` was provided which isn't the most
idiomatic strscpy usage.
For the next case, the destination buffer `clevel` is expected to be
NUL-terminated based on its usage within kstrtol() which expects
NUL-terminated strings. Note that, in context, this code removes a
trailing newline which is seemingly not required as kstrtol() can handle
trailing newlines implicitly. However, there exists further usage of
clevel (or buf) that would also like to have the newline removed. All in
all, with similar reasoning to the first case, let's just use memcpy as
this is just a byte copy and NUL-termination is handled manually.
The third and final case concerning `mddev->metadata_type` is more or
less the same as the other two. We expect that it be NUL-terminated
based on its usage with seq_printf:
| seq_printf(seq, " super external:%s",
| mddev->metadata_type);
... and we can surmise that NUL-padding isn't required either due to how
it is handled in md_clean():
| static void md_clean(struct mddev *mddev)
| {
| ...
| mddev->metadata_type[0] = 0;
| ...
So really, all these instances have precisely calculated lengths and
purposeful NUL-termination so we can just use memcpy to remove ambiguity
surrounding strncpy.
Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230925-strncpy-drivers-md-md-c-v1-1-2b0093b89c2b@google.com
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS
(for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
As found with Coccinelle[1], add __counted_by for struct linear_conf.
Additionally, since the element count member must be set before accessing
the annotated flexible array member, move its initialization earlier.
[1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci
Cc: Song Liu <song@kernel.org>
Cc: linux-raid@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230915200328.never.064-kees@kernel.org
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to remove such checking.
This will also allow the array to be suspended even before the array
is ran.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-8-yukuai1@huaweicloud.com
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to remove such checking.
This will also allow the array to be suspended even before the array
is ran.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-7-yukuai1@huaweicloud.com
Now that mddev_suspend() doean't rely on 'mddev->pers' to be set, it's
safe to call mddev_suspend() earlier.
This will also be helper to refactor mddev_suspend() later.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-6-yukuai1@huaweicloud.com
After commit 4d27e927344a ("md: don't quiesce in mddev_suspend()"),
there is no need to check 'pers->quiesce' before calling
mddev_suspend().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-5-yukuai1@huaweicloud.com
'active_io' used to be initialized while the array is running, and
'mddev->pers' is set while the array is running as well. Hence caller
must hold 'reconfig_mutex' and guarantee 'mddev->pers' is set before
calling mddev_suspend().
Now that 'active_io' is initialized when mddev is allocated, such
restriction doesn't exist anymore. In the meantime, follow up patches
will refactor mddev_suspend(), hence add checking for 'mddev->pers' to
prevent null-ptr-deref.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-4-yukuai1@huaweicloud.com
Currently 'writes_pending' is initialized in pers->run for raid1/5/10,
and it's freed while deleing mddev, instead of pers->free. pers->run can
be called multiple times before mddev is deleted, and a helper
mddev_init_writes_pending() is used to prevent 'writes_pending' to be
initialized multiple times, this usage is safe but a litter weird.
On the other hand, 'writes_pending' is only initialized for raid1/5/10,
however, it's used in common layer, for example:
array_state_store
set_in_sync
if (!mddev->in_sync) -> in_sync is used for all levels
// access writes_pending
There might be some implicit dependency that I don't recognized to make
sure 'writes_pending' can only be accessed for raid1/5/10, but there are
no comments about that.
By the way, it make sense to initialize 'writes_pending' in common layer
because there are already three levels use it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-3-yukuai1@huaweicloud.com
'active_io' is used for mddev_suspend() and it's initialized in
md_run(), this restrict that 'reconfig_mutex' must be held and
"mddev->pers" must be set before calling mddev_suspend().
Initialize 'active_io' early so that mddev_suspend() is safe to call
once mddev is allocated, this will be helpful to refactor
mddev_suspend() in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825030956.1527023-2-yukuai1@huaweicloud.com
Before this patch, for read-only array:
md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, then it will
call remove_and_add_spares() directly to try to remove and add rdevs
from array.
After this patch:
1) md_check_recovery() check that 'MD_RECOVERY_NEEDED' is set, and the
worker 'sync_work' is not pending, and there are rdevs can be added
or removed, then it will queue new work md_start_sync();
2) md_start_sync() will call remove_and_add_spares() and exist;
This change make sure that array reconfiguration is independent from
daemon, and it'll be much easier to synchronize it with io, consier
that io may rely on daemon thread to be done.
Also fix a problem that 'pers->spars_active' is called after
remove_and_add_spares(), which order is wrong, because spares must
active first, and then remove_and_add_spares() can add spares to the
array, like what read-write case does:
1) daemon set 'MD_RECOVERY_RUNNING', register new sync thread to do
recovery;
2) recovery is done, md_do_sync() set 'MD_RECOVERY_DONE' before return;
3) daemon call 'pers->spars_active', and clear 'MD_RECOVERY_RUNNING';
4) in the next round of daemon, call remove_and_add_spares() to add
spares to the array.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-8-yukuai1@huaweicloud.com
There are no functional changes, just to make the code simpler and
prepare to delay remove_and_add_spares() to md_start_sync().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-6-yukuai1@huaweicloud.com
Before this patch, for read-write array:
1) md_check_recover() found that something need to be done, and it'll
try to grab 'reconfig_mutex'. The case that md_check_recover() need
to do something:
- array is not suspend;
- super_block need to be updated;
- 'MD_RECOVERY_NEEDED' or 'MD_RECOVERY_DONE' is set;
- unusual case related to safemode;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
md_check_recover() will try to choose a sync action, and then queue a
work md_start_sync().
3) md_start_sync() register sync_thread;
After this patch,
1) is the same;
2) if 'MD_RECOVERY_RUNNING' is not set, and 'MD_RECOVERY_NEEDED' is set,
queue a work md_start_sync() directly;
3) md_start_sync() will try to choose a sync action, and then register
sync_thread();
Because 'MD_RECOVERY_RUNNING' is cleared when sync_thread is done, 2)
and 3) and md_do_sync() is always ran in serial and they can never
concurrent, this change should not introduce any behavior change for now.
Also fix a problem that md_start_sync() can clear 'MD_RECOVERY_RUNNING'
without protection in error path, which might affect the logical in
md_check_recovery().
The advantage to change this is that array reconfiguration is
independent from daemon now, and it'll be much easier to synchronize it
with io, consider that io may rely on daemon thread to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-4-yukuai1@huaweicloud.com
There are no functional changes, on the one hand make the code cleaner,
on the other hand prevent following checkpatch error in the next patch to
delay choosing sync action to md_start_sync().
ERROR: do not use assignment in if condition
+ } else if ((spares = remove_and_add_spares(mddev, NULL))) {
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-3-yukuai1@huaweicloud.com
It's a little weird to borrow 'del_work' for md_start_sync(), declare
a new work_struct 'sync_work' for md_start_sync().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825031622.1530464-2-yukuai1@huaweicloud.com
Commit 4dba12881f ("dm zoned: support arbitrary number of devices")
made the pointers to additional zoned devices to be stored in a
dynamically allocated dmz->ddev array. However, this array is not freed.
Rename dmz_put_zoned_device to dmz_put_zoned_devices and fix it to
free the dmz->ddev array when cleaning up zoned device information.
Remove NULL assignment for all dmz->ddev elements and just free the
dmz->ddev array instead.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 4dba12881f ("dm zoned: support arbitrary number of devices")
Cc: stable@vger.kernel.org
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
table's list of devices that is managed using dm_{get,put}_device.
- Revert DM core's half-baked RCU optimization if IO submitter has set
REQ_NOWAIT. Can be revisited, and properly justified, after
comprehensively auditing all of DM to also pass GFP_NOWAIT for any
allocations if REQ_NOWAIT used.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmUEt7wACgkQxSPxCi2d
A1prNwf/RB4EyKiSx7XS3ysM6mh/BPGO5FNjWwHebkrSFzAkEowo4i0cY9lRD0N4
x9Wbd5bcV8HarH/fiyffQxgdfXspAIrMt8z5hRnfElkBLzg6hHixxg/3sFCwg+U3
LG6AZFNLil7VmDeca9Pd8MCyXoy1u4ErWjkz3fU8pzzT+NDwRZPZhUMd/MFCWag6
q22S8KMXkYKiAHqKauF52CeDH77XsO66G70t/AElemZ66PpyKpasg2p99RCuHgTg
7jNuMTM6qXYWSWw8OswVXCPZEVfCp4zTFv1ebu9bagfDKR4ppNxwzyz7/CMkir14
4uKKzQ/cy8QND6OR/05zKh4U3ctqyA==
=rVpu
-----END PGP SIGNATURE-----
Merge tag 'for-6.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix DM core retrieve_deps() UAF race due to missing locking of a DM
table's list of devices that is managed using dm_{get,put}_device.
- Revert DM core's half-baked RCU optimization if IO submitter has set
REQ_NOWAIT. Can be revisited, and properly justified, after
comprehensively auditing all of DM to also pass GFP_NOWAIT for any
allocations if REQ_NOWAIT used.
* tag 'for-6.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: don't attempt to queue IO under RCU protection
dm: fix a race condition in retrieve_deps
dm looks up the table for IO based on the request type, with an
assumption that if the request is marked REQ_NOWAIT, it's fine to
attempt to submit that IO while under RCU read lock protection. This
is not OK, as REQ_NOWAIT just means that we should not be sleeping
waiting on other IO, it does not mean that we can't potentially
schedule.
A simple test case demonstrates this quite nicely:
int main(int argc, char *argv[])
{
struct iovec iov;
int fd;
fd = open("/dev/dm-0", O_RDONLY | O_DIRECT);
posix_memalign(&iov.iov_base, 4096, 4096);
iov.iov_len = 4096;
preadv2(fd, &iov, 1, 0, RWF_NOWAIT);
return 0;
}
which will instantly spew:
BUG: sleeping function called from invalid context at include/linux/sched/mm.h:306
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5580, name: dm-nowait
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
INFO: lockdep is turned off.
CPU: 7 PID: 5580 Comm: dm-nowait Not tainted 6.6.0-rc1-g39956d2dcd81 #132
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x11d/0x1b0
__might_resched+0x3c3/0x5e0
? preempt_count_sub+0x150/0x150
mempool_alloc+0x1e2/0x390
? mempool_resize+0x7d0/0x7d0
? lock_sync+0x190/0x190
? lock_release+0x4b7/0x670
? internal_get_user_pages_fast+0x868/0x2d40
bio_alloc_bioset+0x417/0x8c0
? bvec_alloc+0x200/0x200
? internal_get_user_pages_fast+0xb8c/0x2d40
bio_alloc_clone+0x53/0x100
dm_submit_bio+0x27f/0x1a20
? lock_release+0x4b7/0x670
? blk_try_enter_queue+0x1a0/0x4d0
? dm_dax_direct_access+0x260/0x260
? rcu_is_watching+0x12/0xb0
? blk_try_enter_queue+0x1cc/0x4d0
__submit_bio+0x239/0x310
? __bio_queue_enter+0x700/0x700
? kvm_clock_get_cycles+0x40/0x60
? ktime_get+0x285/0x470
submit_bio_noacct_nocheck+0x4d9/0xb80
? should_fail_request+0x80/0x80
? preempt_count_sub+0x150/0x150
? lock_release+0x4b7/0x670
? __bio_add_page+0x143/0x2d0
? iov_iter_revert+0x27/0x360
submit_bio_noacct+0x53e/0x1b30
submit_bio_wait+0x10a/0x230
? submit_bio_wait_endio+0x40/0x40
__blkdev_direct_IO_simple+0x4f8/0x780
? blkdev_bio_end_io+0x4c0/0x4c0
? stack_trace_save+0x90/0xc0
? __bio_clone+0x3c0/0x3c0
? lock_release+0x4b7/0x670
? lock_sync+0x190/0x190
? atime_needs_update+0x3bf/0x7e0
? timestamp_truncate+0x21b/0x2d0
? inode_owner_or_capable+0x240/0x240
blkdev_direct_IO.part.0+0x84a/0x1810
? rcu_is_watching+0x12/0xb0
? lock_release+0x4b7/0x670
? blkdev_read_iter+0x40d/0x530
? reacquire_held_locks+0x4e0/0x4e0
? __blkdev_direct_IO_simple+0x780/0x780
? rcu_is_watching+0x12/0xb0
? __mark_inode_dirty+0x297/0xd50
? preempt_count_add+0x72/0x140
blkdev_read_iter+0x2a4/0x530
do_iter_readv_writev+0x2f2/0x3c0
? generic_copy_file_range+0x1d0/0x1d0
? fsnotify_perm.part.0+0x25d/0x630
? security_file_permission+0xd8/0x100
do_iter_read+0x31b/0x880
? import_iovec+0x10b/0x140
vfs_readv+0x12d/0x1a0
? vfs_iter_read+0xb0/0xb0
? rcu_is_watching+0x12/0xb0
? rcu_is_watching+0x12/0xb0
? lock_release+0x4b7/0x670
do_preadv+0x1b3/0x260
? do_readv+0x370/0x370
__x64_sys_preadv2+0xef/0x150
do_syscall_64+0x39/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f5af41ad806
Code: 41 54 41 89 fc 55 44 89 c5 53 48 89 cb 48 83 ec 18 80 3d e4 dd 0d 00 00 74 7a 45 89 c1 49 89 ca 45 31 c0 b8 47 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 be 00 00 00 48 85 c0 79 4a 48 8b 0d da 55
RSP: 002b:00007ffd3145c7f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000147
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5af41ad806
RDX: 0000000000000001 RSI: 00007ffd3145c850 RDI: 0000000000000003
RBP: 0000000000000008 R08: 0000000000000000 R09: 0000000000000008
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
R13: 00007ffd3145c850 R14: 000055f5f0431dd8 R15: 0000000000000001
</TASK>
where in fact it is dm itself that attempts to allocate a bio clone with
GFP_NOIO under the rcu read lock, regardless of the request type.
Fix this by getting rid of the special casing for REQ_NOWAIT, and just
use the normal SRCU protected table lookup. Get rid of the bio based
table locking helpers at the same time, as they are now unused.
Cc: stable@vger.kernel.org
Fixes: 563a225c9f ("dm: introduce dm_{get,put}_live_table_bio called from dm_submit_bio")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
If there are multiple arrays in system and one mddevice is marked
with MD_DELETED and md_seq_next() is called in the middle of removal
then it _get()s proper device but it may _put() deleted one. As a result,
active counter may never be zeroed for mddevice and it cannot
be removed.
Put the device which has been _get with previous md_seq_next() call.
Cc: stable@vger.kernel.org
Fixes: 12a6caf273 ("md: only delete entries from all_mddevs when the disk is freed")
Reported-by: AceLan Kao <acelan@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217798
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230914152416.10819-1-mariusz.tkaczyk@linux.intel.com
There's a race condition in the multipath target when retrieve_deps
races with multipath_message calling dm_get_device and dm_put_device.
retrieve_deps walks the list of open devices without holding any lock
but multipath may add or remove devices to the list while it is
running. The end result may be memory corruption or use-after-free
memory access.
See this description of a UAF with multipath_message():
https://listman.redhat.com/archives/dm-devel/2022-October/052373.html
Fix this bug by introducing a new rw semaphore "devices_lock". We grab
devices_lock for read in retrieve_deps and we grab it for write in
dm_get_device and dm_put_device.
Reported-by: Luo Meng <luomeng12@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Tested-by: Li Lingfeng <lilingfeng3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
There is a compile error when this commit is added:
md: raid1: fix potential OOB in raid1_remove_disk()
drivers/md/raid1.c: In function 'raid1_remove_disk':
drivers/md/raid1.c:1844:9: error: ISO C90 forbids mixed declarations
and code [-Werror=declaration-after-statement]
1844 | struct raid1_info *p = conf->mirrors + number;
| ^~~~~~
That's because the new code was inserted before the struct.
The change is move the struct command above this commit.
Fixes: 8b0472b50b ("md: raid1: fix potential OOB in raid1_remove_disk()")
Signed-off-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/46d929d0-2aab-4cf2-b2bf-338963e8ba5a@redhat.com
Commit a1d7671910 ("md: use mddev->external to select holder in
export_rdev()") fix the problem that 'claim_rdev' is used for
blkdev_get_by_dev() while 'rdev' is used for blkdev_put().
However, if mddev->external is changed from 0 to 1, then 'rdev' is used
for blkdev_get_by_dev() while 'claim_rdev' is used for blkdev_put(). And
this problem can be reporduced reliably by following:
New file: mdadm/tests/23rdev-lifetime
devname=${dev0##*/}
devt=`cat /sys/block/$devname/dev`
pid=""
runtime=2
clean_up_test() {
pill -9 $pid
echo clear > /sys/block/md0/md/array_state
}
trap 'clean_up_test' EXIT
add_by_sysfs() {
while true; do
echo $devt > /sys/block/md0/md/new_dev
done
}
remove_by_sysfs(){
while true; do
echo remove > /sys/block/md0/md/dev-${devname}/state
done
}
echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed"
add_by_sysfs &
pid="$pid $!"
remove_by_sysfs &
pid="$pid $!"
sleep $runtime
exit 0
Test cmd:
./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime
Test result:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 960 at block/bdev.c:618 blkdev_put+0x27c/0x330
Modules linked in: multipath md_mod loop
CPU: 0 PID: 960 Comm: test Not tainted 6.5.0-rc2-00121-g01e55c376936-dirty #50
RIP: 0010:blkdev_put+0x27c/0x330
Call Trace:
<TASK>
export_rdev.isra.23+0x50/0xa0 [md_mod]
mddev_unlock+0x19d/0x300 [md_mod]
rdev_attr_store+0xec/0x190 [md_mod]
sysfs_kf_write+0x52/0x70
kernfs_fop_write_iter+0x19a/0x2a0
vfs_write+0x3b5/0x770
ksys_write+0x74/0x150
__x64_sys_write+0x22/0x30
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fix the problem by recording if 'rdev' is used as holder.
Fixes: a1d7671910 ("md: use mddev->external to select holder in export_rdev()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825025532.1523008-3-yukuai1@huaweicloud.com
Except for initial reference, mddev->kobject is referenced by
rdev->kobject, and if the last rdev is freed, there is no guarantee that
mddev is still valid. Hence mddev should not be used anymore after
export_rdev().
This problem can be triggered by following test for mdadm at very
low rate:
New file: mdadm/tests/23rdev-lifetime
devname=${dev0##*/}
devt=`cat /sys/block/$devname/dev`
pid=""
runtime=2
clean_up_test() {
pill -9 $pid
echo clear > /sys/block/md0/md/array_state
}
trap 'clean_up_test' EXIT
add_by_sysfs() {
while true; do
echo $devt > /sys/block/md0/md/new_dev
done
}
remove_by_sysfs(){
while true; do
echo remove > /sys/block/md0/md/dev-${devname}/state
done
}
echo md0 > /sys/module/md_mod/parameters/new_array || die "create md0 failed"
add_by_sysfs &
pid="$pid $!"
remove_by_sysfs &
pid="$pid $!"
sleep $runtime
exit 0
Test cmd:
./test --save-logs --logdir=/tmp/ --keep-going --dev=loop --tests=23rdev-lifetime
Test result:
general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bcb: 0000 [#4] PREEMPT SMP
CPU: 0 PID: 1292 Comm: test Tainted: G D W 6.5.0-rc2-00121-g01e55c376936 #562
RIP: 0010:md_wakeup_thread+0x9e/0x320 [md_mod]
Call Trace:
<TASK>
mddev_unlock+0x1b6/0x310 [md_mod]
rdev_attr_store+0xec/0x190 [md_mod]
sysfs_kf_write+0x52/0x70
kernfs_fop_write_iter+0x19a/0x2a0
vfs_write+0x3b5/0x770
ksys_write+0x74/0x150
__x64_sys_write+0x22/0x30
do_syscall_64+0x40/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Fix this problem by don't dereference mddev after export_rdev().
Fixes: 3ce94ce5d0 ("md: fix duplicate filename for rdev")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230825025532.1523008-2-yukuai1@huaweicloud.com
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ
tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa
lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE
pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ
7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v
SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l
l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu
krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK
sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y
tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA
AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs
d0Y/zCZf1A==
=p1bd
-----END PGP SIGNATURE-----
Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:
"Pretty quiet round for this release. This contains:
- Add support for zoned storage to ublk (Andreas, Ming)
- Series improving performance for drivers that mark themselves as
needing a blocking context for issue (Bart)
- Cleanup the flush logic (Chengming)
- sed opal keyring support (Greg)
- Fixes and improvements to the integrity support (Jinyoung)
- Add some exports for bcachefs that we can hopefully delete again in
the future (Kent)
- deadline throttling fix (Zhiguo)
- Series allowing building the kernel without buffer_head support
(Christoph)
- Sanitize the bio page adding flow (Christoph)
- Write back cache fixes (Christoph)
- MD updates via Song:
- Fix perf regression for raid0 large sequential writes (Jan)
- Fix split bio iostat for raid0 (David)
- Various raid1 fixes (Heinz, Xueshi)
- raid6test build fixes (WANG)
- Deprecate bitmap file support (Christoph)
- Fix deadlock with md sync thread (Yu)
- Refactor md io accounting (Yu)
- Various non-urgent fixes (Li, Yu, Jack)
- Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li,
Ming, Nitesh, Ruan, Tejun, Thomas, Xu)"
* tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits)
block: use strscpy() to instead of strncpy()
block: sed-opal: keyring support for SED keys
block: sed-opal: Implement IOC_OPAL_REVERT_LSP
block: sed-opal: Implement IOC_OPAL_DISCOVERY
blk-mq: prealloc tags when increase tagset nr_hw_queues
blk-mq: delete redundant tagset map update when fallback
blk-mq: fix tags leak when shrink nr_hw_queues
ublk: zoned: support REQ_OP_ZONE_RESET_ALL
md: raid0: account for split bio in iostat accounting
md/raid0: Fix performance regression for large sequential writes
md/raid0: Factor out helper for mapping and submitting a bio
md raid1: allow writebehind to work on any leg device set WriteMostly
md/raid1: hold the barrier until handle_read_error() finishes
md/raid1: free the r1bio before waiting for blocked rdev
md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
drivers/rnbd: restore sysfs interface to rnbd-client
md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()
raid6: test: only check for Altivec if building on powerpc hosts
raid6: test: make sure all intermediate and artifact files are .gitignored
...
When a bio is split by md raid0, the newly created bio will not be tracked
by md for I/O accounting. Only the portion of I/O still assigned to the
original bio which was reduced by the split will be accounted for. This
results in md iostat data sometimes showing I/O values far below the actual
amount of data being sent through md.
md_account_bio() needs to be called for all bio generated by the bio split.
A simple example of the issue was generated using a raid0 device on partitions
to the same device. Since all raid0 I/O then goes to one device, it makes it
easy to see a gap between the md device and its sd storage. Reading an lvm
device on top of the md device, the iostat output (some 0 columns and extra
devices removed to make the data more compact) was:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read
md2 0.00 0.00 0.00 0.00 0
sde 0.00 0.00 0.00 0.00 0
md2 1364.00 411496.00 0.00 0.00 411496
sde 1734.00 646144.00 0.00 0.00 646144
md2 1699.00 510680.00 0.00 0.00 510680
sde 2155.00 802784.00 0.00 0.00 802784
md2 803.00 241480.00 0.00 0.00 241480
sde 1016.00 377888.00 0.00 0.00 377888
md2 0.00 0.00 0.00 0.00 0
sde 0.00 0.00 0.00 0.00 0
I/O was generated doing large direct I/O reads (12M) with dd to a linear
lvm volume on top of the 4 leg raid0 device.
The md2 reads were showing as roughly 2/3 of the reads to the sde device
containing all of md2's raid partitions. The sum of reads to sde was
1826816 kB, which was the expected amount as it was the amount read by
dd. With the patch, the total reads from md will match the reads from
sde and be consistent with the amount of I/O generated.
Fixes: 10764815ff ("md: add io accounting for raid0 and raid5")
Signed-off-by: David Jeffery <djeffery@redhat.com>
Tested-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Laurence Oberman <loberman@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
Commit f00d7c85be ("md/raid0: fix up bio splitting.") among other
things changed how bio that needs to be split is submitted. Before this
commit, we have split the bio, mapped and submitted each part. After
this commit, we map only the first part of the split bio and submit the
second part unmapped. Due to bio sorting in __submit_bio_noacct() this
results in the following request ordering:
9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392
Split off chunk-sized (1024 sectors) request:
9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488
Request is unaligned to the chunk so it's split in
raid0_make_request(). This is the first part mapped and punted to
bio_list:
8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464
Now raid0_make_request() returns, second part is postponed on
bio_list. __submit_bio_noacct() resorts the bio_list, mapped request
is submitted to the underlying device:
8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016
Now we take another request from the bio_list which is the remainder
of the original huge request. Split off another chunk-sized bit from
it and the situation repeats:
9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512
8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488
8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016
...
9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write]
8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512
8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016
...
This repeats until we consume the whole original huge request. Now we
finally get to processing the second parts of the split off requests
(in reverse order):
8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920
8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896
8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872
8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848
...
I guess it is obvious that this IO pattern is extremely inefficient way
to perform sequential IO. It also makes bio_list to grow to rather long
lengths.
Change raid0_make_request() to map both parts of the split bio. Since we
know we are provided with at most chunk-sized bios, we will always need
to split the incoming bio at most once.
Fixes: f00d7c85be ("md/raid0: fix up bio splitting.")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz
Signed-off-by: Song Liu <song@kernel.org>
Factor out helper function for mapping and submitting a bio out of
raid0_make_request(). We will use it later for submitting both parts of
a split bio.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz
Signed-off-by: Song Liu <song@kernel.org>
handle_read_error() will call allow_barrier() to match the former barrier
raising. However, it should put the allow_barrier() at the end to avoid a
concurrent raid reshape.
Fixes: 689389a06c ("md/raid1: simplify handle_read_error().")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-4-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
Raid1 reshape will change mempool and r1conf::raid_disks which are
needed to free r1bio. allow_barrier() make a concurrent raid1_reshape()
possible. So, free the in-flight r1bio before waiting blocked rdev.
Fixes: 6bfe0b4990 ("md: support blocking writes to an array on device failure")
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-3-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
After allow_barrier, a concurrent raid1_reshape() will replace old mempool
and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(),
so that r1bio can be freed safely.
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Xueshi Hu <xueshi.hu@smartx.com>
Link: https://lore.kernel.org/r/20230814135356.1113639-2-xueshi.hu@smartx.com
Signed-off-by: Song Liu <song@kernel.org>
r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is
empty, and then submit 'flush_bio', however, r5l_log_flush_endio()
is clearing the list first and then clear the bio, which will cause
null-ptr-deref:
T1: submit flush io
raid5d
handle_active_stripes
r5l_flush_stripe_to_raid
// list is empty
// add 'io_end_ios' to the list
bio_init
submit_bio
// io1
T2: io1 is done
r5l_log_flush_endio
list_splice_tail_init
// clear the list
T3: submit new flush io
...
r5l_flush_stripe_to_raid
// list is empty
// add 'io_end_ios' to the list
bio_init
bio_uninit
// clear bio->bi_blkg
submit_bio
// null-ptr-deref
Fix this problem by clearing bio before clearing the list in
r5l_log_flush_endio().
Fixes: 0dd00cba99 ("raid5-cache: fully initialize flush_bio when needed")
Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org>
Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Commit ba9d9f1a707f ("Revert "md: unlock mddev before reap sync_thread in
action_store"") removed the scenario of calling md_unregister_thread()
without holding mddev->reconfig_mutex, so add a lock holding check before
acquiring mddev->sync_thread by passing mdev to md_unregister_thread().
Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230803071711.2546560-1-lilingfeng@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
After commit b39f35ebe8 ("md: don't quiesce in mddev_suspend()"),
'conf->barrier' will be leaked in the case that raid10 takeover raid0:
level_store
pers->takeover -> raid10_takeover
raid10_takeover_raid0
WRITE_ONCE(conf->barrier, 1)
mddev_suspend
// still raid0
mddev->pers = pers
// switch to raid10
mddev_resume
// resume without suspend
After the above commit, mddev_resume() will not decrease 'conf->barrier'
that is set in raid10_takeover_raid0().
Fix this problem by not setting 'conf->barrier' in raid10_takeover_raid0().
By the way, this problem is found while I'm trying to make
mddev_suspend/resume() to be independent from raid personalities. raid10
is the only personality to use reference count in the quiesce() callback
and this problem is only related to raid10.
Fixes: b39f35ebe8 ("md: don't quiesce in mddev_suspend()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Link: https://lore.kernel.org/r/20230731022800.1424902-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
If rddev->raid_disk is greater than mddev->raid_disks, there will be
an out-of-bounds in raid1_remove_disk(). We have already found
similar reports as follows:
1) commit d17f744e88 ("md-raid10: fix KASAN warning")
2) commit 1ebc2cec0b ("dm raid: fix KASAN warning in raid5_remove_disk")
Fix this bug by checking whether the "number" variable is
valid.
Signed-off-by: Zhang Shurong <zhang_shurong@foxmail.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/tencent_0D24426FAC6A21B69AC0C03CE4143A508F09@qq.com
Signed-off-by: Song Liu <song@kernel.org>
Commit b13015af94 ("md/raid5-cache: Clear conf->log after finishing
work") introduce a new problem:
// caller hold reconfig_mutex
r5l_exit_log
flush_work(&log->disable_writeback_work)
r5c_disable_writeback_async
wait_event
/*
* conf->log is not NULL, and mddev_trylock()
* will fail, wait_event() can never pass.
*/
conf->log = NULL
Fix this problem by setting 'config->log' to NULL before wake_up() as it
used to be, so that wait_event() from r5c_disable_writeback_async() can
exist. In the meantime, move forward md_unregister_thread() so that
null-ptr-deref this commit fixed can still be fixed.
Fixes: b13015af94 ("md/raid5-cache: Clear conf->log after finishing work")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Previously, the bip's bi_size has been set before an integrity pages
were added. If a problem occurs in the process of adding pages for
bip, the bi_size mismatch problem must be dealt with.
When the page is successfully added to bvec, the bi_size is updated.
The parts affected by the change were also contained in this commit.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jinyoung Choi <j-young.choi@samsung.com>
Tested-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20230803024956epcms2p38186a17392706650c582d38ef3dbcd32@epcms2p3
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a new config option that controls building the buffer_head code, and
select it from all file systems and stacking drivers that need it.
For the block device nodes and alternative iomap based buffered I/O path
is provided when buffer_head support is not enabled, and iomap needs a
a small tweak to define the IOMAP_F_BUFFER_HEAD flag to 0 to not call
into the buffer_head code when it doesn't exist.
Otherwise this is just Kconfig and ifdef changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20230801172201.1923299-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Several reasons why 'reconfig_mutex' should be held:
1) rdev_for_each() is not safe to be called without the lock, because
rdev can be removed concurrently.
2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not
be called concurrently.
3) mddev_suspend() from mddev_destroy/create_serial_pool() should be
protected by the lock.
Fixes: 10c92fca63 ("md-bitmap: create and destroy wb_info_pool with the change of backlog")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Local variable is definied first in the beginning of backlog_store(),
there is no need to define it again.
Fixes: 8c13ab115b ("md/bitmap: don't set max_write_behind if there is no write mostly device")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Commit 2ae6aaf769 ("md/raid10: fix io loss while replacement replace
rdev") reads replacement first to prevent io loss. However, there are same
issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by
using dereference_rdev_and_rrdev() to get devices.
Fixes: d30588b273 ("md/raid10: improve raid10 discard request")
Fixes: f2e7e269a7 ("md/raid10: pull the code that wait for blocked dev into one function")
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Factor out a helper to get 'rdev' and 'replacement' from config->mirrors.
Just to make code cleaner and prepare to fix the bug of io loss while
'replacement' replace 'rdev'.
There is no functional change.
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
After commit 4ca40c2ce0 ("md/raid10: Allow replacement device to be
replace old drive."), 'rdev' and 'replacement' could appear to be
identical. There are already checks for that in wait_blocked_dev() and
raid10_write_request(). Add check for raid10_handle_discard() now.
Signed-off-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20230701080529.2684932-2-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
wake_up is called unconditionally in a few paths such as make_request(),
which cause lock contention under high concurrency workload like below
raid1_end_write_request
wake_up
__wake_up_common_lock
spin_lock_irqsave
Improve performance by only call wake_up() if waitqueue is not empty
Fio test script:
[global]
name=random reads and writes
ioengine=libaio
direct=1
readwrite=randrw
rwmixread=70
iodepth=64
buffered=0
filename=/dev/md0
size=1G
runtime=30
time_based
randrepeat=0
norandommap
refill_buffers
ramp_time=10
bs=4k
numjobs=400
group_reporting=1
[job1]
Test result with 2 ramdisk in raid1 on a Intel Broadwell 56 cores server.
Before this patch With this patch
READ BW=4621MB/s BW=7337MB/s
WRITE BW=1980MB/s BW=3144MB/s
The patch is inspired by Yu Kuai's change for raid10:
https://lore.kernel.org/r/20230621105728.1268542-1-yukuai1@huaweicloud.com
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230705113227.148494-1-jinpu.wang@ionos.com
Signed-off-by: Song Liu <song@kernel.org>
memalloc_noio_save() is called for the first mddev_suspend(), and
repeated mddev_suspend() only increase 'suspended'. However,
memalloc_noio_restore() is also called for the first mddev_resume(),
which means that memory reclaim will be enabled before the last
mddev_resume() is called, while the array is still suspended.
Fix this problem by restore 'noio_flag' for the last mddev_resume().
Fixes: 78f57ef9d5 ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
Some levels doesn't implement "pers->quiesce", for example
raid0_quiesce() is empty, and now that all levels will drop 'active_io'
until io is done, wait for 'active_io' to be 0 is enough to make sure all
normal io is done, and percpu_ref_kill() for 'active_io' will make sure
no new normal io can be dispatched. There is no need to call
"pers->quiesce" anymore from mddev_suspend().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230628012931.88911-2-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
In fix_read_error(), 'success' will be checked immediately after assigning
it, if it is set to 1 then the loop will break. Checking it again in
condition of loop is redundant. Clean it up.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230623173236.2513554-3-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
We dereference r10_bio->read_slot too many times in fix_read_error().
Optimize it by using a variable to store read_slot.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230623173236.2513554-2-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
New disk should be added to "removed" position first instead of to be a
replacement. Commit 6090368abc ("md/raid10: prioritize adding disk to
'removed' mirror") has fixed this issue for raid10. Fix it for raid1 now.
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230627014332.3810102-1-linan666@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
use md_account_bio() to enable io accounting, also make sure
mddev_suspend() will wait for all io to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-9-yukuai1@huaweicloud.com
use md_account_bio() to enable io accounting, also make sure
mddev_suspend() will wait for all io to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-8-yukuai1@huaweicloud.com
use md_account_bio() to enable io accounting, also make sure
mddev_suspend() will wait for all io to be done.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-7-yukuai1@huaweicloud.com
Make sure that 'active_io' will represent inflight io instead of io that
is dispatching, and io accounting from all levels will be consistent.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-6-yukuai1@huaweicloud.com
Two problems can be fixed this way:
1) 'active_io' will represent inflight io instead of io that is
dispatching.
2) If io accounting is enabled or disabled while io is still inflight,
bio_start_io_acct() and bio_end_io_acct() is not balanced and io
inflight counter will be leaked.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-5-yukuai1@huaweicloud.com
Io will only be accounted as done from raid5_align_endio() if the io
succeeded, and io inflight counter will be leaked if such io failed.
Fix this problem by switching to use md_account_bio() for io accounting.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-4-yukuai1@huaweicloud.com
Currently, 'active_io' is grabbed before make_reqeust() is called, and
it's dropped immediately make_reqeust() returns. Hence 'active_io'
actually means io is dispatching, not io is inflight.
For raid0 and raid456 that io accounting is enabled, 'active_io' will
also be grabbed when bio is cloned for io accounting, and this 'active_io'
is dropped until io is done.
Always clone new bio so that 'active_io' will mean that io is inflight,
raid1 and raid10 will switch to use this method in later patches.
Now that bio will be cloned even if io accounting is disabled, also
rename related structure from '*_acct_*' to '*_clone_*'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-3-yukuai1@huaweicloud.com
'io_acct_set' is only used for raid0 and raid456, prepare to use it for
raid1 and raid10, so that io accounting from different levels can be
consistent.
By the way, follow up patches will also use this io clone mechanism to
make sure 'active_io' represents in flight io, not io that is dispatching,
so that mddev_suspend will wait for io to be done as designed.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230621165110.1498313-2-yukuai1@huaweicloud.com
The support for bitmaps on files is a very bad idea abusing various kernel
APIs, and fundamentally requires the file to not be on the actual array
without a way to check that this is actually the case. Add a deprecation
warning to see if we might be able to eventually drop it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-12-hch@lst.de
The support for write intent bitmaps in files on an external files in md
is a hot mess that abuses ->bmap to map file offsets into physical device
objects, and also abuses buffer_heads in a creative way.
Make this code optional so that MD can be built into future kernels
without buffer_head support, and so that we can eventually deprecate it.
Note this does not affect the internal bitmap support, which has none of
the problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-11-hch@lst.de
The md driver allocates pages for storing the bitmap file data, which
are not page cache pages, and then stores the page granularity file
offset in page->index, which is a field that isn't really valid except
for page cache pages.
Use a separate index for the superblock, and use the scheme used at
read size to recalculate the index for the bitmap pages instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-10-hch@lst.de
Diretly apply mddev->bitmap_info.offset to the sector number to read
instead of doing that in both callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-9-hch@lst.de
Convert read_sb_page to the normal kernel coding style, calculate the
target sector only once, and add a local iosize variable to make the call
to sync_page_io more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-8-hch@lst.de
Split the confusing loop in md_bitmap_init_from_disk that iterates over
all chunks but also needs to read and map the pages into three separate
loops: one that iterates over the pages to read them, a second optional
one to iterate over the pages to mark them invalid if the bitmaps are
out of date, and a final one that actually iterates over the chunks.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202306160552.smw0qbmb-lkp@intel.com/
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-7-hch@lst.de
Make the difference to read_sb_page clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-6-hch@lst.de
Split the file write code out of write_page into a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-5-hch@lst.de
Don't bother allocating an extra buffer in the I/O failure handler and
instead use the printk built-in format to print the last 4 path name
components.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-4-hch@lst.de
Just a small tidyup to prepare for bigger changes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-3-hch@lst.de
Set BITMAP_WRITE_ERROR directly in write_sb_page instead of propagating
the error to the caller and setting it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230615064840.629492-2-hch@lst.de
For md_check_recovery():
1) if 'MD_RECOVERY_RUNING' is not set, register new sync_thread.
2) if 'MD_RECOVERY_RUNING' is set:
a) if 'MD_RECOVERY_DONE' is not set, don't do anything, wait for
md_do_sync() to be done.
b) if 'MD_RECOVERY_DONE' is set, unregister sync_thread. Current code
expects that sync_thread is not NULL, otherwise new sync_thread will
be registered, which will corrupt the array.
Make sure md_check_recovery() won't register new sync_thread if
'MD_RECOVERY_RUNING' is still set, and a new WARN_ON_ONCE() is added for
the above corruption,
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-7-yukuai1@huaweicloud.com
md_reap_sync_thread() is just replaced with wait_event(resync_wait, ...)
from action_store(), just make sure action_store() will still wait for
everything to be done in md_reap_sync_thread().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewd-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-6-yukuai1@huaweicloud.com
Our test found a following deadlock in raid10:
1) Issue a normal write, and such write failed:
raid10_end_write_request
set_bit(R10BIO_WriteError, &r10_bio->state)
one_write_done
reschedule_retry
// later from md thread
raid10d
handle_write_completed
list_add(&r10_bio->retry_list, &conf->bio_end_io_list)
// later from md thread
raid10d
if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
list_move(conf->bio_end_io_list.prev, &tmp)
r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
raid_end_bio_io(r10_bio)
Dependency chain 1: normal io is waiting for updating superblock
2) Trigger a recovery:
raid10_sync_request
raise_barrier
Dependency chain 2: sync thread is waiting for normal io
3) echo idle/frozen to sync_action:
action_store
mddev_lock
md_unregister_thread
kthread_stop
Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread
4) md thread can't update superblock:
raid10d
md_check_recovery
if (mddev_trylock(mddev))
md_update_sb
Dependency chain 4: update superblock is waiting for 'reconfig_mutex'
Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.
This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Noted that raid456 has similiar deadlock([1]), and it's verified[2] this
deadlock can be fixed by this patch as well.
[1] https://lore.kernel.org/linux-raid/5ed54ffc-ce82-bf66-4eff-390cb23bc1ac@molgen.mpg.de/T/#t
[2] https://lore.kernel.org/linux-raid/e9067438-d713-f5f3-0d3d-9e6b0e9efa0e@huaweicloud.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-5-yukuai1@huaweicloud.com
Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.
Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen. A mutex is added to synchronize idle
and frozen from action_store().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-4-yukuai1@huaweicloud.com
Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-3-yukuai1@huaweicloud.com
This reverts commit 9dfbdafda3.
Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:
list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
__list_add_valid+0xfc/0x140
insert_work+0x78/0x1a0
__queue_work+0x500/0xcf4
queue_work_on+0xe8/0x12c
md_check_recovery+0xa34/0xf30
raid10d+0xb8/0x900 [raid10]
md_thread+0x16c/0x2cc
kthread+0x1a4/0x1ec
ret_from_fork+0x10/0x18
This is because work is requeued while it's still inside workqueue:
t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
// first sync_thread is done
md_check_recovery
mddev_try_lock
/*
* once MD_RECOVERY_DONE is set, new sync_thread
* can start.
*/
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
INIT_WORK(&mddev->del_work, md_start_sync)
queue_work(md_misc_wq, &mddev->del_work)
test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
// set pending bit
insert_work
list_add_tail
mddev_unlock
mddev_lock_nointr
md_reap_sync_thread
// MD_RECOVERY_RUNNING is cleared
mddev_unlock
t3:
// before queued work started from t2
md_check_recovery
// MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
INIT_WORK(&mddev->del_work, md_start_sync)
work->data = 0
// work pending bit is cleared
queue_work(md_misc_wq, &mddev->del_work)
insert_work
list_add_tail
// list is corrupted
The above commit is reverted to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230529132037.2124527-2-yukuai1@huaweicloud.com
When using the cleaner policy to decommission the cache, there is
never any writeback started from the cache as it is constantly delayed
due to normal I/O keeping the device busy. Meaning @idle=false was
always being passed to clean_target_met()
Fix this by adding a specific 'cleaner' flag that is set when the
cleaner policy is configured. This flag serves to always allow the
cleaner's writeback work to be queued until the cache is
decommissioned (even if the cache isn't idle).
Reported-by: David Jeffery <djeffery@redhat.com>
Fixes: b29d4986d0 ("dm cache: significant rework to leverage dm-bio-prison-v2")
Cc: stable@vger.kernel.org
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
__md_stop_writes() and __md_stop() will modify many fields that are
protected by 'reconfig_mutex', and all the callers will grab
'reconfig_mutex' except for md_stop().
Also, update md_stop() to make certain 'reconfig_mutex' is held using
lockdep_assert_held().
Fixes: 9d09e663d5 ("dm: raid456 basic support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
There are four equivalent goto tags in raid_ctr(), clean them up to
use just one.
There is no functional change and this is preparation to fix
raid_ctr()'s unprotected md_stop().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
In the error paths 'bad_stripe_cache' and 'bad_check_reshape',
'reconfig_mutex' is still held after raid_ctr() returns.
Fixes: 9dbd1aa3a8 ("dm raid: add reshaping support to the target")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
If the statement "recalc_tags = kvmalloc(recalc_tags_size, GFP_NOIO);"
fails, we call "vfree(recalc_buffer)" and we jump to the label "oom".
If the condition "recalc_sectors >= 1U << ic->sb->log2_sectors_per_block"
is false, we jump to the label "free_ret" and call "vfree(recalc_buffer)"
again, on an already released memory block.
Fix the bug by setting "recalc_buffer = NULL" after freeing it.
Fixes: da8b4fc1f6 ("dm integrity: only allocate recalculate buffer when needed")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmSjJ2IQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsQMEACQiUBw81tXvetYhz3P/4KrrjvUobgqMU0w
jtrxqMgPee9FbqCShpj76c+La5wu23DnlrCXoHZxFQuiQLnsX5xFV66NYVi+W1CN
k5MHP7f2e9V0T7qJ9UoHFRV1k22LF4X6T8njEZimxsm/uXfpav/knkhI7nUDnB1K
wxlu9akD2Bo/X9O2NTS+X6qjoawZ6rDWN15THMXlC45VzJPLmIcs07Ev+mvw21KE
XqasoZrxEO0S8dWxmJgJGqnRIOQptTS5U+0OPBZT8H220Qp/1q0pQHPw6iLXNrkc
w1a2W1Bge012gjJt7gCMkdDnZb76sKiyGuMbFME7DoRbLCQeaOtoSfmg7NoRI2gp
74TCSr7dPWZUVUy5Tmsy0DCv0552vIbnlQ69W6Xwx8YkplM3FPiMpWrQ5JWEHdvv
Zl84mLP6Yyo54JVuk9zi8q/2L0HfyfMDj4UM/mNs8hwmcUSbPO2TKdIWDaq8xPuS
Ed+D+kg6XFux8tLnCSDLNbaD5JE+ak9gTVhNdRa/zFE04o/OeidscKEqRSYTkdXL
2p34qtw5kEQocO4Pa3eUGO6KJCDTR36Rms5p6ZFybL4O2oZYrAbRi1TGDxaG2Hag
GCr2vaFbmz1zbGuMpFhLha5B7HeDLs+PHOn+B1iUNjEr9RC0EOHV7moJKqjxlnCh
4mBkK/Nlyg==
=kSeX
-----END PGP SIGNATURE-----
Merge tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
"Mostly items that came in a bit late for the initial pull request,
wanted to make sure they had the appropriate amount of linux-next soak
before going upstream.
Outside of stragglers, just generic fixes for either merge window
items, or longer standing bugs"
* tag 'block-6.5-2023-07-03' of git://git.kernel.dk/linux: (25 commits)
md/raid0: add discard support for the 'original' layout
nvme: disable controller on reset state failure
nvme: sync timeout work on failed reset
nvme: ensure unquiesce on teardown
cdrom/gdrom: Fix build error
nvme: improved uring polling
block: add request polling helper
nvme-mpath: fix I/O failure with EAGAIN when failing over I/O
nvme: host: fix command name spelling
blk-sysfs: add a new attr_group for blk_mq
blk-iocost: move wbt_enable/disable_default() out of spinlock
blk-wbt: cleanup rwb_enabled() and wbt_disabled()
blk-wbt: remove dead code to handle wbt enable/disable with io inflight
blk-wbt: don't create wbt sysfs entry if CONFIG_BLK_WBT is disabled
blk-mq: fix two misuses on RQF_USE_SCHED
blk-throttle: Fix io statistics for cgroup v1
bcache: Fix bcache device claiming
bcache: Alloc holder object before async registration
raid10: avoid spin_lock from fastpath from raid10_unplug()
md: fix 'delete_mutex' deadlock
...
API:
- Add linear akcipher/sig API.
- Add tfm cloning (hmac, cmac).
- Add statesize to crypto_ahash.
Algorithms:
- Allow only odd e and restrict value in FIPS mode for RSA.
- Replace LFSR with SHA3-256 in jitter.
- Add interface for gathering of raw entropy in jitter.
Drivers:
- Fix race on data_avail and actual data in hwrng/virtio.
- Add hash and HMAC support in starfive.
- Add RSA algo support in starfive.
- Add support for PCI device 0x156E in ccp.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmSdECcACgkQxycdCkmx
i6dW3g//a4DR6aaqYF8pU4svAzO56a0Plx3DVHUiJ4ygRB7xOzrQqXjCren6wY2a
LFuetwxebAhIAPsC79vI+3j8VAIlU9cNVqOxBIJHGY7wFO4m1AjqBjlealzqLrth
+nEIeUibqLeRw7imOO4adzSsKuSQgyU5rPtKWfrGqqI3RhuMgfWroCtmJ82jmq5l
uMZgB+aGGkzyXztxubHRPeJ3nOFEzo95SscpJ43lOjMcURRBhEa+20jXDhUGwpI7
9ycFV31AW+tfkIprAcliiIzZuwIbzlCkte6AxjAVsN100T/wh9JS1Y+uf1P0oZ9y
AUQQKyc8/QpSkzHZPTncat5P6zta28r8Q5neCvEEEGGuOE8Oc6kb0Os+RE5ANMU4
2A/zrKGOMIWeEWwXGc51xT3gxyl/Rn5wLw1pW7Lm4d5osGT9jiVXx/g66hKLpagJ
jegI6CqgvUajkRNi7JPVnSAauu0Ay8O6pU37/8gLOXNGVZBqONpRimk9qB05LNSF
QYzM2sgYv1tQEmjnG8jLhF5Z8brnqYTv2TZwBX43W10EDQNqUYUDff9Flean5xCb
+2mxJc81rgtUffnMXyYvQwKLhVKoLpeLR6Ts455S5aP06WAfoyEJyYTA/LHG24GX
H2HdS9g5y/K15k9yygMWaXgAx7O7MjM9gEa2VQakhnByj/eQM0s=
=rOLu
-----END PGP SIGNATURE-----
Merge tag 'v6.5-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
"API:
- Add linear akcipher/sig API
- Add tfm cloning (hmac, cmac)
- Add statesize to crypto_ahash
Algorithms:
- Allow only odd e and restrict value in FIPS mode for RSA
- Replace LFSR with SHA3-256 in jitter
- Add interface for gathering of raw entropy in jitter
Drivers:
- Fix race on data_avail and actual data in hwrng/virtio
- Add hash and HMAC support in starfive
- Add RSA algo support in starfive
- Add support for PCI device 0x156E in ccp"
* tag 'v6.5-p1' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (85 commits)
crypto: akcipher - Do not copy dst if it is NULL
crypto: sig - Fix verify call
crypto: akcipher - Set request tfm on sync path
crypto: sm2 - Provide sm2_compute_z_digest when sm2 is disabled
hwrng: imx-rngc - switch to DEFINE_SIMPLE_DEV_PM_OPS
hwrng: st - keep clock enabled while hwrng is registered
hwrng: st - support compile-testing
hwrng: imx-rngc - fix the timeout for init and self check
KEYS: asymmetric: Use new crypto interface without scatterlists
KEYS: asymmetric: Move sm2 code into x509_public_key
KEYS: Add forward declaration in asymmetric-parser.h
crypto: sig - Add interface for sign/verify
crypto: akcipher - Add sync interface without SG lists
crypto: cipher - On clone do crypto_mod_get()
crypto: api - Add __crypto_alloc_tfmgfp
crypto: api - Remove crypto_init_ops()
crypto: rsa - allow only odd e and restrict value in FIPS mode
crypto: geniv - Split geniv out of AEAD Kconfig option
crypto: algboss - Add missing dependency on RNG2
crypto: starfive - Add RSA algo support
...
We've found that using raid0 with the 'original' layout and discard
enabled with different disk sizes (such that at least two zones are
created) can result in data corruption. This is due to the fact that
the discard handling in 'raid0_handle_discard()' assumes the 'alternate'
layout. We've seen this corruption using ext4 but other filesystems are
likely susceptible as well.
More specifically, while multiple zones are necessary to create the
corruption, the corruption may not occur with multiple zones if they
layout in such a way the layout matches what the 'alternate' layout
would have produced. Thus, not all raid0 devices with the 'original'
layout, different size disks and discard enabled will encounter this
corruption.
The 3.14 kernel inadvertently changed the raid0 disk layout for different
size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
same raid0 array could corrupt data. This lead to the creation of the
'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
(to match the post 3.14 layout) in the 5.4 kernel time frame and an option
to tell the kernel which layout to use (since it couldn't be autodetected).
However, when the 'original' layout was added back to 5.4 discard support
for the 'original' layout was not added leading this issue.
I've been able to reliably reproduce the corruption with the following
test case:
1. create raid0 array with different size disks using original layout
2. mkfs
3. mount -o discard
4. create lots of files
5. remove 1/2 the files
6. fstrim -a (or just the mount point for the raid0 array)
7. umount
8. fsck -fn /dev/md0 (spews all sorts of corruptions)
Let's fix this by adding proper discard support to the 'original' layout.
The fix 'maps' the 'original' layout disks to the order in which they are
read/written such that we can compare the disks in the same way that the
current 'alternate' layout does. A 'disk_shift' field is added to
'struct strip_zone'. This could be computed on the fly in
raid0_handle_discard() but by adding this field, we save some computation
in the discard path.
Note we could also potentially fix this by re-ordering the disks in the
zones that follow the first one, and then always read/writing them using
the 'alternate' layout. However, that is seen as a more substantial change,
and we are attempting the least invasive fix at this time to remedy the
corruption.
I've verified the change using the reproducer mentioned above. Typically,
the corruption is seen after less than 3 iterations, while the patch has
run 500+ iterations.
Cc: NeilBrown <neilb@suse.de>
Cc: Song Liu <song@kernel.org>
Fixes: c84a1372df ("md/raid0: avoid RAID0 data corruption due to layout confusion.")
Cc: stable@vger.kernel.org
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230623180523.1901230-1-jbaron@akamai.com
- Fix DM crypt target's crypt_ctr_cipher_new return value on invalid
AEAD cipher.
- Fix DM flakey testing target's write bio corruption feature to
corrupt the data of a cloned bio instead of the original.
- Add random_read_corrupt and random_write_corrupt features to DM
flakey target.
- Fix ABBA deadlock in DM thin metadata by resetting associated bufio
client rather than destroying and recreating it.
- A couple other small DM thinp cleanups.
- Update DM core to support disabling block core IO stats accounting
and optimize away code that isn't needed if stats are disabled.
- Other small DM core cleanups.
- Improve DM integrity target to not require so much memory on 32 bit
systems. Also only allocate the recalculate buffer as needed (and
increasingly reduce its size on allocation failure).
- Update DM integrity to use %*ph for printing hexdump of a small
buffer. Also update DM integrity documentation.
- Various DM core ioctl interface hardening. Now more careful about
alignment of structures and processing of input passed to the kernel
from userspace. Also disallow the creation of DM devices named
"control", "." or ".."
- Eliminate GFP_NOIO workarounds for __vmalloc and kvmalloc in DM
core's ioctl and bufio code.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmScyjAACgkQxSPxCi2d
A1pilwgAucNIAB6uN4ke4WZrMVxSFntUkqDTkCs2Ycw+W4Tf1Mtrj/4WeBzFdJaA
oZMK04LUGaFFXn+halsCDzB354yT9C7V/KfXW8pCM1c9BRz4e8272i2HSN4WwD5n
BU4gVaOV5BwxynfF3Z5siRraad1AwmdoRGGsqzVRESAKaObXU//1tnO42UhxRVhn
nzFqhIm0xRcLAd8xIBlMsZQGIloicdDP9wZdWzTEDspQiwR2dFRmH9bUF8OmsS+h
KwhtDty7aZO+4gJ1ccBImijzQCmbAo7dmFhDfoLXaA5Jt6UwTXMeBHm4aUPMnvQe
NVXoRZJodDemwM642Q/Tx1SpsX6QmA==
=4R7u
-----END PGP SIGNATURE-----
Merge tag 'for-6.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Update DM crypt to allocate compound pages if possible
- Fix DM crypt target's crypt_ctr_cipher_new return value on invalid
AEAD cipher
- Fix DM flakey testing target's write bio corruption feature to
corrupt the data of a cloned bio instead of the original
- Add random_read_corrupt and random_write_corrupt features to DM
flakey target
- Fix ABBA deadlock in DM thin metadata by resetting associated bufio
client rather than destroying and recreating it
- A couple other small DM thinp cleanups
- Update DM core to support disabling block core IO stats accounting
and optimize away code that isn't needed if stats are disabled
- Other small DM core cleanups
- Improve DM integrity target to not require so much memory on 32 bit
systems. Also only allocate the recalculate buffer as needed (and
increasingly reduce its size on allocation failure)
- Update DM integrity to use %*ph for printing hexdump of a small
buffer. Also update DM integrity documentation
- Various DM core ioctl interface hardening. Now more careful about
alignment of structures and processing of input passed to the kernel
from userspace.
Also disallow the creation of DM devices named "control", "." or ".."
- Eliminate GFP_NOIO workarounds for __vmalloc and kvmalloc in DM
core's ioctl and bufio code
* tag 'for-6.5/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (28 commits)
dm: get rid of GFP_NOIO workarounds for __vmalloc and kvmalloc
dm integrity: scale down the recalculate buffer if memory allocation fails
dm integrity: only allocate recalculate buffer when needed
dm integrity: reduce vmalloc space footprint on 32-bit architectures
dm ioctl: Refuse to create device named "." or ".."
dm ioctl: Refuse to create device named "control"
dm ioctl: Avoid double-fetch of version
dm ioctl: structs and parameter strings must not overlap
dm ioctl: Avoid pointer arithmetic overflow
dm ioctl: Check dm_target_spec is sufficiently aligned
Documentation: dm-integrity: Document an example of how the tunables relate.
Documentation: dm-integrity: Document default values.
Documentation: dm-integrity: Document the meaning of "buffer".
Documentation: dm-integrity: Fix minor grammatical error.
dm integrity: Use %*ph for printing hexdump of a small buffer
dm thin: disable discards for thin-pool if no_discard_passdown
dm: remove stale/redundant dm_internal_{suspend,resume} prototypes in dm.h
dm: skip dm-stats work in alloc_io() unless needed
dm: avoid needless dm_io access if all IO accounting is disabled
dm: support turning off block-core's io stats accounting
...