Commit Graph

1019 Commits

Author SHA1 Message Date
Christoph Hellwig c13f0fbc4c nvme: don't call revalidate_disk from nvme_set_queue_dying
In nvme_set_queue_dying we really just want to ensure the disk and bdev
sizes are set to zero.  Going through revalidate_disk leads to a somewhat
arcance and complex callchain relying on special behavior in a few
places.  Instead just lift the set_capacity directly to
nvme_set_queue_dying, and rename and move the nvme_mpath_update_disk_size
helper so that we can use it in nvme_set_queue_dying to propagate the
size to the bdev without detours.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Jens Axboe a98278ecfb Merge branch 'block-5.9' into for-5.10/block
* block-5.9:
  blk-stat: make q->stats->lock irqsafe
  blk-iocost: ioc_pd_free() shouldn't assume irq disabled
  block: fix locking in bdev_del_partition
  block: release disk reference in hd_struct_free_work
  block: ensure bdi->io_pages is always initialized
  nvme-pci: cancel nvme device request before disabling
  nvme: only use power of two io boundaries
  nvme: fix controller instance leak
  nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
  nvme: Fix NULL dereference for pci nvme controllers
  nvme-rdma: fix reset hang if controller died in the middle of a reset
  nvme-rdma: fix timeout handler
  nvme-rdma: serialize controller teardown sequences
  nvme-tcp: fix reset hang if controller died in the middle of a reset
  nvme-tcp: fix timeout handler
  nvme-tcp: serialize controller teardown sequences
  nvme: have nvme_wait_freeze_timeout return if it timed out
  nvme-fabrics: don't check state NVME_CTRL_NEW for request acceptance
  nvmet-tcp: Fix NULL dereference when a connect data comes in h2cdata pdu
2020-09-01 16:49:20 -06:00
Keith Busch e83d776f9f nvme: only use power of two io boundaries
The kernel requires a power of two for boundaries because that's the
only way it can efficiently split commands that cross them. A
controller, however, may report a non-power of two boundary.

The driver had been rounding the controller's value to one the kernel
can use, but splitting on the wrong boundary provides no benefit on the
device side, and incurs additional submission overhead from non-optimal
splits.

Don't provide any boundary hint if the controller's value can't be used
and log a warning when first scanning a disk's unreported IO boundary.
Since the chunk sector logic has grown, move it to a separate function.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28 16:43:57 -07:00
Keith Busch 192f6c29bb nvme: fix controller instance leak
If the driver has to unbind from the controller for an early failure
before the subsystem has been set up, there won't be a subsystem holding
the controller's instance, so the controller needs to free its own
instance in this case.

Fixes: 733e4b69d5 ("nvme: Assign subsys instance from first ctrl")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28 16:43:57 -07:00
Sagi Grimberg 7cd49f7576 nvme: Fix NULL dereference for pci nvme controllers
PCIe controllers do not have fabric opts, verify they exist before
showing ctrl_loss_tmo or reconnect_delay attributes.

Fixes: 764075fdcb ("nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs")
Reported-by: Tobias Markus <tobias@markus-regensburg.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28 16:43:57 -07:00
Sagi Grimberg 7cf0d7c0f3 nvme: have nvme_wait_freeze_timeout return if it timed out
Users can detect if the wait has completed or not and take appropriate
actions based on this information (e.g. weather to continue
initialization or rather fail and schedule another initialization
attempt).

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2020-08-28 16:43:56 -07:00
Linus Torvalds c41c3ec4a2 io_uring-5.9-2020-08-23
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl9CwtMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsehEAC4ReB53LLbZxqcmoA2RNs9yz1I4DM2PU6z
 C+NSGGEnAFHQAhLbfCAzxbtQa6x/m64zoLd+8zHZNAeanJXarszcgSuqhXQFlEfX
 7Jz/vdXGdu7Q4zgkLuO3FxleDoPoUC5qOSFHWYtMu6KvHLOkmc9DvdSUsFMDSThX
 6RsoaQY2gDOD/pwtm8Cqmy89nLZdFoyxadXyk/lzxLodjeRZOwoVc+YM8YWmrXZ0
 mKEEuO4uBWxUUmoyAwUABNqWWAkwTDEhrYCiiG81DkAa1Cu0mRXodN0xycr72cLZ
 Ik2OlnTLCE6B0UXsBu2c0+qXGArWsvDyhEEkwF+O+Ump4IBIr72EmgZb+o2nnkXo
 Uu4X/r0qeQ6XD+vBTHcE6oPUjJhV6uEXXon5aesE+vh277ILmHgMyjJKaSiJcY/E
 efM5SuPRq2kuROKWLKiLJnpuJ/9ZTU/4nk4k1pOlWWOVGLHien0sWBBzQ+iWr6mm
 eRl5EkI3JoahqIrNFz0+qF3DwKPVfu+B02/EzA8OXoYHIRV9KMS5eWX5hK12aZ3i
 4AT3xuAanfcNs4qBAScOfHQxQu9U5Z7Mu4JQJ58xdsJd+UWBnbznUmSLob9KKk+c
 X8AvAcYhb684F87VCmaCzDlIPMb46OYxLBgI6sz7L0xdc7i8TCeeEDbQCN1HixZ3
 SNtKzalNXA==
 =fAwK
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Sagi:
       - nvme completion rework from Christoph and Chao that mostly came
         from a bit of divergence of how we classify errors related to
         pathing/retry etc.
       - nvmet passthru fixes from Chaitanya
       - minor nvmet fixes from Amit and I
       - mpath round-robin path selection fix from Martin
       - ignore noiob for zoned devices from Keith
       - minor nvme-fc fix from Tianjia"

 - BFQ cgroup leak fix (Dmitry)

 - block layer MAINTAINERS addition (Geert)

 - fix null_blk FUA checking (Hou)

 - get_max_io_size() size fix (Keith)

 - fix block page_is_mergeable() for compound pages (Matthew)

 - discard granularity fixes (Ming)

 - IO scheduler ordering fix (Ming)

 - misc fixes

* tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits)
  null_blk: fix passing of REQ_FUA flag in null_handle_rq
  nvmet: Disable keep-alive timer when kato is cleared to 0h
  nvme: redirect commands on dying queue
  nvme: just check the status code type in nvme_is_path_error
  nvme: refactor command completion
  nvme: rename and document nvme_end_request
  nvme: skip noiob for zoned devices
  nvme-pci: fix PRP pool size
  nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
  nvme: Use spin_lock_irq() when taking the ctrl->lock
  nvmet: call blk_mq_free_request() directly
  nvmet: fix oops in pt cmd execution
  nvmet: add ns tear down label for pt-cmd handling
  nvme: multipath: round-robin: eliminate "fallback" variable
  nvme: multipath: round-robin: fix single non-optimized path case
  nvme-fc: Fix wrong return value in __nvme_fc_init_request()
  nvmet-passthru: Reject commands with non-sgl flags set
  nvmet: fix a memory leak
  blkcg: fix memleak for iolatency
  MAINTAINERS: Add missing header files to BLOCK LAYER section
  ...
2020-08-24 11:53:15 -07:00
Gustavo A. R. Silva df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Chao Leng 5eac5f3342 nvme: redirect commands on dying queue
If a command send through nvme-multipath failed on a dying queue, resend it
on another path.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: rebased on top of the completion refactoring]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:14:28 -06:00
Christoph Hellwig 5ddaabe8ed nvme: refactor command completion
Lift all the code to decide the dispostition of a completed command
from nvme_complete_rq and nvme_failover_req into a new helper, which
returns an emum of the potential actions.  nvme_complete_rq then
just switches on those and calls the proper helper for the action.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:14:28 -06:00
Keith Busch c41ad98beb nvme: skip noiob for zoned devices
Zoned block devices reuse the chunk_sectors queue limit to define zone
boundaries. If a such a device happens to also report an optimal
boundary, do not use that to define the chunk_sectors as that may
intermittently interfere with io splitting and zone size queries.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:14:27 -06:00
Logan Gunthorpe ecbcdf0c81 nvme: Use spin_lock_irq() when taking the ctrl->lock
When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
dead lock. The new spin_lock() calls recently added produce the
following lockdep warning when running the blktest nvme/003:

    ================================
    WARNING: inconsistent lock state
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
    ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
    {SOFTIRQ-ON-W} state was registered at:
      lock_acquire+0x164/0x500
      _raw_spin_lock+0x28/0x40
      nvme_get_effects_log+0x37/0x1c0
      nvme_init_identify+0x9e4/0x14f0
      nvme_reset_work+0xadd/0x2360
      process_one_work+0x66b/0xb70
      worker_thread+0x6e/0x6c0
      kthread+0x1e7/0x210
      ret_from_fork+0x22/0x30
    irq event stamp: 1449221
    hardirqs last  enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
    hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
    softirqs last  enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
    softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&ctrl->lock);
      <Interrupt>
        lock(&ctrl->lock);

     *** DEADLOCK ***

    no locks held by ksoftirqd/2/22.

    stack backtrace:
    CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
    Call Trace:
     dump_stack+0xc8/0x11a
     print_usage_bug.cold.63+0x235/0x23e
     mark_lock+0xa9c/0xcf0
     __lock_acquire+0xd9a/0x2b50
     lock_acquire+0x164/0x500
     _raw_spin_lock_irqsave+0x40/0x60
     nvme_keep_alive_end_io+0x50/0xc0
     blk_mq_end_request+0x158/0x210
     nvme_complete_rq+0x146/0x500
     nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
     blk_done_softirq+0x187/0x1e0
     __do_softirq+0x118/0x595
     run_ksoftirqd+0x35/0x50
     smpboot_thread_fn+0x1d3/0x310
     kthread+0x1e7/0x210
     ret_from_fork+0x22/0x30

Fixes: be93e87e78 ("nvme: support for multiple Command Sets Supported and Effects log pages")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Tested-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-21 17:14:27 -06:00
Linus Torvalds 060a72a268 for-5.9/block-merge-20200804
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8pjAEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphuHEAC5hNAi99HLktAQ7qZy4cBqnGNKCrguFszq
 Kxiecp3Nrb9EnAPNWYG+QMO0kD9z8quML85beBaJNxN1PlOk9pawqFAd4ziFncFI
 ruZwIMP+oH/0OmPUA2a4ymrqu+rpyFvfsDL2RKJ9dirAt9fuv9W0RZM5g7Oz83Xi
 cNdPRn0tOhK0DTPxL4M1/NR2OutSgvKDfA5Et3IrDFl7+bJAEFqmSO8wOSdZtvFp
 KcR4O/DXnr5Wl6cPvzlvooQze8vGGJkXAyIKaC9cuBm/nlzMCBGG8kE0v3kRJ8Sc
 uSSFkC+P+OlktY4JwXN+mCacDUdVBiiL/uUs1zel6HmociBgh67mgyJ6AfQtGZry
 yVl9mj44qWZjAzCODv5KnuxlH+gBacdmjcQqwxsZ2P477gfNkxmBXgHeWdfzO9A/
 zTUXaBDXg3VdYxQfD8zTWPkCwXYp+YG3SRb9pfrIWIiYuz2UECZTvl/8Upnacz2B
 POTf+6vcNDlILCtboVE0mKEYR0ckxqrbs0NQloQdmVOfXNyhLml9OrXmwJIffVtE
 pZ9g428c5bm44lIOiB2eW+QPsXo0s8GxqIrMtxzKsJ3WgFefwLiVDLJBqEt78jRJ
 RvpGUxrMLgWFubowH8yDmWV+Fp0NpqcqF+GU45z8nGC3OTS+i0ZvUFYgLM6a2uOf
 sv4bzDPDBg==
 =uMth
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block

Pull block stacking updates from Jens Axboe:
 "The stacking related fixes depended on both the core block and drivers
  branches, so here's a topic branch with that change.

  Outside of that, a late fix from Johannes for zone revalidation"

* tag 'for-5.9/block-merge-20200804' of git://git.kernel.dk/linux-block:
  block: don't do revalidate zones on invalid devices
  block: remove blk_queue_stack_limits
  block: remove bdev_stack_limits
  block: inherit the zoned characteristics in blk_stack_limits
2020-08-05 11:12:34 -07:00
Linus Torvalds e0fc99e21e for-5.9/drivers-20200803
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8od3oQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppkpD/9D+XqD9qYcYTj+ShVCc5+3RtMG5ZiAAX0y
 l4QXomentn/1Y0UYXFGJH7JLZWrKYT0QiktLtfpe5pmTqRUkckTIyJQlsHb+K6Dz
 lFjtywRK9pcFYgiWIUg80wlJKrTa8QdnrlS/Esn4YITKGRbgMIdFvq2jymXC+1ho
 RgodlgzcBUREgHSLo0H3cqEKA53fQiJhKC6CbFrFdrkpf2yUpcTfEDtpSwuIuPj3
 2AUed1qXUtNjdHciCn3N37OuHqXKAA9noXAWfg9Gx/5zfGUNX9QJvlsny1AopgS0
 jJvPSDVAhu/qRLHW6q/ZOT0JAlHegguuTAOtgMh2cMpAS5sumCAtltxVcI7Qnx41
 HalMpTefXsVoBo0gfjqldnIPt34ZNj5aH5GYaH/wPpSg6VkTVBJK8GuQDBvg27qT
 w+U/T6EzuqniWXh/P3COhfrMCR9ueUOY1qWCRwzomlpeIfBhCzidt2wUqIxX1TOA
 Q0Ltf0eERDevsZbE+tIm+VAAg98kHehcS2t8lfFYFO6/PKu2iJpJt/HtJbZNBE+W
 rm96E4qXRiy1UuL7D9vBkaWsbnosuNHgGQXx57GlokQU+2IGBmOxV52XHiSxxpXd
 AS1ZTd56ItmID8VaU09Pbf7ZFbiCgdEAxIbUFzaCuvo+lxryHFphIUARNi/zPnNT
 UC2OzunCqA==
 =oADH
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - NVMe:
      - ZNS support (Aravind, Keith, Matias, Niklas)
      - Misc cleanups, optimizations, fixes (Baolin, Chaitanya, David,
        Dongli, Max, Sagi)

 - null_blk zone capacity support (Aravind)

 - MD:
      - raid5/6 fixes (ChangSyun)
      - Warning fixes (Damien)
      - raid5 stripe fixes (Guoqing, Song, Yufen)
      - sysfs deadlock fix (Junxiao)
      - raid10 deadlock fix (Vitaly)

 - struct_size conversions (Gustavo)

 - Set of bcache updates/fixes (Coly)

* tag 'for-5.9/drivers-20200803' of git://git.kernel.dk/linux-block: (117 commits)
  md/raid5: Allow degraded raid6 to do rmw
  md/raid5: Fix Force reconstruct-write io stuck in degraded raid5
  raid5: don't duplicate code for different paths in handle_stripe
  raid5-cache: hold spinlock instead of mutex in r5c_journal_mode_show
  md: print errno in super_written
  md/raid5: remove the redundant setting of STRIPE_HANDLE
  md: register new md sysfs file 'uuid' read-only
  md: fix max sectors calculation for super 1.0
  nvme-loop: remove extra variable in create ctrl
  nvme-loop: set ctrl state connecting after init
  nvme-multipath: do not fall back to __nvme_find_path() for non-optimized paths
  nvme-multipath: fix logic for non-optimized paths
  nvme-rdma: fix controller reset hang during traffic
  nvme-tcp: fix controller reset hang during traffic
  nvmet: introduce the passthru Kconfig option
  nvmet: introduce the passthru configfs interface
  nvmet: Add passthru enable/disable helpers
  nvmet: add passthru code to process commands
  nvme: export nvme_find_get_ns() and nvme_put_ns()
  nvme: introduce nvme_ctrl_get_by_path()
  ...
2020-08-05 10:51:40 -07:00
Linus Torvalds 382625d0d4 for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
 yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
 FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
 x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
 b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
 8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
 SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
 yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
 GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
 ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
 +BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
 RzAUdZFbWw==
 =abJG
 -----END PGP SIGNATURE-----

Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:
 "Good amount of cleanups and tech debt removals in here, and as a
  result, the diffstat shows a nice net reduction in code.

   - Softirq completion cleanups (Christoph)

   - Stop using ->queuedata (Christoph)

   - Cleanup bd claiming (Christoph)

   - Use check_events, moving away from the legacy media change
     (Christoph)

   - Use inode i_blkbits consistently (Christoph)

   - Remove old unused writeback congestion bits (Christoph)

   - Cleanup/unify submission path (Christoph)

   - Use bio_uninit consistently, instead of bio_disassociate_blkg
     (Christoph)

   - sbitmap cleared bits handling (John)

   - Request merging blktrace event addition (Jan)

   - sysfs add/remove race fixes (Luis)

   - blk-mq tag fixes/optimizations (Ming)

   - Duplicate words in comments (Randy)

   - Flush deferral cleanup (Yufen)

   - IO context locking/retry fixes (John)

   - struct_size() usage (Gustavo)

   - blk-iocost fixes (Chengming)

   - blk-cgroup IO stats fixes (Boris)

   - Various little fixes"

* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
  block: blk-timeout: delete duplicated word
  block: blk-mq-sched: delete duplicated word
  block: blk-mq: delete duplicated word
  block: genhd: delete duplicated words
  block: elevator: delete duplicated word and fix typos
  block: bio: delete duplicated words
  block: bfq-iosched: fix duplicated word
  iocost_monitor: start from the oldest usage index
  iocost: Fix check condition of iocg abs_vdebt
  block: Remove callback typedefs for blk_mq_ops
  block: Use non _rcu version of list functions for tag_set_list
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  ...
2020-08-03 11:57:03 -07:00
Christoph Hellwig 5bedd3afee nvme: add a Identify Namespace Identification Descriptor list quirk
Add a quirk for a device that does not support the Identify Namespace
Identification Descriptor list despite claiming 1.3 compliance.

Fixes: ea43d9709f ("nvme: fix identify error status silent ignore")
Reported-by: Ingo Brunberg <ingo_brunberg@web.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ingo Brunberg <ingo_brunberg@web.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2020-07-29 08:05:44 +02:00
Logan Gunthorpe 24493b8b85 nvme: export nvme_find_get_ns() and nvme_put_ns()
nvme_find_get_ns() and nvme_put_ns() are required by the target passthru
code and are exported under the NVME_TARGET_PASSTHRU namespace.

Based-on-a-patch-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:21 +02:00
Logan Gunthorpe f783f444ce nvme: introduce nvme_ctrl_get_by_path()
nvme_ctrl_get_by_path() is analogous to blkdev_get_by_path() except it
gets a struct nvme_ctrl from the path to its char dev (/dev/nvme0).
It makes use of filp_open() to open the file and uses the private
data to obtain a pointer to the struct nvme_ctrl. If the fops of the
file do not match, -EINVAL is returned.

The purpose of this function is to support NVMe-OF target passthru
and is exported under the NVME_TARGET_PASSTHRU namespace.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:21 +02:00
Logan Gunthorpe 17365ae697 nvme: introduce nvme_execute_passthru_rq to call nvme_passthru_[start|end]()
Introduce a new nvme_execute_passthru_rq() helper which calls
nvme_passthru_[start|end]() around blk_execute_rq(). This ensures
all passthru calls (including nvme_submit_io()) will be wrapped
appropriately.

nvme_execute_passthru_rq() will also be useful for the nvmet passthru
code and is exported in the NVME_TARGET_PASSTHRU namespace.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:21 +02:00
Logan Gunthorpe df21b6b193 nvme: create helper function to obtain command effects
Separate the code to obtain command effects from the code
to start a passthru request and move the nvme_passthru_start() and
nvme_passthru_end() functions up above nvme_submit_user_cmd() in order
that they may be used in a new helper a subsequent patch.

The new helper function will be necessary for nvmet passthru
code to determine if we need to change out of interrupt context
to handle the effects. It is exported in the NVME_TARGET_PASSTHRU
namespace.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:20 +02:00
Logan Gunthorpe 2bf5d3bbff nvme: clear any SGL flags in passthru commands
The host driver should decide whether to use SGLs or PRPs and they
currently assume the flags are cleared after the call to
nvme_setup_cmd(). However, passed-through commands may erroneously
set these bits; so clear them for all cases.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:20 +02:00
Sagi Grimberg ecca390e80 nvme: fix deadlock in disconnect during scan_work and/or ana_work
A deadlock happens in the following scenario with multipath:
1) scan_work(nvme0) detects a new nsid while nvme0
    is an optimized path to it, path nvme1 happens to be
    inaccessible.

2) Before scan_work is complete nvme0 disconnect is initiated
    nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING

3) scan_work(1) attempts to submit IO,
    but nvme_path_is_optimized() observes nvme0 is not LIVE.
    Since nvme1 is a possible path IO is requeued and scan_work hangs.

--
Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  efi_partition+0x1e6/0x708
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
--

4) Delete also hangs in flush_work(ctrl->scan_work)
    from nvme_remove_namespaces().

Similiarly a deadlock with ana_work may happen: if ana_work has started
and calls nvme_mpath_set_live and device_add_disk, it will
trigger I/O. When we trigger disconnect I/O will block because
our accessible (optimized) path is disconnecting, but the alternate
path is inaccessible, so I/O blocks. Then disconnect tries to flush
the ana_work and hangs.

[  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
[  605.552087] Call Trace:
[  605.552683]  __schedule+0x2b9/0x6c0
[  605.553507]  schedule+0x42/0xb0
[  605.554201]  io_schedule+0x16/0x40
[  605.555012]  do_read_cache_page+0x438/0x830
[  605.556925]  read_cache_page+0x12/0x20
[  605.557757]  read_dev_sector+0x27/0xc0
[  605.558587]  amiga_partition+0x4d/0x4c5
[  605.561278]  check_partition+0x154/0x244
[  605.562138]  rescan_partitions+0xae/0x280
[  605.563076]  __blkdev_get+0x40f/0x560
[  605.563830]  blkdev_get+0x3d/0x140
[  605.564500]  __device_add_disk+0x388/0x480
[  605.565316]  device_add_disk+0x13/0x20
[  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
[  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
[  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
[  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
[  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
[  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
[  605.573330]  process_one_work+0x1db/0x380
[  605.574144]  worker_thread+0x4d/0x400
[  605.574896]  kthread+0x104/0x140
[  605.577205]  ret_from_fork+0x35/0x40
[  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
[  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
[  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  605.582320] nvme            D    0 14044  14043 0x00000000
[  605.583424] Call Trace:
[  605.583935]  __schedule+0x2b9/0x6c0
[  605.584625]  schedule+0x42/0xb0
[  605.585290]  schedule_timeout+0x203/0x2f0
[  605.588493]  wait_for_completion+0xb1/0x120
[  605.590066]  __flush_work+0x123/0x1d0
[  605.591758]  __cancel_work_timer+0x10e/0x190
[  605.593542]  cancel_work_sync+0x10/0x20
[  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
[  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
[  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
[  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
[  605.598320]  dev_attr_store+0x17/0x30

Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
indicate the phase of controller deletion where I/O cannot be allowed
to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
be issued to the bottom device, and only after we flush the ana_work
and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
from re-firing by aborting early if we are not LIVE, so we should be safe
here.

In addition, change the transport drivers to follow the updated state
machine.

Fixes: 0d0b660f21 ("nvme: add ANA support")
Reported-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:19 +02:00
Chaitanya Kulkarni 6c3c05b087 nvme-core: replace ctrl page size with a macro
Saving the nvme controller's page size was from a time when the driver
tried to use different sized pages, but this value is always set to
a constant, and has been this way for some time. Remove the 'page_size'
field and replace its usage with the constant value.

This also lets the compiler make some micro-optimizations in the io
path, and that's always a good thing.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:18 +02:00
Baolin Wang 5887450b69 nvme: remove redundant validation in nvme_start_ctrl()
We've already validated the 'kato' in nvme_start_keep_alive(), thus no
need to validate it again in nvme_start_ctrl(). Remove it.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:18 +02:00
Dan Carpenter eca9e8271e nvme: remove an unnecessary condition
"v" is an unsigned int so it can't be more than UINT_MAX.  Removing this
check makes it easier to preserve the error code as well.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-29 07:45:18 +02:00
Christoph Hellwig b9b1a5d715 block: remove blk_queue_stack_limits
This function is just a tiny wrapper around blk_stack_limits.  Open code
it int the two callers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-20 15:38:52 -06:00
Jens Axboe 4f43d64807 Merge branch 'for-5.9/drivers' into for-5.9/block-merge
* for-5.9/drivers: (38 commits)
  block: add max_active_zones to blk-sysfs
  block: add max_open_zones to blk-sysfs
  s390/dasd: Use struct_size() helper
  s390/dasd: fix inability to use DASD with DIAG driver
  md-cluster: fix wild pointer of unlock_all_bitmaps()
  md/raid5-cache: clear MD_SB_CHANGE_PENDING before flushing stripes
  md: fix deadlock causing by sysfs_notify
  md: improve io stats accounting
  md: raid0/linear: fix dereference before null check on pointer mddev
  rsxx: switch from 'pci_free_consistent()' to 'dma_free_coherent()'
  nvme: remove ns->disk checks
  nvme-pci: use standard block status symbolic names
  nvme-pci: use the consistent return type of nvme_pci_iod_alloc_size()
  nvme-pci: add a blank line after declarations
  nvme-pci: fix some comments issues
  nvme-pci: remove redundant segment validation
  nvme: document quirked Intel models
  nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
  nvme: support for zoned namespaces
  nvme: support for multiple Command Sets Supported and Effects log pages
  ...
2020-07-20 15:38:27 -06:00
Jens Axboe 9caaa66c91 Merge branch 'for-5.9/block' into for-5.9/block-merge
* for-5.9/block: (124 commits)
  blk-cgroup: show global disk stats in root cgroup io.stat
  blk-cgroup: make iostat functions visible to stat printing
  block: improve discard bio alignment in __blkdev_issue_discard()
  block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
  block: defer flush request no matter whether we have elevator
  block: make blk_timeout_init() static
  block: remove retry loop in ioc_release_fn()
  block: remove unnecessary ioc nested locking
  block: integrate bd_start_claiming into __blkdev_get
  block: use bd_prepare_to_claim directly in the loop driver
  block: refactor bd_start_claiming
  block: simplify the restart case in __blkdev_get
  Revert "blk-rq-qos: remove redundant finish_wait to rq_qos_wait."
  block: always remove partitions from blk_drop_partitions()
  block: relax jiffies rounding for timeouts
  blk-mq: remove redundant validation in __blk_mq_end_request()
  blk-mq: Remove unnecessary local variable
  writeback: remove bdi->congested_fn
  writeback: remove struct bdi_writeback_congested
  writeback: remove {set,clear}_wb_congested
  ...
2020-07-20 15:38:23 -06:00
Anthony Iliopoulos 05b29021fb nvme: explicitly update mpath disk capacity on revalidation
Commit 3b4b19721e ("nvme: fix possible deadlock when I/O is
blocked") reverted multipath head disk revalidation due to deadlocks
caused by holding the bd_mutex during revalidate.

Updating the multipath disk blockdev size is still required though for
userspace to be able to observe any resizing while the device is
mounted. Directly update the bdev inode size to avoid unnecessarily
holding the bdev->bd_mutex.

Fixes: 3b4b19721e ("nvme: fix possible deadlock when I/O is
blocked")

Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-16 16:40:27 +02:00
Christoph Hellwig 3913f4f3a6 nvme: remove ns->disk checks
By the time a namespace is added to ctrl->namespaces list and thus
can be looked up ns->disk has been assigned, and it it never cleared.

Remove all the checks for ns->disk being NULL.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2020-07-08 19:15:20 +02:00
Sagi Grimberg 764075fdcb nvme: expose reconnect_delay and ctrl_loss_tmo via sysfs
This is useful information, and moreover its it's useful to
be able to alter these parameters per controller after it
has been established.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:20 +02:00
Keith Busch 240e6ee272 nvme: support for zoned namespaces
Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined
in NVM Express TP4053. Zoned namespaces are discovered based on their
Command Set Identifier reported in the namespaces Namespace
Identification Descriptor list. A successfully discovered Zoned
Namespace will be registered with the block layer as a host managed
zoned block device with Zone Append command support. A namespace that
does not support append is not supported by the driver.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Ajay Joshi <ajay.joshi@wdc.com>
Signed-off-by: Aravind Ramesh <aravind.ramesh@wdc.com>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Keith Busch <keith.busch@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:20 +02:00
Keith Busch be93e87e78 nvme: support for multiple Command Sets Supported and Effects log pages
The Commands Supported and Effects log page was extended with a CSI
field that enables the host to query the log page for each command set
supported. Retrieve this log page for each command set that an attached
namespace supports, and save a pointer to that log in the namespace head.

Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Keith Busch <keith.busch@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:20 +02:00
Niklas Cassel 71010c3094 nvme: implement multiple I/O Command Set support
Implements support for multiple I/O Command Sets.  NVMe TP 4056
introduces a method to enumerate multiple command sets per namespace. If
the command set is exposed, this method for enumeration will be used
instead of the traditional method that uses the CC.CSS register command
set register for command set identification.

For namespaces where the Command Set Identifier is not supported or
recognized, the specific namespace will not be created.

Reviewed-by: Javier González <javier.gonz@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Matias Bjørling <matias.bjorling@wdc.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:19 +02:00
Baolin Wang f5af577d55 nvme: use USEC_PER_SEC instead of magic numbers
Use USEC_PER_SEC instead of magic numbers to make code more readable.

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:19 +02:00
Chaitanya Kulkarni d4047cf994 nvme-core: use u16 type for ctrl->sqsize
In nvme_init_identify() when calculating submission queue size use u16
instead of int type in the min_t() since target variable ctrl->sqsize is
of type u16.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:15 +02:00
Chaitanya Kulkarni a87835e9ec nvme-core: use u16 type for directives
In nvme_configure_directives() when calculating number of streams use
u16 instead of unsigned type in the min_t() since target variable
ctrl->nr_streams is of type u16.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-08 16:16:15 +02:00
Jens Axboe 482c6b614a Linux 5.8-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl8CYDYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGcQkH/2vOsPf79yWtsc7x
 hd2LpCPfrm7T1xlQcYcXbEbyRI8sqPmguixO8pRI1ePl2lBZ7KurfyeYgYZNGpFU
 t74Ph6A6dSWoCgO68Genm/SQuK8ic6o9n1Vr8tDsGDp5KlHWNaweq4JwHrsPmO1T
 cI0PR/ClAhLG8cQZ4x988Es5HTNGY17XK27e+M/zKYxSMGY2NRdJBGQIq964i5Q8
 2d9G0rtVCaVDzgjrLwaFm6RBu21Il7HV6KsBsacyTFiL1ywx2vnUHzeZQyvuJSOQ
 4YpLo9v4tBP10WHC50LRStZyO0qRwPVd/Yl7fL4R/CKsJT9H4uiwasVoEBVSL/k6
 CUn3JL0=
 =P/Vx
 -----END PGP SIGNATURE-----

Merge tag 'v5.8-rc4' into for-5.9/drivers

Merge in 5.8-rc4 for-5.9/block to setup for-5.9/drivers, to provide
a clean base and making the life for the NVMe changes easier.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v5.8-rc4': (732 commits)
  Linux 5.8-rc4
  x86/ldt: use "pr_info_once()" instead of open-coding it badly
  MIPS: Do not use smp_processor_id() in preemptible code
  MIPS: Add missing EHB in mtc0 -> mfc0 sequence for DSPen
  .gitignore: Do not track `defconfig` from `make savedefconfig`
  io_uring: fix regression with always ignoring signals in io_cqring_wait()
  x86/ldt: Disable 16-bit segments on Xen PV
  x86/entry/32: Fix #MC and #DB wiring on x86_32
  x86/entry/xen: Route #DB correctly on Xen PV
  x86/entry, selftests: Further improve user entry sanity checks
  x86/entry/compat: Clear RAX high bits on Xen PV SYSENTER
  i2c: mlxcpld: check correct size of maximum RECV_LEN packet
  i2c: add Kconfig help text for slave mode
  i2c: slave-eeprom: update documentation
  i2c: eg20t: Load module automatically if ID matches
  i2c: designware: platdrv: Set class based on DMI
  i2c: algo-pca: Add 0x78 as SCL stuck low status for PCA9665
  mm/page_alloc: fix documentation error
  vmalloc: fix the owner argument for the new __vmalloc_node_range callers
  mm/cma.c: use exact_nid true to fix possible per-numa cma leak
  ...
2020-07-08 08:02:13 -06:00
Sagi Grimberg ea43d9709f nvme: fix identify error status silent ignore
Commit 59c7c3caaa intended to only silently ignore non retry-able
errors (DNR bit set) such that we can still identify misbehaving
controllers, and in the other hand propagate retry-able errors (DNR bit
cleared) so we don't wrongly abandon a namespace just because it happens
to be temporarily inaccessible.

The goal remains the same as the original commit where this was
introduced but unfortunately had the logic backwards.

Fixes: 59c7c3caaa ("nvme: fix possible hang when ns scanning fails during error recovery")
Reported-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-07-02 10:38:00 +02:00
Christoph Hellwig c62b37d96b block: move ->make_request_fn to struct block_device_operations
The make_request_fn is a little weird in that it sits directly in
struct request_queue instead of an operation vector.  Replace it with
a block_device_operations method called submit_bio (which describes much
better what it does).  Also remove the request_queue argument to it, as
the queue can be derived pretty trivially from the bio.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-01 07:27:24 -06:00
Sagi Grimberg 3b4b19721e nvme: fix possible deadlock when I/O is blocked
Revert fab7772bfb ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")

When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition scan which triggers I/O on the mpath device node.
This process will usually be triggered from the scan_work which holds
the scan_lock. If I/O blocks (if we got ana change currently have only
available paths but none are accessible) this can deadlock on the head
disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
takes it to check for resize (also triggered from scan_work on a different
path). See trace [1].

The mpath disk revalidation was originally added to detect online disk
size change, but this is no longer needed since commit cb224c3af4
("nvme: Convert to use set_capacity_revalidate_and_notify") which already
updates resize info without unnecessarily revalidating the disk (the
mpath disk doesn't even implement .revalidate_disk fop).

[1]:
--
kernel: INFO: task kworker/u65:9:494 blocked for more than 241 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:9   D    0   494      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  schedule_preempt_disabled+0xe/0x10
kernel:  __mutex_lock.isra.0+0x182/0x4f0
kernel:  __mutex_lock_slowpath+0x13/0x20
kernel:  mutex_lock+0x2e/0x40
kernel:  revalidate_disk+0x63/0xa0
kernel:  __nvme_revalidate_disk+0xfe/0x110 [nvme_core]
kernel:  nvme_revalidate_disk+0xa4/0x160 [nvme_core]
kernel:  ? evict+0x14c/0x1b0
kernel:  revalidate_disk+0x2b/0xa0
kernel:  nvme_validate_ns+0x49/0x940 [nvme_core]
kernel:  ? blk_mq_free_request+0xd2/0x100
kernel:  ? __nvme_submit_sync_cmd+0xbe/0x1e0 [nvme_core]
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x1f/0x40
...
kernel: INFO: task kworker/u65:1:2630 blocked for more than 241 seconds.
kernel:       Tainted: G           OE     5.3.5-050305-generic #201910071830
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: kworker/u65:1   D    0  2630      2 0x80004000
kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core]
kernel: Call Trace:
kernel:  __schedule+0x2b9/0x6c0
kernel:  schedule+0x42/0xb0
kernel:  io_schedule+0x16/0x40
kernel:  do_read_cache_page+0x438/0x830
kernel:  ? __switch_to_asm+0x34/0x70
kernel:  ? file_fdatawait_range+0x30/0x30
kernel:  read_cache_page+0x12/0x20
kernel:  read_dev_sector+0x27/0xc0
kernel:  read_lba+0xc1/0x220
kernel:  ? kmem_cache_alloc_trace+0x19c/0x230
kernel:  efi_partition+0x1e6/0x708
kernel:  ? vsnprintf+0x39e/0x4e0
kernel:  ? snprintf+0x49/0x60
kernel:  check_partition+0x154/0x244
kernel:  rescan_partitions+0xae/0x280
kernel:  __blkdev_get+0x40f/0x560
kernel:  blkdev_get+0x3d/0x140
kernel:  __device_add_disk+0x388/0x480
kernel:  device_add_disk+0x13/0x20
kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
kernel:  ? nvme_update_ns_ana_state+0x60/0x60 [nvme_core]
kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
kernel:  ? blk_mq_free_request+0xd2/0x100
kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
kernel:  process_one_work+0x1db/0x380
kernel:  worker_thread+0x249/0x400
kernel:  kthread+0x104/0x140
kernel:  ? process_one_work+0x380/0x380
kernel:  ? kthread_park+0x80/0x80
kernel:  ret_from_fork+0x1f/0x40
--

Fixes: fab7772bfb ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")
Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-24 18:41:20 +02:00
Max Gurtovoy 4fea243ebc nvme: set initial value for controller's numa node
Initialize the node to NUMA_NO_NODE value. Transports that are aware of
numa node affinity can override it (e.g. RDMA transport set the affinity
according to the RDMA HCA).

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-06-24 18:37:08 +02:00
Christoph Hellwig 15f73f5b3e blk-mq: move failure injection out of blk_mq_complete_request
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Niklas Cassel 108a58585b nvme: do not call del_gendisk() on a disk that was never added
device_add_disk() is negated by del_gendisk().
alloc_disk_node() is negated by put_disk().

In nvme_alloc_ns(), device_add_disk() is one of the last things being
called in the success case, and only void functions are being called
after this. Therefore this call should not be negated in the error path.

The superfluous call to del_gendisk() leads to the following prints:
[    7.839975] kobject: '(null)' (000000001ff73734): is not initialized, yet kobject_put() is being called.
[    7.840865] WARNING: CPU: 2 PID: 361 at lib/kobject.c:736 kobject_put+0x70/0x120

Fixes: 33cfdc2aa6 ("nvme: enforce extended LBA format for fabrics metadata")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-11 09:10:05 -06:00
Linus Torvalds bce159d734 for-5.8/drivers-2020-06-01
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl7VPc4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgQkEACnQlzWOfNQMz1AzgUAv/S8IYDJCLrkbjLZ
 JK4pJv8Hjhss/7sS+fd8kyKe9VtaZz2IjmrXcC66RMMwtpx4iHnkRffoNAgEdGOl
 /M5TCZGhs+F/mp3Lc0WdR5DFHkM6yy2Tkk9wCFLreB4bW67janAWnd7nbU4INqJj
 +WqIgpzNMc/kfUhpBYTeQLORhL4e2TG9ADTi/zeUITlpnEsA65LOgXKEpeIFYnSX
 KTl4GIZ9tjazG3Y1Eva7DYHDIErNNAtX67KBqf+WBgMV98eB0O6xIPN1WlmhDTqj
 FGMLkb8msH1HHntvxDAuc4/ortnUy8vPI4o6zKP89HJJNjIM5p5eHEuVF5JnBw42
 Rtu9Om6JqWx51nhAhJNBj9bUStYbhEl0vVQCwbkfPbDJhzTy3RR8z709q9+ZwOrL
 xbp4aJBzqrzscjBEiSQbNCf2PyuOAdU0r1x81UN81ZN41d5qUcumcinjw4Y7vru8
 z5zMlo1Iy/AWQYyu7jgHmnpI7ZyA/1Qclo5dV7aa72bLFaJa35e7QxgfQOFBA5dY
 UZl6QPJRlnB80uGRzD5jCh2O2sQ3XZqYnpaKsUAka1GgbceCp9IC4A5mfZvpACsh
 Xk8VXjlhvY/iPJsKLqrh4Oedg4Dj5M3PLL9C3MDfYeIP2qgXpbnk87UV1TPNSpY0
 QcTxsXXXIw==
 =H+/Z
 -----END PGP SIGNATURE-----

Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "On top of the core changes, here are the block driver changes for this
  merge window:

   - NVMe changes:
        - NVMe over Fibre Channel protocol updates, which also reach
          over to drivers/scsi/lpfc (James Smart)
        - namespace revalidation support on the target (Anthony
          Iliopoulos)
        - gcc zero length array fix (Arnd Bergmann)
        - nvmet cleanups (Chaitanya Kulkarni)
        - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
        - use a SRQ per completion vector (Max Gurtovoy)
        - fix handling of runtime changes to the queue count (Weiping
          Zhang)
        - t10 protection information support for nvme-rdma and
          nvmet-rdma (Israel Rukshin and Max Gurtovoy)
        - target side AEN improvements (Chaitanya Kulkarni)
        - various fixes and minor improvements all over, icluding the
          nvme part of the lpfc driver"

   - Floppy code cleanup series (Willy, Denis)

   - Floppy contention fix (Jiri)

   - Loop CONFIGURE support (Martijn)

   - bcache fixes/improvements (Coly, Joe, Colin)

   - q->queuedata cleanups (Christoph)

   - Get rid of ioctl_by_bdev (Christoph, Stefan)

   - md/raid5 allocation fixes (Coly)

   - zero length array fixes (Gustavo)

   - swim3 task state fix (Xu)"

* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
  bcache: configure the asynchronous registertion to be experimental
  bcache: asynchronous devices registration
  bcache: fix refcount underflow in bcache_device_free()
  bcache: Convert pr_<level> uses to a more typical style
  bcache: remove redundant variables i and n
  lpfc: Fix return value in __lpfc_nvme_ls_abort
  lpfc: fix axchg pointer reference after free and double frees
  lpfc: Fix pointer checks and comments in LS receive refactoring
  nvme: set dma alignment to qword
  nvmet: cleanups the loop in nvmet_async_events_process
  nvmet: fix memory leak when removing namespaces and controllers concurrently
  nvmet-rdma: add metadata/T10-PI support
  nvmet: add metadata support for block devices
  nvmet: add metadata/T10-PI support
  nvme: add Metadata Capabilities enumerations
  nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: add metadata characteristics for a namespace
  nvme-rdma: add metadata/T10-PI support
  nvme-rdma: introduce nvme_rdma_sgl structure
  ...
2020-06-02 15:37:03 -07:00
Keith Busch 3382a567ef nvme: force complete cancelled requests
Use blk_mq_foce_complete_rq() to bypass fake timeout error injection so
that request reclaim may proceed.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29 10:21:59 -06:00
Keith Busch 3b2a1ebceb nvme: set dma alignment to qword
The default dma alignment mask is 511, which is much larger than any nvme
controller requires. NVMe controllers accept qword aligned DMA addresses,
so set the request_queue constraints to that. This can help avoid bounce
buffers on user passthrough commands.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:40 +02:00
Max Gurtovoy 33cfdc2aa6 nvme: enforce extended LBA format for fabrics metadata
An extended LBA is a larger LBA that is created when metadata associated
with the LBA is transferred contiguously with the LBA data (AKA
interleaved). The metadata may be either transferred as part of the LBA
(creating an extended LBA) or it may be transferred as a separate
contiguous buffer of data. According to the NVMeoF spec, a fabrics ctrl
supports only an Extended LBA format. Fail revalidation in case we have a
spec violation. Also add a flag that will imply on capable transports and
controllers as part of a preparation for allowing end-to-end protection
information for fabric controllers.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:39 +02:00
Max Gurtovoy 9509335039 nvme: introduce max_integrity_segments ctrl attribute
This patch doesn't change any logic, and is needed as a preparation
for adding PI support for fabrics drivers that will use an extended
LBA format for metadata and will support more than 1 integrity segment.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:39 +02:00
James Smart 4d2ce68835 nvme: make nvme_ns_has_pi accessible to transports
Move the nvme_ns_has_pi() inline from core.c to the nvme.h header.
This allows use by the transports.

Signed-off-by: James Smart <jsmart2021@gmail.com>
[maxg: added a comment for nvme_ns_has_pi()]
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:38 +02:00
Max Gurtovoy b29f84857a nvme: introduce NVME_NS_METADATA_SUPPORTED flag
This is a preparation for adding support for metadata in fabric
controllers. New flag will imply that NVMe namespace supports getting
metadata that was originally generated by host's block layer.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:38 +02:00
Max Gurtovoy ffc89b1d3c nvme: introduce namespace features flag
Replace the specific ext boolean (that implies on extended LBA format)
with a feature in the new namespace features flag. This is a preparation
for adding more namespace features (such as metadata specific features).

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:38 +02:00
Damien Le Moal 68ab60ca2d nvme: fix io_opt limit setting
Currently, a namespace io_opt queue limit is set by default to the
physical sector size of the namespace and to the the write optimal
size (NOWS) when the namespace reports optimal IO sizes. This causes
problems with block limits stacking in blk_stack_limits() when a
namespace block device is combined with an HDD which generally do not
report any optimal transfer size (io_opt limit is 0). The code:

/* Optimal I/O a multiple of the physical block size? */
if (t->io_opt & (t->physical_block_size - 1)) {
	t->io_opt = 0;
	t->misaligned = 1;
	ret = -1;
}

in blk_stack_limits() results in an error return for this function when
the combined devices have different but compatible physical sector
sizes (e.g. 512B sector SSD with 4KB sector disks).

Fix this by not setting the optimal IO size queue limit if the namespace
does not report an optimal write size value.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Bart van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:37 +02:00
Wu Bo 84e4c204b6 nvme: disable streams when get stream params failed
Disable streams again if getting the stream params fails.

Signed-off-by: Wu Bo <wubo40@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-05-27 07:12:37 +02:00
Keith Busch 92decf118f nvme: define constants for identification values
Improve code readability by defining the specification's constants that
the driver is using when decoding identification payloads.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Bart van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:36 -06:00
Keith Busch b04df85d9a nvme: flush scan work on passthrough commands
If a passthrough command causes the namespace inventory or capabilities
to change, flush the scan work that handles these changes so the driver
synchronizes with the user command's effects before returning the result
to user space.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:36 -06:00
Christoph Hellwig 6623c5b3df nvme: clean up error handling in nvme_init_ns_head
Use a common label for putting the nshead if needed and only convert
nvme status codes for the one case where it actually is needed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:36 -06:00
Keith Busch 31fdad7be1 nvme: consolodate io settings
The stream parameters indicating optimal io settings were just getting
overwritten later. Rearrange the settings so the streams parameters can
be preserved if provided.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch bc1af009a8 nvme: revalidate namespace stream parameters
The stream parameters are based on the currently formatted logical block
size. Recheck these parameters on namespace revalidation so the
registered constraints will be accurate.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch 38adf94e16 nvme: consolidate chunk_sectors settings
Move the quirked chunk_sectors setting to the same location as noiob so
one place registers this setting. And since the noiob value is only used
locally, remove the member from struct nvme_ns.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch b2b2de7c5a nvme: revalidate after verifying identifiers
If the namespace identifiers have changed, skip updating the disk
information, as that will register parameters from a mismatched
namespace.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch b2ce4d9069 nvme-multipath: set bdi capabilities once
The queues' backing device info capabilities don't change with each
namespace revalidation. Set it only when each path's request_queue
is initially added to a multipath queue.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch 0c284db7f2 nvme: check namespace head shared property
Reject a new shared namespace if a duplicate unshared namespace exists.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch 9ad1927a3b nvme: always search for namespace head
Even if a namespace reports it is not capable of sharing, search the
subsystem for a matching namespace head. If found, the driver should
reject that namespace since it's coming from an invalid configuration.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch ac262508da nvme: release namespace head reference on error
If a namespace identification does not match the subsystem's head for
that NSID, release the reference that was taken when the matching head
was initially found.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch d567572906 nvme: unlink head after removing last namespace
The driver had been unlinking the namespace head from the subsystem's
list only after the last reference was released, and outside of the
list's subsys->lock protection.

There is no reason to track an empty head, so unlink the entry from the
subsystem's list when the last namespace using that head is removed and
with the mutex lock protecting the list update. The next namespace to
attach reusing the previous NSID will allocate a new head rather than
find the old head with mismatched identifiers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Christoph Hellwig aec459b484 nvme: remove the magic 1024 constant in nvme_scan_ns_list
Replace it with a value derived from the identify data and nsid sizes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Christoph Hellwig 4005f28d25 nvme: avoid an Identify Controller command for each namespace scan
The namespace lists are 0-terminated, so we don't really need the NN value
execept for the legacy sequential scan.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Christoph Hellwig 4450ba3bbb nvme: factor out a nvme_ns_remove_by_nsid helper
Factor out a piece of deeply indented and logicaly separate code
from nvme_scan_ns_list into a new helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Christoph Hellwig 25dcaa9292 nvme: clean up nvme_scan_work
Move the check for the supported CNS value into nvme_scan_ns_list, and
limit the life time of the identify controller allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Christoph Hellwig b9a5c3d4c3 nvme: refine the Qemu Identify CNS quirk
Add a helper to check if we can use Identify CNS values > 1, and refine
the Qemu quirk to not apply to reported versions larger than 1.1, as the
Qemu implementation had been fixed by then.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch 03f8cebc12 nvme: remove unused parameter
nvme_alloc_ns_head() doesn't use the 'struct nvme_id_ns' parameter.
Remove it, and update caller accordingly.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:35 -06:00
Keith Busch 71fb90eb71 nvme: provide num dword helper
Various nvme commands use a zeroes based number of dwords field. Create
a helper function to convert byte lengths to this format.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:18:34 -06:00
Sagi Grimberg 59c7c3caaa nvme: fix possible hang when ns scanning fails during error recovery
When the controller is reconnecting, the host fails I/O and admin
commands as the host cannot reach the controller. ns scanning may
revalidate namespaces during that period and it is wrong to remove
namespaces due to these failures as we may hang (see 205da24343).

One command that may fail is nvme_identify_ns_descs. Since we return
success due to having ns identify descriptor list optional, we continue
to compare ns identifiers in nvme_revalidate_disk, obviously fail and
return -ENODEV to nvme_validate_ns, which will remove the namespace.

Exactly what we don't want to happen.

Fixes: 22802bf742 ("nvme: Namepace identification descriptor list is optional")
Tested-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-09 16:07:58 -06:00
Niklas Cassel 132be62387 nvme: prevent double free in nvme_alloc_ns() error handling
When jumping to the out_put_disk label, we will call put_disk(), which will
trigger a call to disk_release(), which calls blk_put_queue().

Later in the cleanup code, we do blk_cleanup_queue(), which will also call
blk_put_queue().

Putting the queue twice is incorrect, and will generate a KASAN splat.

Set the disk->queue pointer to NULL, before calling put_disk(), so that the
first call to blk_put_queue() will not free the queue.

The second call to blk_put_queue() uses another pointer to the same queue,
so this call will still free the queue.

Fixes: 85136c0102 ("lightnvm: simplify geometry enumeration")
Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-27 17:08:06 +02:00
Sagi Grimberg 74e4d20e2f nvme: inherit stable pages constraint in the mpath stack device
If the backing device require stable pages, we need to set it on the
stack mpath device as well. This applies to rdma/fc transports when
doing data integrity and tcp transport calculating digests.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-04-02 10:51:46 +02:00
Nick Bowler c95b708d5f nvme: fix compat address handling in several ioctls
On a 32-bit kernel, the upper bits of userspace addresses passed via
various ioctls are silently ignored by the nvme driver.

However on a 64-bit kernel running a compat task, these upper bits are
not ignored and are in fact required to be zero for the ioctls to work.

Unfortunately, this difference matters.  32-bit smartctl submits the
NVME_IOCTL_ADMIN_CMD ioctl with garbage in these upper bits because it
seems the pointer value it puts into the nvme_passthru_cmd structure is
sign extended.  This works fine on 32-bit kernels but fails on a 64-bit
one because (at least on my setup) the addresses smartctl uses are
consistently above 2G.  For example:

  # smartctl -x /dev/nvme0n1
  smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.5.11] (local build)
  Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

  Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Bad address

Since changing 32-bit kernels to actually check all of the submitted
address bits now would break existing userspace, this patch fixes the
compat problem by explicitly zeroing the upper bits in the compat case.
This enables 32-bit smartctl to work on a 64-bit kernel.

Signed-off-by: Nick Bowler <nbowler@draconx.ca>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-03-31 16:19:11 +02:00
Linus Torvalds 1592614838 for-5.7/drivers-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJDYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplhMD/95jd4nlVetHAo54z+Zk2ExE13+yDamRKyh
 vc7t2tz1reqFOimtVr5aVuTXCTgOx4CpiIox5qcn6qAExN4JtCChOBRGize/0u8S
 ckxnhHbN2C0rfnGldvrYYeNRonFI+7QKimnurWUSYYGN0xqbo21BxJ7dFaohMseo
 q4K8sIW0ctE6AOlw28Jerkg614s2NDGZ7q1laheXnYHn5c9f1m0NaKN/jyTGgr0X
 TLBiLbX2yRrAuvpctBj6Fna6YN7Vdd9jsf2Bt6ipUI1XgHQoVUGMxQNhWPyjsbSv
 GzRQUNAfVcasLzCP/Mj/47144OkUtDDpn2mjeXDaFljLDGFULD+jp/SsOmLCxkPC
 gI7G2yfBvF96/SOyT0JXrLyMcBd1R2vRoASbc5tPu82mZhx7YJZH5WYtOB9h2gra
 RTYo3xcm0EoN6yeMaH+xOuXxTWWInIrgKPONW4H8s7hxEiMt5oFNVBI7vqPr4LVp
 tpfxiKZDavKOofKXogNV4W7mSMP/Ir5Q9Ha4g5SXHBGp0z/PHmnQ0xDGNq0KDnU4
 eNO0UYCFNCNa+0AOhpNxaVuVm9LjrgvyXRjePgOZQ4akhohwHO6DLrHK1f8Hb1vD
 8Ih6uR+F5zZlKsouWro8HLGYm5w40Wq9tbCI8QbPYH6nkGoDmzpPv9jbAeWgJU5c
 KqP/5TBSLA==
 =Bs4E
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:

 - floppy driver cleanup series from Willy

 - NVMe updates and fixes (Various)

 - null_blk trace improvements (Chaitanya)

 - bcache fixes (Coly)

 - md fixes (via Song)

 - loop block size change optimizations (Martijn)

 - scnprintf() use (Takashi)

* tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
  null_blk: add trace in null_blk_zoned.c
  null_blk: add tracepoint helpers for zoned mode
  block: add a zone condition debug helper
  nvme: cleanup namespace identifier reporting in nvme_init_ns_head
  nvme: rename __nvme_find_ns_head to nvme_find_ns_head
  nvme: refactor nvme_identify_ns_descs error handling
  nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
  nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
  nvme: Fix controller creation races with teardown flow
  nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
  nvme: Fix ctrl use-after-free during sysfs deletion
  nvme-pci: Re-order nvme_pci_free_ctrl
  nvme: Remove unused return code from nvme_delete_ctrl_sync
  nvme: Use nvme_state_terminal helper
  nvme: release ida resources
  nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
  nvmet-tcp: optimize tcp stack TX when data digest is used
  nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
  nvme-multipath: do not reset on unknown status
  nvmet-rdma: allocate RW ctxs according to mdts
  ...
2020-03-30 11:43:51 -07:00
Christoph Hellwig 43fcd9e1ea nvme: cleanup namespace identifier reporting in nvme_init_ns_head
Lift the common namespace identifier reporting between the shared
namespace and new nshead cases into common code.  This also means
one less lock is held while doing I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Christoph Hellwig 026d2ef752 nvme: rename __nvme_find_ns_head to nvme_find_ns_head
There is no non __-prefixed version, so make the name a little more
readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Christoph Hellwig fb314eb0cb nvme: refactor nvme_identify_ns_descs error handling
Move the handling of an error into the function from the caller, and
only do it for an actual error on the admin command itself, not the
command parsing, as that should be enough to deal with devices claiming
a bogus version compliance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin ce1518139e nvme: Fix controller creation races with teardown flow
Calling nvme_sysfs_delete() when the controller is in the middle of
creation may cause several bugs. If the controller is in NEW state we
remove delete_controller file and don't delete the controller. The user
will not be able to use nvme disconnect command on that controller again,
although the controller may be active. Other bugs may happen if the
controller is in the middle of create_ctrl callback and
nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at
nvme_do_delete_ctrl() before it was allocated at create_ctrl callback.

To fix all those races don't allow the user to delete the controller
before it was fully created.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin 726612b6b8 nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
Put the ctrl reference count at nvme_uninit_ctrl as opposed to
nvme_init_ctrl which takes it. This decrease the reference count at the
core layer instead of decreasing it on each transport separately.
Also move the call of nvme_uninit_ctrl at PCI driver after calling to
nvme_release_prp_pools and nvme_dev_unmap, in order to put the reference
count after using the dev. This is safe because those functions use
nvme_dev which is freed only later at nvme_pci_free_ctrl.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin b780d7415a nvme: Fix ctrl use-after-free during sysfs deletion
In case nvme_sysfs_delete() is called by the user before taking the ctrl
reference count, the ctrl may be freed during the creation and cause the
bug. Take the reference as soon as the controller is externally visible,
which is done by cdev_device_add() in nvme_init_ctrl(). Also take the
reference count at the core layer instead of taking it on each transport
separately.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:56 +09:00
Israel Rukshin 6721c18a06 nvme: Remove unused return code from nvme_delete_ctrl_sync
The return code of nvme_delete_ctrl_sync is never used, so change it to
void.

Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:55 +09:00
Israel Rukshin e7c43feae2 nvme: Use nvme_state_terminal helper
Improve code readability.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:55 +09:00
Max Gurtovoy f41cfd5d0a nvme: release ida resources
ida instances allocate some internal memory in addition to the base
'struct ida'. Use ida_destroy() to release that memory at module_exit().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:55 +09:00
masahiro31.yamada@kioxia.com c225b61031 nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
Currently 32 bit application gets ENOTTY when it calls
compat_ioctl with NVME_IOCTL_SUBMIT_IO in 64 bit kernel.

The cause is that the results of sizeof(struct nvme_user_io),
which is used to define NVME_IOCTL_SUBMIT_IO,
are not same between 32 bit compiler and 64 bit compiler.

* 32 bit: the result of sizeof nvme_user_io is 44.
* 64 bit: the result of sizeof nvme_user_io is 48.

64 bit compiler seems to add 32 bit padding for multiple of 8 bytes.

This patch adds a compat_ioctl handler.
The handler replaces NVME_IOCTL_SUBMIT_IO32 with NVME_IOCTL_SUBMIT_IO
in case 32 bit application calls compat_ioctl for submit in 64 bit kernel.
Then, it calls nvme_ioctl as usual.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Masahiro Yamada (KIOXIA) <masahiro31.yamada@kioxia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:55 +09:00
John Meneghini 764e933209 nvme-multipath: do not reset on unknown status
The nvme multipath error handling defaults to controller reset if the
error is unknown. There are, however, no existing nvme status codes that
indicate a reset should be used, and resetting causes unnecessary
disruption to the rest of IO.

Change nvme's error handling to first check if failover should happen.
If not, let the normal error handling take over rather than reset the
controller.

Based-on-a-patch-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Meneghini <johnm@netapp.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:51:55 +09:00
Josh Triplett 3e98c2443f nvme: Check for readiness more quickly, to speed up boot time
After initialization, nvme_wait_ready checks for readiness every 100ms,
even though the drive may be ready far sooner than that. This delays
system boot by hundreds of milliseconds. Reduce the delay, checking for
readiness every millisecond instead.

Boot-time tests on an AWS c5.12xlarge:

Before:
[    0.546936] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs
...
[    0.764178] nvme nvme0: 2/0/0 default/read/poll queues
[    0.768424]  nvme0n1: p1
[    0.774132] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
[    0.774146] VFS: Mounted root (ext4 filesystem) on device 259:1.
...
[    0.788141] Run /sbin/init as init process

After:
[    0.537088] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs
...
[    0.543457] nvme nvme0: 2/0/0 default/read/poll queues
[    0.548473]  nvme0n1: p1
[    0.554339] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
[    0.554344] VFS: Mounted root (ext4 filesystem) on device 259:1.
...
[    0.567931] Run /sbin/init as init process

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:47:03 +09:00
Rupesh Girase 94d2e705b6 nvme: log additional message for controller status
Log the controller status to know more about issue if it
lies within kernel nvme subsytem or controller is unhealthy.

Signed-off-by: Rupesh Girase <rgirase@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulakrni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:47:03 +09:00
Chaitanya Kulkarni ad95a613ea nvme: code cleanup nvme_identify_ns_desc()
The function nvme_identify_ns_desc() has 3 levels of nesting which make
error message to exceeded > 80 char per line which is not aligned with
the kernel code standards and rest of the NVMe subsystem code.

Add a helper function to move the processing of the log when the
command is successful by reducing the nesting and keeping the
code < 80 char per line.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:45:25 +09:00
Sagi Grimberg 45fb19f766 nvme: expose hostid via sysfs for fabrics controllers
We allow userspace to connect with a custom hostid which is useful for
certain use-cases. However there is is no way to tell what is the hostid
used to connect to a given controller.

Expose this so userspace can correlate controllers based on hostid.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:45:12 +09:00
Sagi Grimberg 76171c6cdf nvme: expose hostnqn via sysfs for fabrics controllers
We allow userspace to connect with a custom hostnqn which is useful for
certain use-cases. However there is no way to tell what is the hostnqn
used to connect to a given controller.

Expose this so userspace can correlate controllers based on hostnqn.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-26 04:44:35 +09:00
Balbir Singh cb224c3af4 nvme: Convert to use set_capacity_revalidate_and_notify
block/genhd provides set_capacity_revalidate_and_notify() for
sending RESIZE notifications via uevents. This notification is
newly added to NVME devices

Signed-off-by: Balbir Singh <sblbir@amazon.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-18 15:13:21 -06:00
Edmund Nadolski adce7e9856 nvme: remove unused return code from nvme_alloc_ns
The return code of nvme_alloc_ns is never used, so change it
to void.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-03-04 09:09:08 -08:00
Keith Busch 15755854d5 nvme: Fix uninitialized-variable warning
gcc may detect a false positive on nvme using an unintialized variable
if setting features fails. Since this is not a fast path, explicitly
initialize this variable to suppress the warning.

Reported-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2020-02-20 01:40:57 +09:00
Yi Zhang f25372ffc3 nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
nvme fw-activate operation will get bellow warning log,
fix it by update the parameter order

[  113.231513] nvme nvme0: Get FW SLOT INFO log error

Fixes: 0e98719b0e ("nvme: simplify the API for getting log pages")
Reported-by: Sujith Pandel <sujith_pandel@dell.com>
Reviewed-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-14 10:12:04 -07:00
Nigel Kirkland 97b2512ad0 nvme: prevent warning triggered by nvme_stop_keep_alive
Delayed keep alive work is queued on system workqueue and may be cancelled
via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

Check_flush_dependency detects mismatched attributes between the work-queue
context used to cancel the keep alive work and system-wq. Specifically
system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
to cancel keep alive work have WQ_MEM_RECLAIM flag.

Example warning:

  workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
	is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

However this creates a secondary concern where work and a request to cancel
that work may be in the same work queue - namely err_work in the rdma and
tcp transports, which will want to flush/cancel the keep alive work which
will now be on nvme_wq.

After reviewing the transports, it looks like err_work can be moved to
nvme_reset_wq. In fact that aligns them better with transition into
RESETTING and performing related reset work in nvme_reset_wq.

Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.

Signed-off-by: Nigel Kirkland <nigel.kirkland@broadcom.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-14 10:12:04 -07:00
Keith Busch 35038bffa8 nvme: Translate more status codes to blk_status_t
Decode interrupted command and not ready namespace nvme status codes to
BLK_STS_TARGET. These are not generic IO errors and should use a non-path
specific error so that it can use the non-failover retry path.

Reported-by: John Meneghini <John.Meneghini@netapp.com>
Cc: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-01-10 08:55:50 -07:00
Linus Torvalds f1fcd7786e for-linus-20191212
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3y54EQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqJuD/93LZmzS5UEWrNLkRaAaCyAy40MPxuXRZEp
 42yk7cvAT4OcCr+W6nkAgG6IHGRXOz8QvOzt0P5/HfugpNlB2oz5a/6+TiTtcZTt
 YNt0Z4yuBMU5SXIIxc3lUMcJGxslzOr+L+9ZXD4u5UqIdG1fSrECAexSCrlmmTwu
 Fx02TakDc/bbUYDfLAQD1+/Z066rp1ZWDkjXqA4kUvbFzt8F7qEOc1Evq47SuR7d
 Iw0bM3LVASXwTq2lRc1bFFL2glku6wwkccjwdyjSrQmK4+8LhF396fQGtXuj0Mrs
 OzuWhaOoGhan7dpj1D8e4tqugflQy9rv9bcy6Z9PjBY+VauuFdgPr3iFcwPaPbXm
 17ir4y7xJJxXlhZl/Bn06KIB2h+nLWDIaundFys5JnMmTiZvWIgSJ6Q3gWtMxgfH
 zWZLMw/UtRAmjHhLqvGsMaBTfgKX5ATpMbfGeZeXheVtVaOgGTunXunT56o7oRHB
 q4XWZqbydsYyHBUhgSzhBr03i67wbotxtebqg9VZ0UD8XM4iM8Kor/DleK03oUqD
 DsltKF66NAGNeOcV3TNzJuXHyF6S/vZdO7JdFHY29+pdljoTj5GB88+W9CbhwQRe
 WiKVpq7sAe/bh0wtqrD+QCByjSNSVU62kVgRhfqms47804j/vNqNvOKaC5UWTd0I
 2LG4jfSbeg==
 =hmxJ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191212' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - stable fix for the bi_size overflow. Not a corruption issue, but a
   case wher we could merge but disallowed (Andreas)

 - NVMe pull request via Keith, with various fixes.

 - MD pull request from Song.

 - Merge window regression fix for the rq passthrough stats (Logan)

 - Remove unused blkcg_drain_queue() function (Guoqing)

* tag 'for-linus-20191212' of git://git.kernel.dk/linux-block:
  blk-cgroup: remove blkcg_drain_queue
  block: fix NULL pointer dereference in account statistics with IDE
  md: make sure desc_nr less than MD_SB_DISKS
  md: raid1: check rdev before reference in raid1_sync_request func
  raid5: need to set STRIPE_HANDLE for batch head
  block: fix "check bi_size overflow before merge"
  nvme/pci: Fix read queue count
  nvme/pci Limit write queue sizes to possible cpus
  nvme/pci: Fix write and poll queue types
  nvme/pci: Remove last_cq_head
  nvme: Namepace identification descriptor list is optional
  nvme-fc: fix double-free scenarios on hw queues
  nvme: else following return is not needed
  nvme: add error message on mismatching controller ids
  nvme_fc: add module to ops template to allow module references
  nvmet-loop: Avoid preallocating big SGL for data
  nvme-fc: Avoid preallocating big SGL for data
  nvme-rdma: Avoid preallocating big SGL for data
2019-12-13 14:27:19 -08:00
Jens Axboe dc3ecfc981 Merge branch 'nvme/for-5.5' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Keith

* 'nvme/for-5.5' of git://git.infradead.org/nvme:
  nvme/pci: Fix read queue count
  nvme/pci Limit write queue sizes to possible cpus
  nvme/pci: Fix write and poll queue types
  nvme/pci: Remove last_cq_head
  nvme: Namepace identification descriptor list is optional
  nvme-fc: fix double-free scenarios on hw queues
  nvme: else following return is not needed
  nvme: add error message on mismatching controller ids
  nvme_fc: add module to ops template to allow module references
  nvmet-loop: Avoid preallocating big SGL for data
  nvme-fc: Avoid preallocating big SGL for data
  nvme-rdma: Avoid preallocating big SGL for data
2019-12-06 17:27:56 -07:00
Linus Torvalds c3bed3b20e pci-v5.5-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAl3leXUUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyY3g/9FAVVdPEaadNtAhQ/zIxcjozDovKq
 0q7yOA3aTBTUoNEinm88an6p0dcC4gNKtGukXmzVH2Hhxm9kLRdtpZGYY00tpLUB
 9rI7XsgwwHa+hLwsHbIs507sKGFGy5FLr0ChTTGLDEMppnEvjA2hZooYmcB/OgrC
 LlFcwbNKGOk/Si9u2bF2nLO0JDoVHnwzpF99saew/nqc7Lfj9e9IPZFom+VjPBUh
 AOvRp2H7uBN+WQlpLeFeMDDoeXh34lX0kYqIV/cVkXVnknDGYKV2CBTg2aeX7jd0
 QiPHZh6zlW8zNQgaCZRiBAbatVEOnRMRJ++yiqB8hBYp1LMXm6kJ01YSQpXkugoY
 Vp9dtzzTARWV/XkKwD4brw9ZEmIDnO+Ed2x2VbUkPJVcXAvzSQWAx82IU0Iuqmcb
 9qr6U2Zf/Xk5aFlGPYVH8QOG+QqzIbZNRQ7NlhDlITyW4P6QPu0mw374yYP2wDGL
 sP5YSS3YGa0sQcEgDtVnd4z+WTZI4AwXLPaeaLkDhdfHp2FsERUY4TrPs33J99xw
 og4EyokVFzjYzlnBPU6WWn7LL+jj5ccXkL3MA4DR4FJOnNGHh7NXfQUH56rrgsq7
 F9/8shL5DuTbQkde1uSyUG9Iq/RigVLlV5DQavFm3dSXvZi0E16t5alC5URNTzk7
 at8Bogn53QhlmYc=
 =uUXw
 -----END PGP SIGNATURE-----

Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "Enumeration:

   - Warn if a host bridge has no NUMA info (Yunsheng Lin)

   - Add PCI_STD_NUM_BARS for the number of standard BARs (Denis
     Efremov)

  Resource management:

   - Fix boot-time Embedded Controller GPE storm caused by incorrect
     resource assignment after ACPI Bus Check Notification (Mika
     Westerberg)

   - Protect pci_reassign_bridge_resources() against concurrent
     addition/removal (Benjamin Herrenschmidt)

   - Fix bridge dma_ranges resource list cleanup (Rob Herring)

   - Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control
     the MMIO and prefetchable MMIO window sizes of hotplug bridges
     independently (Nicholas Johnson)

   - Fix MMIO/MMIO_PREF window assignment that assigned more space than
     desired (Nicholas Johnson)

   - Only enforce bus numbers from bridge EA if the bridge has EA
     devices downstream (Subbaraya Sundeep)

   - Consolidate DT "dma-ranges" parsing and convert all host drivers to
     use shared parsing (Rob Herring)

  Error reporting:

   - Restore AER capability after resume (Mayurkumar Patel)

   - Add PoisonTLPBlocked AER counter (Rajat Jain)

   - Use for_each_set_bit() to simplify AER code (Andy Shevchenko)

   - Fix AER kernel-doc (Andy Shevchenko)

   - Add "pcie_ports=dpc-native" parameter to allow native use of DPC
     even if platform didn't grant control over AER (Olof Johansson)

  Hotplug:

   - Avoid returning prematurely from sysfs requests to enable or
     disable a PCIe hotplug slot (Lukas Wunner)

   - Don't disable interrupts twice when suspending hotplug ports (Mika
     Westerberg)

   - Fix deadlocks when PCIe ports are hot-removed while suspended (Mika
     Westerberg)

  Power management:

   - Remove unnecessary ASPM locking (Bjorn Helgaas)

   - Add support for disabling L1 PM Substates (Heiner Kallweit)

   - Allow re-enabling Clock PM after it has been disabled (Heiner
     Kallweit)

   - Add sysfs attributes for controlling ASPM link states (Heiner
     Kallweit)

   - Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl"
     sysfs files (Heiner Kallweit)

   - Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on
     USB 2.0 or 1.1 connect events (Kai-Heng Feng)

   - Move power state check out of pci_msi_supported() (Bjorn Helgaas)

   - Fix incorrect MSI-X masking on resume and revert related nvme quirk
     for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan)

   - Always return devices to D0 when thawing to fix hibernation with
     drivers like mlx4 that used legacy power management (previously we
     only did it for drivers with new power management ops) (Dexuan Cui)

   - Clear PCIe PME Status even for legacy power management (Bjorn
     Helgaas)

   - Fix PCI PM documentation errors (Bjorn Helgaas)

   - Use dev_printk() for more power management messages (Bjorn Helgaas)

   - Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas)

   - Convert xen-platform from legacy to generic power management (Bjorn
     Helgaas)

   - Removed unused .resume_early() and .suspend_late() legacy power
     management hooks (Bjorn Helgaas)

   - Rearrange power management code for clarity (Rafael J. Wysocki)

   - Decode power states more clearly ("4" or "D4" really refers to
     "D3cold") (Bjorn Helgaas)

   - Notice when reading PM Control register returns an error (~0)
     instead of interpreting it as being in D3hot (Bjorn Helgaas)

   - Add missing link delays required by the PCIe spec (Mika Westerberg)

  Virtualization:

   - Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn
     Helgaas)

   - Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code
     previously didn't recognize that) (Kuppuswamy Sathyanarayanan)

   - Allow VFs to use PASID (the PF PASID capability is shared by the
     VFs, but the code previously didn't recognize that) (Kuppuswamy
     Sathyanarayanan)

   - Disconnect PF and VF ATS enablement, since ATS in PFs and
     associated VFs can be enabled independently (Kuppuswamy
     Sathyanarayanan)

   - Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan)

   - Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas)

   - Consolidate ATS declarations in linux/pci-ats.h (Krzysztof
     Wilczynski)

   - Remove unused PRI and PASID stubs (Bjorn Helgaas)

   - Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID
     interfaces that are only used by built-in IOMMU drivers (Bjorn
     Helgaas)

   - Hide PRI and PASID state restoration functions used only inside the
     PCI core (Bjorn Helgaas)

   - Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski)

   - Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut)

   - Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George
     Cherian)

   - Fix the UPDCR register address in the Intel ACS quirk (Steffen
     Liebergeld)

   - Unify ACS quirk implementations (Bjorn Helgaas)

  Amlogic Meson host bridge driver:

   - Fix meson PERST# GPIO polarity problem (Remi Pommarel)

   - Add DT bindings for Amlogic Meson G12A (Neil Armstrong)

   - Fix meson clock names to match DT bindings (Neil Armstrong)

   - Add meson support for Amlogic G12A SoC with separate shared PHY
     (Neil Armstrong)

   - Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe
     combo PHY (Neil Armstrong)

   - Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong)

   - Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT
     (Neil Armstrong)

  Broadcom iProc host bridge driver:

   - Invalidate iProc PAXB address mapping before programming it
     (Abhishek Shah)

   - Fix iproc-msi and mvebu __iomem annotations (Ben Dooks)

  Cadence host bridge driver:

   - Refactor Cadence PCIe host controller to use as a library for both
     host and endpoint (Tom Joseph)

  Freescale Layerscape host bridge driver:

   - Add layerscape LS1028a support (Xiaowei Bao)

  Intel VMD host bridge driver:

   - Add VMD bus 224-255 restriction decode (Jon Derrick)

   - Add VMD 8086:9A0B device ID (Jon Derrick)

   - Remove Keith from VMD maintainer list (Keith Busch)

  Marvell ARMADA 3700 / Aardvark host bridge driver:

   - Use LTSSM state to build link training flag since Aardvark doesn't
     implement the Link Training bit (Remi Pommarel)

   - Delay before training Aardvark link in case PERST# was asserted
     before the driver probe (Remi Pommarel)

   - Fix Aardvark issues with Root Control reads and writes (Remi
     Pommarel)

   - Don't rely on jiffies in Aardvark config access path since
     interrupts may be disabled (Remi Pommarel)

   - Fix Aardvark big-endian support (Grzegorz Jaszczyk)

  Marvell ARMADA 370 / XP host bridge driver:

   - Make mvebu_pci_bridge_emul_ops static (Ben Dooks)

  Microsoft Hyper-V host bridge driver:

   - Add hibernation support for Hyper-V virtual PCI devices (Dexuan
     Cui)

   - Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan
     Cui)

   - Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui)

  Mobiveil host bridge driver:

   - Change mobiveil csr_read()/write() function names that conflict
     with riscv arch functions (Kefeng Wang)

  NVIDIA Tegra host bridge driver:

   - Fix Tegra CLKREQ dependency programming (Vidya Sagar)

  Renesas R-Car host bridge driver:

   - Remove unnecessary header include from rcar (Andrew Murray)

   - Tighten register index checking for rcar inbound range programming
     (Marek Vasut)

   - Fix rcar inbound range alignment calculation to improve packing of
     multiple entries (Marek Vasut)

   - Update rcar MACCTLR setting to match documentation (Yoshihiro
     Shimoda)

   - Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual
     (Yoshihiro Shimoda)

   - Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon
     Horman)

  Rockchip host bridge driver:

   - Make rockchip 0V9 and 1V8 power regulators non-optional (Robin
     Murphy)

  Socionext UniPhier host bridge driver:

   - Set uniphier to host (RC) mode always (Kunihiko Hayashi)

  Endpoint drivers:

   - Fix endpoint driver sign extension problem when shifting page
     number to phys_addr_t (Alan Mikhak)

  Misc:

   - Add NumaChip SPDX header (Krzysztof Wilczynski)

   - Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski)

   - Remove unused includes (Krzysztof Wilczynski)

   - Removed unused sysfs attribute groups (Ben Dooks)

   - Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas)

   - Add PCIe Link Control 2 register field definitions to replace magic
     numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas)

   - Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and
     Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas)

   - Use pcie_capability_read_word() instead of pci_read_config_word()
     in AMDGPU and Radeon CIK/SI (Frederick Lawler)

   - Remove unused pci_irq_get_node() Greg Kroah-Hartman)

   - Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig
     (Palmer Dabbelt, Michal Simek)

   - Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe)

   - Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn
     Helgaas)

   - Fix bridge emulation big-endian support (Grzegorz Jaszczyk)

   - Fix dwc find_next_bit() usage (Niklas Cassel)

   - Fix pcitest.c fd leak (Hewenliang)

   - Fix typos and comments (Bjorn Helgaas)

   - Fix Kconfig whitespace errors (Krzysztof Kozlowski)"

* tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits)
  PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist
  asm-generic: Make msi.h a mandatory include/asm header
  Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
  PCI/MSI: Fix incorrect MSI-X masking on resume
  PCI/MSI: Move power state check out of pci_msi_supported()
  PCI/MSI: Remove unused pci_irq_get_node()
  PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer
  PCI: hv: Change pci_protocol_version to per-hbus
  PCI: hv: Add hibernation support
  PCI: hv: Reorganize the code in preparation of hibernation
  MAINTAINERS: Remove Keith from VMD maintainer
  PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code
  PCI/ASPM: Add sysfs attributes for controlling ASPM link states
  PCI: Fix indentation
  drm/radeon: Prefer pcie_capability_read_word()
  drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
  drm/radeon: Correct Transmit Margin masks
  drm/amdgpu: Prefer pcie_capability_read_word()
  PCI: uniphier: Set mode register to host mode
  drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
  ...
2019-12-03 13:58:22 -08:00
Keith Busch 22802bf742 nvme: Namepace identification descriptor list is optional
Despite NVM Express specification 1.3 requires a controller claiming to
be 1.3 or higher implement Identify CNS 03h (Namespace Identification
Descriptor list), the driver doesn't really need this identification in
order to use a namespace. The code had already documented in comments
that we're not to consider an error to this command.

Return success if the controller provided any response to an
namespace identification descriptors command.

Fixes: 538af88ea7 ("nvme: make nvme_report_ns_ids propagate error back")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679
Reported-by: Ingo Brunberg <ingo_brunberg@web.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: stable@vger.kernel.org # 5.4+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-12-03 06:10:00 +09:00
Linus Torvalds 0da522107e compat_ioctl: remove most of fs/compat_ioctl.c
As part of the cleanup of some remaining y2038 issues, I came to
 fs/compat_ioctl.c, which still has a couple of commands that need support
 for time64_t.
 
 In completely unrelated work, I spent time on cleaning up parts of this
 file in the past, moving things out into drivers instead.
 
 After Al Viro reviewed an earlier version of this series and did a lot
 more of that cleanup, I decided to try to completely eliminate the rest
 of it and move it all into drivers.
 
 This series incorporates some of Al's work and many patches of my own,
 but in the end stops short of actually removing the last part, which is
 the scsi ioctl handlers. I have patches for those as well, but they need
 more testing or possibly a rewrite.
 
 Signed-off-by: Arnd Bergmann <arnd@arndb.de>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJdsHCdAAoJEJpsee/mABjZtYkP/1JGl3jFv3Iq/5BCdPkaePP1
 RtMJRNfURgK3GeuHUui330PvVjI/pLWXU/VXMK2MPTASpJLzYz3uCaZrpVWEMpDZ
 +ImzGmgJkITlW1uWU3zOcQhOxTyb1hCZ0Ci+2xn9QAmyOL7prXoXCXDWv3h6iyiF
 lwG+nW+HNtyx41YG+9bRfKNoG0ZJ+nkJ70BV6u0acQHXWn7Xuupa9YUmBL87hxAL
 6dlJfLTJg6q8QSv/Q6LxslfWk2Ti8OOJZOwtFM5R8Bgl0iUcvshiRCKfv/3t9jXD
 dJNvF1uq8z+gracWK49Qsfq5dnZ2ZxHFUo9u0NjbCrxNvWH/sdvhbaUBuJI75seH
 VIznCkdxFhrqitJJ8KmxANxG08u+9zSKjSlxG2SmlA4qFx/AoStoHwQXcogJscNb
 YIXYKmWBvwPzYu09QFAXdHFPmZvp/3HhMWU6o92lvDhsDwzkSGt3XKhCJea4DCaT
 m+oCcoACqSWhMwdbJOEFofSub4bY43s5iaYuKes+c8O261/Dwg6v/pgIVez9mxXm
 TBnvCsotq5m8wbwzv99eFqGeJH8zpDHrXxEtRR5KQqMqjLq/OQVaEzmpHZTEuK7n
 e/V/PAKo2/V63g4k6GApQXDxnjwT+m0aWToWoeEzPYXS6KmtWC91r4bWtslu3rdl
 bN65armTm7bFFR32Avnu
 =lgCl
 -----END PGP SIGNATURE-----

Merge tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull removal of most of fs/compat_ioctl.c from Arnd Bergmann:
 "As part of the cleanup of some remaining y2038 issues, I came to
  fs/compat_ioctl.c, which still has a couple of commands that need
  support for time64_t.

  In completely unrelated work, I spent time on cleaning up parts of
  this file in the past, moving things out into drivers instead.

  After Al Viro reviewed an earlier version of this series and did a lot
  more of that cleanup, I decided to try to completely eliminate the
  rest of it and move it all into drivers.

  This series incorporates some of Al's work and many patches of my own,
  but in the end stops short of actually removing the last part, which
  is the scsi ioctl handlers. I have patches for those as well, but they
  need more testing or possibly a rewrite"

* tag 'compat-ioctl-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (42 commits)
  scsi: sd: enable compat ioctls for sed-opal
  pktcdvd: add compat_ioctl handler
  compat_ioctl: move SG_GET_REQUEST_TABLE handling
  compat_ioctl: ppp: move simple commands into ppp_generic.c
  compat_ioctl: handle PPPIOCGIDLE for 64-bit time_t
  compat_ioctl: move PPPIOCSCOMPRESS to ppp_generic
  compat_ioctl: unify copy-in of ppp filters
  tty: handle compat PPP ioctls
  compat_ioctl: move SIOCOUTQ out of compat_ioctl.c
  compat_ioctl: handle SIOCOUTQNSD
  af_unix: add compat_ioctl support
  compat_ioctl: reimplement SG_IO handling
  compat_ioctl: move WDIOC handling into wdt drivers
  fs: compat_ioctl: move FITRIM emulation into file systems
  gfs2: add compat_ioctl support
  compat_ioctl: remove unused convert_in_user macro
  compat_ioctl: remove last RAID handling code
  compat_ioctl: remove /dev/raw ioctl translation
  compat_ioctl: remove PCI ioctl translation
  compat_ioctl: remove joystick ioctl translation
  ...
2019-12-01 13:46:15 -08:00
Jian-Hong Pan 655e7aee1f Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
Since e045fa29e8 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is
merged, we can revert the previous quirk now.

This reverts commit 19ea025e1d.

Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
Fixes: 19ea025e1d ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T")
Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.com
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
2019-11-26 13:13:14 -06:00
James Smart a8157ff360 nvme: add error message on mismatching controller ids
We've seen a few devices that return different controller id's to
the Fabric Connect command vs the Identify(controller) command. It's
currently hard to identify this failure by existing error messages. It
comes across as a (re)connect attempt in the transport that fails with
a -22 (-EINVAL) status. The issue is compounded by older kernels not
having the controller id check or had the identify command overwrite the
fabrics controller id value before it checked. Both resulted in cases
where the devices appeared fine until more recent kernels.

Clarify the reject by adding an error message on controller id mismatches.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-27 02:48:33 +09:00
Linus Torvalds 323264eefb for-5.5/drivers-post-20191122
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3X/7gQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgponxD/9mb/H9LD6/flEqoPv7n1dv7Y8Oe+AKpGLb
 a2Jh8ycpU6WtzZdlMYbQkxAqgJCupLTlih3WY3NuI1fwSsxwyMziQEnVnKgPNf7s
 PLWt+Qo5ryooyVkPi4KCHKjx2CFDUL4B1BqtSLm9n3eN72FRa9HCsCiMugjCbV+K
 aF7snzZ0ss+m7SnKIpdtXJjcdIFC2hXwCAGWAOv1vOPwhqQZFBsxjHKnEJtumTp2
 +wzLBPItLBbzHtAyopbiNlfsuHL9CF9L1QFaCTZE7N6eVWnlYpIhMCMrfEJ/e3jK
 27ct3PWAa4Qr2S/85//AMOyL9mxK96g9FjuGjxnKQZ+qWh89RPLq+oGs1zWSv2O0
 08BynWvxYHkoOR2+baPx89SHrlwyN0HCsvFBxVKrMsVHpy81sIYLFlytYaQUMx2/
 RkgkIxiAo5R5vYHPZYX8cU4c1rASjG0tYD9OA6e78MJFIBfkTl70XHHVX2Tgf5by
 fuwon/g+iVvN94QHb81ulZcA0hXRz8jM2RNIAPhqfoJXX6wNzD5MooNxTs9m88IP
 6/HaM1l6AJUtOMNA1aZbKAq+ARYuIA0/qHoSS0UVHoG8D4YkYaFpvYsAaA+TBzeO
 J8IcwSu6eVR1NvgJ9b4cwXFWSf75a/o4UeDP/1fYcQU5Gn/KQEdQVFSMvND1nnHW
 hUTP5AKFcQ==
 =oyE3
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block

Pull additional block driver updates from Jens Axboe:
 "Here's another block driver update, done to avoid conflicts with the
  zoned changes coming next.

  This contains:

   - Prepare SCSI sd for zone open/close/finish support

   - Small NVMe pull request
        - hwmon support (Akinobu)
        - add new co-maintainer (Christoph)
        - work-around for a discard issue on non-conformant drives
          (Eduard)

   - Small nbd leak fix"

* tag 'for-5.5/drivers-post-20191122' of git://git.kernel.dk/linux-block:
  nbd: prevent memory leak
  nvme: hwmon: add quirk to avoid changing temperature threshold
  nvme: hwmon: provide temperature min and max values for each sensor
  nvmet: add another maintainer
  nvme: Discard workaround for non-conformant devices
  nvme: Add hardware monitoring support
  scsi: sd_zbc: add zone open, close, and finish support
2019-11-25 11:18:03 -08:00
Linus Torvalds 2d53943090 for-5.5/drivers-20191121
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3WyEAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplgbD/4jNeqT0q2IkNcUUEWkZWsBOlfi0SiclS5v
 X8JY1IxTlL0kaBWm83mw06JewucQ97Fh7xblPE8/iDHJqpgEX4vvSQY1b8hcDulZ
 YOKUnLkFU22nICeT04/8x/+f8gqD5KOlGxkgEvUKViQW15oc0oNu4St/yFM1QEN0
 qNMzpcfFXV9lYsOPl0y3pKdP+qbfcpeSmaFD9Z65gxN6rJy1WR8rtUGXy2luoiEc
 dh15IL9AGN/r8VTo8yRpD9PStiuJqpALIR8OHJSHPj+s0pQ6twk4aehcnYseAMbH
 zSDpa9AJrfqlnh8tUfKYLWi/PM7pMH0F01rAiQv47j/C0+QhbiOU/uTFTzUW5hQ1
 eK6XzJ0slxwnDsHLKf+xJmCj0Oyk0jDimNQr/2MNsuhmr29V5lfvBNflub8eOLyZ
 ie2Eulv+z6pYBSJx6kqm0X3vhXOy4wgU+X8LzvfcP9iAjgU1rfzxUWxLEj+KfJS2
 Nl+ERV9nafoPpoKpNR7zWRBUulp1qZJzo/U9JaUKiI5cWkIH1hhHmU2++xMeyJpb
 XHoDFNTGv6z/eef65eSveFD7F274TSi16K56Obk+4KWaSrIR0d6VwUA7FDmJbSI+
 Jqk1OFdaRGsQ5OcVxF1Qo4WChn0FvhcD0c+yL0N19WZ01QeYsb3hlA+MUPDtGQ04
 U79MPfu7iA==
 =i0jf
 -----END PGP SIGNATURE-----

Merge tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the main block driver updates for 5.5. Nothing major in here,
  mostly just fixes. This contains:

   - a set of bcache changes via Coly

   - MD changes from Song

   - loop unmap write-zeroes fix (Darrick)

   - spelling fixes (Geert)

   - zoned additions cleanups to null_blk/dm (Ajay)

   - allow null_blk online submit queue changes (Bart)

   - NVMe changes via Keith, nothing major here either"

* tag 'for-5.5/drivers-20191121' of git://git.kernel.dk/linux-block: (56 commits)
  Revert "bcache: fix fifo index swapping condition in journal_pin_cmp()"
  drivers/md/raid5-ppl.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  drivers/md/raid5.c: use the new spelling of RWH_WRITE_LIFE_NOT_SET
  bcache: don't export symbols
  bcache: remove the extra cflags for request.o
  bcache: at least try to shrink 1 node in bch_mca_scan()
  bcache: add idle_max_writeback_rate sysfs interface
  bcache: add code comments in bch_btree_leaf_dirty()
  bcache: fix deadlock in bcache_allocator
  bcache: add code comment bch_keylist_pop() and bch_keylist_pop_front()
  bcache: deleted code comments for dead code in bch_data_insert_keys()
  bcache: add more accurate error messages in read_super()
  bcache: fix static checker warning in bcache_device_free()
  bcache: fix a lost wake-up problem caused by mca_cannibalize_lock
  bcache: fix fifo index swapping condition in journal_pin_cmp()
  md/raid10: prevent access of uninitialized resync_pages offset
  md: avoid invalid memory access for array sb->dev_roles
  md/raid1: avoid soft lockup under high load
  null_blk: add zone open, close, and finish support
  dm: add zone open, close and finish support
  ...
2019-11-25 11:15:41 -08:00
Eduard Hasenleithner 530436c45e nvme: Discard workaround for non-conformant devices
Users observe IOMMU related errors when performing discard on nvme from
non-compliant nvme devices reading beyond the end of the DMA mapped
ranges to discard.

Two different variants of this behavior have been observed: SM22XX
controllers round up the read size to a multiple of 512 bytes, and Phison
E12 unconditionally reads the maximum discard size allowed by the spec
(256 segments or 4kB).

Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
so the driver DMA maps a memory range that will always succeed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
Signed-off-by: Eduard Hasenleithner <eduard@hasenleithner.at>
[changelog, use existing define, kernel coding style]
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-13 06:37:38 +09:00
Guenter Roeck 400b6a7b13 nvme: Add hardware monitoring support
nvme devices report temperature information in the controller information
(for limits) and in the smart log. Currently, the only means to retrieve
this information is the nvme command line interface, which requires
super-user privileges.

At the same time, it would be desirable to be able to use NVMe temperature
information for thermal control.

This patch adds support to read NVMe temperatures from the kernel using the
hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
can use this information to set thermal policies, and userspace can access
it using libsensors and/or the "sensors" command.

Example output from the "sensors" command:

nvme0-pci-0100
Adapter: PCI adapter
Composite:    +39.0°C  (high = +85.0°C, crit = +85.0°C)
Sensor 1:     +39.0°C
Sensor 2:     +41.0°C

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-11-12 01:57:35 +09:00
Damien Le Moal e08f2ae850 nvme: Introduce nvme_lba_to_sect()
Introduce the new helper function nvme_lba_to_sect() to convert a device
logical block number to a 512B sector number. Use this new helper in
obvious places, cleaning up the code.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:41 -07:00
Damien Le Moal 314d48dd22 nvme: Cleanup and rename nvme_block_nr()
Rename nvme_block_nr() to nvme_sect_to_lba() and use SECTOR_SHIFT
instead of its hard coded value 9. Also add a comment to decribe this
helper.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:41 -07:00
Max Gurtovoy 16686f3a6c nvme: move common call to nvme_cleanup_cmd to core layer
nvme_cleanup_cmd should be called for each call to nvme_setup_cmd
(symmetrical functions). Move the call for nvme_cleanup_cmd to the common
core layer and call it during nvme_complete_rq for the good flow. For
error flow, each transport will call nvme_cleanup_cmd independently. Also
take care of a special case of path failure, where we call
nvme_complete_rq without doing nvme_setup_cmd.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:41 -07:00
Max Gurtovoy 2dc3947b53 nvme: introduce "Command Aborted By host" status code
Fix the status code of canceled requests initiated by the host according
to TP4028 (Status Code 0x371):
"Command Aborted By host: The command was aborted as a result of host
action (e.g., the host disconnected the Fabric connection)."

Also in a multipath environment, unless otherwise specified, errors of
this type (path related) should be retried using a different path, if
one is available.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-04 10:56:41 -07:00
Arnd Bergmann 1832f2d8ff compat_ioctl: move more drivers to compat_ptr_ioctl
The .ioctl and .compat_ioctl file operations have the same prototype so
they can both point to the same function, which works great almost all
the time when all the commands are compatible.

One exception is the s390 architecture, where a compat pointer is only
31 bit wide, and converting it into a 64-bit pointer requires calling
compat_ptr(). Most drivers here will never run in s390, but since we now
have a generic helper for it, it's easy enough to use it consistently.

I double-checked all these drivers to ensure that all ioctl arguments
are used as pointers or are ignored, but are not interpreted as integer
values.

Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Darren Hart (VMware) <dvhart@infradead.org>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-10-23 17:23:44 +02:00
Keith Busch c1ac9a4b07 nvme: Wait for reset state when required
Prevent simultaneous controller disabling/enabling tasks from interfering
with each other through a function to wait until the task successfully
transitioned the controller to the RESETTING state. This ensures disabling
the controller will not be interrupted by another reset path, otherwise
a concurrent reset may leave the controller in the wrong state.

Tested-by: Edmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14 23:22:00 +09:00
Keith Busch 4c75f87785 nvme: Prevent resets during paused controller state
A paused controller is doing critical internal activation work in the
background. Prevent subsequent controller resets from occurring during
this period by setting the controller state to RESETTING first. A helper
function, nvme_try_sched_reset_work(), is introduced for these paths so
they may continue with scheduling the reset_work after they've completed
their uninterruptible critical section.

Tested-by: Edmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14 23:21:54 +09:00
Keith Busch 5d02a5c1d6 nvme: Remove ADMIN_ONLY state
The admin only state was intended to fence off actions that don't
apply to a non-IO capable controller. The only actual user of this is
the scan_work, and pci was the only transport to ever set this state.
The consequence of having this state is placing an additional burden on
every other action that applies to both live and admin only controllers.

Remove the admin only state and place the admin only burden on the only
place that actually cares: scan_work.

This also prepares to make it easier to temporarily pause a LIVE state
so that we don't need to remember which state the controller had been in
prior to the pause.

Tested-by: Edmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <kbusch@kernel.org>
2019-10-14 23:21:44 +09:00
Sagi Grimberg 6abff1b9f7 nvme: fix possible deadlock when nvme_update_formats fails
nvme_update_formats may fail to revalidate the namespace and
attempt to remove the namespace. This may lead to a deadlock
as nvme_ns_remove will attempt to acquire the subsystem lock
which is already acquired by the passthru command with effects.

Move the invalid namepsace removal to after the passthru command
releases the subsystem lock.

Reported-by: Judy Brock <judy.brock@samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-10-04 17:10:12 -07:00
James Smart 2b1ff255d2 nvme: Add ctrl attributes for queue_count and sqsize
Current controller interrogation requires a lot of guesswork
on how many io queues were created and what the io sq size is.
The numbers are dependent upon core/fabric defaults, connect
arguments, and target responses.

Add sysfs attributes for queue_count and sqsize.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25 13:01:44 -07:00
Marta Rybczynska 65e68edce0 nvme: allow 64-bit results in passthru commands
It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.

As a result, it is not possible to implement IOL's NVMe Conformance
test 4.3 Case 1 for Fabrics targets [1] (page 123).

This issue has been already discussed [2], but without a solution.

This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.

[1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf
[2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.html

Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25 12:53:27 -07:00
Jian-Hong Pan 19ea025e1d nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
Kingston NVME SSD with firmware version E8FK11.T has no interrupt after
resume with actions related to suspend to idle. This patch applied
NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue.

Fixes: d916b1be94 ("nvme-pci: use host managed power state for suspend")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25 12:53:14 -07:00
Dan Carpenter bc4f6e06a9 nvme: fix an error code in nvme_init_subsystem()
"ret" should be a negative error code here, but it's either success or
possibly uninitialized.

Fixes: 32fd90c407 ("nvme: change locking for the per-subsystem controller list")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-25 12:53:14 -07:00
Balbir Singh b224726de5 nvme-pci: Fix a race in controller removal
User space programs like udevd may try to read to partitions at the
same time the driver detects a namespace is unusable, and may deadlock
if revalidate_disk() is called while such a process is waiting to
enter the frozen queue. On detecting a dead namespace, move the disk
revalidate after unblocking dispatchers that may be holding bd_butex.

changelog Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Balbir Singh <sblbir@amzn.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-23 14:00:11 -07:00
Max Gurtovoy 54d4e6ab91 block: centralize PI remapping logic to the block layer
Currently t10_pi_prepare/t10_pi_complete functions are called during the
NVMe and SCSi layers command preparetion/completion, but their actual
place should be the block layer since T10-PI is a general data integrity
feature that is used by block storage protocols. Introduce .prepare_fn
and .complete_fn callbacks within the integrity profile that each type
can implement according to its needs.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Suggested-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>

Fixed to not call queue integrity functions if BLK_DEV_INTEGRITY
isn't defined in the config.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-17 20:03:49 -06:00
Linus Torvalds 7ad67ca553 for-5.4/block-2019-09-16
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl1/no0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmo9EACFXMbdNmEEUMyRSdOkVLlr7ZlTyQi1tLpB
 YESDPxdBfybzpi0qa8JSaysGIfvSkSjmSAqBqrWPmASOSOL6CK4bbA4fTYbgPplk
 XeHUdgGiG34oCQUn8Xil5reYaTm7I6LQWnWTpVa5fIhAyUYaGJL+987ykoGmpQmB
 Dvf3YSc+8H0RTp9PCMVd6UCGPkZbVlLImGad3PF5ULvTEaE4RCXC2aiAgh0p1l5A
 J2CkRZ+/mio3zN2O4YN7VdPGfr1Wo1iZ834xbIGLegv1miHXagFk7jwTcC7zIt5t
 oSnJnqIg3iCe7SpWt4Bkzw/zy/2UqaspifbCMgw8vychlViVRUHFO5h85Yboo7kQ
 OMLEQPcwjm6dTHv5h1iXF9LW1O7NoiYmmgvApU9uOo1HUrl1X7PZ3JEfUsVHxkOO
 T4D5igf0Krsl1eAbiwEUQzy7vFZ8PlRHqrHgK+fkyotzHu1BJR7OQkYygEfGFOB/
 EfMxplGDpmibYGuWCwDX2bPAmLV3SPUQENReHrfPJRDt5TD1UkFpVGv/PLLhbr0p
 cLYI78DKpDSigBpVMmwq5nTYpnex33eyDTTA8C0sakcsdzdmU5qv30y3wm4nTiep
 f6gZo6IMXwRg/rCgVVrd9SKQAr/8wEzVlsDW3qyi2pVT8sHIgm0tFv7paihXGdDV
 xsKgmTrQQQ==
 =Qt+h
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Two NVMe pull requests:
     - ana log parse fix from Anton
     - nvme quirks support for Apple devices from Ben
     - fix missing bio completion tracing for multipath stack devices
       from Hannes and Mikhail
     - IP TOS settings for nvme rdma and tcp transports from Israel
     - rq_dma_dir cleanups from Israel
     - tracing for Get LBA Status command from Minwoo
     - Some nvme-tcp cleanups from Minwoo, Potnuri and Myself
     - Some consolidation between the fabrics transports for handling
       the CAP register
     - reset race with ns scanning fix for fabrics (move fabrics
       commands to a dedicated request queue with a different lifetime
       from the admin request queue)."
     - controller reset and namespace scan races fixes
     - nvme discovery log change uevent support
     - naming improvements from Keith
     - multiple discovery controllers reject fix from James
     - some regular cleanups from various people

 - Series fixing (and re-fixing) null_blk debug printing and nr_devices
   checks (André)

 - A few pull requests from Song, with fixes from Andy, Guoqing,
   Guilherme, Neil, Nigel, and Yufen.

 - REQ_OP_ZONE_RESET_ALL support (Chaitanya)

 - Bio merge handling unification (Christoph)

 - Pick default elevator correctly for devices with special needs
   (Damien)

 - Block stats fixes (Hou)

 - Timeout and support devices nbd fixes (Mike)

 - Series fixing races around elevator switching and device add/remove
   (Ming)

 - sed-opal cleanups (Revanth)

 - Per device weight support for BFQ (Fam)

 - Support for blk-iocost, a new model that can properly account cost of
   IO workloads. (Tejun)

 - blk-cgroup writeback fixes (Tejun)

 - paride queue init fixes (zhengbin)

 - blk_set_runtime_active() cleanup (Stanley)

 - Block segment mapping optimizations (Bart)

 - lightnvm fixes (Hans/Minwoo/YueHaibing)

 - Various little fixes and cleanups

* tag 'for-5.4/block-2019-09-16' of git://git.kernel.dk/linux-block: (186 commits)
  null_blk: format pr_* logs with pr_fmt
  null_blk: match the type of parameter nr_devices
  null_blk: do not fail the module load with zero devices
  block: also check RQF_STATS in blk_mq_need_time_stamp()
  block: make rq sector size accessible for block stats
  bfq: Fix bfq linkage error
  raid5: use bio_end_sector in r5_next_bio
  raid5: remove STRIPE_OPS_REQ_PENDING
  md: add feature flag MD_FEATURE_RAID0_LAYOUT
  md/raid0: avoid RAID0 data corruption due to layout confusion.
  raid5: don't set STRIPE_HANDLE to stripe which is in batch list
  raid5: don't increment read_errors on EILSEQ return
  nvmet: fix a wrong error status returned in error log page
  nvme: send discovery log page change events to userspace
  nvme: add uevent variables for controller devices
  nvme: enable aen regardless of the presence of I/O queues
  nvme-fabrics: allow discovery subsystems accept a kato
  nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
  nvme: Remove redundant assignment of cq vector
  nvme: Assign subsys instance from first ctrl
  ...
2019-09-17 16:57:47 -07:00
Sagi Grimberg 85f8a4351d nvme: send discovery log page change events to userspace
If the controller supports discovery log page change events,
we want to enable it. When we see a discovery log change event
we will send it up to userspace and expect it to handle it.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:46 -07:00
Sagi Grimberg a42f42e5bb nvme: add uevent variables for controller devices
When we send uevents to userspace, add controller specific
environment variables to uniquly identify the controller beyond
its device name.

This will be useful to address discovery log change events by
actually verifying that the discovery controller is indeed the
same as the device that generated the event.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:46 -07:00
Sagi Grimberg 93da40239b nvme: enable aen regardless of the presence of I/O queues
AENs in general are not related to the presence of I/O queues,
so enable them regardless. Note that the only exception is that
discovery controller will not support any of the requested AENs
and nvme_enable_aen will respect that and return, so it is still
safe to enable regardless.

Note it is safe to enable AENs even before the initial namespace
scanning as we have the scan operation in a workqueue context.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:46 -07:00
Keith Busch 733e4b69d5 nvme: Assign subsys instance from first ctrl
The namespace disk names must be unique for the lifetime of the
subsystem. This was accomplished by using their parent subsystems'
instances which were allocated independently from the controllers
connected to that subsystem. This allowed name prefixes assigned to
namespaces to match a controller from an unrelated subsystem, and has
created confusion among users examining device nodes.

Ensure a namespace's subsystem instance never clashes with a controller
instance of another subsystem by transferring the instance ownership
to the parent subsystem from the first controller discovered in that
subsystem.

Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:46 -07:00
Edmund Nadolski 03894b7a89 nvme: include admin_q sync with nvme_sync_queues
nvme_sync_queues currently syncs all namespace queues, but should
also sync the admin queue, if present.

Signed-off-by: Edmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
James Smart c26aa57202 nvme: Treat discovery subsystems as unique subsystems
Current code matches subnqn and collapses all controllers to the
same subnqn to a single subsystem structure. This is good for
recognizing multiple controllers for the same subsystem. But with
the well-known discovery subnqn, the subsystems aren't truly the
same subsystem. As such, subsystem specific rules, such as no
overlap of controller id, do not apply. With today's behavior, the
check for overlap of controller id can fail, preventing the new
discovery controller from being created.

When searching for like subsystem nqn, exclude the discovery nqn
from matching. This will result in each discovery controller being
attached to a unique subsystem structure.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Sagi Grimberg 205da24343 nvme: fix ns removal hang when failing to revalidate due to a transient error
If a controller reset is racing with a namespace revalidation, the
revalidation (admin) I/O will surely fail, but we should not remove the
namespace as we will execute the I/O when the controller is back up.
Same for spurious allocation errors (return -ENOMEM).

Fix this by checking the specific error code in nvme_revalidate_disk and
if it is a transient error (for example non DNR nvme statuses or
a negative ENOMEM as allocation failure), do not remove the namespace as
it will either recover when the controller is back up and schedule
a subsequent scan, or the controller is going away and the namespaces
will be removed anyways.

This fixes a hang namespace scanning racing with a controller reset and
also sporious I/O errors in path failover coditions where the
controller reset is racing with the namespace scan work with multipath
enabled.

Reported-by: Hannes Reinecke  <hare@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Sagi Grimberg 538af88ea7 nvme: make nvme_report_ns_ids propagate error back
Make the callers check the return status and propagate
back accordingly (casting to errno from a positive nvme status).
Also print the return status in nvme_report_ns_ids.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Sagi Grimberg 331813f687 nvme: make nvme_identify_ns propagate errors back
right now callers of nvme_identify_ns only know that it failed,
but don't know why. Make nvme_identify_ns propagate the error back.
Because nvme_submit_sync_cmd may return a positive status code, we
make nvme_identify_ns receive the id by reference and return that
status up the call chain, but make sure not to leak positive nvme
status codes to the upper layers.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Sagi Grimberg 2f9c173647 nvme: pass status to nvme_error_status
No need for the full blown request structure.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Sagi Grimberg 1c0d12c0b1 nvme: fail cancelled commands with NVME_SC_HOST_PATH_ERROR
NVME_SC_ABORT_REQ means that the request was aborted due to
an abort command received. In our case, this is a transport
cancellation, so host pathing error is much more appropriate.

Also, convert NVME_SC_HOST_PATH_ERROR to BLK_STS_TRANSPORT for
such that callers can understand that the status is a transport
related error. This will be used by the ns scanning code to
understand if it got an error from the controller or that the
controller happens to be unreachable by the transport.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-09-12 08:50:45 -07:00
Hannes Reinecke 35fe0d12c8 nvme: trace bio completion
When native multipathing is enabled we cannot enable blktrace for
the underlying paths, so any completion is never traced.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[fixed-up by Mikhail for non-multipath-build]
Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:02 -07:00
Sagi Grimberg b5b0504878 nvme: don't pass cap to nvme_disable_ctrl
All seem to call it with ctrl->cap so no need to pass it
at all.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Sagi Grimberg c0f2f45be2 nvme: move sqsize setting to the core
nvme_enable_ctrl reads the cap register right after, so
no need to do that locally in the transport driver. Have
sqsize setting in nvme_init_identify.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Sagi Grimberg 4fba445828 nvme: have nvme_init_identify set ctrl->cap
No need to use a stack cap variable.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-08-29 12:55:00 -07:00
Mario Limonciello cb32de1b7e nvme: Add quirk for LiteON CL1 devices running FW 22301111
One of the components in LiteON CL1 device has limitations that
can be encountered based upon boundary race conditions using the
nvme bus specific suspend to idle flow.

When this situation occurs the drive doesn't resume properly from
suspend-to-idle.

LiteON has confirmed this problem and fixed in the next firmware
version.  As this firmware is already in the field, avoid running
nvme specific suspend to idle flow.

Fixes: d916b1be94 ("nvme-pci: use host managed power state for suspend")
Link: http://lists.infradead.org/pipermail/linux-nvme/2019-July/thread.html
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: Charles Hyde <charles.hyde@dellteam.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-20 11:02:10 -06:00
Guilherme G. Piccoli a89fcca818 nvme: Fix cntlid validation when not using NVMEoF
Commit 1b1031ca63 ("nvme: validate cntlid during controller initialisation")
introduced a validation for controllers with duplicate cntlid that runs
on nvme_init_subsystem(). The problem is that the validation relies on
ctrl->cntlid, and this value is assigned (from id_ctrl value) after the
call for nvme_init_subsystem() in nvme_init_identify() for non-fabrics
scenario. That leads to ctrl->cntlid always being 0 in case we have a
physical set of controllers in the same subsystem.

This patch fixes that by loading the discovered cntlid id_ctrl value into
ctrl->cntlid before the subsystem initialization, only for the non-fabrics
case. The patch was tested with emulated nvme devices (qemu) having two
controllers in a single subsystem. Without the patch, we couldn't make
it work failing in the duplicate check; when running with the patch, we
could see the subsystem holding both controllers.

For the fabrics case we see ctrl->cntlid has a more intricate relation
with the admin connect, so we didn't change that.

Fixes: 1b1031ca63 ("nvme: validate cntlid during controller initialisation")
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-20 11:02:10 -06:00
Ming Lei a87ccce0b5 blk-mq: remove blk_mq_complete_request_sync
blk_mq_tagset_wait_completed_request() has been applied for waiting
for completed request's fn, so not necessary to use
blk_mq_complete_request_sync() any more.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04 21:41:29 -06:00
Ming Lei 78ca407247 nvme: don't abort completed request in nvme_cancel_request
Before aborting in-flight requests, all IO queues and their interrupts
have been shutdown. However, request's completion function may not be
done yet because it can be scheduled to run via IPI.

So don't abort one request if it is marked as completed, otherwise
we may abort one normal completed request.

Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-08-04 21:41:29 -06:00
Sagi Grimberg 0157ec8dad nvme: fix controller removal race with scan work
With multipath enabled, nvme_scan_work() can read from the device
(through nvme_mpath_add_disk()) and hang [1]. However, with fabrics,
once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang
(see nvmf_check_ready()) and the mpath stack device make_request
will block if head->list is not empty. However, when the head->list
consistst of only DELETING/DEAD controllers, we should actually not
block, but rather fail immediately.

In addition, before we go ahead and remove the namespaces, make sure
to clear the current path and kick the requeue list so that the
request will fast fail upon requeuing.

[1]:
--
  INFO: task kworker/u4:3:166 blocked for more than 120 seconds.
        Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u4:3    D    0   166      2 0x80004000
  Workqueue: nvme-wq nvme_scan_work
  Call Trace:
   __schedule+0x851/0x1400
   schedule+0x99/0x210
   io_schedule+0x21/0x70
   do_read_cache_page+0xa57/0x1330
   read_cache_page+0x4a/0x70
   read_dev_sector+0xbf/0x380
   amiga_partition+0xc4/0x1230
   check_partition+0x30f/0x630
   rescan_partitions+0x19a/0x980
   __blkdev_get+0x85a/0x12f0
   blkdev_get+0x2a5/0x790
   __device_add_disk+0xe25/0x1250
   device_add_disk+0x13/0x20
   nvme_mpath_set_live+0x172/0x2b0
   nvme_update_ns_ana_state+0x130/0x180
   nvme_set_ns_ana_state+0x9a/0xb0
   nvme_parse_ana_log+0x1c3/0x4a0
   nvme_mpath_add_disk+0x157/0x290
   nvme_validate_ns+0x1017/0x1bd0
   nvme_scan_work+0x44d/0x6a0
   process_one_work+0x7d7/0x1240
   worker_thread+0x8e/0xff0
   kthread+0x2c3/0x3b0
   ret_from_fork+0x35/0x40

   INFO: task kworker/u4:1:1034 blocked for more than 120 seconds.
        Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u4:1    D    0  1034      2 0x80004000
  Workqueue: nvme-delete-wq nvme_delete_ctrl_work
  Call Trace:
   __schedule+0x851/0x1400
   schedule+0x99/0x210
   schedule_timeout+0x390/0x830
   wait_for_completion+0x1a7/0x310
   __flush_work+0x241/0x5d0
   flush_work+0x10/0x20
   nvme_remove_namespaces+0x85/0x3d0
   nvme_do_delete_ctrl+0xb4/0x1e0
   nvme_delete_ctrl_work+0x15/0x20
   process_one_work+0x7d7/0x1240
   worker_thread+0x8e/0xff0
   kthread+0x2c3/0x3b0
   ret_from_fork+0x35/0x40
--

Reported-by: Logan Gunthorpe <logang@deltatee.com>
Tested-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-07-31 18:01:56 -07:00
Sagi Grimberg b9156daeb1 nvme: fix a possible deadlock when passthru commands sent to a multipath device
When the user issues a command with side effects, we will end up freezing
the namespace request queue when updating disk info (and the same for
the corresponding mpath disk node).

However, we are not freezing the mpath node request queue,
which means that mpath I/O can still come in and block on blk_queue_enter
(called from nvme_ns_head_make_request -> direct_make_request).

This is a deadlock, because blk_queue_enter will block until the inner
namespace request queue is unfroze, but that process is blocked because
the namespace revalidation is trying to update the mpath disk info
and freeze its request queue (which will never complete because
of the I/O that is blocked on blk_queue_enter).

Fix this by freezing all the subsystem nsheads request queues before
executing the passthru command. Given that these commands are infrequent
we should not worry about this temporary I/O freeze to keep things sane.

Here is the matching hang traces:
--
[ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds.
[ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42
[ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 374.487274] systemd-udevd D 0 17994 1 0x00000000
[ 374.493407] Call Trace:
[ 374.496145] __schedule+0x2ef/0x620
[ 374.500047] schedule+0x38/0xa0
[ 374.503569] blk_queue_enter+0x139/0x220
[ 374.507959] ? remove_wait_queue+0x60/0x60
[ 374.512540] direct_make_request+0x60/0x130
[ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core]
[ 374.523740] ? generic_make_request_checks+0x307/0x6f0
[ 374.529484] generic_make_request+0x10d/0x2e0
[ 374.534356] submit_bio+0x75/0x140
[ 374.538163] ? guard_bio_eod+0x32/0xe0
[ 374.542361] submit_bh_wbc+0x171/0x1b0
[ 374.546553] block_read_full_page+0x1ed/0x330
[ 374.551426] ? check_disk_change+0x70/0x70
[ 374.556008] ? scan_shadow_nodes+0x30/0x30
[ 374.560588] blkdev_readpage+0x18/0x20
[ 374.564783] do_read_cache_page+0x301/0x860
[ 374.569463] ? blkdev_writepages+0x10/0x10
[ 374.574037] ? prep_new_page+0x88/0x130
[ 374.578329] ? get_page_from_freelist+0xa2f/0x1280
[ 374.583688] ? __alloc_pages_nodemask+0x179/0x320
[ 374.588947] read_cache_page+0x12/0x20
[ 374.593142] read_dev_sector+0x2d/0xd0
[ 374.597337] read_lba+0x104/0x1f0
[ 374.601046] find_valid_gpt+0xfa/0x720
[ 374.605243] ? string_nocheck+0x58/0x70
[ 374.609534] ? find_valid_gpt+0x720/0x720
[ 374.614016] efi_partition+0x89/0x430
[ 374.618113] ? string+0x48/0x60
[ 374.621632] ? snprintf+0x49/0x70
[ 374.625339] ? find_valid_gpt+0x720/0x720
[ 374.629828] check_partition+0x116/0x210
[ 374.634214] rescan_partitions+0xb6/0x360
[ 374.638699] __blkdev_reread_part+0x64/0x70
[ 374.643377] blkdev_reread_part+0x23/0x40
[ 374.647860] blkdev_ioctl+0x48c/0x990
[ 374.651956] block_ioctl+0x41/0x50
[ 374.655766] do_vfs_ioctl+0xa7/0x600
[ 374.659766] ? locks_lock_inode_wait+0xb1/0x150
[ 374.664832] ksys_ioctl+0x67/0x90
[ 374.668539] __x64_sys_ioctl+0x1a/0x20
[ 374.672732] do_syscall_64+0x5a/0x1c0
[ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds.
[ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42
[ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 374.760170] nvmeadm D 0 49141 36333 0x00004080
[ 374.766301] Call Trace:
[ 374.769038] __schedule+0x2ef/0x620
[ 374.772939] schedule+0x38/0xa0
[ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100
[ 374.781614] ? remove_wait_queue+0x60/0x60
[ 374.786192] blk_mq_freeze_queue+0x1a/0x20
[ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core]
[ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core]
[ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core]
[ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core]
[ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core]
[ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core]
[ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core]
[ 374.833114] do_vfs_ioctl+0xa7/0x600
[ 374.837117] ? __audit_syscall_entry+0xdd/0x130
[ 374.842184] ksys_ioctl+0x67/0x90
[ 374.845891] __x64_sys_ioctl+0x1a/0x20
[ 374.850082] do_syscall_64+0x5a/0x1c0
[ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9
--

Reported-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Tested-by: James Puthukattukaran <james.puthukattukaran@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-07-31 17:59:01 -07:00
Logan Gunthorpe 8c36e66fb4 nvme-core: Fix extra device_put() call on error path
In the error path for nvme_init_subsystem(), nvme_put_subsystem()
will call device_put(), but it will get called again after the
mutex_unlock().

The device_put() only needs to be called if device_add() fails.

This bug caused a KASAN use-after-free error when adding and
removing subsytems in a loop:

  BUG: KASAN: use-after-free in device_del+0x8d9/0x9a0
  Read of size 8 at addr ffff8883cdaf7120 by task multipathd/329

  CPU: 0 PID: 329 Comm: multipathd Not tainted 5.2.0-rc6-vmlocalyes-00019-g70a2b39005fd-dirty #314
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
  Call Trace:
   dump_stack+0x7b/0xb5
   print_address_description+0x6f/0x280
   ? device_del+0x8d9/0x9a0
   __kasan_report+0x148/0x199
   ? device_del+0x8d9/0x9a0
   ? class_release+0x100/0x130
   ? device_del+0x8d9/0x9a0
   kasan_report+0x12/0x20
   __asan_report_load8_noabort+0x14/0x20
   device_del+0x8d9/0x9a0
   ? device_platform_notify+0x70/0x70
   nvme_destroy_subsystem+0xf9/0x150
   nvme_free_ctrl+0x280/0x3a0
   device_release+0x72/0x1d0
   kobject_put+0x144/0x410
   put_device+0x13/0x20
   nvme_free_ns+0xc4/0x100
   nvme_release+0xb3/0xe0
   __blkdev_put+0x549/0x6e0
   ? kasan_check_write+0x14/0x20
   ? bd_set_size+0xb0/0xb0
   ? kasan_check_write+0x14/0x20
   ? mutex_lock+0x8f/0xe0
   ? __mutex_lock_slowpath+0x20/0x20
   ? locks_remove_file+0x239/0x370
   blkdev_put+0x72/0x2c0
   blkdev_close+0x8d/0xd0
   __fput+0x256/0x770
   ? _raw_read_lock_irq+0x40/0x40
   ____fput+0xe/0x10
   task_work_run+0x10c/0x180
   ? filp_close+0xf7/0x140
   exit_to_usermode_loop+0x151/0x170
   do_syscall_64+0x240/0x2e0
   ? prepare_exit_to_usermode+0xd5/0x190
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7f5a79af05d7
  Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f c3 66 0f 1f 44 00 00 53 89 fb 48 83 ec 10 e8 c4 fb ff ff 89 df 89 c2 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2b 89 d7 89 44 24 0c e8 06 fc ff ff 8b 44 24
  RSP: 002b:00007f5a7799c810 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
  RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007f5a79af05d7
  RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000008
  RBP: 00007f5a58000f98 R08: 0000000000000002 R09: 00007f5a7935ee80
  R10: 0000000000000000 R11: 0000000000000293 R12: 000055e432447240
  R13: 0000000000000000 R14: 0000000000000001 R15: 000055e4324a9cf0

  Allocated by task 1236:
   save_stack+0x21/0x80
   __kasan_kmalloc.constprop.6+0xab/0xe0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x102/0x210
   nvme_init_identify+0x13c3/0x3820
   nvme_loop_configure_admin_queue+0x4fa/0x5e0
   nvme_loop_create_ctrl+0x469/0xf40
   nvmf_dev_write+0x19a3/0x21ab
   __vfs_write+0x66/0x120
   vfs_write+0x154/0x490
   ksys_write+0x104/0x240
   __x64_sys_write+0x73/0xb0
   do_syscall_64+0xa5/0x2e0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 329:
   save_stack+0x21/0x80
   __kasan_slab_free+0x129/0x190
   kasan_slab_free+0xe/0x10
   kfree+0xa7/0x200
   nvme_release_subsystem+0x49/0x60
   device_release+0x72/0x1d0
   kobject_put+0x144/0x410
   put_device+0x13/0x20
   klist_class_dev_put+0x31/0x40
   klist_put+0x8f/0xf0
   klist_del+0xe/0x10
   device_del+0x3a7/0x9a0
   nvme_destroy_subsystem+0xf9/0x150
   nvme_free_ctrl+0x280/0x3a0
   device_release+0x72/0x1d0
   kobject_put+0x144/0x410
   put_device+0x13/0x20
   nvme_free_ns+0xc4/0x100
   nvme_release+0xb3/0xe0
   __blkdev_put+0x549/0x6e0
   blkdev_put+0x72/0x2c0
   blkdev_close+0x8d/0xd0
   __fput+0x256/0x770
   ____fput+0xe/0x10
   task_work_run+0x10c/0x180
   exit_to_usermode_loop+0x151/0x170
   do_syscall_64+0x240/0x2e0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 32fd90c407 ("nvme: change locking for the per-subsystem controller list")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by : Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-07-31 17:57:24 -07:00
Anthony Iliopoulos fab7772bfb nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns
When CONFIG_NVME_MULTIPATH is set, only the hidden gendisk associated
with the per-controller ns is run through revalidate_disk when a
rescan is triggered, while the visible blockdev never gets its size
(bdev->bd_inode->i_size) updated to reflect any capacity changes that
may have occurred.

This prevents online resizing of nvme block devices and in extension of
any filesystems atop that will are unable to expand while mounted, as
userspace relies on the blockdev size for obtaining the disk capacity
(via BLKGETSIZE/64 ioctls).

Fix this by explicitly revalidating the actual namespace gendisk in
addition to the per-controller gendisk, when multipath is enabled.

Signed-off-by: Anthony Iliopoulos <ailiopoulos@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-30 13:29:20 +03:00
Logan Gunthorpe e654dfd38c nvme: fix memory leak caused by incorrect subsystem free
When freeing the subsystem after finding another match with
__nvme_find_get_subsystem(), use put_device() instead of
__nvme_release_subsystem() which calls kfree() directly.

Per the documentation, put_device() should always be used
after device_initialization() is called. Otherwise, leaks
like the one below which was detected by kmemleak may occur.

Once the call of __nvme_release_subsystem() is removed it no
longer makes sense to keep the helper, so fold it back
into nvme_release_subsystem().

unreferenced object 0xffff8883d12bfbc0 (size 16):
  comm "nvme", pid 2635, jiffies 4294933602 (age 739.952s)
  hex dump (first 16 bytes):
    6e 76 6d 65 2d 73 75 62 73 79 73 32 00 88 ff ff  nvme-subsys2....
  backtrace:
    [<000000007d8fc208>] __kmalloc_track_caller+0x16d/0x2a0
    [<0000000081169e5f>] kvasprintf+0xad/0x130
    [<0000000025626f25>] kvasprintf_const+0x47/0x120
    [<00000000fa66ad36>] kobject_set_name_vargs+0x44/0x120
    [<000000004881f8b3>] dev_set_name+0x98/0xc0
    [<000000007124dae3>] nvme_init_identify+0x1995/0x38e0
    [<000000009315020a>] nvme_loop_configure_admin_queue+0x4fa/0x5e0
    [<000000001a63e766>] nvme_loop_create_ctrl+0x489/0xf80
    [<00000000a46ecc23>] nvmf_dev_write+0x1a12/0x2220
    [<000000002259b3d5>] __vfs_write+0x66/0x120
    [<000000002f6df81e>] vfs_write+0x154/0x490
    [<000000007e8cfc19>] ksys_write+0x10a/0x240
    [<00000000ff5c7b85>] __x64_sys_write+0x73/0xb0
    [<00000000fee6d692>] do_syscall_64+0xaa/0x470
    [<00000000997e1ede>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: ab9e00cc72 ("nvme: track subsystems")
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-23 17:46:43 +02:00
Minwoo Im 7d30c81b80 nvme: fix NULL deref for fabrics options
git://git.infradead.org/nvme.git nvme-5.3 branch now causes the
following NULL deref oops.  Check the ctrl->opts first before the deref.

[   16.337581] BUG: kernel NULL pointer dereference, address: 0000000000000056
[   16.338551] #PF: supervisor read access in kernel mode
[   16.338551] #PF: error_code(0x0000) - not-present page
[   16.338551] PGD 0 P4D 0
[   16.338551] Oops: 0000 [#1] SMP PTI
[   16.338551] CPU: 2 PID: 1035 Comm: kworker/u16:5 Not tainted 5.2.0-rc6+ #1
[   16.338551] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
[   16.338551] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[   16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
[   16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
[   16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
[   16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
[   16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
[   16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
[   16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
[   16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
[   16.338551] FS:  0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
[   16.338551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
[   16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   16.338551] Call Trace:
[   16.338551]  nvme_scan_work+0x2c0/0x340 [nvme_core]
[   16.338551]  ? __switch_to_asm+0x40/0x70
[   16.338551]  ? _raw_spin_unlock_irqrestore+0x18/0x30
[   16.338551]  ? try_to_wake_up+0x408/0x450
[   16.338551]  process_one_work+0x20b/0x3e0
[   16.338551]  worker_thread+0x1f9/0x3d0
[   16.338551]  ? cancel_delayed_work+0xa0/0xa0
[   16.338551]  kthread+0x117/0x120
[   16.338551]  ? kthread_stop+0xf0/0xf0
[   16.338551]  ret_from_fork+0x3a/0x50
[   16.338551] Modules linked in: nvme nvme_core
[   16.338551] CR2: 0000000000000056
[   16.338551] ---[ end trace b9bf761a93e62d84 ]---
[   16.338551] RIP: 0010:nvme_validate_ns+0xc9/0x7e0 [nvme_core]
[   16.338551] Code: c0 49 89 c5 0f 84 00 07 00 00 48 8b 7b 58 e8 be 48 39 c1 48 3d 00 f0 ff ff 49 89 45 18 0f 87 a4 06 00 00 48 8b 93 70 0a 00 00 <80> 7a 56 00 74 0c 48 8b 40 68 83 48 3c 08 49 8b 45 18 48 89 c6 bf
[   16.338551] RSP: 0018:ffffc900024c7d10 EFLAGS: 00010283
[   16.338551] RAX: ffff888135a30720 RBX: ffff88813a4fd1f8 RCX: 0000000000000007
[   16.338551] RDX: 0000000000000000 RSI: ffffffff8256dd38 RDI: ffff888135a30720
[   16.338551] RBP: 0000000000000001 R08: 0000000000000007 R09: ffff88813aa6a840
[   16.338551] R10: 0000000000000001 R11: 000000000002d060 R12: ffff88813a4fd1f8
[   16.338551] R13: ffff88813a77f800 R14: ffff88813aa35180 R15: 0000000000000001
[   16.338551] FS:  0000000000000000(0000) GS:ffff88813ba80000(0000) knlGS:0000000000000000
[   16.338551] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   16.338551] CR2: 0000000000000056 CR3: 000000000240a002 CR4: 0000000000360ee0
[   16.338551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   16.338551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 958f2a0f81 ("nvme-tcp: set the STABLE_WRITES flag when data digests are enabled")
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-07-11 16:02:26 -06:00
Sagi Grimberg 420dc733f9 nvme: fix regression upon hot device removal and insertion
When we validate the new controller id, we want to skip
controllers that are either deleting or dead. Fix the check
to do that and not on the newly added controller.

Fixes: 1b1031ca63 ("nvme: validate cntlid during controller initialisation")
Reported-by: Jon Derrick <jonathan.derrick@intel.com>
Tested-by: Jon Derrick <jonathan.derrick@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-10 09:36:16 -07:00
Mikhail Skorzhinskii 958f2a0f81 nvme-tcp: set the STABLE_WRITES flag when data digests are enabled
There was a few false alarms sighted on target side about wrong data
digest while performing high throughput load to XFS filesystem shared
through NVMoF TCP.

This flag tells the rest of the kernel to ensure that the data buffer
does not change while the write is in flight.  It incurs a performance
penalty, so only enable it when it is actually needed, i.e. when we are
calculating data digests.

Although even with this change in place, ext2 users can steel experience
false positives, as ext2 is not respecting this flag. This may be apply
to vfat as well.

Signed-off-by: Mikhail Skorzhinskii <mskorzhinskiy@solarflare.com>
Signed-off-by: Mike Playle <mplayle@solarflare.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09 14:18:09 -07:00
Bart Van Assche 81adb86334 nvme: set physical block size and optimal I/O size
>From the NVMe 1.4 spec:

NSFEAT bit 4 if set to 1: indicates that the fields NPWG, NPWA, NPDG, NPDA,
and NOWS are defined for this namespace and should be used by the host for
I/O optimization;
[ ... ]
Namespace Preferred Write Granularity (NPWG): This field indicates the
smallest recommended write granularity in logical blocks for this namespace.
This is a 0's based value. The size indicated should be less than or equal
to Maximum Data Transfer Size (MDTS) that is specified in units of minimum
memory page size. The value of this field may change if the namespace is
reformatted. The size should be a multiple of Namespace Preferred Write
Alignment (NPWA). Refer to section 8.25 for how this field is utilized to
improve performance and endurance.
[ ... ]
Each Write, Write Uncorrectable, or Write Zeroes commands should address a
multiple of Namespace Preferred Write Granularity (NPWG) (refer to Figure
245) and Stream Write Size (SWS) (refer to Figure 515) logical blocks (as
expressed in the NLB field), and the SLBA field of the command should be
aligned to Namespace Preferred Write Alignment (NPWA) (refer to Figure 245)
for best performance.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-07-09 14:15:37 -07:00
Akinobu Mita f79d5fda4e nvme: enable to inject errors into admin commands
This enables to inject errors into the commands submitted to the admin
queue.

It is useful to test error handling in the controller initialization.

	# echo 100 > /sys/kernel/debug/nvme0/fault_inject/probability
	# echo 1 > /sys/kernel/debug/nvme0/fault_inject/times
	# echo 10 > /sys/kernel/debug/nvme0/fault_inject/space
	# nvme reset /dev/nvme0
	# dmesg
	...
	nvme nvme0: Could not set queue count (16385)
	nvme nvme0: IO queues not created

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:15:50 +02:00
Akinobu Mita a3646451ed nvme: prepare for fault injection into admin commands
Currenlty fault injection support for nvme only enables to inject errors
into the commands submitted to I/O queues.

In preparation for fault injection into the admin commands, this makes
the helper functions independent of struct nvme_ns.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:15:50 +02:00
Keith Busch 1a87ee657c nvme: export get and set features
Future use intends to make use of both, so export these functions. And
since their implementation is identical except for the opcode, provide a
new function that implement both.

[akinobu.mita@gmail.com>: fix line over 80 characters]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Anton Eidelman 2181e45561 nvme: fix possible io failures when removing multipathed ns
When a shared namespace is removed, we call blk_cleanup_queue()
when the device can still be accessed as the current path and this can
result in submission to a dying queue. Hence, direct_make_request()
called by our mpath device may fail (propagating the failure to userspace).
Instead, we want to failover this I/O to a different path if one exists.
Thus, before we cleanup the request queue, we make sure that the device is
cleared from the current path nor it can be selected again as such.

Fix this by:
- clear the ns from the head->list and synchronize rcu to make sure there is
  no concurrent path search that restores it as the current path
- clear the mpath current path in order to trigger a subsequent path search
  and sync srcu to wait for any ongoing request submissions
- safely continue to namespace removal and blk_cleanup_queue

Signed-off-by: Anton Eidelman <anton@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-06-21 11:08:38 +02:00
Jaesoo Lee c8e8c77b3b nvme: Fix u32 overflow in the number of namespace list calculation
The Number of Namespaces (nn) field in the identify controller data structure is
defined as u32 and the maximum allowed value in NVMe specification is
0xFFFFFFFEUL. This change fixes the possible overflow of the DIV_ROUND_UP()
operation used in nvme_scan_ns_list() by casting the nn to u64.

Signed-off-by: Jaesoo Lee <jalee@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2019-06-06 09:53:07 -07:00
Laine Walker-Avina 2d466c7a57 nvme: copy MTFA field from identify controller
We use the controller's reported maximum firmware activation time as our
timeout before resetting a controller for a failed activation notice,
but this value was never being read so we could only use the default
timeout. Copy the Identify Controller MTFA field to the corresponding
nvme_ctrl's mtfa field.

Fixes: b6dccf7fae (“nvme: add support for FW activation without reset”).
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Minwoo Im <minwoo.im@samsung.com>
Signed-off-by: Laine Walker-Avina <laine.walker-avina@intel.com>
[changelog, fix endian]
Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-21 09:01:37 -06:00
Yufen Yu 510a405d94 nvme: fix memory leak for power latency tolerance
Unconditionally hide device pm latency tolerance when uninitializing
the controller to ensure all qos resources are released so that we're
not leaking this memory. This is safe to call if none were allocated in
the first place, or were previously freed.

Fixes: c5552fde102fc("nvme: Enable autonomous power state transitions")
Suggested-by: Keith Busch <keith.busch@intel.com>
Tested-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Yufen Yu <yuyufen@huawei.com>
[changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17 11:08:09 -06:00
Christoph Hellwig 5fb4aac756 nvme: release namespace SRCU protection before performing controller ioctls
Holding the SRCU critical section protecting the namespace list can
cause deadlocks when using the per-namespace admin passthrough ioctl to
delete as namespace.  Release it earlier when performing per-controller
ioctls to avoid that.

Reported-by: Kenneth Heitke <kenneth.heitke@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-17 11:07:11 -06:00
Christoph Hellwig 90ec611adc nvme: merge nvme_ns_ioctl into nvme_ioctl
Merge the two functions to make future changes a little easier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-17 11:07:10 -06:00
Christoph Hellwig 3f98bcc58c nvme: remove the ifdef around nvme_nvm_ioctl
We already have a proper stub if lightnvm is not enabled, so don't bother
with the ifdef.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-17 11:07:08 -06:00
Christoph Hellwig 100c815cbd nvme: fix srcu locking on error return in nvme_get_ns_from_disk
If we can't get a namespace don't leak the SRCU lock.  nvme_ioctl was
working around this, but nvme_pr_command wasn't handling this properly.
Just do what callers would usually expect.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-17 11:06:59 -06:00
Keith Busch 6fa0321a96 nvme: Fix known effects
We're trying to append known effects to the ones reported in the
controller's log. The original patch accomplished this, but something
went wrong when patch was merged causing the effects log to override
the known effects.

Link: http://lists.infradead.org/pipermail/linux-nvme/2019-May/023710.html
Fixes: f4524cc456 ("nvme-pci: add known admin effects to augument admin effects log page")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17 11:05:35 -06:00
Keith Busch d6135c3a1e nvme-pci: Sync queues on reset
A controller with multiple namespaces may have multiple request_queues with
their own timeout work. If a controller fails with IO outstanding to
diffent namespaces, each request queue may attempt to handle it, so
ensure there is no previously scheduled timeout work executing prior to
starting controller initialization by synchronizing with each queue.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2019-05-17 11:04:34 -06:00
Linus Torvalds 1718de78e6 for-5.2/block-post-20190516
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzd7PYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpggWD/46Hmn6FuiXQ30HTJd9WKtJzenAAIdUpjq8
 +U985q7vvcqIUotMcG9VUOlCaxk79D5XbptInzLo5CRSn9vMv0sXmAHIFkoj201K
 gW3sHqajnWFFj60Eq5IVdHBZekvD8+bBZMvnX+S53QHOfwY+D1Nx/CtjkxNeq+48
 98kMA/Q1d87Ied6oMW6Nyc7UEN3SanTnntYRIeSrXOJPiwxVWT6SsPUC01VZcwrt
 NSt6IVoW2vFgU0sg8VetzCSfJyTzI0YytjTj/WKGQzuBiKFAvChWrrYZiZ/Z4587
 6W4SFR94nYkW5U1BKgrMp64KUEn20m+jk0IHRYApsFwutSBHJCeB9m2sddxur/GQ
 G/IyXZxv5jKFNBhUEiSedfml9OF+nBbwJGJCKF64Wnybk/gqFgxM1gzyw4fMAXr+
 qYQdETv02W0rDqUG9i3/CaXlN4Lf1IvLR8al4ao0LfDJ0TSXw+UviNsuHEHAv8ey
 sioREF8JacSj1q42TsRGckn3k4HVmaGyFwI3ceLT5bRq8VAhJ+cp7WqML1lUEmY0
 2iIz+PKPDSyigqrh1wvo8ZqhqHifo+0TbRkCOCi5j+PRX6GiYlrvShGevZXEZPqC
 lOFNDgCH3VBTvrcx3j05jJK1qvL4QWAwb/rDUsHZVbsnSVTEHxs/3BsIFQNZpE9/
 AoXCH/ye0Q==
 =ZKv1
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block

Pull more block updates from Jens Axboe:
 "This is mainly some late lightnvm changes that came in just before the
  merge window, as well as fixes that have been queued up since the
  initial pull request was frozen.

  This contains:

   - lightnvm changes, fixing race conditions, improving memory
     utilization, and improving pblk compatability (Chansol, Igor,
     Marcin)

   - NVMe pull request with minor fixes all over the map (via Christoph)

   - remove redundant error print in sata_rcar (Geert)

   - struct_size() cleanup (Jackie)

   - dasd CONFIG_LBADF warning fix (Ming)

   - brd cond_resched() improvement (Mikulas)"

* tag 'for-5.2/block-post-20190516' of git://git.kernel.dk/linux-block: (41 commits)
  block/bio-integrity: use struct_size() in kmalloc()
  nvme: validate cntlid during controller initialisation
  nvme: change locking for the per-subsystem controller list
  nvme: trace all async notice events
  nvme: fix typos in nvme status code values
  nvme-fabrics: remove unused argument
  nvme-multipath: avoid crash on invalid subsystem cntlid enumeration
  nvme-fc: use separate work queue to avoid warning
  nvme-rdma: remove redundant reference between ib_device and tagset
  nvme-pci: mark expected switch fall-through
  nvme-pci: add known admin effects to augument admin effects log page
  nvme-pci: init shadow doorbell after each reset
  brd: add cond_resched to brd_free_pages
  sata_rcar: Remove ata_host_alloc() error printing
  s390/dasd: fix build warning in dasd_eckd_build_cp_raw
  lightnvm: pblk: use nvm_rq_to_ppa_list()
  lightnvm: pblk: simplify partial read path
  lightnvm: do not remove instance under global lock
  lightnvm: track inflight target creations
  lightnvm: pblk: recover only written metadata
  ...
2019-05-16 19:08:15 -07:00
Christoph Hellwig 1b1031ca63 nvme: validate cntlid during controller initialisation
The CNTLID value is required to be unique, and we do rely on this
for correct operation. So reject any controller for which a non-unique
CNTLID has been detected.

Based on a patch from Hannes Reinecke.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2019-05-14 17:19:50 +02:00
Christoph Hellwig 32fd90c407 nvme: change locking for the per-subsystem controller list
Life becomes a lot simpler if we just use the global
nvme_subsystems_lock to protect this list.  Given that it is only
accessed during controller probing and removal that isn't a scalability
problem either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2019-05-14 17:19:50 +02:00
Chaitanya Kulkarni 521cfb8e5a nvme: trace all async notice events
This patch removes the tracing of the NVMe Async events out of the
switch so that it can trace all the events including the ones which
are not handled in the nvme_handle_aen_notice(). The events which
are not handled in the nvme_handle_aen_notice() such as
NVME_AER_NOTICE_DISC_CHANGED corresponding event identifier needs
to be added in the drivers/nvme/host/trace.h so that it can stringify
the AER .

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-14 17:19:47 +02:00
Maxim Levitsky f4524cc456 nvme-pci: add known admin effects to augument admin effects log page
Add known admin effects even if hardware has known admin effects page,
since hardware can't be ever trusted to report sane values.
(on my Intel DC P3700, it reports no side effects for namespace format)

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-05-13 16:00:03 +02:00
Linus Torvalds 67a2422239 for-5.2/block-20190507
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzR0AAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpo0MD/47D1kBK9rGzkAwIz1Jkh1Qy/ITVaDJzmHJ
 UP5uncQsgKFLKMR1LbRcrWtmk2MwFDNULGbteHFeCYE1ypCrTgpWSp5+SJluKd1Q
 hma9krLSAXO9QiSaZ4jafshXFIZxz6IjakOW8c9LrT80Ze47yh7AxiLwDafcp/Jj
 x6NW790qB7ENDtfarDkZk14NCS8HGLRHO5B21LB+hT0Kfbh0XZaLzJdj7Mck1wPA
 VT8hL9mPuA++AjF7Ra4kUjwSakgmajTa3nS2fpkwTYdztQfas7x5Jiv7FWxrrelb
 qbabkNkWKepcHAPEiZR7o53TyfCucGeSK/jG+dsJ9KhNp26kl1ci3frl5T6PfVMP
 SPPDjsKIHs+dqFrU9y5rSGhLJqewTs96hHthnLGxyF67+5sRb5+YIy+dcqgiyc/b
 TUVyjCD6r0cO2q4v9VhwnhOyeBUA9Rwbu8nl7JV5Q45uG7qI4BC39l1jfubMNDPO
 GLNGUUzb6ER7z6lYINjRSF2Jhejsx8SR9P7jhpb1Q7k/VvDDxO1T4FpwvqWFz9+s
 Gn+s6//+cA6LL+42eZkQjvwF2CUNE7TaVT8zdb+s5HP1RQkZToqUnsQCGeRTrFni
 RqWXfW9o9+awYRp431417oMdX/LvLGq9+ZtifRk9DqDcowXevTaf0W2RpplWSuiX
 RcCuPeLAVg==
 =Ot0g
 -----END PGP SIGNATURE-----

Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Nothing major in this series, just fixes and improvements all over the
  map. This contains:

   - Series of fixes for sed-opal (David, Jonas)

   - Fixes and performance tweaks for BFQ (via Paolo)

   - Set of fixes for bcache (via Coly)

   - Set of fixes for md (via Song)

   - Enabling multi-page for passthrough requests (Ming)

   - Queue release fix series (Ming)

   - Device notification improvements (Martin)

   - Propagate underlying device rotational status in loop (Holger)

   - Removal of mtip32xx trim support, which has been disabled for years
     (Christoph)

   - Improvement and cleanup of nvme command handling (Christoph)

   - Add block SPDX tags (Christoph)

   - Cleanup/hardening of bio/bvec iteration (Christoph)

   - A few NVMe pull requests (Christoph)

   - Removal of CONFIG_LBDAF (Christoph)

   - Various little fixes here and there"

* tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
  block: fix mismerge in bvec_advance
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
  blk-mq: always free hctx after request queue is freed
  blk-mq: split blk_mq_alloc_and_init_hctx into two parts
  blk-mq: free hw queue's resource in hctx's release handler
  blk-mq: move cancel of requeue_work into blk_mq_release
  blk-mq: grab .q_usage_counter when queuing request from plug code path
  block: fix function name in comment
  nvmet: protect discovery change log event list iteration
  nvme: mark nvme_core_init and nvme_core_exit static
  nvme: move command size checks to the core
  nvme-fabrics: check more command sizes
  nvme-pci: check more command sizes
  nvme-pci: remove an unneeded variable initialization
  nvme-pci: unquiesce admin queue on shutdown
  nvme-pci: shutdown on timeout during deletion
  nvme-pci: fix psdt field for single segment sgls
  nvme-multipath: don't print ANA group state by default
  nvme-multipath: split bios with the ns_head bio_set before submitting
  ...
2019-05-07 18:14:36 -07:00
Christoph Hellwig 893a74b7a7 nvme: mark nvme_core_init and nvme_core_exit static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-01 09:18:47 -04:00
Christoph Hellwig 811015409f nvme: move command size checks to the core
Most command aren't PCIe specific, so move the size checking for them
to core.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
2019-05-01 09:18:36 -04:00
Sagi Grimberg 01fa017484 nvme: set 0 capacity if namespace block size exceeds PAGE_SIZE
If our target exposed a namespace with a block size that is greater
than PAGE_SIZE, set 0 capacity on the namespace as we do not support it.

This issue encountered when the nvmet namespace was backed by a tempfile.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-25 16:51:42 +02:00
Jens Axboe 5c61ee2cd5 Linux 5.1-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAly8rGYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGmZMH/1IRB0E1Qmzz8yzw
 wj79UuRGYPqxDDSWW+wNc8sU4Ic7iYirn9APHAztCdQqsjmzU/OVLfSa3JhdBe5w
 THo7pbGKBqEDcWnKfNk/21jXFNLZ1vr9BoQv2DGU2MMhHAyo/NZbalo2YVtpQPmM
 OCRth5n+LzvH7rGrX7RYgWu24G9l3NMfgtaDAXBNXesCGFAjVRrdkU5CBAaabvtU
 4GWh/nnutndOOLdByL3x+VZ3H3fIBnbNjcIGCglvvqzk7h3hrfGEl4UCULldTxcM
 IFsfMUhSw1ENy7F6DHGbKIG90cdCJcrQ8J/ziEzjj/KLGALluutfFhVvr6YCM2J6
 2RgU8CY=
 =CfY1
 -----END PGP SIGNATURE-----

Merge tag 'v5.1-rc6' into for-5.2/block

Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
comment, and is trivial. The other one is a conflict due to a later fix
in the bio multi-page work, and needs a bit more care.

* tag 'v5.1-rc6': (770 commits)
  Linux 5.1-rc6
  block: make sure that bvec length can't be overflow
  block: kill all_q_node in request_queue
  x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
  coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
  mm/kmemleak.c: fix unused-function warning
  init: initialize jump labels before command line option parsing
  kernel/watchdog_hld.c: hard lockup message should end with a newline
  kcov: improve CONFIG_ARCH_HAS_KCOV help text
  mm: fix inactive list balancing between NUMA nodes and cgroups
  mm/hotplug: treat CMA pages as unmovable
  proc: fixup proc-pid-vm test
  proc: fix map_files test on F29
  mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
  mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
  mm: swapoff: shmem_unuse() stop eviction without igrab()
  mm: swapoff: take notice of completion sooner
  mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
  mm: swapoff: shmem_find_swap_entries() filter out other types
  slab: store tagged freelist for off-slab slabmgmt
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-22 09:47:36 -06:00
Ingo Molnar 94e4dcc75a Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU and LKMM commits from Paul E. McKenney:

 - An LKMM commit adding support for synchronize_srcu_expedited()
 - A couple of straggling RCU flavor consolidation updates
 - Documentation updates.
 - Miscellaneous fixes
 - SRCU updates
 - RCU CPU stall-warning updates
 - Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-04-18 14:42:24 +02:00
Ming Lei eb3afb75b5 nvme: cancel request synchronously
nvme_cancel_request() is used in error handler, and it is always
reliable to cancel request synchronously, and avoids possible race
in which request may be completed after real hw queue is destroyed.

One issue is reported by our customer on NVMe RDMA, in which freed ib
queue pair may be used in nvme_rdma_complete_rq().

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: James Smart <james.smart@broadcom.com>
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-10 09:57:35 -06:00
Kenneth Heitke d0de579c04 nvme: log the error status on Identify Namespace failure
Identify Namespace failures are logged as a warning but there is not
an indication of the cause for the failure. Update the log message to
include the error status.

Signed-off-by: Kenneth Heitke <kenneth.heitke@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05 08:07:58 +02:00
Max Gurtovoy 43e2d08d07 nvme: avoid double dereference to convert le to cpu
Use le16_to_cpu instead of le16_to_cpup and le64_to_cpu instead of
le64_to_cpup. This will also align the code to nvme-core driver
convention.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-04-05 08:07:56 +02:00
Paul E. McKenney f5ad399149 srcu: Remove cleanup_srcu_struct_quiesced()
The cleanup_srcu_struct_quiesced() function was added because NVME
used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that
NVME workqueues waiting on SRCU workqueues could result in deadlocks
during low-memory conditions.  However, SRCU now also has WQ_MEM_RECLAIM
workqueues, so there is no longer a potential for deadlock.  Furthermore,
it turns out to be extremely hard to use cleanup_srcu_struct_quiesced()
correctly due to the fact that SRCU callback invocation accesses the
srcu_struct structure's per-CPU data area just after callbacks are
invoked.  Therefore, the usual practice of using srcu_barrier() to wait
for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced()
fails because SRCU's callback-invocation workqueue handler might be
delayed, which can result in cleanup_srcu_struct_quiesced() being invoked
(and thus freeing the per-CPU data) before the SRCU's callback-invocation
workqueue handler is finished using that per-CPU data.  Nor is this a
theoretical problem: KASAN emitted use-after-free warnings because of
this problem on actual runs.

In short, NVME can now safely invoke cleanup_srcu_struct(), which
avoids the use-after-free scenario.  And cleanup_srcu_struct_quiesced()
is quite difficult to use safely.  This commit therefore removes
cleanup_srcu_struct_quiesced(), switching its sole user back to
cleanup_srcu_struct().  This effectively reverts the following pair
of commits:

f7194ac32c ("srcu: Add cleanup_srcu_struct_quiesced()")
4317228ad9 ("nvme: Avoid flush dependency in delete controller flow")

Reported-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Bart Van Assche <bvanassche@acm.org>
2019-03-26 14:39:24 -07:00
Christoph Hellwig 9f0916ab93 nvme: add proper write zeroes setup for the multipath device
Add a gendisk argument to nvme_config_write_zeroes so that the call to
nvme_update_disk_info for the multipath device node updates the
proper request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:57:34 -06:00
Christoph Hellwig 2631857160 nvme: add proper discard setup for the multipath device
Add a gendisk argument to nvme_config_discard so that the call to
nvme_update_disk_info for the multipath device node updates the
proper request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:57:34 -06:00
Christoph Hellwig b1aafb35b4 nvme: remove nvme_ns_config_oncs
Just opencode the two function calls in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:57:34 -06:00
Christoph Hellwig 7b210e4ed5 nvme: disable Write Zeroes for qemu controllers
Qemu started out with a broken implementation of Write Zeroes written
by yours truly.  Disable Write Zeroes on qemu for now, eventually
we need to go back and make all the qemu quirks version specific,
but that is left for another time.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:57:34 -06:00
Yufen Yu 01fc08ff1f nvme: update comment to make the code easier to read
After commit a686ed75c0 ("nvme: introduce a helper function for
controller deletion), nvme_delete_ctrl_sync no longer use flush_work.
Update comment, accordingly.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:05:39 -06:00
Sagi Grimberg a63b83700b nvme: put ns_head ref if namespace fails allocation
In case nvme_alloc_ns fails after we initialize ns_head but before we
add the ns to the controller namespaces list we need to explicitly put
the ns_head reference because when we teardown the controller we
won't find it, causing us to leak a dangling subsystem eventually.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:05:39 -06:00
Keith Busch 415df90b43 nvme: don't warn on block content change effects
A write or flush IO passthrough command is expected to change the
logical block content, so don't warn on these as no additional handling
is necessary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-03-13 12:05:39 -06:00
Christoph Hellwig bc50ad7501 nvme: convert to SPDX identifiers
Update license to use SPDX-License-Identifier instead of verbose license
text.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2019-02-20 07:22:28 -07:00
Hannes Reinecke ab4ab09cbd nvme: return error from nvme_alloc_ns()
nvme_alloc_ns() might fail, so we should be returning an error code.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:21:00 -07:00
Bart Van Assche b9c77583b0 nvme: avoid that deleting a controller triggers a circular locking complaint
Rework nvme_delete_ctrl_sync() such that it does not have to wait for
queued work. This patch avoids that test nvme/008 triggers the following
complaint:

WARNING: possible circular locking dependency detected
5.0.0-rc6-dbg+ #10 Not tainted
------------------------------------------------------
nvme/7918 is trying to acquire lock:
000000009a1a7b69 ((work_completion)(&ctrl->delete_work)){+.+.}, at: __flush_work+0x379/0x410

but task is already holding lock:
00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (kn->count#389){++++}:
       lock_acquire+0xc5/0x1e0
       __kernfs_remove+0x42a/0x4a0
       kernfs_remove_by_name_ns+0x45/0x90
       remove_files.isra.1+0x3a/0x90
       sysfs_remove_group+0x5c/0xc0
       sysfs_remove_groups+0x39/0x60
       device_remove_attrs+0x68/0xb0
       device_del+0x24d/0x570
       cdev_device_del+0x1a/0x50
       nvme_delete_ctrl_work+0xbd/0xe0
       process_one_work+0x4f1/0xa40
       worker_thread+0x67/0x5b0
       kthread+0x1cf/0x1f0
       ret_from_fork+0x24/0x30

-> #0 ((work_completion)(&ctrl->delete_work)){+.+.}:
       __lock_acquire+0x1323/0x17b0
       lock_acquire+0xc5/0x1e0
       __flush_work+0x399/0x410
       flush_work+0x10/0x20
       nvme_delete_ctrl_sync+0x65/0x70
       nvme_sysfs_delete+0x4f/0x60
       dev_attr_store+0x3e/0x50
       sysfs_kf_write+0x87/0xa0
       kernfs_fop_write+0x186/0x240
       __vfs_write+0xd7/0x430
       vfs_write+0xfa/0x260
       ksys_write+0xab/0x130
       __x64_sys_write+0x43/0x50
       do_syscall_64+0x71/0x210
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(kn->count#389);
                               lock((work_completion)(&ctrl->delete_work));
                               lock(kn->count#389);
  lock((work_completion)(&ctrl->delete_work));

 *** DEADLOCK ***

3 locks held by nvme/7918:
 #0: 00000000e2223b44 (sb_writers#6){.+.+}, at: vfs_write+0x1eb/0x260
 #1: 000000003404976f (&of->mutex){+.+.}, at: kernfs_fop_write+0x128/0x240
 #2: 00000000ef5a45b4 (kn->count#389){++++}, at: kernfs_remove_self+0x196/0x210

stack backtrace:
CPU: 4 PID: 7918 Comm: nvme Not tainted 5.0.0-rc6-dbg+ #10
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xca
 print_circular_bug.isra.36.cold.54+0x173/0x1d5
 check_prev_add.constprop.45+0x996/0x1110
 __lock_acquire+0x1323/0x17b0
 lock_acquire+0xc5/0x1e0
 __flush_work+0x399/0x410
 flush_work+0x10/0x20
 nvme_delete_ctrl_sync+0x65/0x70
 nvme_sysfs_delete+0x4f/0x60
 dev_attr_store+0x3e/0x50
 sysfs_kf_write+0x87/0xa0
 kernfs_fop_write+0x186/0x240
 __vfs_write+0xd7/0x430
 vfs_write+0xfa/0x260
 ksys_write+0xab/0x130
 __x64_sys_write+0x43/0x50
 do_syscall_64+0x71/0x210
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:19:42 -07:00
Bart Van Assche a686ed75c0 nvme: introduce a helper function for controller deletion
This patch does not change any functionality but makes the next patch
in this series easier to read.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:18:21 -07:00
Bart Van Assche d84c4b024a nvme: unexport nvme_delete_ctrl_sync()
Since nvme_delete_ctrl_sync() is not called from any other kernel module,
unexport it.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:18:15 -07:00
Hannes Reinecke 75c10e7327 nvme-multipath: round-robin I/O policy
Implement a simple round-robin I/O policy for multipathing.  Path
selection is done in two rounds, first iterating across all optimized
paths, and if that doesn't return any valid paths, iterate over all
optimized and non-optimized paths.  If no paths are found, use the
existing algorithm.  Also add a sysfs attribute 'iopolicy' to switch
between the current NUMA-aware I/O policy and the 'round-robin' I/O
policy.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-20 07:17:49 -07:00
Jens Axboe 6fb845f0e7 Linux 5.0-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFRBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlxgqNUeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwsoH+OVXu0NQofwTvVru
 8lgF3BSDG2mhf7mxbBBlBizGVy9jnjRNGCFMC+Jq8IwiFLwprja/G27kaDTkpuF1
 PHC3yfjKvjTeUP5aNdHlmxv6j1sSJfZl0y46DQal4UeTG/Giq8TFTi+Tbz7Wb/WV
 yCx4Lr8okAwTuNhnL8ojUCVIpd3c8QsyR9v6nEQ14Mj+MvEbokyTkMJV0bzOrM38
 JOB+/X1XY4JPZ6o3MoXrBca3bxbAJzMneq+9CWw1U5eiIG3msg4a+Ua3++RQMDNr
 8BP0yCZ6wo32S8uu0PI6HrZaBnLYi5g9Wh7Q7yc0mn1Uh1zWFykA6TtqK90agJeR
 A6Ktjw==
 =scY4
 -----END PGP SIGNATURE-----

Merge tag 'v5.0-rc6' into for-5.1/block

Pull in 5.0-rc6 to avoid a dumb merge conflict with fs/iomap.c.
This is needed since io_uring is now based on the block branch,
to avoid a conflict between the multi-page bvecs and the bits
of io_uring that touch the core block parts.

* tag 'v5.0-rc6': (525 commits)
  Linux 5.0-rc6
  x86/mm: Make set_pmd_at() paravirt aware
  MAINTAINERS: Update the ocores i2c bus driver maintainer, etc
  blk-mq: remove duplicated definition of blk_mq_freeze_queue
  Blk-iolatency: warn on negative inflight IO counter
  blk-iolatency: fix IO hang due to negative inflight counter
  MAINTAINERS: unify reference to xen-devel list
  x86/mm/cpa: Fix set_mce_nospec()
  futex: Handle early deadlock return correctly
  futex: Fix barrier comment
  net: dsa: b53: Fix for failure when irq is not defined in dt
  blktrace: Show requests without sector
  mips: cm: reprime error cause
  mips: loongson64: remove unreachable(), fix loongson_poweroff().
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  KVM: nVMX: unconditionally cancel preemption timer in free_nested (CVE-2019-7221)
  KVM: x86: work around leak of uninitialized stack contents (CVE-2019-7222)
  kvm: fix kvm_ioctl_create_device() reference counting (CVE-2019-6974)
  signal: Better detection of synchronous signals
  ...
2019-02-15 08:43:59 -07:00
Keith Busch e7ad43c3ed nvme: lock NS list changes while handling command effects
If a controller supports the NS Change Notification, the namespace
scan_work is automatically triggered after attaching a new namespace.

Occasionally the namespace scan_work may append the new namespace to the
list before the admin command effects handling is completed. The effects
handling unfreezes namespaces, but if it unfreezes the newly attached
namespace, its request_queue freeze depth will be off and we'll hit the
warning in blk_mq_unfreeze_queue().

On the next namespace add, we will fail to freeze that queue due to the
previous bad accounting and deadlock waiting for frozen.

Fix that by preventing scan work from altering the namespace list while
command effects handling needs to pair freeze with unfreeze.

Reported-by: Wen Xiong <wenxiong@us.ibm.com>
Tested-by: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-06 16:35:33 +01:00
Sagi Grimberg 794a4cb3d2 nvme: remove the .stop_ctrl callout
It is used now just to flush error recovery and reconnect work items in
the RDMA and TCP transports, which can simply be moved to the
corresponding teardown routines.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-04 15:41:25 +01:00
Chaitanya Kulkarni 6e02318eae nvme: add support for the Write Zeroes command
Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block
device, if the device supports an optional command bit set for write
zeroes. Add support to setup write zeroes command. Set maximum possible
write zeroes sectors in one write zeroes command according to
nvme write zeroes command definition.

This patch was posted as a part of block-write-zeroes support
implementation (https://patchwork.kernel.org/patch/9454859/),
but did not make into mainline kernel as it got reverted due to
failure on the Linus's machine.

In this patch in order to be more cautious, we use NVMe controller's
maximum hardware sector size which is calculated based on the
controller's MDTS (Maximum Data Transfer Size) field to calculate
the maximum sectors for the write zeroes request.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
[folded a fix from Keith Busch to properly respect
 NVME_QUIRK_DEALLOCATE_ZEROES]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-02-04 15:41:25 +01:00
Andrey Smirnov b8a38ea64d nvme: don't initlialize ctrl->cntlid twice
ctrl->cntlid will already be initialized from id->cntlid for
non-NVME_F_FABRICS controllers few lines below. For NVME_F_FABRICS
controllers this field should already be initialized, otherwise the
check

	if (ctrl->cntlid != le16_to_cpu(id->cntlid))

below will always be a no-op.

Signed-off-by: Andrey Smirnov <andrew.smirnov@gmail.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:08 -05:00
James Dingwall 6299358d19 nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQN
If a device provides an NQN it is expected to be globally unique.
Unfortunately some firmware revisions for Intel 760p/Pro 7600p devices did
not satisfy this requirement.  In these circumstances if a system has >1
affected device then only one device is enabled.  If this quirk is enabled
then the device supplied subnqn is ignored and we fallback to generating
one as if the field was empty.  In this case we also suppress the version
check so we don't print a warning when the quirk is enabled.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: James Dingwall <james@dingwall.me.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:08 -05:00
Keith Busch 3da584f571 nvme: pad fake subsys NQN vid and ssvid with zeros
We need to preserve the leading zeros in the vid and ssvid when generating
a unique NQN. Truncating these may lead to naming collisions.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-09 13:47:06 -05:00
Sagi Grimberg 6287b51c77 nvme-core: optionally poll sync commands
Pass poll bool to indicate that we need it to poll. This prepares us for
polling support in nvmf since connect is an I/O that will be queued
and has to be polled in order to complete. If poll is passed,
we call nvme_execute_rq_polled which sends the requests and polls
for its completion.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-18 17:50:48 +01:00
Sagi Grimberg 092ff05200 nvme: fix kernel paging oops
free the controller discard_page correctly.

Fixes: cb5b7262b0 ("nvme: provide fallback for discard alloc failure")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-13 13:58:23 -07:00
Chaitanya Kulkarni b7c8f3663d nvme: remove nvme_common command cdw10 array
This is a preparation patch which removes the nvme common command cdw10
array and replace with individual fields. This is needed for the nvmet
error log page implementation make is error log page entry offset
assignment easier.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-13 09:59:01 +01:00
Jens Axboe cb5b7262b0 nvme: provide fallback for discard alloc failure
When boxes are run near (or to) OOM, we have a problem with the discard
page allocation in nvme. If we fail allocating the special page, we
return busy, and it'll get retried. But since ordering is honored for
dispatch requests, we can keep retrying this same IO and failing. Behind
that IO could be requests that want to free memory, but they never get
the chance.

Allocate a fixed discard page per controller for a safe fallback, and use
that if the initial allocation fails.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-13 09:59:00 +01:00
Chengguang Xu 8eb5d89f48 nvme: add __exit annotation
Add __exit annotation to cleanup helper which is only
called once in the module.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-13 09:58:59 +01:00
Matias Bjørling 85136c0102 lightnvm: simplify geometry enumeration
Currently the geometry of an OCSSD is enumerated using a two step
approach:

First, nvm_register is called, the OCSSD identify command is issued,
and second the geometry sos and csecs values are read either from the
OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data
structure if it is a 2.0 device.

This patch recombines it into a single step, such that nvm_register can
use the csecs and sos fields independent of which version is used. This
enables one to dynamically size the lightnvm subsystem dma pool.

Reviewed-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-11 12:22:34 -07:00
Jens Axboe 96f774106e Linux 4.20-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwNpb0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGwGwH/00UHnXfxww3ixxz
 zwTVDzptA6SPm6s84yJOWatM5fXhPiAltZaHSYF9lzRzNU71NCq7Frhq3fQUIXKM
 OxqDn9nfSTWcjWTk2q5n2keyRV/KIn67YX7UgqFc1bO/mqtVjEgNWaMyblhI+e9E
 giu1ZXayHr43jK1cDOmGExZubXUq7Vsc9TOlrd+d2SwIqeEP7TCMrPhnHDwCNvX2
 UU5dtANpVzGtHaBcr37wJj+L8kODCc0f+PQ3g2ar5jTHst5SLlHp2u0AMRnUmgdi
 VkGx+mu/uk8mtwUqMIMqhplklVoqK6LTeLqsY5Xt32SKruw9UqyJGdphLjW2QP/g
 MkmA1lI=
 =7kaD
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc6' into for-4.21/block

Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
two corruption fixes. We're going to be overhauling the direct dispatch
path, and we need to do that on top of the changes we made for that
in mainline.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-09 17:45:40 -07:00
Keith Busch 49cd84b6f8 nvme: implement Enhanced Command Retry
A controller may have an internal state that is not able to successfully
process commands for a short duration. In such states, an immediate
command requeue is expected to fail. The driver may exceed its max
retry count, which permanently ends the command in failure when the same
command would succeed after waiting for the controller to be ready.

NVMe ratified TP 4033 provides a delay hint in the completion status
code for failed commands. Implement the retry delay based on the command
completion status and the controller's requested delay.

Note that requeued commands are handled per request_queue, not per
individual request. If multiple commands fail, the controller should
consistently report the desired delay time for retryable commands in
all CQEs, otherwise the requeue list may be kicked too soon.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:58 -07:00
Israel Rukshin 5c4072ad1c nvme: Remove unused forward declaration
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:57 -07:00
Sagi Grimberg 6e3ca03ee9 nvme: support traffic based keep-alive
If the controller supports traffic based keep alive, we restart the keep
alive timer if any admin or io commands was completed during the kato
period.  This prevents a possible starvation of keep alive commands in
the presence of heavy traffic as in such case, we already have a health
indication from the host perspective.

Only set a comp_seen indicator in case the controller supports keep
alive to minimize the overhead for pci controllers.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Sagi Grimberg 3e53ba38a9 nvme: cache controller attributes
We get the controller attributes in identify, cache them as we'll need
them for traffic based keep alive support.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Hannes Reinecke 103e515efa nvme: add a numa_node field to struct nvme_ctrl
Instead of directly poking into the struct device add a new numa_node
field to struct nvme_ctrl.  This allows fabrics drivers where ctrl->dev
is a virtual device to support NUMA affinity as well.

Also expose the field as a sysfs attribute, and populate it for the
RDMA and FC transports.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
Chaitanya Kulkarni 1190203555 nvme: consolidate memset calls in the nvme_setup_cmd path
In function nvme_setup_cmd() we call command specific setup function
for flush, rw, and discard. Instead of calling memset in each function
lets call it once in the parent function.

This is purely code cleanup patch and it does not change any existing
functionality.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07 22:26:55 -07:00
James Smart 86880d6461 nvme: validate controller state before rescheduling keep alive
Delete operations are seeing NULL pointer references in call_timer_fn.
Tracking these back, the timer appears to be the keep alive timer.

nvme_keep_alive_work() which is tied to the timer that is cancelled
by nvme_stop_keep_alive(), simply starts the keep alive io but doesn't
wait for it's completion. So nvme_stop_keep_alive() only stops a timer
when it's pending. When a keep alive is in flight, there is no timer
running and the nvme_stop_keep_alive() will have no affect on the keep
alive io. Thus, if the io completes successfully, the keep alive timer
will be rescheduled.   In the failure case, delete is called, the
controller state is changed, the nvme_stop_keep_alive() is called while
the io is outstanding, and the delete path continues on. The keep
alive happens to successfully complete before the delete paths mark it
as aborted as part of the queue termination, so the timer is restarted.
The delete paths then tear down the controller, and later on the timer
code fires and the timer entry is now corrupt.

Fix by validating the controller state before rescheduling the keep
alive. Testing with the fix has confirmed the condition above was hit.

Signed-off-by: James Smart <jsmart2021@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-12-07 07:11:11 -08:00
Jens Axboe 89d04ec349 Linux 4.20-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlwEZdIeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGAlQH/19oax2Za3IPqF4X
 DM3lal5M6zlUVkoYstqzpbR3MqUwgEnMfvoeMDC6mI9N4/+r2LkV7cRR8HzqQCCS
 jDfD69IzRGb52VSeJmbOrkxBWsR1Nn0t4Z3rEeLPxwaOoNpRc8H973MbAQ2FKMpY
 S4Y3jIK1dNiRRxdh52NupVkQF+djAUwkBuVk/rrvRJmTDij4la03cuCDAO+Di9lt
 GHlVvygKw2SJhDR+z3ArwZNmE0ceCcE6+W7zPHzj2KeWuKrZg22kfUD454f2YEIw
 FG0hu9qecgtpYCkLSm2vr4jQzmpsDoyq3ZfwhjGrP4qtvPC3Db3vL3dbQnkzUcJu
 JtwhVCE=
 =O1q1
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc5' into for-4.21/block

Pull in v4.20-rc5, solving a conflict we'll otherwise get in aio.c and
also getting the merge fix that went into mainline that users are
hitting testing for-4.21/block and/or for-next.

* tag 'v4.20-rc5': (664 commits)
  Linux 4.20-rc5
  PCI: Fix incorrect value returned from pcie_get_speed_cap()
  MAINTAINERS: Update linux-mips mailing list address
  ocfs2: fix potential use after free
  mm/khugepaged: fix the xas_create_range() error path
  mm/khugepaged: collapse_shmem() do not crash on Compound
  mm/khugepaged: collapse_shmem() without freezing new_page
  mm/khugepaged: minor reorderings in collapse_shmem()
  mm/khugepaged: collapse_shmem() remember to clear holes
  mm/khugepaged: fix crashes due to misaccounted holes
  mm/khugepaged: collapse_shmem() stop if punched or truncated
  mm/huge_memory: fix lockdep complaint on 32-bit i_size_read()
  mm/huge_memory: splitting set mapping+index before unfreeze
  mm/huge_memory: rename freeze_page() to unmap_page()
  initramfs: clean old path before creating a hardlink
  kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace
  psi: make disabling/enabling easier for vendor kernels
  proc: fixup map_files test on arm
  debugobjects: avoid recursive calls with kmemleak
  userfaultfd: shmem: UFFDIO_COPY: set the page dirty if VM_WRITE is not set
  ...
2018-12-04 09:38:05 -07:00
Sagi Grimberg f6c8e432cb nvme: flush namespace scanning work just before removing namespaces
nvme_stop_ctrl can be called also for reset flow and there is no need to
flush the scan_work as namespaces are not being removed. This can cause
deadlock in rdma, fc and loop drivers since nvme_stop_ctrl barriers
before controller teardown (and specifically I/O cancellation of the
scan_work itself) takes place, but the scan_work will be blocked anyways
so there is no need to flush it.

Instead, move scan_work flush to nvme_remove_namespaces() where it really
needs to flush.

Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed by: James Smart <jsmart2021@gmail.com>
Tested-by: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-11-30 17:23:23 +01:00
Igor Konopko 751a0cc0cd nvme-pci: fix surprise removal
When a PCIe NVMe device is not present, nvme_dev_remove_admin() calls
blk_cleanup_queue() on the admin queue, which frees the hctx for that
queue.  Moments later, on the same path nvme_kill_queues() calls
blk_mq_unquiesce_queue() on admin queue and tries to access hctx of it,
which leads to following OOPS:

Oops: 0000 [#1] SMP PTI
RIP: 0010:sbitmap_any_bit_set+0xb/0x40
Call Trace:
 blk_mq_run_hw_queue+0xd5/0x150
 blk_mq_run_hw_queues+0x3a/0x50
 nvme_kill_queues+0x26/0x50
 nvme_remove_namespaces+0xb2/0xc0
 nvme_remove+0x60/0x140
 pci_device_remove+0x3b/0xb0

Fixes: cb4bfda62a ("nvme-pci: fix hot removal during error handling")
Signed-off-by: Igor Konopko <igor.j.konopko@intel.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-11-27 18:12:26 +01:00
Keith Busch d6a2b9535d nvme: Free ctrl device name on init failure
Free the kobject name that was allocated for the controller device on
failure rather than its parent.

Fixes: d22524a478 ("nvme: switch controller refcounting to use struct device")
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-11-27 08:35:15 +01:00
Jens Axboe a78b03bc73 Linux 4.20-rc3
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlvx2sAeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGycgIAIuxobwt0RRKa0zO
 ROS+34JGoC2yU2P9VdEGWdtxS6ANMVQgKPBhWL6s+xR89Kd+V4xSdJLD1pNTxxqP
 0DCva0np1/Q4juH+JbU50v/lykoLgteZ0P0LBRGf1y8p3WiLPv45IbnNsMDNYhB2
 7a8rOmZYakRY9CPznRDw3X8cJt3sddKgFJHIOGz1OQJVWtCD0KPGcJmQNsbDSagY
 Zx6Z5BKSIdjRqaAdN5gDa1Pft3WQo7TpaQGl80lSsgr5LcjmscXA3sClOCy+25Mo
 FZLx0PcwP+Efq8RTGzNK51WSOMa6d37hvjDqUAdQBOR0KbyjRyXQwyQVw/MGbPJs
 7J3Pzm0=
 =56Mt
 -----END PGP SIGNATURE-----

Merge tag 'v4.20-rc3' into for-4.21/block

Merge in -rc3 to resolve a few conflicts, but also to get a few
important fixes that have gone into mainline since the block
4.21 branch was forked off (most notably the SCSI queue issue,
which is both a conflict AND needed fix).

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-18 15:46:03 -07:00
Sagi Grimberg 8f676b8508 nvme: make sure ns head inherits underlying device limits
Whenever we update ns_head info, we need to make sure it is still
compatible with all underlying backing devices because although nvme
multipath doesn't have any explicit use of these limits, other devices
can still be stacked on top of it which may rely on the underlying limits.
Start with unlimited stacking limits, and every info update iterate over
siblings and adjust queue limits.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-09 06:14:47 -07:00
Jens Axboe 7baa85727d blk-mq-tag: change busy_iter_fn to return whether to continue or not
We have this functionality in sbitmap, but we don't export it in
blk-mq for users of the tags busy iteration. This can be useful
for stopping the iteration, if the caller doesn't need to find
more requests.

Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-11-08 10:24:07 -07:00
Linus Torvalds bd6bf7c104 pci-v4.20-changes
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlvPV7IUHGJoZWxnYWFz
 QGdvb2dsZS5jb20ACgkQWYigwDrT+vyaUg//WnCaRIu2oKOp8c/bplZJDW5eT10d
 oYAN9qeyptU9RYrg4KBNbZL9UKGFTk3AoN5AUjrk8njxc/dY2ra/79esOvZyyYQy
 qLXBvrXKg3yZnlNlnyBneGSnUVwv/kl2hZS+kmYby2YOa8AH/mhU0FIFvsnfRK2I
 XvwABFm2ZYvXCqh3e5HXaHhOsR88NQ9In0AXVC7zHGqv1r/bMVn2YzPZHL/zzMrF
 mS79tdBTH+shSvchH9zvfgIs+UEKvvjEJsG2liwMkcQaV41i5dZjSKTdJ3EaD/Y2
 BreLxXRnRYGUkBqfcon16Yx+P6VCefDRLa+RhwYO3dxFF2N4ZpblbkIdBATwKLjL
 npiGc6R8yFjTmZU0/7olMyMCm7igIBmDvWPcsKEE8R4PezwoQv6YKHBMwEaflIbl
 Rv4IUqjJzmQPaA0KkRoAVgAKHxldaNqno/6G1FR2gwz+fr68p5WSYFlQ3axhvTjc
 bBMJpB/fbp9WmpGJieTt6iMOI6V1pnCVjibM5ZON59WCFfytHGGpbYW05gtZEod4
 d/3yRuU53JRSj3jQAQuF1B6qYhyxvv5YEtAQqIFeHaPZ67nL6agw09hE+TlXjWbE
 rTQRShflQ+ydnzIfKicFgy6/53D5hq7iH2l7HwJVXbXRQ104T5DB/XHUUTr+UWQn
 /Nkhov32/n6GjxQ=
 =58I4
 -----END PGP SIGNATURE-----

Merge tag 'pci-v4.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:

 - Fix ASPM link_state teardown on removal (Lukas Wunner)

 - Fix misleading _OSC ASPM message (Sinan Kaya)

 - Make _OSC optional for PCI (Sinan Kaya)

 - Don't initialize ASPM link state when ACPI_FADT_NO_ASPM is set
   (Patrick Talbert)

 - Remove x86 and arm64 node-local allocation for host bridge structures
   (Punit Agrawal)

 - Pay attention to device-specific _PXM node values (Jonathan Cameron)

 - Support new Immediate Readiness bit (Felipe Balbi)

 - Differentiate between pciehp surprise and safe removal (Lukas Wunner)

 - Remove unnecessary pciehp includes (Lukas Wunner)

 - Drop pciehp hotplug_slot_ops wrappers (Lukas Wunner)

 - Tolerate PCIe Slot Presence Detect being hardwired to zero to
   workaround broken hardware, e.g., the Wilocity switch/wireless device
   (Lukas Wunner)

 - Unify pciehp controller & slot structs (Lukas Wunner)

 - Constify hotplug_slot_ops (Lukas Wunner)

 - Drop hotplug_slot_info (Lukas Wunner)

 - Embed hotplug_slot struct into users instead of allocating it
   separately (Lukas Wunner)

 - Initialize PCIe port service drivers directly instead of relying on
   initcall ordering (Keith Busch)

 - Restore PCI config state after a slot reset (Keith Busch)

 - Save/restore DPC config state along with other PCI config state
   (Keith Busch)

 - Reference count devices during AER handling to avoid race issue with
   concurrent hot removal (Keith Busch)

 - If an Upstream Port reports ERR_FATAL, don't try to read the Port's
   config space because it is probably unreachable (Keith Busch)

 - During error handling, use slot-specific reset instead of secondary
   bus reset to avoid link up/down issues on hotplug ports (Keith Busch)

 - Restore previous AER/DPC handling that does not remove and
   re-enumerate devices on ERR_FATAL (Keith Busch)

 - Notify all drivers that may be affected by error recovery resets
   (Keith Busch)

 - Always generate error recovery uevents, even if a driver doesn't have
   error callbacks (Keith Busch)

 - Make PCIe link active reporting detection generic (Keith Busch)

 - Support D3cold in PCIe hierarchies during system sleep and runtime,
   including hotplug and Thunderbolt ports (Mika Westerberg)

 - Handle hpmemsize/hpiosize kernel parameters uniformly, whether slots
   are empty or occupied (Jon Derrick)

 - Remove duplicated include from pci/pcie/err.c and unused variable
   from cpqphp (YueHaibing)

 - Remove driver pci_cleanup_aer_uncorrect_error_status() calls (Oza
   Pawandeep)

 - Uninline PCI bus accessors for better ftracing (Keith Busch)

 - Remove unused AER Root Port .error_resume method (Keith Busch)

 - Use kfifo in AER instead of a local version (Keith Busch)

 - Use threaded IRQ in AER bottom half (Keith Busch)

 - Use managed resources in AER core (Keith Busch)

 - Reuse pcie_port_find_device() for AER injection (Keith Busch)

 - Abstract AER interrupt handling to disconnect error injection (Keith
   Busch)

 - Refactor AER injection callbacks to simplify future improvments
   (Keith Busch)

 - Remove unused Netronome NFP32xx Device IDs (Jakub Kicinski)

 - Use bitmap_zalloc() for dma_alias_mask (Andy Shevchenko)

 - Add switch fall-through annotations (Gustavo A. R. Silva)

 - Remove unused Switchtec quirk variable (Joshua Abraham)

 - Fix pci.c kernel-doc warning (Randy Dunlap)

 - Remove trivial PCI wrappers for DMA APIs (Christoph Hellwig)

 - Add Intel GPU device IDs to spurious interrupt quirk (Bin Meng)

 - Run Switchtec DMA aliasing quirk only on NTB endpoints to avoid
   useless dmesg errors (Logan Gunthorpe)

 - Update Switchtec NTB documentation (Wesley Yung)

 - Remove redundant "default n" from Kconfig (Bartlomiej Zolnierkiewicz)

 - Avoid panic when drivers enable MSI/MSI-X twice (Tonghao Zhang)

 - Add PCI support for peer-to-peer DMA (Logan Gunthorpe)

 - Add sysfs group for PCI peer-to-peer memory statistics (Logan
   Gunthorpe)

 - Add PCI peer-to-peer DMA scatterlist mapping interface (Logan
   Gunthorpe)

 - Add PCI configfs/sysfs helpers for use by peer-to-peer users (Logan
   Gunthorpe)

 - Add PCI peer-to-peer DMA driver writer's documentation (Logan
   Gunthorpe)

 - Add block layer flag to indicate driver support for PCI peer-to-peer
   DMA (Logan Gunthorpe)

 - Map Infiniband scatterlists for peer-to-peer DMA if they contain P2P
   memory (Logan Gunthorpe)

 - Register nvme-pci CMB buffer as PCI peer-to-peer memory (Logan
   Gunthorpe)

 - Add nvme-pci support for PCI peer-to-peer memory in requests (Logan
   Gunthorpe)

 - Use PCI peer-to-peer memory in nvme (Stephen Bates, Steve Wise,
   Christoph Hellwig, Logan Gunthorpe)

 - Cache VF config space size to optimize enumeration of many VFs
   (KarimAllah Ahmed)

 - Remove unnecessary <linux/pci-ats.h> include (Bjorn Helgaas)

 - Fix VMD AERSID quirk Device ID matching (Jon Derrick)

 - Fix Cadence PHY handling during probe (Alan Douglas)

 - Signal Cadence Endpoint interrupts via AXI region 0 instead of last
   region (Alan Douglas)

 - Write Cadence Endpoint MSI interrupts with 32 bits of data (Alan
   Douglas)

 - Remove redundant controller tests for "device_type == pci" (Rob
   Herring)

 - Document R-Car E3 (R8A77990) bindings (Tho Vu)

 - Add device tree support for R-Car r8a7744 (Biju Das)

 - Drop unused mvebu PCIe capability code (Thomas Petazzoni)

 - Add shared PCI bridge emulation code (Thomas Petazzoni)

 - Convert mvebu to use shared PCI bridge emulation (Thomas Petazzoni)

 - Add aardvark Root Port emulation (Thomas Petazzoni)

 - Support 100MHz/200MHz refclocks for i.MX6 (Lucas Stach)

 - Add initial power management for i.MX7 (Leonard Crestez)

 - Add PME_Turn_Off support for i.MX7 (Leonard Crestez)

 - Fix qcom runtime power management error handling (Bjorn Andersson)

 - Update TI dra7xx unaligned access errata workaround for host mode as
   well as endpoint mode (Vignesh R)

 - Fix kirin section mismatch warning (Nathan Chancellor)

 - Remove iproc PAXC slot check to allow VF support (Jitendra Bhivare)

 - Quirk Keystone K2G to limit MRRS to 256 (Kishon Vijay Abraham I)

 - Update Keystone to use MRRS quirk for host bridge instead of open
   coding (Kishon Vijay Abraham I)

 - Refactor Keystone link establishment (Kishon Vijay Abraham I)

 - Simplify and speed up Keystone link training (Kishon Vijay Abraham I)

 - Remove unused Keystone host_init argument (Kishon Vijay Abraham I)

 - Merge Keystone driver files into one (Kishon Vijay Abraham I)

 - Remove redundant Keystone platform_set_drvdata() (Kishon Vijay
   Abraham I)

 - Rename Keystone functions for uniformity (Kishon Vijay Abraham I)

 - Add Keystone device control module DT binding (Kishon Vijay Abraham
   I)

 - Use SYSCON API to get Keystone control module device IDs (Kishon
   Vijay Abraham I)

 - Clean up Keystone PHY handling (Kishon Vijay Abraham I)

 - Use runtime PM APIs to enable Keystone clock (Kishon Vijay Abraham I)

 - Clean up Keystone config space access checks (Kishon Vijay Abraham I)

 - Get Keystone outbound window count from DT (Kishon Vijay Abraham I)

 - Clean up Keystone outbound window configuration (Kishon Vijay Abraham
   I)

 - Clean up Keystone DBI setup (Kishon Vijay Abraham I)

 - Clean up Keystone ks_pcie_link_up() (Kishon Vijay Abraham I)

 - Fix Keystone IRQ status checking (Kishon Vijay Abraham I)

 - Add debug messages for all Keystone errors (Kishon Vijay Abraham I)

 - Clean up Keystone includes and macros (Kishon Vijay Abraham I)

 - Fix Mediatek unchecked return value from devm_pci_remap_iospace()
   (Gustavo A. R. Silva)

 - Fix Mediatek endpoint/port matching logic (Honghui Zhang)

 - Change Mediatek Root Port Class Code to PCI_CLASS_BRIDGE_PCI (Honghui
   Zhang)

 - Remove redundant Mediatek PM domain check (Honghui Zhang)

 - Convert Mediatek to pci_host_probe() (Honghui Zhang)

 - Fix Mediatek MSI enablement (Honghui Zhang)

 - Add Mediatek system PM support for MT2712 and MT7622 (Honghui Zhang)

 - Add Mediatek loadable module support (Honghui Zhang)

 - Detach VMD resources after stopping root bus to prevent orphan
   resources (Jon Derrick)

 - Convert pcitest build process to that used by other tools (iio, perf,
   etc) (Gustavo Pimentel)

* tag 'pci-v4.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (140 commits)
  PCI/AER: Refactor error injection fallbacks
  PCI/AER: Abstract AER interrupt handling
  PCI/AER: Reuse existing pcie_port_find_device() interface
  PCI/AER: Use managed resource allocations
  PCI: pcie: Remove redundant 'default n' from Kconfig
  PCI: aardvark: Implement emulated root PCI bridge config space
  PCI: mvebu: Convert to PCI emulated bridge config space
  PCI: mvebu: Drop unused PCI express capability code
  PCI: Introduce PCI bridge emulated config space common logic
  PCI: vmd: Detach resources after stopping root bus
  nvmet: Optionally use PCI P2P memory
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
  block: Add PCI P2P flag for request queue
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
  PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
  ...
2018-10-25 06:50:48 -07:00
Linus Torvalds 6ab9e09238 for-4.20/block-20181021
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlvNQKgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+8D/9Iy6YIeoPwN10gYsqIh0P2fS3wKzL3kiww
 3vFsWO78PzgLxUlNmB7teLtNFc/R5mi8becZmAdvs9za5YFZk56o3Ifv1x+e+z00
 VY1/gxhiJD6suLeJ6lECnERGDaiWOZVRMo2TE17vxYGW6GGaa0Ts6PUUXmpla1u5
 WKctgt0Qv9WVNyiIdLdeHqzKJwsSSwNTt8fK7eFhy3x8e0CwJr+GtXckbbW3LFkY
 lug0npsTli3EmEPMovZhd25SjZmTk5GTM+ADZQ7Tnv5KXoDWB9jn6TcCSAi3G+5d
 5WUVwfnDyYJiH8qvlg5tRJ690muIy3xMOmpr7QBQ0YnR/LQ3EW+1CVfqD+qimgLH
 TXzlREXQpBP3YlxSDS5nddz4o5z84GZmC9B/43ujPaZKIQ6eBXYdkmQH7tPtSugm
 C6VGomR5tHotjxIiAsexh/5hAus+wW8bObKGTPTyINT0ub3XNclwCKLh26CgI9ie
 WvbS9g3j/KPvu/7s6weZpgD+cks0YdWe/XdXXxiHwsGI9h3J2aJna5RQt1rKWDm5
 wGCgbc/B8eSwiWx+GXlqdB9/Dy/bGXOnSTDnKpEVl1f5zNjeLwUKXbjvkMefWs4m
 jEIcquuDETORY+ZYEfa5YbmS4Lhskr0kzMVTVkZ++81tAWpSCU9Xh3IHrR8TNpt+
 J0oh0FHBDg==
 =LRTT
 -----END PGP SIGNATURE-----

Merge tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "This is the main pull request for block changes for 4.20. This
  contains:

   - Series enabling runtime PM for blk-mq (Bart).

   - Two pull requests from Christoph for NVMe, with items such as;
      - Better AEN tracking
      - Multipath improvements
      - RDMA fixes
      - Rework of FC for target removal
      - Fixes for issues identified by static checkers
      - Fabric cleanups, as prep for TCP transport
      - Various cleanups and bug fixes

   - Block merging cleanups (Christoph)

   - Conversion of drivers to generic DMA mapping API (Christoph)

   - Series fixing ref count issues with blkcg (Dennis)

   - Series improving BFQ heuristics (Paolo, et al)

   - Series improving heuristics for the Kyber IO scheduler (Omar)

   - Removal of dangerous bio_rewind_iter() API (Ming)

   - Apply single queue IPI redirection logic to blk-mq (Ming)

   - Set of fixes and improvements for bcache (Coly et al)

   - Series closing a hotplug race with sysfs group attributes (Hannes)

   - Set of patches for lightnvm:
      - pblk trace support (Hans)
      - SPDX license header update (Javier)
      - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
        specs behind a common core interface. (Javier, Matias)
      - Enable pblk to use a common interface to retrieve chunk metadata
        (Matias)
      - Bug fixes (Various)

   - Set of fixes and updates to the blk IO latency target (Josef)

   - blk-mq queue number updates fixes (Jianchao)

   - Convert a bunch of drivers from the old legacy IO interface to
     blk-mq. This will conclude with the removal of the legacy IO
     interface itself in 4.21, with the rest of the drivers (me, Omar)

   - Removal of the DAC960 driver. The SCSI tree will introduce two
     replacement drivers for this (Hannes)"

* tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
  block: setup bounce bio_sets properly
  blkcg: reassociate bios when make_request() is called recursively
  blkcg: fix edge case for blk_get_rl() under memory pressure
  nvme-fabrics: move controller options matching to fabrics
  nvme-rdma: always have a valid trsvcid
  mtip32xx: fully switch to the generic DMA API
  rsxx: switch to the generic DMA API
  umem: switch to the generic DMA API
  sx8: switch to the generic DMA API
  sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
  skd: switch to the generic DMA API
  ubd: remove use of blk_rq_map_sg
  nvme-pci: remove duplicate check
  drivers/block: Remove DAC960 driver
  nvme-pci: fix hot removal during error handling
  nvmet-fcloop: suppress a compiler warning
  nvme-core: make implicit seed truncation explicit
  nvmet-fc: fix kernel-doc headers
  nvme-fc: rework the request initialization code
  nvme-fc: introduce struct nvme_fcp_op_w_sgl
  ...
2018-10-22 17:46:08 +01:00
Logan Gunthorpe e0596ab290 nvme-pci: Add support for P2P memory in requests
For P2P requests, we must use the pci_p2pmem_map_sg() function instead of
the dma_map_sg functions.

With that, we can then indicate PCI_P2P support in the request queue.  For
this, we create an NVME_F_PCI_P2P flag which tells the core to set
QUEUE_FLAG_PCI_P2P in the request queue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2018-10-17 12:18:21 -05:00
Bart Van Assche 202359c007 nvme-core: make implicit seed truncation explicit
The nvme_user_io.slba field is 64 bits wide. That value is copied into the
32-bit bio_integrity_payload.bip_iter.bi_sector field. Make that truncation
explicit to avoid that Coverity complains about implicit truncation. See
also Coverity ID 1056486 on http://scan.coverity.com/projects/linux.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:30 +02:00
Bart Van Assche bb2a1d4e80 nvme-core: rework a NQN copying operation
Although it is easy to see that the code in nvme_init_subnqn() guarantees that
the subsys->nqn string is '\0'-terminated, apparently Coverity is not smart
enough to see this. Make it easier for Coverity to analyze this code by changing
the strncpy() call into a strlcpy() call. This patch does not change the
behavior of the code but fixes Coveritiy ID 1423720.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:25 +02:00
Bart Van Assche eb090c4c94 nvme-core: declare local symbols static
This patch avoids that sparse complains about missing declarations.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-17 08:58:25 +02:00
Keith Busch 48f78be332 nvme: remove ns sibling before clearing path
The code had been clearing a namespace being deleted as the current
path while that namespace was still in the path siblings list. It is
possible a new IO could set that namespace back to the current path
since it appeared to be an eligable path to select, which may result in
a use-after-free error.

This patch ensures a namespace being removed is not eligable to be reset
as a current path prior to clearing it as the current path.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-08 11:53:42 +02:00
Christoph Hellwig f333444708 nvme: take node locality into account when selecting a path
Make current_path an array with an entry for every possible node, and
cache the best path on a per-node basis.  Take the node distance into
account when selecting it.  This is primarily useful for dual-ported PCIe
devices which are connected to PCIe root ports on different sockets.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2018-10-01 14:16:14 -07:00
Chaitanya Kulkarni 09bd1ff4b1 nvme-core: add async event trace helper
This patch adds a new event for nvme async event notification.
We print the async event in the decoded format when we recognize
the event otherwise we just dump the result.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:12 -07:00
Milan P. Gandhi 53b3a66163 nvme: fix typo in nvme_identify_ns_descs
Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-10-01 14:16:08 -07:00
Hannes Reinecke 33b14f67a4 nvme: register ns_id attributes as default sysfs groups
We should be registering the ns_id attribute as default sysfs
attribute groups, otherwise we have a race condition between
the uevent and the attributes appearing in sysfs.

Suggested-by: Bart van Assche <bvanassche@acm.org>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 08:30:29 -06:00
Hannes Reinecke fef912bf86 block: genhd: add 'groups' argument to device_add_disk
Update device_add_disk() to take an 'groups' argument so that
individual drivers can register a device with additional sysfs
attributes.
This avoids race condition the driver would otherwise have if these
groups were to be created with sysfs_add_groups().

Signed-off-by: Martin Wilck <martin.wilck@suse.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-28 08:30:28 -06:00
Chaitanya Kulkarni 1293477f4f nvme: set gendisk read only based on nsattr
NVMe 1.3 TP 4005 introduces new filed (NSATTR). This field indicates
whether given namespace is write protected or not. This patch sets the
gendisk associated with the namespace to read only based on the identify
namespace nsattr field.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-08-08 11:55:49 +02:00
Jens Axboe f87b0f0dfa Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block2
Pull NVMe changes from Christoph:

"This contains the support for TP4004, Asymmetric Namespace Access,
 which makes NVMe multipathing usable in practice."

* 'nvme-4.19' of git://git.infradead.org/nvme:
  nvmet: use Retain Async Event bit to clear AEN
  nvmet: support configuring ANA groups
  nvmet: add minimal ANA support
  nvmet: track and limit the number of namespaces per subsystem
  nvmet: keep a port pointer in nvmet_ctrl
  nvme: add ANA support
  nvme: remove nvme_req_needs_failover
  nvme: simplify the API for getting log pages
  nvme.h: add ANA definitions
  nvme.h: add support for the log specific field

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:34:09 -06:00
Jens Axboe 05b9ba4b55 Linux 4.18-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltU8z0eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG5X8H/2fJr7m3k242+t76
 sitwvx1eoPqTgryW59dRKm9IuXAGA+AjauvHzaz1QxomeQa50JghGWefD0eiJfkA
 1AphQ/24EOiAbbVk084dAI/C2p122dE4D5Fy7CrfLnuouyrbFaZI5STbnrRct7sR
 9deeYW0GDHO1Uenp4WDCj0baaqJqaevZ+7GG09DnWpya2nQtSkGBjqn6GpYmrfOU
 mqFuxAX8mEOW6cwK16y/vYtnVjuuMAiZ63/OJ8AQ6d6ArGLwAsdn7f8Fn4I4tEr2
 L0d3CRLUyegms4++Dmlu05k64buQu46WlPhjCZc5/Ts4kjrNxBuHejj2/jeSnUSt
 vJJlibI=
 =42a5
 -----END PGP SIGNATURE-----

Merge tag 'v4.18-rc6' into for-4.19/block2

Pull in 4.18-rc6 to get the NVMe core AEN change to avoid a
merge conflict down the line.

Signed-of-by: Jens Axboe <axboe@kernel.dk>
2018-08-05 19:32:09 -06:00
Max Gurtovoy f7f1fc363a nvme: use blk API to remap ref tags for IOs with metadata
Also moved the logic of the remapping to the nvme core driver instead
of implementing it in the nvme pci driver. This way all the other nvme
transport drivers will benefit from it (in case they'll implement metadata
support).

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30 08:27:04 -06:00
Max Gurtovoy ddd0bc7569 block: move ref_tag calculation func to the block layer
Currently this function is implemented in the scsi layer, but it's
actual place should be the block layer since T10-PI is a general
data integrity feature that is used in the nvme protocol as well.

Suggested-by: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-30 08:27:01 -06:00
Christoph Hellwig 0d0b660f21 nvme: add ANA support
Add support for Asynchronous Namespace Access as specified in NVMe 1.3
TP 4004.  With ANA each namespace attached to a controller belongs to an
ANA group that describes the characteristics of accessing the namespaces
through this controller.  In the optimized and non-optimized states
namespaces can be accessed regularly, although in a multi-pathing
environment we should always prefer to access a namespace through a
controller where an optimized relationship exists.  Namespaces in
Inaccessible, Permanent-Loss or Change state for a given controller
should not be accessed.

The states are updated through reading the ANA log page, which is read
once during controller initialization, whenever the ANA change notice
AEN is received, or when one of the ANA specific status codes that
signal a state change is received on a command.

The ANA state is kept in the nvme_ns structure, which makes the checks in
the fast path very simple.  Updating the ANA state when reading the log
page is also very simple, the only downside is that finding the initial
ANA state when scanning for namespaces is a bit cumbersome.

The gendisk for a ns_head is only registered once a live path for it
exists.  Without that the kernel would hang during partition scanning.

Includes fixes and improvements from Hannes Reinecke.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:12:08 +02:00
Christoph Hellwig 8decf5d5b9 nvme: remove nvme_req_needs_failover
Now that we just call out to blk_path_error there isn't really any good
reason to not merge it into the only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:12:05 +02:00
Christoph Hellwig 0e98719b0e nvme: simplify the API for getting log pages
Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes
a plain nsid instead of the nvme_ns pointer.  Also add support for the
log specific field while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-07-27 19:12:01 +02:00
Keith Busch 5d87eb94d9 nvme: use hw qid in trace events
We can not match a command to its completion based on the command
id alone. We need the submitting queue identifier to pair with the
completion, so this patch adds that to the trace buffer.

This patch is also collapsing the admin and IO submission traces into a
single one so we don't need to duplicate this and creating unnecessary
code branches: we know if the command is an admin vs IO based on the qid.

And since we're here, the patch fixes code formatting in the area.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
[hch: move the qid helper to nvme.h and made it an inline function]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:19 +02:00
James Smart 230f1f9e04 nvme: move init of keep_alive work item to controller initialization
Currently, the code initializes the keep alive work item whenever
nvme_start_keep_alive() is called. However, this routine is called
several times while reconnecting, etc. Although it's hoped that keep
alive is always disabled and not scheduled when start is called,
re-initing if it were scheduled or completing can have very bad
side effects. There's no need for re-initialization.

Move the keep_alive work item and cmd struct initialization to
controller init.

Signed-off-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-23 09:35:13 +02:00
Roland Dreier 9b38276813 nvme: fix handling of metadata_len for NVME_IOCTL_IO_CMD
The old code in nvme_user_cmd() passed the userspace virtual address
from nvme_passthru_cmd.metadata as the length of the metadata buffer
as well as the address to nvme_submit_user_cmd().

Fixes: 63263d60 ("nvme: Use metadata for passthrough commands")
Signed-off-by: Roland Dreier <roland@purestorage.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-20 07:43:59 -07:00
Weiping Zhang fa441b71aa nvme: don't enable AEN if not supported
Avoid excuting set_feature command if there is no supported bit in
Optional Asynchronous Events Supported (OAES).

Fixes: c0561f82 ("nvme: submit AEN event configuration on startup")
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Weiping Zhang <zhangweiping@didichuxing.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-17 07:30:27 -07:00
Scott Bauer cf39a6bc34 nvme: ensure forward progress during Admin passthru
If the controller supports effects and goes down during the passthru admin
command we will deadlock during namespace revalidation.

[  363.488275] INFO: task kworker/u16:5:231 blocked for more than 120 seconds.
[  363.488290]       Not tainted 4.17.0+ #2
[  363.488296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.488303] kworker/u16:5   D    0   231      2 0x80000000
[  363.488331] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  363.488338] Call Trace:
[  363.488385]  schedule+0x75/0x190
[  363.488396]  rwsem_down_read_failed+0x1c3/0x2f0
[  363.488481]  call_rwsem_down_read_failed+0x14/0x30
[  363.488504]  down_read+0x1d/0x80
[  363.488523]  nvme_stop_queues+0x1e/0xa0 [nvme_core]
[  363.488536]  nvme_dev_disable+0xae4/0x1620 [nvme]
[  363.488614]  nvme_reset_work+0xd1e/0x49d9 [nvme]
[  363.488911]  process_one_work+0x81a/0x1400
[  363.488934]  worker_thread+0x87/0xe80
[  363.488955]  kthread+0x2db/0x390
[  363.488977]  ret_from_fork+0x35/0x40

Fixes: 84fef62d13 ("nvme: check admin passthru command effects")
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Reviewed-by: Keith Busch <keith.busch@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-07-17 05:46:19 -07:00
Jens Axboe 943e942e62 nvme-pci: limit max IO size and segments to avoid high order allocations
nvme requires an sg table allocation for each request. If the request
is large, then the allocation can become quite large. For instance,
with our default software settings of 1280KB IO size, we'll need
10248 bytes of sg table. That turns into a 2nd order allocation,
which we can't always guarantee. If we fail the allocation, blk-mq
will retry it later. But there's no guarantee that we'll EVER be
able to allocate that much contigious memory.

Limit the IO size such that we never need more than a single page
of memory. That's a lot faster and more reliable. Then back that
allocation with a mempool, so that we know we'll always be able
to succeed the allocation at some point.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-21 18:59:46 +02:00
Jens Axboe 95c7c09f4c Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Christoph:

"Fix various little regressions introduced in this merge window, plus
 a rework of the fibre channel connect and reconnect path to share the
 code instead of having separate sets of bugs. Last but not least a
 trivial trace point addition from Hannes."

* 'nvme-4.18' of git://git.infradead.org/nvme:
  nvme-fabrics: fix and refine state checks in __nvmf_check_ready
  nvme-fabrics: handle the admin-only case properly in nvmf_check_ready
  nvme-fabrics: refactor queue ready check
  blk-mq: remove blk_mq_tagset_iter
  nvme: remove nvme_reinit_tagset
  nvme-fc: fix nulling of queue data on reconnect
  nvme-fc: remove reinit_request routine
  nvme-fc: change controllers first connect to use reconnect path
  nvme: don't rely on the changed namespace list log
  nvmet: free smart-log buffer after use
  nvme-rdma: fix error flow during mapping request data
  nvme: add bio remapping tracepoint
  nvme: fix NULL pointer dereference in nvme_init_subsystem
2018-06-15 08:11:05 -06:00
Christoph Hellwig 14dfa400f9 nvme: remove nvme_reinit_tagset
Unused now that all transports stopped using it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
2018-06-14 17:01:27 +02:00
Christoph Hellwig f493af37ab nvme: don't rely on the changed namespace list log
Don't optimize our namespace rescan based on the changed namespace list
log page as userspace might have changed the content through reading
it.

Suggested-by: Keith Busch <keith.busch@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@linux.intel.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2018-06-13 09:24:34 +02:00
Israel Rukshin 16001c1072 nvme: fix NULL pointer dereference in nvme_init_subsystem
When using nvme-pci driver the nvmf_ctrl_options is NULL.
There is no need to check for discovery_nqn flag at non-fabrics controller.

Fixes: 181303d0 ("nvme-fabrics: allow duplicate connections to the discovery controller")
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-06-11 16:17:41 +02:00
Linus Torvalds a3818841bd for-linus-20180608
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlsa4sQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqPNEADbZby01Q6i+dTZYIosz5+/gq8gSkCcpQ/T
 krK/f2MlwD7Rdog1BnGNNP5XOqK8pKIGdARL1FQKpViii6xGIoOc2F4VK+vO44yR
 LI+BeeOM6rWNOAoBO4CqeZz/Fv5IYi7KURWogYhZMqrxBqT2OeD9MMowm5NulBix
 YZ2ttFWiTJScJttJCDPE6cu9EjHDeK63Nr7+UU80k3atU4eUpUp1mRFGmtaYWulq
 l3KaENCwm00WCVqM4i/gVWr2AkgTZqAAyeCx7IrPsrQrCMEhxEpMnU52e2kXSxhM
 Qx6FLNEOjzARuBDurtfJE74usQcW2xDLzT8fh2UStnPpt6S/JX6f9GMBVk0G7I8B
 8COF4DF+bzdbhhz2SiZaTFOmDML5H1iQ8t6lTTms0Bnq29mE3E4QFom8lO+2BxN3
 g6PFhvYaOkhTVtV5BPXpXs9xZBLHrv5G/JopXsZh0RF1kpiova+nfA1K2uJPFpJ0
 NcHuMZKmIG3uBqY3fj5Ul+zuVhZ/1v8B69zWoSWafLrk+VRdcEAniuY2E6SsQFP5
 gV4GNja85S53DnlIVwEUXPYMiY6opiwP53yMNMvkB/FdzaQB5Ehdif2fhZu64QmE
 TtqbHtAuV0VZ3z4GrJ3XNbV6Np4wMOhYls4lTkZsnqNNO2sw/eoTYcmwxLDEYOQw
 uQ9rhZh4IQ==
 =N3BP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20180608' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:
 "A few fixes for this merge window, where some of them should go in
  sooner rather than later, hence a new pull this week. This pull
  request contains:

   - Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
     changes, but also teardown/removal and misc changes (Christop/Dan/
     Johannes/Sagi/Steve).

   - Two lightnvm fixes for issues that showed up in this window
     (Colin/Wei).

   - Failfast/driver flags inheritance for flush requests (Hannes).

   - The md device put sanitization and fix (Kent).

   - dm bio_set inheritance fix (me).

   - nbd discard granularity fix (Josef).

   - nbd consistency in command printing (Kevin).

   - Loop recursion validation fix (Ted).

   - Partition overlap check (Wang)"

[ .. and now my build is warning-free again thanks to the md fix  - Linus ]

* tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
  nvme: cleanup double shift issue
  nvme-pci: make CMB SQ mod-param read-only
  nvme-pci: unquiesce dead controller queues
  nvme-pci: remove HMB teardown on reset
  nvme-pci: queue creation fixes
  nvme-pci: remove unnecessary completion doorbell check
  nvme-pci: remove unnecessary nested locking
  nvmet: filter newlines from user input
  nvme-rdma: correctly check for target keyed sgl support
  nvme: don't hold nvmf_transports_rwsem for more than transport lookups
  nvmet: return all zeroed buffer when we can't find an active namespace
  md: Unify mddev destruction paths
  dm: use bioset_init_from_src() to copy bio_set
  block: add bioset_init_from_src() helper
  block: always set partition number to '0' in blk_partition_remap()
  block: pass failfast and driver-specific flags to flush requests
  nbd: set discard_alignment to the granularity
  nbd: Consistently use request pointer in debug messages.
  block: add verifier for cmdline partition
  lightnvm: pblk: fix resource leak of invalid_bitmap
  ...
2018-06-08 13:36:19 -07:00
Dan Carpenter 77016199f1 nvme: cleanup double shift issue
The problem here is that set_bit() and test_bit() take a bit number so
we should be passing 0 but instead we're passing (1 << 0) which leads to
a double shift.  It doesn't cause a runtime bug in the current code
because it's done consistently and we only set that one bit.

I decided to just re-use NVME_AER_NOTICE_NS_CHANGED instead of
introducing a new define for this.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-06-08 12:51:11 -06:00
Linus Torvalds 4057adafb3 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:

 - updates to the handling of expedited grace periods

 - updates to reduce lock contention in the rcu_node combining tree

   [ These are in preparation for the consolidation of RCU-bh,
     RCU-preempt, and RCU-sched into a single flavor, which was
     requested by Linus in response to a security flaw whose root cause
     included confusion between the multiple flavors of RCU ]

 - torture-test updates that save their users some time and effort

 - miscellaneous fixes

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  rcu/x86: Provide early rcu_cpu_starting() callback
  torture: Make kvm-find-errors.sh find build warnings
  rcutorture: Abbreviate kvm.sh summary lines
  rcutorture: Print end-of-test state in kvm.sh summary
  rcutorture: Print end-of-test state
  torture: Fold parse-torture.sh into parse-console.sh
  torture: Add a script to edit output from failed runs
  rcu: Update list of rcu_future_grace_period() trace events
  rcu: Drop early GP request check from rcu_gp_kthread()
  rcu: Simplify and inline cpu_needs_another_gp()
  rcu: The rcu_gp_cleanup() function does not need cpu_needs_another_gp()
  rcu: Make rcu_start_this_gp() check for out-of-range requests
  rcu: Add funnel locking to rcu_start_this_gp()
  rcu: Make rcu_start_future_gp() caller select grace period
  rcu: Inline rcu_start_gp_advanced() into rcu_start_future_gp()
  rcu: Clear request other than RCU_GP_FLAG_INIT at GP end
  rcu: Cleanup, don't put ->completed into an int
  rcu: Switch __rcu_process_callbacks() to rcu_accelerate_cbs()
  rcu: Avoid __call_rcu_core() root rcu_node ->lock acquisition
  rcu: Make rcu_migrate_callbacks wake GP kthread when needed
  ...
2018-06-04 15:54:04 -07:00
Linus Torvalds f459c34538 for-4.18/block-20180603
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJbFIrHAAoJEPfTWPspceCm2+kQAKo7o7HL30aRxJYu+gYafkuW
 PV47zr3e4vhMDEzDaMsh1+V7I7bm3uS+NZu6cFbcV+N9KXFpeb4V4Hvvm5cs+OC3
 WCOBi4eC1h4qnDQ3ZyySrCMN+KHYJ16pZqddEjqw+fhVudx8i+F+jz3Y4ZMDDc3q
 pArKZvjKh2wEuYXUMFTjaXY46IgPt+er94OwvrhyHk+4AcA+Q/oqSfSdDahUC8jb
 BVR3FV4I3NOHUaru0RbrUko13sVZSboWPCIFrlTDz8xXcJOnVHzdVS1WLFDXLHnB
 O8q9cADCfa4K08kz68RxykcJiNxNvz5ChDaG0KloCFO+q1tzYRoXLsfaxyuUDg57
 Zd93OFZC6hAzXdhclDFIuPET9OQIjDzwphodfKKmDsm3wtyOtydpA0o7JUEongp0
 O1gQsEfYOXmQsXlo8Ot+Z7Ne/HvtGZ91JahUa/59edxQbcKaMrktoyQsQ/d1nOEL
 4kXID18wPcFHWRQHYXyVuw6kbpRtQnh/U2m1eenSZ7tVQHwoe6mF3cfSf5MMseak
 k8nAnmsfEvOL4Ar9ftg61GOrImaQlidxOC2A8fmY5r0Sq/ZldvIFIZizsdTTCcni
 8SOTxcQowyqPf5NvMNQ8cKqqCJap3ppj4m7anZNhbypDIF2TmOWsEcXcMDn4y9on
 fax14DPLo59gBRiPCn5f
 =nga/
 -----END PGP SIGNATURE-----

Merge tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - clean up how we pass around gfp_t and
   blk_mq_req_flags_t (Christoph)

 - prepare us to defer scheduler attach (Christoph)

 - clean up drivers handling of bounce buffers (Christoph)

 - fix timeout handling corner cases (Christoph/Bart/Keith)

 - bcache fixes (Coly)

 - prep work for bcachefs and some block layer optimizations (Kent).

 - convert users of bio_sets to using embedded structs (Kent).

 - fixes for the BFQ io scheduler (Paolo/Davide/Filippo)

 - lightnvm fixes and improvements (Matias, with contributions from Hans
   and Javier)

 - adding discard throttling to blk-wbt (me)

 - sbitmap blk-mq-tag handling (me/Omar/Ming).

 - remove the sparc jsflash block driver, acked by DaveM.

 - Kyber scheduler improvement from Jianchao, making it more friendly
   wrt merging.

 - conversion of symbolic proc permissions to octal, from Joe Perches.
   Previously the block parts were a mix of both.

 - nbd fixes (Josef and Kevin Vigor)

 - unify how we handle the various kinds of timestamps that the block
   core and utility code uses (Omar)

 - three NVMe pull requests from Keith and Christoph, bringing AEN to
   feature completeness, file backed namespaces, cq/sq lock split, and
   various fixes

 - various little fixes and improvements all over the map

* tag 'for-4.18/block-20180603' of git://git.kernel.dk/linux-block: (196 commits)
  blk-mq: update nr_requests when switching to 'none' scheduler
  block: don't use blocking queue entered for recursive bio submits
  dm-crypt: fix warning in shutdown path
  lightnvm: pblk: take bitmap alloc. out of critical section
  lightnvm: pblk: kick writer on new flush points
  lightnvm: pblk: only try to recover lines with written smeta
  lightnvm: pblk: remove unnecessary bio_get/put
  lightnvm: pblk: add possibility to set write buffer size manually
  lightnvm: fix partial read error path
  lightnvm: proper error handling for pblk_bio_add_pages
  lightnvm: pblk: fix smeta write error path
  lightnvm: pblk: garbage collect lines with failed writes
  lightnvm: pblk: rework write error recovery path
  lightnvm: pblk: remove dead function
  lightnvm: pass flag on graceful teardown to targets
  lightnvm: pblk: check for chunk size before allocating it
  lightnvm: pblk: remove unnecessary argument
  lightnvm: pblk: remove unnecessary indirection
  lightnvm: pblk: return NVM_ error on failed submission
  lightnvm: pblk: warn in case of corrupted write buffer
  ...
2018-06-04 07:58:06 -07:00
Jens Axboe 84e92c131a Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block
Pull NVMe changes from Christoph:

"Below is another set of NVMe updates for 4.18.  Besides the usual bug
 fixes this includes more feature completness in terms of AEN and log
 page handling on the target."

* 'nvme-4.18' of git://git.infradead.org/nvme:
  nvme: use the changed namespaces list log to clear ns data changed AENs
  nvme: mark nvme_queue_scan static
  nvme: submit AEN event configuration on startup
  nvmet: mask pending AENs
  nvmet: add AEN configuration support
  nvmet: implement the changed namespaces log
  nvmet: split log page implementation
  nvmet: add a new nvmet_zero_sgl helper
  nvme.h: add AEN configuration symbols
  nvme.h: add the changed namespace list log
  nvme.h: untangle AEN notice definitions
  nvmet: fix error return code in nvmet_file_ns_enable()
  nvmet: fix a typo in nvmet_file_ns_enable()
  nvme-fabrics: allow internal passthrough command on deleting controllers
  nvme-loop: add support for multiple ports
  nvme-pci: simplify __nvme_submit_cmd
  nvme-pci: Rate limit the nvme timeout warnings
  nvme: allow duplicate controller if prior controller being deleted
2018-06-01 07:39:48 -06:00
Christoph Hellwig 30d90964e7 nvme: use the changed namespaces list log to clear ns data changed AENs
Per section 5.2 we need to issue the corresponding log page to clear an
AEN, so for a namespace data changed AEN we need to read the changed
namespace list log.  And once we read that log anyway we might as well
use it to optimize the rescan.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-06-01 14:37:35 +02:00
Christoph Hellwig 50e8d8eeed nvme: mark nvme_queue_scan static
And move it toward the top of the file to avoid a forward declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-06-01 14:37:35 +02:00
Hannes Reinecke c0561f82a7 nvme: submit AEN event configuration on startup
We should register for AEN events; some law-abiding targets might
not be sending us AENs otherwise.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[hch: slight cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2018-06-01 14:37:35 +02:00
Christoph Hellwig 868c2392a7 nvme.h: untangle AEN notice definitions
Stop including the event type in the definitions for the notice type.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2018-05-31 18:46:48 +02:00
Christoph Hellwig d250bf4e77 blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
We already check for started commands in all callbacks, but we should
also protect against already completed commands.  Do this by taking
the checks to common code.

Acked-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-30 11:31:34 -06:00
Max Gurtovoy c97f414c54 nvme: fix extended data LBA supported setting
This value depands on the metadata support value, so reorder the
initialization to fit.

Fixes: b5be3b392 ("nvme: always unregister the integrity profile in __nvme_revalidate_disk")
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
2018-05-29 20:29:16 +02:00
Hannes Reinecke 75c8b19a23 nvme: fixup memory leak in nvme_init_identify()
If nvme_get_effects_log() failed the 'id' buffer from the previous
nvme_identify_ctrl() call will never be freed.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
Hannes Reinecke 181303d035 nvme-fabrics: allow duplicate connections to the discovery controller
The whole point of the discovery controller is that it can accept
multiple connections. Additionally the cmic field is not even defined for
the discovery controller identify page.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-25 16:50:12 +02:00
Ivan Bornyakov e9a9853c23 nvme: host: core: fix precedence of ternary operator
Ternary operator have lower precedence then bitwise or, so 'cdw10' was
calculated wrong.

Signed-off-by: Ivan Bornyakov <brnkv.i1@gmail.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-05-23 09:11:37 -06:00
Jens Axboe 81b1dab458 Merge branch 'nvme-4.18' of git://git.infradead.org/nvme into for-4.18/block
Pull NVMe changes from Keith:

"This is just the first nvme pull request for 4.18. There are several
fabrics and target patches that I missed, so there will be more to
come."

* 'nvme-4.18' of git://git.infradead.org/nvme:
  nvme-pci: drop IRQ disabling on submission queue lock
  nvme-pci: split the nvme queue lock into submission and completion locks
  nvme-pci: handle completions outside of the queue lock
  nvme-pci: move ->cq_vector == -1 check outside of ->q_lock
  nvme-pci: remove cq check after submission
  nvme-pci: simplify nvme_cqe_valid
  nvme: mark the result argument to nvme_complete_async_event volatile
  nvme/pci: Sync controller reset for AER slot_reset
  nvme/pci: Hold controller reference during async probe
  nvme: only reconfigure discard if necessary
  nvme/pci: Use async_schedule for initial reset work
  nvme: lightnvm: add granby support
  NVMe: Add Quirk Delay before CHK RDY for Seagate Nytro Flash Storage
  nvme: change order of qid and cmdid in completion trace
  nvme: fc: provide a descriptive error
2018-05-21 08:33:37 -06:00
Christoph Hellwig 287a63ebbe nvme: mark the result argument to nvme_complete_async_event volatile
We'll need that in the PCIe driver soon as we'll read it straight off the
CQ.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-18 14:41:35 -06:00
Ingo Molnar 13a553199f Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
- Updates to the handling of expedited grace periods, perhaps most
   notably parallelizing their initialization.  Other changes
   include fixes from Boqun Feng.

 - Miscellaneous fixes.  These include an nvme fix from Nitzan Carmi
   that I am carrying because it depends on a new SRCU function
   cleanup_srcu_struct_quiesced().  This branch also includes fixes
   from Byungchul Park and Yury Norov.

 - Updates to reduce lock contention in the rcu_node combining tree.
   These are in preparation for the consolidation of RCU-bh,
   RCU-preempt, and RCU-sched into a single flavor, which was
   requested by Linus Torvalds in response to a security flaw
   whose root cause included confusion between the multiple flavors
   of RCU.

 - Torture-test updates that save their users some time and effort.

Conflicts:
	drivers/nvme/host/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-05-16 09:34:51 +02:00
Nitzan Carmi 4317228ad9 nvme: Avoid flush dependency in delete controller flow
The nvme_delete_ctrl() function queues a work item on a MEM_RECLAIM
queue (nvme_delete_wq), which eventually calls cleanup_srcu_struct(),
which in turn flushes a delayed work from an !MEM_RECLAIM queue. This
is unsafe as we might trigger deadlocks under severe memory pressure.

Since we don't ever invoke call_srcu(), it is safe to use the shiny new
_quiesced() version of srcu cleanup, thus avoiding that flush dependency.
This commit makes that change.

Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Tested-by: Nicholas Piggin <npiggin@gmail.com>
2018-05-15 10:28:03 -07:00
Charles Machalow 4e50d9ebae nvme: Fix sync controller reset return
If a controller reset is requested while the device has no namespaces,
we were incorrectly returning ENETRESET. This patch adds the check for
ADMIN_ONLY controller state to indicate a successful reset.

Fixes: 8000d1fdb0  ("nvme-rdma: fix sysfs invoked reset_ctrl error flow ")
Cc: <stable@vger.kernel.org>
Signed-off-by: Charles Machalow <charles.machalow@intel.com>
[changelog]
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-05-11 10:51:45 -06:00
Jianchao Wang 12d9f07022 nvme: fix use-after-free in nvme_free_ns_head
Currently only nvme_ctrl will take a reference counter of
nvme_subsystem, nvme_ns_head also needs it. Otherwise
nvme_free_ns_head will access the nvme_subsystem.ns_ida
which has been freed by __nvme_release_subsystem after all the
reference of nvme_subsystem have been released by nvme_free_ctrl.
This could cause memory corruption.

 BUG: KASAN: use-after-free in radix_tree_next_chunk+0x9f/0x4b0
 Read of size 8 at addr ffff88036494d2e8 by task fio/1815

 CPU: 1 PID: 1815 Comm: fio Kdump: loaded Tainted: G        W         4.17.0-rc1+ #18
 Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
 Call Trace:
  dump_stack+0x91/0xeb
  print_address_description+0x6b/0x290
  kasan_report+0x261/0x360
  radix_tree_next_chunk+0x9f/0x4b0
  ida_remove+0x8b/0x180
  ida_simple_remove+0x26/0x40
  nvme_free_ns_head+0x58/0xc0
  __blkdev_put+0x30a/0x3a0
  blkdev_close+0x44/0x50
  __fput+0x184/0x380
  task_work_run+0xaf/0xe0
  do_exit+0x501/0x1440
  do_group_exit+0x89/0x140
  __x64_sys_exit_group+0x28/0x30
  do_syscall_64+0x72/0x230

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-05-07 08:25:08 -06:00
Keith Busch a785dbccd9 nvme/multipath: Fix multipath disabled naming collisions
When CONFIG_NVME_MULTIPATH is set, but we're not using nvme to multipath,
namespaces with multiple paths were not creating unique names due to
reusing the same instance number from the namespace's head.

This patch fixes this by falling back to the non-multipath naming method
when the parameter disabled using multipath.

Reported-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 09:37:50 -06:00
Keith Busch f31a21103c nvme: Set integrity flag for user passthrough commands
If the command a separate metadata buffer attached, the request needs
to have the integrity flag set so the driver knows to map it.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-03 09:37:50 -06:00
Jens Axboe 3831761eb8 nvme: only reconfigure discard if necessary
Currently nvme reconfigures discard for every disk revalidation. This
is problematic because any O_WRONLY or O_RDWR open will trigger a
partition scan through udev/systemd, and we will reconfigure discard.
This blows away any user settings, like discard_max_bytes.

Only re-configure the user settable settings if we need to.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
[removed redundant queue flag setting]
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-05-02 11:16:58 -06:00
James Smart bb06ec3145 nvme: expand nvmf_check_if_ready checks
The nvmf_check_if_ready() checks that were added are very simplistic.
As such, the routine allows a lot of cases to fail ios during windows
of reset or re-connection. In cases where there are not multi-path
options present, the error goes back to the callee - the filesystem
or application. Not good.

The common routine was rewritten and calling syntax slightly expanded
so that per-transport is_ready routines don't need to be present.
The transports now call the routine directly. The routine is now a
fabrics routine rather than an inline function.

The routine now looks at controller state to decide the action to
take. Some states mandate io failure. Others define the condition where
a command can be accepted.  When the decision is unclear, a generic
queue-or-reject check is made to look for failfast or multipath ios and
only fails the io if it is so marked. Otherwise, the io will be queued
and wait for the controller state to resolve.

Admin commands issued via ioctl share a live admin queue with commands
from the transport for controller init. The ioctls could be intermixed
with the initialization commands. It's possible for the ioctl cmd to
be issued prior to the controller being enabled. To block this, the
ioctl admin commands need to be distinguished from admin commands used
for controller init. Added a USERCMD nvme_req(req)->rq_flags bit to
reflect this division and set it on ioctls requests.  As the
nvmf_check_if_ready() routine is called prior to nvme_setup_cmd(),
ensure that commands allocated by the ioctl path (actually anything
in core.c) preps the nvme_req(req) before starting the io. This will
preserve the USERCMD flag during execution and/or retry.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.e>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Keith Busch 62843c2e42 nvme: Use admin command effects for admin commands
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Max Gurtovoy fd92c77f58 nvme: check return value of init_srcu_struct function
Also add error flow in case srcu initialization function fails.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Johannes Thumshirn 00b683dbab nvme: unexport nvme_start_keep_alive
nvme_start_keep_alive() isn't used outside core.c so unexport it and
make it static.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Matias Bjørling 7ec6074ff0 nvme: enforce 64bit offset for nvme_get_log_ext fn
Compiling on 32 bits system produces a warning for the shift width
when shifting 32 bit integer with 64bit integer.

Make sure that offset always is 64bit, and use macros for retrieving
lower and upper bits of the offset.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-04-12 09:58:27 -06:00
Linus Torvalds 3526dd0c78 for-4.17/block-20180402
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
 +Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
 X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
 I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
 h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
 Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
 zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
 NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
 UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
 EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
 mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
 5J1axMgVH5LAsIEcEQVq
 =RVhK
 -----END PGP SIGNATURE-----

Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "It's a pretty quiet round this time, which is nice. This contains:

   - series from Bart, cleaning up the way we set/test/clear atomic
     queue flags.

   - series from Bart, fixing races between gendisk and queue
     registration and removal.

   - set of bcache fixes and improvements from various folks, by way of
     Michael Lyle.

   - set of lightnvm updates from Matias, most of it being the 1.2 to
     2.0 transition.

   - removal of unused DIO flags from Nikolay.

   - blk-mq/sbitmap memory ordering fixes from Omar.

   - divide-by-zero fix for BFQ from Paolo.

   - minor documentation patches from Randy.

   - timeout fix from Tejun.

   - Alpha "can't write a char atomically" fix from Mikulas.

   - set of NVMe fixes by way of Keith.

   - bsg and bsg-lib improvements from Christoph.

   - a few sed-opal fixes from Jonas.

   - cdrom check-disk-change deadlock fix from Maurizio.

   - various little fixes, comment fixes, etc from various folks"

* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
  blk-mq: Directly schedule q->timeout_work when aborting a request
  blktrace: fix comment in blktrace_api.h
  lightnvm: remove function name in strings
  lightnvm: pblk: remove some unnecessary NULL checks
  lightnvm: pblk: don't recover unwritten lines
  lightnvm: pblk: implement 2.0 support
  lightnvm: pblk: implement get log report chunk
  lightnvm: pblk: rename ppaf* to addrf*
  lightnvm: pblk: check for supported version
  lightnvm: implement get log report chunk helpers
  lightnvm: make address conversions depend on generic device
  lightnvm: add support for 2.0 address format
  lightnvm: normalize geometry nomenclature
  lightnvm: complete geo structure with maxoc*
  lightnvm: add shorten OCSSD version in geo
  lightnvm: add minor version to generic geometry
  lightnvm: simplify geometry structure
  lightnvm: pblk: refactor init/exit sequences
  lightnvm: Avoid validation of default op value
  lightnvm: centralize permission check for lightnvm ioctl
  ...
2018-04-05 14:27:02 -07:00
Javier González a294c19945 lightnvm: implement get log report chunk helpers
The 2.0 spec provides a report chunk log page that can be retrieved
using the stangard nvme get log page. This replaces the dedicated
get/put bad block table in 1.2.

This patch implements the helper functions to allow targets retrieve the
chunk metadata using get log page. It makes nvme_get_log_ext available
outside of nvme core so that we can use it form lightnvm.

Signed-off-by: Javier González <javier@cnexlabs.com>
Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling 96257a8a7f nvme: lightnvm: add late setup of block size and metadata
The nvme driver sets up the size of the nvme namespace in two steps.
First it initializes the device with standard logical block and
metadata sizes, and then sets the correct logical block and metadata
size. Due to the OCSSD 2.0 specification relies on the namespace to
expose these sizes for correct initialization, let it be updated
appropriately on the LightNVM side as well.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29 17:29:09 -06:00
Matias Bjørling d558fb51ad nvme: make nvme_get_log_ext non-static
Enable the lightnvm integration to use the nvme_get_log_ext()
function.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Nitzan Carmi b435ecea2a nvme: Add .stop_ctrl to nvme ctrl ops
For consistancy reasons, any fabric-specific works
(e.g error recovery/reconnect) should be canceled in
nvme_stop_ctrl, as for all other NVMe pending works
(e.g. scan, keep alive).

The patch aims to simplify the logic of the code, as
we now only rely on a vague demand from any fabric
to flush its private workqueues at the beginning of
.delete_ctrl op.

Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Keith Busch 2079699c10 nvme: Skip checking heads without namespaces
If a task is holding a reference to a namespace on a removed controller,
the head will not be released. If the same controller is added again
later, its namespaces may not be successfully added. Instead, the user
will see kernel message "Duplicate IDs for nsid <X>".

This patch fixes that by skipping heads that don't have namespaces when
considering if a new namespace is safe to add.

Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Cc: stable@vger.kernel.org
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Max Gurtovoy 77d0612da0 nvme: centralize ctrl removal prints
nvme_delete_ctrl can be called from various contexts in parallel,
and cause duplicated information prints, even though the specific
context doesn't perform the actual removal. Instead, print the
information when the actual removal occurs.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Matias Bjørling 70da6094a6 nvme: implement log page low/high offset and dwords
NVMe 1.2.1 extends the get log page interface to include 64 bit
offset and increases the number of dwords to 32 bits. Implement
for future use.

Signed-off-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jianchao Wang 765cc031cd nvme: change namespaces_mutext to namespaces_rwsem
namespaces_mutext is used to synchronize the operations on ctrl
namespaces list. Most of the time, it is a read operation.

On the other hand, there are many interfaces in nvme core that
need this lock, such as nvme_wait_freeze, and even more interfaces
will be added. If we use mutex here, circular dependency could be
introduced easily. For example:
context A                  context B
nvme_xxx                   nvme_xxx
hold namespaces_mutext     require namespaces_mutext
sync context B

So it is better to change it from mutex to rwsem.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Jianchao Wang 6f8e0d787e nvme: fix the dangerous reference of namespaces list
nvme_remove_namespaces and nvme_remove_invalid_namespaces reference
the ctrl->namespaces list w/o holding namespaces_mutext. It is ok
to invoke nvme_ns_remove there, but what if there is others.

To be safer, reference the ctrl->namespaces list under
namespaces_mutext.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Thomas Tai b9e03857f2 nvme: Add fault injection feature
Linux's fault injection framework provides a systematic way to support
error injection via debugfs in the /sys/kernel/debug directory. This
patch uses the framework to add error injection to NVMe driver. The
fault injection source code is stored in a separate file and only linked
if CONFIG_FAULT_INJECTION_DEBUG_FS kernel config is selected.

Once the error injection is enabled, NVME_SC_INVALID_OPCODE with no
retry will be injected into the nvme_end_request. Users can change
the default status code and no retry flag via debufs. Following example
shows how to enable and inject an error. For more examples, refer to
Documentation/fault-injection/nvme-fault-injection.txt

How to enable nvme fault injection:

First, enable CONFIG_FAULT_INJECTION_DEBUG_FS kernel config,
recompile the kernel. After booting up the kernel, do the
following.

How to inject an error:

mount /dev/nvme0n1 /mnt
echo 1 > /sys/kernel/debug/nvme0n1/fault_inject/times
echo 100 > /sys/kernel/debug/nvme0n1/fault_inject/probability
cp a.file /mnt

Expected Result:

cp: cannot stat ‘/mnt/a.file’: Input/output error

Message from dmesg:

FAULT_INJECTION: forcing a failure.
name fault_inject, interval 1, probability 100, space 0, times 1
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.15.0-rc8+ #2
Hardware name: innotek GmbH VirtualBox/VirtualBox,
BIOS VirtualBox 12/01/2006
Call Trace:
  <IRQ>
  dump_stack+0x5c/0x7d
  should_fail+0x148/0x170
  nvme_should_fail+0x2f/0x50 [nvme_core]
  nvme_process_cq+0xe7/0x1d0 [nvme]
  nvme_irq+0x1e/0x40 [nvme]
  __handle_irq_event_percpu+0x3a/0x190
  handle_irq_event_percpu+0x30/0x70
  handle_irq_event+0x36/0x60
  handle_fasteoi_irq+0x78/0x120
  handle_irq+0xa7/0x130
  ? tick_irq_enter+0xa8/0xc0
  do_IRQ+0x43/0xc0
  common_interrupt+0xa2/0xa2
  </IRQ>
RIP: 0010:native_safe_halt+0x2/0x10
RSP: 0018:ffffffff82003e90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
RAX: ffffffff817a10c0 RBX: ffffffff82012480 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 000000008e38ce64 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff82012480
R13: ffffffff82012480 R14: 0000000000000000 R15: 0000000000000000
  ? __sched_text_end+0x4/0x4
  default_idle+0x18/0xf0
  do_idle+0x150/0x1d0
  cpu_startup_entry+0x6f/0x80
  start_kernel+0x4c4/0x4e4
  ? set_init_arg+0x55/0x55
  secondary_startup_64+0xa5/0xb0
  print_req_error: I/O error, dev nvme0n1, sector 9240
EXT4-fs error (device nvme0n1): ext4_find_entry:1436:
inode #2: comm cp: reading directory lblock 0

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: Karl Volz <karl.volz@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Minwoo Im 42595eb7d0 nvme: use define instead of magic value for identify size
NVME_IDENTIFY_DATA_SIZE was added to linux/nvme.h by following commit.
  commit 0add5e8e58 ("nvmet: use NVME_IDENTIFY_DATA_SIZE")

Make it use NVME_IDENTIFY_DATA_SIZE define instead of magic value
0x1000 in case of identify data size.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26 08:53:43 -06:00
Bart Van Assche 8b904b5b6b block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()
This patch has been generated as follows:

for verb in set_unlocked clear_unlocked set clear; do
  replace-in-files queue_flag_${verb} blk_queue_flag_${verb%_unlocked} \
    $(git grep -lw queue_flag_${verb} drivers block/bsg*)
done

Except for protecting all queue flag changes with the queue lock
this patch does not change any functionality.

Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-08 14:13:48 -07:00
Christoph Hellwig 8a30ecc6e0 Revert "nvme: create 'slaves' and 'holders' entries for hidden controllers"
This reverts commit e9a48034d7.

The slaves and holders link for the hidden gendisks confuse lsblk so that
it errors out on, or doesn't report the nvme multipath devices.  Given
that we don't need holder relationships for something that can't even be
directly accessed we should just stop creating those links.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Potnuri Bharat Teja <bharat@chelsio.com>
Cc: stable@vger.kernel.org
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-03-07 03:22:28 -07:00
Jens Axboe 468f098734 Merge branch 'for-jens' of git://git.infradead.org/nvme into for-linus
Pull NVMe fixes from Keith for 4.16-rc.

* 'for-jens' of git://git.infradead.org/nvme:
  nvmet: fix PSDT field check in command format
  nvme-multipath: fix sysfs dangerously created links
  nvme-pci: Fix nvme queue cleanup if IRQ setup fails
  nvmet-loop: use blk_rq_payload_bytes for sgl selection
  nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes
  nvme-fabrics: don't check for non-NULL module in nvmf_register_transport
2018-02-28 12:18:58 -07:00
Baegjae Sung 9bd82b1a44 nvme-multipath: fix sysfs dangerously created links
If multipathing is enabled, each NVMe subsystem creates a head
namespace (e.g., nvme0n1) and multiple private namespaces
(e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
private namespaces, links of head namespace are used, so the
namespace creation order must be followed (e.g., nvme0n1 ->
nvme0c1n1). If the order is not followed, links of sysfs will be
incomplete or kernel panic will occur.

The kernel panic was:
  kernel BUG at fs/sysfs/symlink.c:27!
  Call Trace:
    nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
    nvme_validate_ns+0x5c2/0x850 [nvme_core]
    nvme_scan_work+0x1af/0x2d0 [nvme_core]

Correct order
Context A     Context B
nvme0n1
nvme0c0n1     nvme0c1n1

Incorrect order
Context A     Context B
              nvme0c1n1
nvme0n1
nvme0c0n1

The nvme_mpath_add_disk (for creating head namespace) is called
just before the nvme_mpath_add_disk_links (for creating private
namespaces). In nvme_mpath_add_disk, the first context acquires
the lock of subsystem and creates a head namespace, and other
contexts do nothing by checking GENHD_FL_UP of a head namespace
after waiting to acquire the lock. We verified the code with or
without multipathing using three vendors of dual-port NVMe SSDs.

Signed-off-by: Baegjae Sung <baegjae@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
2018-02-28 02:46:48 -07:00
Nitzan Carmi 8000d1fdb0 nvme-rdma: fix sysfs invoked reset_ctrl error flow
When reset_controller that is invoked by sysfs fails,
it enters an error flow which practically removes the
nvme ctrl entirely (similar to delete_ctrl flow). It
causes the system to hang, since a sysfs attribute cannot
be unregistered by one of its own methods.

This can be fixed by calling delete_ctrl as a work rather
than sequential code. In addition, it should give the ctrl
a chance to recover using reconnection mechanism (consistant
with FC reset_ctrl error flow). Also, while we're here, return
suitable errno in case the reset ended with non live ctrl.

Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-14 15:44:22 +02:00
Jianchao Wang 3fd176b754 nvme: fix the deadlock in nvme_update_formats
nvme_update_formats will invoke nvme_ns_remove under namespaces_mutext.
The will cause deadlock because nvme_ns_remove will also require
the namespaces_mutext. Fix it by getting the ns entries which should
be removed under namespaces_mutext and invoke nvme_ns_remove out of
namespaces_mutext.

Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-13 17:09:50 -07:00
Roland Dreier 0a34e4668c nvme: Don't use a stack buffer for keep-alive command
In nvme_keep_alive() we pass a request with a pointer to an NVMe command on
the stack into blk_execute_rq_nowait().  However, the block layer doesn't
guarantee that the request is fully queued before blk_execute_rq_nowait()
returns.  If not, and the request is queued after nvme_keep_alive() returns,
then we'll end up using stack memory that might have been overwritten to
form the NVMe command we pass to hardware.

Fix this by keeping a special command struct in the nvme_ctrl struct right
next to the delayed work struct used for keep-alives.

Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-12 22:18:14 +02:00
Keith Busch 8cb6af7b3a nvme: Fix discard buffer overrun
This patch checks the discard range array bounds before setting it in
case the driver gets a badly formed request.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08 18:35:55 +02:00
Max Gurtovoy 3096a739d2 nvme: delete NVME_CTRL_LIVE --> NVME_CTRL_CONNECTING transition
There is no logical reason to move from live state to connecting
state. In case of initial connection establishment, the transition
should be NVME_CTRL_NEW --> NVME_CTRL_CONNECTING --> NVME_CTRL_LIVE.
In case of error recovery or reset, the transition should be
NVME_CTRL_LIVE --> NVME_CTRL_RESETTING --> NVME_CTRL_CONNECTING -->
NVME_CTRL_LIVE.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08 18:35:54 +02:00
Max Gurtovoy b754a32c66 nvme-rdma: use NVME_CTRL_CONNECTING state to mark init process
In order to avoid concurrent error recovery during initialization
process (allowed by the NVME_CTRL_NEW --> NVME_CTRL_RESETTING transition)
we must mark the ctrl as CONNECTING before initial connection
establisment.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08 18:35:53 +02:00
Max Gurtovoy ad6a0a52e6 nvme: rename NVME_CTRL_RECONNECTING state to NVME_CTRL_CONNECTING
In pci transport, this state is used to mark the initialization
process. This should be also used in other transports as well.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2018-02-08 18:35:53 +02:00
Linus Torvalds 47fcc0360c Driver Core updates for 4.16-rc1
Here is the set of "big" driver core patches for 4.16-rc1.
 
 The majority of the work here is in the firmware subsystem, with reworks
 to try to attempt to make the code easier to handle in the long run, but
 no functional change.  There's also some tree-wide sysfs attribute
 fixups with lots of acks from the various subsystem maintainers, as well
 as a handful of other normal fixes and changes.
 
 And finally, some license cleanups for the driver core and sysfs code.
 
 All have been in linux-next for a while with no reported issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWnLvPw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynNzACgkzjPoBytJWbpWFt6SR6L33/u4kEAnRFvVCGL
 s6ygQPQhZIjKk2Lxa2hC
 =Zihy
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of "big" driver core patches for 4.16-rc1.

  The majority of the work here is in the firmware subsystem, with
  reworks to try to attempt to make the code easier to handle in the
  long run, but no functional change. There's also some tree-wide sysfs
  attribute fixups with lots of acks from the various subsystem
  maintainers, as well as a handful of other normal fixes and changes.

  And finally, some license cleanups for the driver core and sysfs code.

  All have been in linux-next for a while with no reported issues"

* tag 'driver-core-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (48 commits)
  device property: Define type of PROPERTY_ENRTY_*() macros
  device property: Reuse property_entry_free_data()
  device property: Move property_entry_free_data() upper
  firmware: Fix up docs referring to FIRMWARE_IN_KERNEL
  firmware: Drop FIRMWARE_IN_KERNEL Kconfig option
  USB: serial: keyspan: Drop firmware Kconfig options
  sysfs: remove DEBUG defines
  sysfs: use SPDX identifiers
  drivers: base: add coredump driver ops
  sysfs: add attribute specification for /sysfs/devices/.../coredump
  test_firmware: fix missing unlock on error in config_num_requests_store()
  test_firmware: make local symbol test_fw_config static
  sysfs: turn WARN() into pr_warn()
  firmware: Fix a typo in fallback-mechanisms.rst
  treewide: Use DEVICE_ATTR_WO
  treewide: Use DEVICE_ATTR_RO
  treewide: Use DEVICE_ATTR_RW
  sysfs.h: Use octal permissions
  component: add debugfs support
  bus: simple-pm-bus: convert bool SIMPLE_PM_BUS to tristate
  ...
2018-02-01 10:00:28 -08:00
Linus Torvalds 0a4b6e2f80 Merge branch 'for-4.16/block' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "This is the main pull request for block IO related changes for the
  4.16 kernel. Nothing major in this pull request, but a good amount of
  improvements and fixes all over the map. This contains:

   - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
     Paolo.

   - Support for SMR zones for deadline and mq-deadline from Damien and
     Christoph.

   - Set of fixes for bcache by way of Michael Lyle, including fixes
     from himself, Kent, Rui, Tang, and Coly.

   - Series from Matias for lightnvm with fixes from Hans Holmberg,
     Javier, and Matias. Mostly centered around pblk, and the removing
     rrpc 1.2 in preparation for supporting 2.0.

   - A couple of NVMe pull requests from Christoph. Nothing major in
     here, just fixes and cleanups, and support for command tracing from
     Johannes.

   - Support for blk-throttle for tracking reads and writes separately.
     From Joseph Qi. A few cleanups/fixes also for blk-throttle from
     Weiping.

   - Series from Mike Snitzer that enables dm to register its queue more
     logically, something that's alwways been problematic on dm since
     it's a stacked device.

   - Series from Ming cleaning up some of the bio accessor use, in
     preparation for supporting multipage bvecs.

   - Various fixes from Ming closing up holes around queue mapping and
     quiescing.

   - BSD partition fix from Richard Narron, fixing a problem where we
     can't mount newer (10/11) FreeBSD partitions.

   - Series from Tejun reworking blk-mq timeout handling. The previous
     scheme relied on atomic bits, but it had races where we would think
     a request had timed out if it to reused at the wrong time.

   - null_blk now supports faking timeouts, to enable us to better
     exercise and test that functionality separately. From me.

   - Kill the separate atomic poll bit in the request struct. After
     this, we don't use the atomic bits on blk-mq anymore at all. From
     me.

   - sgl_alloc/free helpers from Bart.

   - Heavily contended tag case scalability improvement from me.

   - Various little fixes and cleanups from Arnd, Bart, Corentin,
     Douglas, Eryu, Goldwyn, and myself"

* 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
  block: remove smart1,2.h
  nvme: add tracepoint for nvme_complete_rq
  nvme: add tracepoint for nvme_setup_cmd
  nvme-pci: introduce RECONNECTING state to mark initializing procedure
  nvme-rdma: remove redundant boolean for inline_data
  nvme: don't free uuid pointer before printing it
  nvme-pci: Suspend queues after deleting them
  bsg: use pr_debug instead of hand crafted macros
  blk-mq-debugfs: don't allow write on attributes with seq_operations set
  nvme-pci: Fix queue double allocations
  block: Set BIO_TRACE_COMPLETION on new bio during split
  blk-throttle: use queue_is_rq_based
  block: Remove kblockd_schedule_delayed_work{,_on}()
  blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
  blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
  lib/scatterlist: Fix chaining support in sgl_alloc_order()
  blk-throttle: track read and write request individually
  block: add bdev_read_only() checks to common helpers
  block: fail op_is_write() requests to read-only partitions
  blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
  ...
2018-01-29 11:51:49 -08:00
Johannes Thumshirn ca5554a696 nvme: add tracepoint for nvme_complete_rq
Add a tracepoint in nvme_complete_rq() for completions of NVMe commands. An
expmale output of the trace-point is as follows:

<idle>-0     [001] d.h.     3.505266: nvme_complete_rq: cmdid=989, qid=1, res=0, retries=0, flags=0x0, status=0

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-26 12:34:40 +01:00
Johannes Thumshirn 3d030e41d9 nvme: add tracepoint for nvme_setup_cmd
Add tracepoints for nvme_setup_cmd() for tracing admin and/or nvm commands.

Examples of the two tracepoints are as follows for trace_nvme_setup_admin_cmd():
kworker/u8:0-5     [003] ....     2.998792: nvme_setup_admin_cmd: cmdid=14, flags=0x0, meta=0x0, cmd=(nvme_admin_create_cq cqid=1, qsize=1023, cq_flags=0x3, irq_vector=0)

and trace_nvme_setup_nvm_cmd():
dd-205   [001] ....     3.503929: nvme_setup_nvm_cmd: qid=1, nsid=1, cmdid=989, flags=0x0, meta=0x0, cmd=(nvme_cmd_read slba=4096, len=2047, ctrl=0x0, dsmgmt=0, reftag=0)

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-26 12:34:40 +01:00
Jianchao Wang ad70062cdb nvme-pci: introduce RECONNECTING state to mark initializing procedure
After Sagi's commit (nvme-rdma: fix concurrent reset and reconnect),
both nvme-fc/rdma have following pattern:
RESETTING    - quiesce blk-mq queues, teardown and delete queues/
               connections, clear out outstanding IO requests...
RECONNECTING - establish new queues/connections and some other
               initializing things.
Introduce RECONNECTING to nvme-pci transport to do the same mark.
Then we get a coherent state definition among nvme pci/rdma/fc
transports.

Suggested-by: James Smart <james.smart@broadcom.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-26 08:12:04 +01:00
Roy Shterman b227c59b9b nvme: host delete_work and reset_work on separate workqueues
We need to ensure that delete_work will be hosted on a different
workqueue than all the works we flush or cancel from it.
Otherwise we may hit a circular dependency warning [1].

Also, given that delete_work flushes reset_work, host reset_work
on nvme_reset_wq and delete_work on nvme_delete_wq. In addition,
fix the flushing in the individual drivers to flush nvme_delete_wq
when draining queued deletes.

[1]:
[  178.491942] =============================================
[  178.492718] [ INFO: possible recursive locking detected ]
[  178.493495] 4.9.0-rc4-c844263313a8-lb #3 Tainted: G           OE
[  178.494382] ---------------------------------------------
[  178.495160] kworker/5:1/135 is trying to acquire lock:
[  178.495894]  (
[  178.496120] "nvme-wq"
[  178.496471] ){++++.+}
[  178.496599] , at:
[  178.496921] [<ffffffffa70ac206>] flush_work+0x1a6/0x2d0
[  178.497670]
               but task is already holding lock:
[  178.498499]  (
[  178.498724] "nvme-wq"
[  178.499074] ){++++.+}
[  178.499202] , at:
[  178.499520] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.500343]
               other info that might help us debug this:
[  178.501269]  Possible unsafe locking scenario:

[  178.502113]        CPU0
[  178.502472]        ----
[  178.502829]   lock(
[  178.503115] "nvme-wq"
[  178.503467] );
[  178.503716]   lock(
[  178.504001] "nvme-wq"
[  178.504353] );
[  178.504601]
                *** DEADLOCK ***

[  178.505441]  May be due to missing lock nesting notation

[  178.506453] 2 locks held by kworker/5:1/135:
[  178.507068]  #0:
[  178.507330]  (
[  178.507598] "nvme-wq"
[  178.507726] ){++++.+}
[  178.508079] , at:
[  178.508173] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.509004]  #1:
[  178.509265]  (
[  178.509532] (&ctrl->delete_work)
[  178.509795] ){+.+.+.}
[  178.510145] , at:
[  178.510239] [<ffffffffa70ad6c2>] process_one_work+0x162/0x6a0
[  178.511070]
               stack backtrace:
:
[  178.511693] CPU: 5 PID: 135 Comm: kworker/5:1 Tainted: G           OE   4.9.0-rc4-c844263313a8-lb #3
[  178.512974] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
[  178.514247] Workqueue: nvme-wq nvme_del_ctrl_work [nvme_tcp]
[  178.515071]  ffffc2668175bae0 ffffffffa7450823 ffffffffa88abd80 ffffffffa88abd80
[  178.516195]  ffffc2668175bb98 ffffffffa70eb012 ffffffffa8d8d90d ffff9c472e9ea700
[  178.517318]  ffff9c472e9ea700 ffff9c4700000000 ffff9c4700007200 ab83be61bec0d50e
[  178.518443] Call Trace:
[  178.518807]  [<ffffffffa7450823>] dump_stack+0x85/0xc2
[  178.519542]  [<ffffffffa70eb012>] __lock_acquire+0x17d2/0x18f0
[  178.520377]  [<ffffffffa75839a7>] ? serial8250_console_putchar+0x27/0x30
[  178.521330]  [<ffffffffa7583980>] ? wait_for_xmitr+0xa0/0xa0
[  178.522174]  [<ffffffffa70ac1eb>] ? flush_work+0x18b/0x2d0
[  178.522975]  [<ffffffffa70eb7cb>] lock_acquire+0x11b/0x220
[  178.523753]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.524535]  [<ffffffffa70ac229>] flush_work+0x1c9/0x2d0
[  178.525291]  [<ffffffffa70ac206>] ? flush_work+0x1a6/0x2d0
[  178.526077]  [<ffffffffa70a9cf0>] ? flush_workqueue_prep_pwqs+0x220/0x220
[  178.527040]  [<ffffffffa70ae7cf>] __cancel_work_timer+0x10f/0x1d0
[  178.527907]  [<ffffffffa70fecb9>] ? vprintk_default+0x29/0x40
[  178.528726]  [<ffffffffa71cb507>] ? printk+0x48/0x50
[  178.529434]  [<ffffffffa70ae8c3>] cancel_delayed_work_sync+0x13/0x20
[  178.530381]  [<ffffffffc042100b>] nvme_stop_ctrl+0x5b/0x70 [nvme_core]
[  178.531314]  [<ffffffffc0403dcc>] nvme_del_ctrl_work+0x2c/0x50 [nvme_tcp]
[  178.532271]  [<ffffffffa70ad741>] process_one_work+0x1e1/0x6a0
[  178.533101]  [<ffffffffa70ad6c2>] ? process_one_work+0x162/0x6a0
[  178.533954]  [<ffffffffa70adc4e>] worker_thread+0x4e/0x490
[  178.534735]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.535588]  [<ffffffffa70adc00>] ? process_one_work+0x6a0/0x6a0
[  178.536441]  [<ffffffffa70b48cf>] kthread+0xff/0x120
[  178.537149]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538094]  [<ffffffffa70b47d0>] ? kthread_park+0x60/0x60
[  178.538900]  [<ffffffffa78e332a>] ret_from_fork+0x2a/0x40

Signed-off-by: Roy Shterman <roys@lightbitslabs.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-15 17:09:30 +01:00
Sagi Grimberg 79c48ccf2f nvme-pci: serialize pci resets
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-15 16:20:10 +01:00
Keith Busch 908e45643d nvme/multipath: Consult blk_status_t for failover
This removes nvme multipath's specific status decoding to see if failover
is needed, using the generic blk_status_t that was decoded earlier. This
abstraction from the raw NVMe status means all status decoding exists
in one place.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 10:52:14 -07:00
Keith Busch e96fef2c3f nvme: Add more command status translation
This adds more NVMe status code translations to blk_status_t values,
and captures all the current status codes NVMe multipath uses.

Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-10 10:52:12 -07:00
Joe Perches c828a89203 treewide: Use DEVICE_ATTR_RO
Convert DEVICE_ATTR uses to DEVICE_ATTR_RO where possible.

Done with perl script:

$ git grep -w --name-only DEVICE_ATTR | \
  xargs perl -i -e 'local $/; while (<>) { s/\bDEVICE_ATTR\s*\(\s*(\w+)\s*,\s*\(?(?:\s*S_IRUGO\s*|\s*0444\s*)\)?\s*,\s*\1_show\s*,\s*NULL\s*\)/DEVICE_ATTR_RO(\1)/g; print;}'

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Robert Jarzmik <robert.jarzmik@free.fr>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Zhang Rui <rui.zhang@intel.com>
Acked-by: Harald Freudenberger <freude@linux.vnet.ibm.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-09 16:34:34 +01:00
Israel Rukshin b837b28394 nvme: fix subsystem multiple controllers support check
There is a problem when another module (e.g. nvmet) takes a reference on
the nvme block device and the physical nvme drive is removed.  In that
case nvme_free_ctrl() will not be called and the controller state will be
"deleting" or "dead" unless nvmet module releases the block device.
Later on, the same nvme drive probes back and nvme_init_subsystem() will
be called and fail due to duplicate subnqn (if the nvme device doesn't
support subsystem with multiple controllers). This will cause a probe
failure.  This commit changes the check of multiple controllers support
at nvme_init_subsystem() by not counting all the controllers at "dead" or
"deleting" state (this is safe because controllers at this state will
never be active again).

Fixes: ab9e00cc72 ("nvme: track subsystems")
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Israel Rukshin <israelr@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 16:57:00 +01:00
Nitzan Carmi 85088c4a0f nvme: take refcount on transport module
The block device is backed by the transport so we must ensure that the
transport driver will not be removed until all references are released.
Otherwise, we might end up referencing freed memory.

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 16:56:57 +01:00
Jianchao Wang 2b1b7e784a nvme-pci: fix NULL pointer reference in nvme_alloc_ns
When the io queues setup or tagset allocation failed, ctrl.tagset is
NULL.  But the scan work will still be queued and executed, then panic
comes up due to NULL pointer reference of ctrl.tagset.

To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only
admin queue is live. When non io queues or tagset allocation failed, ctrl
enters into this state, scan work will not be started.  But async event
work and nvme dev ioctl will be still available.  This will be helpful to
do further investigation and recovery.

Suggested-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:02:13 +01:00
Max Gurtovoy 1a3838d732 nvme: modify the debug level for setting shutdown timeout
When an NVMe controller reports RTD3 Entry Latency larger than the value
of shutdown_timeout module parameter, we update the shutdown_timeout
accordingly to honor RTD3 Entry Latency. Use an informational debug level
instead of a warning level for it.

Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-01-08 11:02:00 +01:00
Sagi Grimberg 479a322fb7 nvme-mpath: fix last path removal during traffic
In case our last path is removed during traffic, we can end up requeueing
the bio(s) but never schedule the actual requeue work as upper layers
still have open handles on the mpath device node.

Fix this by scheduling requeue work if the namespace being removed is
the last path in the ns_head path list.

Fixes: 32acab3181 ("nvme: implement multipath access to nvme subsystems")
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29 10:32:58 +01:00
Jeff Lien cee160fd34 nvme: fix sector units when going between formats
If you format a device with a 4k sector size back to 512 bytes, the queue
limit values for physical block size and minimum IO size were not getting
updated; only the logical block size was being updated.  This patch adds
code to update the physical block and IO minimum sizes.

Signed-off-by: Jeff Lien <jeff.lien@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-29 10:31:05 +01:00
Keith Busch 654b4a4acd nvme: setup streams after initializing namespace head
Fixes a NULL pointer dereference.

Reported-by: Arnav Dawn <a.dawn@samsung.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15 15:18:07 +01:00
Keith Busch 249159c5f1 nvme: check hw sectors before setting chunk sectors
Some devices with IDs matching the "stripe" quirk don't actually have
this quirk, and don't have an MDTS value. When MDTS is not set, the
driver sets the max sectors to UINT_MAX, which is not a power of 2,
hitting a BUG_ON from blk_queue_chunk_sectors. This patch skips setting
chunk sectors for such devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15 15:18:07 +01:00
Ming Lei bd9f5d6576 nvme: call blk_integrity_unregister after queue is cleaned up
During IO complete path, bio_integrity_advance() is often called, and
blk_get_integrity() is called in this function. But in
blk_integrity_unregister, the buffer pointed by queue->integrity
is cleared, and blk_integrity->profile becomes NULL, then blk_get_integrity
returns NULL, and causes kernel oops[1] finally.

This patch fixes this issue by calling blk_integrity_unregister() after
blk_cleanup_queue().

[1] kernel oops log
[  122.068007] BUG: unable to handle kernel NULL pointer dereference at 000000000000000a
[  122.076760] IP: bio_integrity_advance+0x3d/0xf0
[  122.081815] PGD 0 P4D 0
[  122.084641] Oops: 0000 [#1] SMP
[  122.088142] Modules linked in: sunrpc ipmi_ssif intel_rapl vfat fat x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mei_me ipmi_si crct10dif_pclmul crc32_pclmul sg mei ghash_clmulni_intel mxm_wmi ipmi_devintf iTCO_wdt intel_cstate intel_uncore pcspkr intel_rapl_perf iTCO_vendor_support dcdbas ipmi_msghandler lpc_ich acpi_power_meter shpchp wmi dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel ahci nvme tg3 libahci nvme_core i2c_core libata ptp megaraid_sas pps_core dm_mirror dm_region_hash dm_log dm_mod
[  122.149577] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-11.el7a.x86_64 #1
[  122.157635] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  122.166179] task: ffff8802ff1e8000 task.stack: ffffc90000130000
[  122.172785] RIP: 0010:bio_integrity_advance+0x3d/0xf0
[  122.178419] RSP: 0018:ffff88047fc03d70 EFLAGS: 00010006
[  122.184248] RAX: ffff880473b08000 RBX: ffff880458c71a80 RCX: ffff880473b08248
[  122.192209] RDX: 0000000000000000 RSI: 000000000000003c RDI: ffffc900038d7ba0
[  122.200171] RBP: ffff88047fc03d78 R08: 0000000000000001 R09: ffffffffa01a78b5
[  122.208132] R10: ffff88047fc1eda0 R11: ffff880458c71ad0 R12: 0000000000007800
[  122.216094] R13: 0000000000000000 R14: 0000000000007800 R15: ffff880473a39b40
[  122.224056] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[  122.233083] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  122.239494] CR2: 000000000000000a CR3: 0000000001c09002 CR4: 00000000001606e0
[  122.247455] Call Trace:
[  122.250183]  <IRQ>
[  122.252429]  bio_advance+0x28/0xf0
[  122.256217]  blk_update_request+0xa1/0x310
[  122.260778]  blk_mq_end_request+0x1e/0x70
[  122.265256]  nvme_complete_rq+0x1c/0xd0 [nvme_core]
[  122.270699]  nvme_pci_complete_rq+0x85/0x130 [nvme]
[  122.276140]  __blk_mq_complete_request+0x8d/0x140
[  122.281387]  blk_mq_complete_request+0x16/0x20
[  122.286345]  nvme_process_cq+0xdd/0x1c0 [nvme]
[  122.291301]  nvme_irq+0x23/0x50 [nvme]
[  122.295485]  __handle_irq_event_percpu+0x3c/0x190
[  122.300725]  handle_irq_event_percpu+0x32/0x80
[  122.305683]  handle_irq_event+0x3b/0x60
[  122.309964]  handle_edge_irq+0x8f/0x190
[  122.314247]  handle_irq+0xab/0x120
[  122.318043]  do_IRQ+0x48/0xd0
[  122.321355]  common_interrupt+0x9d/0x9d
[  122.325625]  </IRQ>
[  122.327967] RIP: 0010:cpuidle_enter_state+0xe9/0x280
[  122.333504] RSP: 0018:ffffc90000133e68 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff35
[  122.341952] RAX: ffff88047fc1b900 RBX: ffff88047fc24400 RCX: 000000000000001f
[  122.349913] RDX: 0000000000000000 RSI: fffffcf2e6007295 RDI: 0000000000000000
[  122.357874] RBP: ffffc90000133ea0 R08: 000000000000062e R09: 0000000000000253
[  122.365836] R10: 0000000000000225 R11: 0000000000000018 R12: 0000000000000002
[  122.373797] R13: 0000000000000001 R14: ffff88047fc24400 R15: 0000001c6bd1d263
[  122.381762]  ? cpuidle_enter_state+0xc5/0x280
[  122.386623]  cpuidle_enter+0x17/0x20
[  122.390611]  call_cpuidle+0x23/0x40
[  122.394501]  do_idle+0x17e/0x1f0
[  122.398101]  cpu_startup_entry+0x73/0x80
[  122.402478]  start_secondary+0x178/0x1c0
[  122.406854]  secondary_startup_64+0xa5/0xa5
[  122.411520] Code: 48 8b 5f 68 48 8b 47 08 31 d2 4c 8b 5b 48 48 8b 80 d0 03 00 00 48 83 b8 48 02 00 00 00 48 8d 88 48 02 00 00 48 0f 45 d1 c1 ee 09 <0f> b6 4a 0a 0f b6 52 09 89 f0 48 01 73 08 83 e9 09 d3 e8 0f af
[  122.432604] RIP: bio_integrity_advance+0x3d/0xf0 RSP: ffff88047fc03d70
[  122.439888] CR2: 000000000000000a

Reported-by: Zhang Yi <yizhan@redhat.com>
Tested-by: Zhang Yi <yizhan@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15 15:18:07 +01:00
David Disseldorp b224f6134d nvme: set discard_alignment to zero
Similar to 7c08428979 ("rbd: set discard_alignment to zero"), NVMe
devices are currently incorrectly initialised with the block queue
discard_alignment set to the NVMe stream alignment.

As per Documentation/ABI/testing/sysfs-block:
  The discard_alignment parameter indicates how many bytes the beginning
  of the device is offset from the internal allocation unit's natural
  alignment.

Correcting the discard_alignment parameter to zero has no effect on how
discard requests are propagated through the block layer - @alignment in
__blkdev_issue_discard() remains zero. However, it does fix other
consumers, such as LIO's Block Limits VPD response.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-12-15 15:13:32 +01:00
Keith Busch 9941a862cc nvme: Suppress static analyis warning
The ns->head is always valid, so we don't need to check for NULL.

Reported-by: Dan Carpenter <dan.caprenter@oracle.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20 08:38:11 +01:00
Keith Busch b0d61d586f nvme: Fix NULL dereference on reservation request
This fixes using the NULL 'head' before getting the reference. It is
however possible the head will always be NULL, so this patch uses the
struct nvme_ns to get the ns_id field.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-20 08:38:11 +01:00
Linus Torvalds e2c5923c34 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
2017-11-14 15:32:19 -08:00
Martin Wilck a04b5de505 nvme: fix visibility of "uuid" ns attribute
"uuid" must be invisible if both ns->uuid and ns->nguid are unset,
not if either one is.

Fixes: d934f9848a "nvme: provide UUID value to userspace"
Signed-off-by: Martin Wilck <mwilck@suse.com>
[hch: rebased to the nvme-4.15 tree to help resolving a conflict]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-11 15:38:21 -07:00
Hannes Reinecke 1e496938b6 nvme: expose subsys attribute to sysfs
We should be exposing the subsystem attributes like 'model' and
'subsysnqn' to sysfs to allow for easier identification of the
subsystem.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Hannes Reinecke e9a48034d7 nvme: create 'slaves' and 'holders' entries for hidden controllers
When creating nvme multipath devices we should populate the 'slaves' and
'holders' directorys properly to aid userspace topology detection.

Signed-off-by: Hannes Reinecke <hare@suse.com>
[hch: split from a larger patch, compile fix for CONFIG_NVME_MULTIPATH=n]
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 5b85b826b8 nvme: also expose the namespace identification sysfs files for mpath nodes
We do this by adding a helper that returns the ns_head for a device that
can belong to either the per-controller or per-subsystem block device
nodes, and otherwise reuse all the existing code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 32acab3181 nvme: implement multipath access to nvme subsystems
This patch adds native multipath support to the nvme driver.  For each
namespace we create only single block device node, which can be used
to access that namespace through any of the controllers that refer to it.
The gendisk for each controllers path to the name space still exists
inside the kernel, but is hidden from userspace.  The character device
nodes are still available on a per-controller basis.  A new link from
the sysfs directory for the subsystem allows to find all controllers
for a given subsystem.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig ed754e5dee nvme: track shared namespaces
Introduce a new struct nvme_ns_head that holds information about an actual
namespace, unlike struct nvme_ns, which only holds the per-controller
namespace information.  For private namespaces there is a 1:1 relation of
the two, but for shared namespaces this lets us discover all the paths to
it.  For now only the identifiers are moved to the new structure, but most
of the information in struct nvme_ns should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 002fab0404 nvme: introduce a nvme_ns_ids structure
This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig ab9e00cc72 nvme: track subsystems
This adds a new nvme_subsystem structure so that we can track multiple
controllers that belong to a single subsystem.  For now we only use it
to store the NQN, and to check that we don't have duplicate NQNs unless
the involved subsystems support multiple controllers.

Includes code originally from Hannes Reinecke to expose the subsystems
in sysfs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Bart Van Assche 9a95e4ef70 block, nvme: Introduce blk_mq_req_flags_t
Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Cc: linux-nvme@lists.infradead.org
Cc: Christoph Hellwig <hch@lst.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Keith Busch e3d7874dcf nvme: send uevent for some asynchronous events
This will give udev a chance to observe and handle asynchronous event
notifications and clear the log to unmask future events of the same type.
The driver will create a change uevent of the asyncronuos event result
before submitting the next AEN request to the device if a completed AEN
event is of type error, smart, command set or vendor specific,

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Keith Busch d99ca609a1 nvme: unexport starting async event work
Async event work is for core use only and should not be called directly
from drivers.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Keith Busch ad22c355b7 nvme: remove handling of multiple AEN requests
The driver can handle tracking only one AEN request, so this patch
removes handling for multiple ones.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Smart  <james.smart@broadcom.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Keith Busch 38dabe210f nvme: centralize AEN defines
All the transports were unnecessarilly duplicating the AEN request
accounting. This patch defines everything in one place.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Javier González ab083b11f6 nvme: fix eui_show() print format
Fix print formatting, but keep the original output to prevent user
breakage as suggested by Joe Perches.

Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Javier González a47619b5c6 nvme: compare NQN string with right size
Copy subnqns using NVMF_NQN_SIZE as it is < 256

Signed-off-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Keith Busch 84fef62d13 nvme: check admin passthru command effects
The NVMe standard provides a command effects log page so the host may
be aware of special requirements it may need to do for a particular
command. For example, the command may need to run with IO quiesced to
prevent timeouts or undefined behavior, or it may change the logical block
formats that determine how the host needs to construct future commands.

This patch saves the nvme command effects log page if the controller
supports it, and performs appropriate actions before and after an admin
passthrough command is completed. If the controller does not support the
command effects log page, the driver will define the effects for known
opcodes. The nvme format and santize are the only commands in this patch
with known effects.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Keith Busch c627c487ec nvme: factor get log into a helper
And fix the warning on a successful firmware log.

Reviewed-by: Javier González <javier@cnexlabs.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 715ea9e09d nvme: fix and clarify the check for missing metadata
Update the check in nvme_setup_rw for missing metadata so that it is
together with the other metadata handling, does not contain impossible
to reach conditions and warns if we get an impossible requests for
a (non-PI) metadata-enabled namespace when CONFIG_BLK_DEV_INTEGRITY
is not set.

Also add a little helper that checks if a given metadata configuration
contains protection information

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Javier González <jg@lightnvm.io>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 24b0b58c5b nvme: split __nvme_revalidate_disk
Split out the code that applies the calculate value to a given disk/queue
into new helper that can be reused by the multipath code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 6e78f21ae4 nvme: set the chunk size before freezing the queue
We don't need a frozen queue to update the chunk_size, which just is a
hint, and moving it a little earlier will allow for some better code
reuse with the multipath code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 30e5e929c7 nvme: don't pass struct nvme_ns to nvme_config_discard
To allow reusing this function for the multipath node.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig 39b7baa410 nvme: don't pass struct nvme_ns to nvme_init_integrity
To allow reusing this function for the multipath node.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig b5be3b3929 nvme: always unregister the integrity profile in __nvme_revalidate_disk
This is safe because the queue is always frozen when we revalidate, and
it simplifies both the existing code as well as the multipath
implementation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Christoph Hellwig e54b064cb2 nvme: move the dying queue check from cancel to completion
With multipath we don't want a hard DNR bit on a request that is cancelled
by a controller reset, but instead want to be able to retry it on another
patch.  To archive this don't always set the DNR bit when the queue is
dying in nvme_cancel_request, but defer that decision to
nvme_req_needs_retry.  Note that it applies to any command there and not
just cancelled commands, but one the queue is dying that is the right
thing to do anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-10 19:53:25 -07:00
Minwoo Im a806c6c81e nvme: comment typo fixed in clearing AER
a tiny typo fixed in a comment.

Signed-off-by: Minwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-03 10:02:20 +03:00
James Smart 3cec7f9de4 nvme: allow controller RESETTING to RECONNECTING transition
Transport will typically transition from LIVE to RESETTING when initially
performing a reset or recovering from an error.  Adding this transition
allows a transport to transition to RECONNECTING when it checks/waits for
connectivity then creates new transport connections and reinits the
controller.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:34:40 +01:00
Sagi Grimberg 4054637c9b nvme: flush reset_work before safely continuing with delete operation
Prevent racing controller reset and delete flows. reset_work must not
ever self-requeue so flushing it suffices.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:17 +01:00
Christoph Hellwig 6cd53d14aa nvme: consolidate common code from ->reset_work
No change in behavior except that the FC code cancels two work items a
little later now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:06 +01:00
Christoph Hellwig c5017e8570 nvme: move controller deletion to common code
Move the ->delete_work and the associated helpers to common code instead
of duplicating them in every driver.  This also adds the missing reference
get/put for the loop driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-11-01 16:28:04 +01:00
Keith Busch 5e0fab57fb nvme: Fix setting logical block format when revalidating
Revalidating the disk needs to set the logical block format and capacity,
otherwise it can't figure out if the users modified anything about
the namespace.

Fixes: cdbff4f26b ("nvme: remove nvme_revalidate_ns")

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-10-30 08:22:41 -06:00
Christoph Hellwig 999ada2871 nvme: check for a live controller in nvme_dev_open
This is a much more sensible check than just the admin queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@rimbeg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2017-10-27 09:05:22 +03:00
Christoph Hellwig a6a5149b10 nvme: get rid of nvme_ctrl_list
Use the core chrdev code to set up the link between the character device
and the nvme controller.  This allows us to get rid of the global list
of all controllers, and also ensures that we have both a reference to
the controller and the transport module before the open method of the
character device is called.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sgi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2017-10-27 09:04:53 +03:00
Christoph Hellwig d22524a478 nvme: switch controller refcounting to use struct device
Instead of allocating a separate struct device for the character device
handle embedd it into struct nvme_ctrl and use it for the main controller
refcounting.  This removes double refcounting and gets us an automatic
reference for the character device operations.  We keep ctrl->device as a
pointer for now to avoid chaning printks all over, but in the future we
could look into message printing helpers that take a controller structure
similar to what other subsystems do.

Note the delete_ctrl operation always already has a reference (either
through sysfs due this change, or because every open file on the
/dev/nvme-fabrics node has a refernece) when it is entered now, so we
don't need to do the unless_zero variant there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
2017-10-27 09:04:07 +03:00
Christoph Hellwig c6424a90da nvme: simplify nvme_open
Now that we are protected against lookup vs free races for the namespace
by using kref_get_unless_zero we don't need the hack of NULLing out the
disk private data during removal.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2017-10-27 09:03:31 +03:00
Christoph Hellwig 2dd4122854 nvme: use kref_get_unless_zero in nvme_find_get_ns
For kref_get_unless_zero to protect against lookup vs free races we need
to use it in all places where we aren't guaranteed to already hold a
reference.  There is no such guarantee in nvme_find_get_ns, so switch to
kref_get_unless_zero in this function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2017-10-27 09:02:58 +03:00
Christoph Hellwig 9843f685ae nvme: use ida_simple_{get,remove} for the controller instance
Switch to the ida_simple_* helpers instead of opencoding them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2017-10-20 12:16:57 +02:00
Sagi Grimberg 31b8446079 nvme: introduce nvme_reinit_tagset
Move blk_mq_reinit_tagset from blk-mq to nvme core
as the only user of it. Current transports that use
it (rdma, fc) simply implement .reinit_request op.

This patch does not change any functionality.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-18 19:27:48 +02:00
Christoph Hellwig 761f2e1ed8 nvme: simplify compat_ioctl handling
We can just use our normal ioctl handler for the compat case and remove
the boilerplate code for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
2017-10-16 14:54:10 +02:00
Marc Olson 8ae4e4477d nvme: update timeout module parameter type
The underlying blk_mq_tag_set, and request timeout parameters support an
unsigned int. Extend the size of the nvme module parameters for io and
admin commands to match.

Signed-off-by: Marc Olson <marcolso@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-04 09:43:09 +02:00
Martin Wilck 007a61ae2f nvme: fix visibility of "uuid" ns attribute
"uuid" must be invisible if both ns->uuid and ns->nguid are unset,
not if either one is.

Fixes: d934f9848a "nvme: provide UUID value to userspace"
Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-10-01 09:58:04 +02:00
Sagi Grimberg 1a40d97288 nvme-core: Use nvme_wq to queue async events and fw activation
async_event_work might race as it is executed from two different
workqueues at the moment.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 12:42:11 -06:00
James Smart 0951338d96 nvme: allow timed-out ios to retry
Currently the nvme_req_needs_retry() applies several checks to see if
a retry is allowed. On of those is whether the current time has exceeded
the start time of the io plus the timeout length. This check, if an io
times out, means there is never a retry allowed for the io. Which means
applications see the io failure.

Remove this check and allow the io to timeout, like it does on other
protocols, and retries to be made.

On the FC transport, a frame can be lost for an individual io, and there
may be no other errors that escalate for the connection/association.
The io will timeout, which causes the transport to escalate into creating
a new association, but the io that timed out, due to this retry logic, has
already failed back to the application and things are hosed.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
James Smart cd48282cc7 nvme: stop aer posting if controller state not live
If an nvme async_event command completes, in most cases, a new
async event is posted. However, if the controller enters a
resetting or reconnecting state, there is nothing to block the
scheduled work element from posting the async event again. Nor are
there calls from the transport to stop async events when an
association dies.

In the case of FC, where the association is torn down, the aer must
be aborted on the FC link and completes through the normal job
completion path. Thus the terminated async event ends up being
rescheduled even though the controller isn't in a valid state for
the aer, and the reposting gets the transport into a partially torn
down data structure.

It's possible to hit the scenario on rdma, although much less likely
due to an aer completing right as the association is terminated and
as the association teardown reclaims the blk requests via
nvme_cancel_request() so its immediate, not a link-related action
like on FC.

Fix by putting controller state checks in both the async event
completion routine where it schedules the async event and in the
async event work routine before it calls into the transport. It's
effectively a "stop_async_events()" behavior.  The transport, when
it creates a new association with the subsystem will transition
the state back to live and is already restarting the async event
posting.

Signed-off-by: James Smart <james.smart@broadcom.com>
[hch: remove taking a lock over reading the controller state]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-09-25 08:56:05 -06:00
Christoph Hellwig 044a9df1a7 nvme-pci: implement the HMB entry number and size limitations
Adds support for the new Host Memory Buffer Minimum Descriptor Entry Size
and Host Memory Maximum Descriptors Entries field that were added in
TP 4002 HMB Enhancements.  These allow the controller to advertise
limits for the usual number of segments in the host memory buffer, as
well as a minimum usable per-segment size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-09-11 12:29:40 -04:00
Christoph Hellwig 608cc4b14a nvme: fix lightnvm check
nvme_nvm_ns_supported assumes every device is a pci_dev, which leads to
reading an incorrect field, or possible even a dereference of unallocated
memory for fabrics controllers.

Fix this by introducing a quirk for lighnvm capable devices instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-09-11 12:29:36 -04:00
Keith Busch 63263d60e0 nvme: Use metadata for passthrough commands
The ioctls' struct allows the user to provide a metadata address and
length for a passthrough command. This patch uses these values that were
previously ignored and deletes the now unused wrapper function.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-30 14:59:35 +02:00
Keith Busch 485783ca63 nvme: Make nvme user functions static
These functions are used only locally in the nvme core.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-30 14:59:34 +02:00
Christoph Hellwig 1cad65620f nvme: factor metadata handling out of __nvme_submit_user_cmd
Keep the metadata code in a separate helper instead of making the
main function more complicated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-30 14:59:30 +02:00
Christoph Hellwig 1d5df6af8c nvme: don't blindly overwrite identifiers on disk revalidate
Instead validate that these identifiers do not change, as that is
prohibited by the specification.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-08-29 10:23:04 +02:00
Christoph Hellwig cdbff4f26b nvme: remove nvme_revalidate_ns
The function is used in two places, and the shared code for those will
diverge later in this series.

Instead factor out a new helper to get the ids for a namespace, simplify
the calling conventions for nvme_identify_ns and just open code the
sequence.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-08-29 10:22:39 +02:00
Christoph Hellwig 0a72bbba49 nvme: allow calling nvme_change_ctrl_state from irq context
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-08-29 10:22:23 +02:00
Christoph Hellwig a751da3394 nvme: report more detailed status codes to the block layer
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-08-29 10:22:18 +02:00
Martin K. Petersen 07fbd32a6b nvme: honor RTD3 Entry Latency for shutdowns
If an NVMe controller reports RTD3 Entry Latency larger than
shutdown_timeout, up to a maximum of 60 seconds, use that value to set
the shutdown timer. Otherwise fall back to the module parameter which
defaults to 5 seconds.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
[hch: removed do_div, made transition time local scope]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-28 23:00:44 +03:00
Max Gurtovoy 60b43f627a nvme: rename AMS symbolic constants to fit specification
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-28 23:00:39 +03:00
Sagi Grimberg caaa15c509 nvme: fix identify namespace logging
Use ctrl->device and lose the func name.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-28 23:00:38 +03:00
Jon Derrick dbf86b3900 nvme: add support for NVMe 1.3 Timestamp Feature
NVME's Timestamp feature allows controllers to be aware of the epoch
time in milliseconds. This patch adds the set features hook for various
transports through the identify path, so that resets and resumes can
update the controller as necessary.

Signed-off-by: Jon Derrick <jonathan.derrick@intel.com>
[hch: rebased on top of nvme-4.13 error handling changes,
      changed nvme_configure_timestamp to return the status]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-28 21:38:18 +02:00
Arnav Dawn 62346eaeb2 nvme: define NVME_NSID_ALL
Define the constant "0xffffffff" (used as nsid for all namespaces)
as NVME_NSID_ALL.

Signed-off-by: Arnav Dawn <a.dawn@samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-08-28 21:38:17 +02:00
Arnav Dawn b6dccf7fae nvme: add support for FW activation without reset
This patch adds support for handling Fw activation without reset
On completion of FW-activation-starting AER, all queues are
paused till CSTS.PP is cleared or timed out (exceeds max time for
fw activtion MTFA). If device fails to clear CSTS.PP within MTFA,
driver issues reset controller.

Signed-off-by: Arnav Dawn <a.dawn@samsung.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-08-28 21:38:17 +02:00
Jens Axboe cd996fb47c Linux 4.13-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZo2HiAAoJEHm+PkMAQRiG3OcIAJqSeVK2uQ/QhmqFN1ExYay4
 bdTjSTtSk7GH6PxI2C0cqfZvsxOUU7ICDHG8bYM1LA0S0SxfOtoFhHGKc/BcFLX8
 MiKJWlF51ZbX0mkIEpKF+C8pRrXPgSqtk3N450/k2BzG9qCZSM93A2NCOB7v9T9w
 XOBUIYHqfTS2tdmCinjwu8Ls+w8oPOGH1gLjxZyGnBlg4lTqHMcUufmHeVEAh11d
 giGByqqqXH69kGD1HNC7H6quzXN9rz4n0gEwEG0mIhfkJ98b+ESSWwSEXXypOAQD
 QT5/6+2YizXf5DPCqR46xasQCPjRsS6Sv0cF2cntW2PEAb4jBjhx5gTFlJcoOC8=
 =efWJ
 -----END PGP SIGNATURE-----

Merge tag 'v4.13-rc7' into for-4.14/block-postmerge

Linux 4.13-rc7

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28 13:00:44 -06:00
Christoph Hellwig 74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Kwan (Hingkwan) Huen-SSI a082b42628 nvme: fix directive command numd calculation
The numd field of directive receive command takes number of dwords to
transfer. This fix has the correct calculation for numd.

Signed-off-by: Kwan (Hingkwan) Huen-SSI <kwan.huen@samsung.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-10 19:53:44 +02:00
Keith Busch 634b832590 nvme: fix nvme reset command timeout handling
We need to return an error if a timeout occurs on any NVMe command during
initialization. Without this, the nvme reset work will be stuck. A timeout
will have a negative error code, meaning we need to stop initializing
the controller. All postitive returns mean the controller is still usable.

bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196325

Signed-off-by: Keith Busch <keith.busch@intel.com>
Cc: Martin Peres <martin.peres@intel.com>
[jth consolidated cleanup path ]
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-10 19:53:38 +02:00
Martin Wilck 758f373558 nvme: strip trailing 0-bytes in wwid_show
Some broken controllers (such as earlier Linux targets) pad model or
serial fields with 0-bytes rather than spaces. The NVMe spec disallows
0 bytes in "ASCII" fields.  Thus strip trailing 0-bytes, too. Also make
sure that we get no underflow for pathological input.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-08-10 10:43:31 +02:00
Scott Bauer 7dd1ab163c nvme: validate admin queue before unquiesce
With a misbehaving controller it's possible we'll never
enter the live state and create an admin queue. When we
fail out of reset work it's possible we failed out early
enough without setting up the admin queue. We tear down
queues after a failed reset, but needed to do some more
sanitization.

Fixes 443bd90f2cca: "nvme: host: unquiesce queue in nvme_kill_queues()"

[  189.650995] nvme nvme1: pci function 0000:0b:00.0
[  317.680055] nvme nvme0: Device not ready; aborting reset
[  317.680183] nvme nvme0: Removing after probe failure status: -19
[  317.681258] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  317.681397] general protection fault: 0000 [#1] SMP KASAN
[  317.682984] CPU: 3 PID: 477 Comm: kworker/3:2 Not tainted 4.13.0-rc1+ #5
[  317.683112] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F5 03/07/2016
[  317.683284] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[  317.683398] task: ffff8803b0990000 task.stack: ffff8803c2ef0000
[  317.683516] RIP: 0010:blk_mq_unquiesce_queue+0x2b/0xa0
[  317.683614] RSP: 0018:ffff8803c2ef7d40 EFLAGS: 00010282
[  317.683716] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 1ffff1006fbdcde3
[  317.683847] RDX: 0000000000000038 RSI: 1ffff1006f5a9245 RDI: 0000000000000000
[  317.683978] RBP: ffff8803c2ef7d58 R08: 1ffff1007bcdc974 R09: 0000000000000000
[  317.684108] R10: 1ffff1007bcdc975 R11: 0000000000000000 R12: 00000000000001c0
[  317.684239] R13: ffff88037ad49228 R14: ffff88037ad492d0 R15: ffff88037ad492e0
[  317.684371] FS:  0000000000000000(0000) GS:ffff8803de6c0000(0000) knlGS:0000000000000000
[  317.684519] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  317.684627] CR2: 0000002d1860c000 CR3: 000000045b40d000 CR4: 00000000003406e0
[  317.684758] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  317.684888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  317.685018] Call Trace:
[  317.685084]  nvme_kill_queues+0x4d/0x170 [nvme_core]
[  317.685185]  nvme_remove_dead_ctrl_work+0x3a/0x90 [nvme]
[  317.685289]  process_one_work+0x771/0x1170
[  317.685372]  worker_thread+0xde/0x11e0
[  317.685452]  ? pci_mmcfg_check_reserved+0x110/0x110
[  317.685550]  kthread+0x2d3/0x3d0
[  317.685617]  ? process_one_work+0x1170/0x1170
[  317.685704]  ? kthread_create_on_node+0xc0/0xc0
[  317.685785]  ret_from_fork+0x25/0x30
[  317.685798] Code: 0f 1f 44 00 00 55 48 b8 00 00 00 00 00 fc ff df 48 89 e5 41 54 4c 8d a7 c0 01 00 00 53 48 89 fb 4c 89 e2 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 75 50 48 8b bb c0 01 00 00 e8 33 8a f9 00 0f ba b3
[  317.685872] RIP: blk_mq_unquiesce_queue+0x2b/0xa0 RSP: ffff8803c2ef7d40
[  317.685908] ---[ end trace a3f8704150b1e8b4 ]---

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-26 17:41:41 +02:00
Johannes Thumshirn 6484f5d16f nvme: also provide a UUID in the WWID sysfs attribute
The WWID sysfs attribute can provide multiple means of a World Wide ID
for a NVMe device. It can either be a NGUID, a EUI-64 or a concatenation
of VID, Serial Number, Model and the Namespace ID in this order of
preference.

If the target also sends us a UUID use the UUID for identification and
give it the highest priority.

This eases generation of /dev/disk/by-* symlinks.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-25 17:58:22 +02:00
Christoph Hellwig dc1a0afbac nvme: fix byte swapping in the streams code
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-20 08:41:56 -06:00
Sagi Grimberg d09f2b45f3 nvme: split nvme_uninit_ctrl into stop and uninit
Usually before we teardown the controller we want to:
1. complete/cancel any ctrl inflight works
2. remove ctrl namespaces (only for removal though, resets
   shouldn't remove any namespaces).

but we do not want to destroy the controller device as
we might use it for logging during the teardown stage.

This patch adds nvme_start_ctrl() which queues inflight
controller works (aen, ns scan, queue start and keep-alive
if kato is set) and nvme_stop_ctrl() which cancels the works
namespace removal is left to the callers to handle.

Move nvme_uninit_ctrl after we are done with the
controller device.

Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-06 09:49:42 +03:00
Sagi Grimberg 8d7b8fafad nvme: kick requeue list when requeueing a request instead of when starting the queues
When we requeue a request, we can always insert the request
back to the scheduler instead of doing it when restarting
the queues and kicking the requeue work, so get rid of
the requeue kick in nvme (core and drivers).

Also, now there is no need start hw queues in nvme_kill_queues
We don't stop the hw queues anymore, so no need to
start them.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-07-06 09:48:59 +03:00
Christoph Hellwig 49d3d50b0d nvme: simplify nvme_dev_attrs_are_visible
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 08:14:13 -06:00
Christoph Hellwig 180de00700 nvme: read the subsystem NQN from Identify Controller
NVMe 1.2.1 or later requires controllers to provide a subsystem NQN in the
Identify controller data structures.  Use this NQN for the subsysnqn
sysfs attribute by storing it in the nvme_ctrl structure after verifying
it.  For older controllers we generate a "fake" NQN per non-normative
text in the NVMe 1.3 spec.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 08:14:13 -06:00
Kai-Heng Feng 76a5af8417 nvme: explicitly disable APST on quirked devices
A user reports APST is enabled, even when the NVMe is quirked or with
option "default_ps_max_latency_us=0".

The current logic will not set APST if the device is quirked. But the
NVMe in question will enable APST automatically.

Separate the logic "apst is supported" and "to enable apst", so we can
use the latter one to explicitly disable APST at initialiaztion.

BugLink: https://bugs.launchpad.net/bugs/1699004
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Reviewed-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 08:14:13 -06:00
Keith Busch 3f7f25a910 nvme: Remove SCSI translations
The SCSI-to-NVMe translations were added to assist storage applications
utilizing SG_IO transitioning to NVMe. It was always recommended,
however, to use native NVMe for device management as too much is lost
in translation and the maintenance burden in keeping this kludgey
layer around has been neglected such that much of the translations are
completely broken.

This patch removes SG_IO handling from NVMe to avoid any confusion
regarding maintenance support for this interface. The config option for
NVMe SCSI emulation has been disabled by default since 4.5. The driver
has supported native nvme user commands since the beginning, and native
tooling is publicly available for use or as reference for anyone writing
their own tools, so there's no excuse for hanging onto a broken crutch.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-28 08:14:13 -06:00
Jens Axboe f5d1184062 nvme: add support for streams and directives
This adds support for Directives in NVMe, particular for the Streams
directive. Support for Directives is a new feature in NVMe 1.3. It
allows a user to pass in information about where to store the data, so
that it the device can do so most effiently. If an application is
managing and writing data with different life times, mixing differently
retentioned data onto the same locations on flash can cause write
amplification to grow. This, in turn, will reduce performance and life
time of the device.

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-27 12:05:56 -06:00
Ming Lei 443bd90f2c nvme: host: unquiesce queue in nvme_kill_queues()
When nvme_kill_queues() is run, queues may be in
quiesced state, so we forcibly unquiesce queues to avoid
blocking dispatch, and I/O hang can be avoided in
remove path.

Peviously we use blk_mq_start_stopped_hw_queues() as
counterpart of blk_mq_quiesce_queue(), now we have
introduced blk_mq_unquiesce_queue(), so use it explicitly.

Cc: linux-nvme@lists.infradead.org
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 20:52:58 -06:00
Ming Lei f660174e8b blk-mq: use the introduced blk_mq_unquiesce_queue()
blk_mq_unquiesce_queue() is used for unquiescing the
queue explicitly, so replace blk_mq_start_stopped_hw_queues()
with it.

For the scsi part, this patch takes Bart's suggestion to
switch to block quiesce/unquiesce API completely.

Cc: linux-nvme@lists.infradead.org
Cc: linux-scsi@vger.kernel.org
Cc: dm-devel@redhat.com
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-18 14:20:04 -06:00
Scott Bauer 6b8190d61a nvme: implement NS Optimal IO Boundary from 1.3 Spec
The NVMe 1.3 spec introduces Namespace Optimal IO Boundaries (NOIOB),
which standardizes the stripe mechanism we currently have quirks for.
This patch implements the necessary logic to handle this new feature.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-16 08:25:54 +02:00
Sagi Grimberg 8fa611213d nvme: don't hard code size of struct t10_pi_tuple
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 15:50:18 +02:00
Christoph Hellwig 39bdc5901f nvme: no need to wait for the reset when keepalive fails
We don't need to wait for the reset from the delayed work item that
is kicked off when we don't get a keepalive.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-06-15 15:48:45 +02:00
Christoph Hellwig d86c4d8ef3 nvme: move reset workqueue handling to common code
This moves the nvme_reset function from the PCIe driver to common code,
renaming it to nvme_reset_ctrl in the process.  Additionally a new
helper nvme_reset_ctrl_sync is added for the case where we want to
wait for the reset.  To facilitate that the reset_work work structure is
move to the common nvme_ctrl structure and the ->reset_ctrl method is
removed.  For now the drivers initialize the reset_work with their own
callback, but longer term we should move to callouts for specific
parts of the reset process and move even more code to the core.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
2017-06-15 15:48:34 +02:00
Christoph Hellwig ebe6d874cd nvme: move protection information check into nvme_setup_rw
It only applies to read/write commands, and this way non-PCIe drivers
get the check as well instead of having to duplicate it when adding
metadata support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:30:30 +02:00
Christoph Hellwig b3b1b0b01d nvme: mark shutdown_timeout static
And open code the SHUTDOWN_TIMEOUT macro.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:30:29 +02:00
Johannes Thumshirn f0425db00c nvme: use ctrl->device consistently for logging
Change the few left over users of ctrl->dev over to using ctrl->device
for logging purposes, so we consistently use the same device.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:30:24 +02:00
Johannes Thumshirn d934f9848a nvme: provide UUID value to userspace
Now that we have a way for getting the UUID from a target, provide it
to userspace as well.

Unfortunately there is already a sysfs attribute called UUID which is
a misnomer as it holds the NGUID value. So instead of creating yet
another wrong name, create a new 'nguid' sysfs attribute for the
NGUID. For the UUID attribute add a check wheter the namespace has a
UUID assigned to it and return this or return the NGUID to maintain
backwards compatibility. This should give userspace a chance to catch
up.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@rimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:30:20 +02:00
Johannes Thumshirn 3b22ba2682 nvme: get list of namespace descriptors
If a target identifies itself as NVMe 1.3 compliant, try to get the
list of Namespace Identification Descriptors and populate the UUID,
NGUID and EUI64 fileds in the NVMe namespace structure with these
values.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:30:18 +02:00
Johannes Thumshirn 90985b84c4 nvme: rename uuid to nguid in nvme_ns
The uuid field in the nvme_ns structure represents the nguid field
from the identify namespace command. And as NVMe 1.3 introduced an
UUID in the NVMe Namespace Identification Descriptor this will
collide.

So rename the uuid to nguid to prevent any further
confusion. Unfortunately we export the nguid to sysfs in the uuid
sysfs attribute, but this can't be changed anymore without possibly
breaking existing userspace.

Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:30:17 +02:00
Sagi Grimberg c669ccdc50 nvme: queue ns scanning and async request from nvme_wq
To suppress the warning triggered by nvme_uninit_ctrl:
kernel: [   50.350439] nvme nvme0: rescanning
kernel: [   50.363351] ------------[ cut here]------------
kernel: [   50.363396] WARNING: CPU: 1 PID: 37 at kernel/workqueue.c:2423 check_flush_dependency+0x11f/0x130
kernel: [   50.363409] workqueue: WQ_MEM_RECLAIM
nvme-wq:nvme_del_ctrl_work [nvme_core] is flushing !WQ_MEM_RECLAIM events:nvme_scan_work [nvme_core]

This was triggered with nvme-loop, but can happen with rdma/pci as well afaict.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:29:48 +02:00
Sagi Grimberg 9a6327d2f2 nvme: Move transports to use nvme-core workqueue
Instead of each transport using it's own workqueue, export
a single nvme-core workqueue and use that instead.

In the future, this will help us moving towards some unification
if controller setup/teardown flows.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:29:43 +02:00
Sagi Grimberg c58bd1bf4d nvme: Don't allow to reset a reconnecting controller
The reset operation is guaranteed to fail for all scenarios
but the esoteric case where in the last reconnect attempt
concurrent with the reset we happen to successfully reconnect.

We just deny initiating a reset if we are reconnecting.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-15 14:28:22 +02:00
Christoph Hellwig fe6d53c9c0 nvme: save hmpre and hmmin in struct nvme_ctrl
We'll need the later for the HMB support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
2017-06-13 11:45:35 +02:00
Jens Axboe 8f66439eec Linux 4.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJZPdbLAAoJEHm+PkMAQRiGx4wH/1nCjfnl6fE8oJ24/1gEAOUh
 biFdqJkYZmlLYHVtYfLm4Ueg4adJdg0wx6qM/4RaAzmQVvLfDV34bc1qBf1+P95G
 kVF+osWyXrZo5cTwkwapHW/KNu4VJwAx2D1wrlxKDVG5AOrULH1pYOYGOpApEkZU
 4N+q5+M0ce0GJpqtUZX+UnI33ygjdDbBxXoFKsr24B7eA0ouGbAJ7dC88WcaETL+
 2/7tT01SvDMo0jBSV0WIqlgXwZ5gp3yPGnklC3F4159Yze6VFrzHMKS/UpPF8o8E
 W9EbuzwxsKyXUifX2GY348L1f+47glen/1sedbuKnFhP6E9aqUQQJXvEO7ueQl4=
 =m2Gx
 -----END PGP SIGNATURE-----

Merge tag 'v4.12-rc5' into for-4.13/block

We've already got a few conflicts and upcoming work depends on some of the
changes that have gone into mainline as regression fixes for this series.

Pull in 4.12-rc5 to resolve these conflicts and make it easier on down stream
trees to continue working on 4.13 changes.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-12 08:30:13 -06:00
Christoph Hellwig fc17b6534e blk-mq: switch ->queue_rq return value to blk_status_t
Use the same values for use for request completion errors as the return
value from ->queue_rq.  BLK_STS_RESOURCE is special cased to cause
a requeue, and all the others are completed as-is.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Christoph Hellwig 2a842acab1 block: introduce new block status code type
Currently we use nornal Linux errno values in the block layer, and while
we accept any error a few have overloaded magic meanings.  This patch
instead introduces a new  blk_status_t value that holds block layer specific
status codes and explicitly explains their meaning.  Helpers to convert from
and to the previous special meanings are provided for now, but I suspect
we want to get rid of them in the long run - those drivers that have a
errno input (e.g. networking) usually get errnos that don't know about
the special block layer overloads, and similarly returning them to userspace
will usually return somethings that strictly speaking isn't correct
for file system operations, but that's left as an exercise for later.

For now the set of errors is a very limited set that closely corresponds
to the previous overloaded errno values, but there is some low hanging
fruite to improve it.

blk_status_t (ab)uses the sparse __bitwise annotations to allow for sparse
typechecking, so that we can easily catch places passing the wrong values.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Kai-Heng Feng 9947d6a09c nvme: relax APST default max latency to 100ms
Christoph Hellwig suggests we should to make APST work out of the box.
Hence relax the the default max latency to make them able to enter
deepest power state on default.

Here are id-ctrl excerpts from two high latency NVMes:

vid     : 0x14a4
ssvid   : 0x1b4b
mn      : CX2-GB1024-Q11 NVMe LITEON 1024GB
ps    3 : mp:0.1000W non-operational enlat:5000 exlat:5000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0100W non-operational enlat:50000 exlat:100000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

vid     : 0x15b7
ssvid   : 0x1b4b
mn      : A400 NVMe SanDisk 512GB
ps    3 : mp:0.0500W non-operational enlat:51000 exlat:10000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    4 : mp:0.0055W non-operational enlat:1000000 exlat:100000 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-07 11:08:55 +02:00
Kai-Heng Feng da87591bea nvme: only consider exit latency when choosing useful non-op power states
When a NVMe is in non-op states, the latency is exlat.
The latency will be enlat + exlat only when the NVMe tries to transit
from operational state right atfer it begins to transit to
non-operational state, which should be a rare case.

Therefore, as Andy Lutomirski suggests, use exlat only when deciding power
states to trainsit to.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-07 11:08:54 +02:00
Ming Lei 82654b6b8e nvme: fix hang in remove path
We need to start admin queues too in nvme_kill_queues()
for avoiding hang in remove path[1].

This patch is very similar with 806f026f9b901eaf(nvme: use
blk_mq_start_hw_queues() in nvme_kill_queues()).

[1] hang stack trace
[<ffffffff813c9716>] blk_execute_rq+0x56/0x80
[<ffffffff815cb6e9>] __nvme_submit_sync_cmd+0x89/0xf0
[<ffffffff815ce7be>] nvme_set_features+0x5e/0x90
[<ffffffff815ce9f6>] nvme_configure_apst+0x166/0x200
[<ffffffff815cef45>] nvme_set_latency_tolerance+0x35/0x50
[<ffffffff8157bd11>] apply_constraint+0xb1/0xc0
[<ffffffff8157cbb4>] dev_pm_qos_constraints_destroy+0xf4/0x1f0
[<ffffffff8157b44a>] dpm_sysfs_remove+0x2a/0x60
[<ffffffff8156d951>] device_del+0x101/0x320
[<ffffffff8156db8a>] device_unregister+0x1a/0x60
[<ffffffff8156dc4c>] device_destroy+0x3c/0x50
[<ffffffff815cd295>] nvme_uninit_ctrl+0x45/0xa0
[<ffffffff815d4858>] nvme_remove+0x78/0x110
[<ffffffff81452b69>] pci_device_remove+0x39/0xb0
[<ffffffff81572935>] device_release_driver_internal+0x155/0x210
[<ffffffff81572a02>] device_release_driver+0x12/0x20
[<ffffffff815d36fb>] nvme_remove_dead_ctrl_work+0x6b/0x70
[<ffffffff810bf3bc>] process_one_work+0x18c/0x3a0
[<ffffffff810bf61e>] worker_thread+0x4e/0x3b0
[<ffffffff810c5ac9>] kthread+0x109/0x140
[<ffffffff8185800c>] ret_from_fork+0x2c/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

Fixes: c5552fde102fc("nvme: Enable autonomous power state transitions")
Reported-by: Rakesh Pandit <rakesh@tuxera.com>
Tested-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-07 11:08:49 +02:00
Christoph Hellwig c81bfba998 nvme: only setup block integrity if supported by the driver
Currently only the PCIe driver supports metadata, so we should not claim
integrity support for the other drivers.  This prevents nasty crashes
with targets that advertise metadata support on fabrics.

Also use the opportunity to factor out some code into a separate helper
that isn't even compiled if CONFIG_BLK_DEV_INTEGRITY is disabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-05-26 09:54:23 +03:00
Christoph Hellwig d3d5b87ddd nvme: replace is_flags field in nvme_ctrl_ops with a flags field
So that we can have more flags for transport-specific behavior.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
2017-05-26 09:54:16 +03:00
Ming Lei 986f75c876 nvme: avoid to use blk_mq_abort_requeue_list()
NVMe may add request into requeue list simply and not kick off the
requeue if hw queues are stopped. Then blk_mq_abort_requeue_list()
is called in both nvme_kill_queues() and nvme_ns_remove() for
dealing with this issue.

Unfortunately blk_mq_abort_requeue_list() is absolutely a
race maker, for example, one request may be requeued during
the aborting. So this patch just calls blk_mq_kick_requeue_list() in
nvme_kill_queues() to handle this issue like what nvme_start_queues()
does. Now all requests in requeue list when queues are stopped will be
handled by blk_mq_kick_requeue_list() when queues are restarted, either
in nvme_start_queues() or in nvme_kill_queues().

Cc: stable@vger.kernel.org
Reported-by: Zhang Yi <yizhan@redhat.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-22 20:50:10 +02:00
Ming Lei 806f026f9b nvme: use blk_mq_start_hw_queues() in nvme_kill_queues()
Inside nvme_kill_queues(), we have to start hw queues for
draining requests in sw queues, .dispatch list and requeue list,
so use blk_mq_start_hw_queues() instead of blk_mq_start_stopped_hw_queues()
which only run queues if queues are stopped, but the queues may have
been started already, for example nvme_start_queues() is called in reset work
function.

blk_mq_start_hw_queues() run hw queues in current context, instead
of running asynchronously like before. Given nvme_kill_queues() is
run from either remove context or reset worker context, both are fine
to run hw queue directly. And the mutex of namespaces_mutex isn't a
problem too becasue nvme_start_freeze() runs hw queue in this way
already.

Cc: stable@vger.kernel.org
Reported-by: Zhang Yi <yizhan@redhat.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-05-22 20:50:09 +02:00
Andy Lutomirski c35e30b472 nvme: Add nvme_core.force_apst to ignore the NO_APST quirk
We're probably going to be stuck quirking APST off on an over-broad
range of devices for 4.11.  Let's make it easy to override the quirk
for testing.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24 22:03:46 -06:00
Andy Lutomirski fb0dc3993b nvme: Display raw APST configuration via DYNAMIC_DEBUG
Debugging APST is currently a bit of a pain.  This gives optional
simple log messages that describe the APST state.

The easiest way to use this is probably with the nvme_core.dyndbg=+p
module parameter.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24 22:03:46 -06:00
Andy Lutomirski 76e4ad09a3 nvme: Fix APST comment
There was a typo in the description of the timeout heuristic.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-24 22:03:46 -06:00
Jens Axboe d9fd363a6c Merge branch 'master' into for-4.12/post-merge 2017-04-24 22:03:14 -06:00
Junxiong Guan e02ab02304 nvme: let dm-mpath distinguish nvme error codes
Currently most IOs which return the nvme error codes are retried on
the other path if those IOs returns EIO from NVMe driver. This
patch let Multipath distinguish nvme media error codes and some
generic or cmd-specific nvme error codes so that multipath will
not retry those kinds of IO, to save bandwidth.

Signed-off-by: Junxiong Guan <guanjunxiong@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-04-21 16:41:56 +02:00
Andy Lutomirski be56945c4e nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA"
There's a report that it malfunctions with APST on.

See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184

Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 14:42:10 -06:00
Andy Lutomirski ff5350a86b nvme: Adjust the Samsung APST quirk
I got a couple more reports: the Samsung APST issues appears to
affect multiple 950-series devices in Dell XPS 15 9550 and Precision
5510 laptops.  Change the quirk: rather than blacklisting the
firmware on the first problematic SSD that was reported, disable
APST on all 144d:a802 devices if they're installed in the two
affected Dell models.  While we're at it, disable only the deepest
sleep state instead of all of them -- the reporters say that this is
sufficient to fix the problem.

(I have a device that appears to be entirely identical to one of the
affected devices, but I have a different Dell laptop, so it's not
the case that all Samsung devices with firmware BXW75D0Q are broken
under all circumstances.)

Samsung engineers have an affected system, and hopefully they'll
give us a better workaround some time soon.  In the mean time, this
should minimize regressions.

See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184

Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 14:42:09 -06:00
Christoph Hellwig 08e0029aa2 blk-mq: remove the error argument to blk_mq_complete_request
Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig 65ba6b54e7 nvme: make nvme_error_status private
Currently it's used by the lighnvm passthrough ioctl, but we'd like to make
it private in preparation of block layer specific error code.  Lighnvm already
returns the real NVMe status anyway, so I think we can just limit it to
returning -EIO for any status set.

This will need a careful audit from the lightnvm folks, though.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig 27fa9bc545 nvme: split nvme status from block req->errors
We want our own clearly defined error field for NVMe passthrough commands,
and the request errors field is going away in its current form.

Just store the status and result field in the nvme_request field from
hardirq completion context (using a new helper) and then generate a
Linux errno for the block layer only when we actually need it.

Because we can't overload the status value with a negative error code
for cancelled command we now have a flags filed in struct nvme_request
that contains a bit for this condition.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20 12:16:10 -06:00
Christoph Hellwig e850fd16f7 nvme: implement REQ_OP_WRITE_ZEROES
But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing.  Also rename the quirk flag to be a bit
more self-explanatory.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-08 11:25:38 -06:00
Jens Axboe 65f619d253 Merge branch 'for-linus' into for-4.12/block
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.

Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-07 12:45:20 -06:00
Christoph Hellwig 44e44b29fb nvme: move the retries count to struct nvme_request
The way NVMe uses this field is entirely different from the older
SCSI/BLOCK_PC usage, so move it into struct nvme_request.

Also reduce the size of the file to a unsigned char so that we leave
space for additional smaller fields that will appear soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig 83f3aeb386 nvme: mark nvme_max_retries static
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig f6324b1bb7 nvme: cleanup nvme_req_needs_retry
Don't pass the status explicitly but derive it from the requeust,
and unwind the complex condition to be more readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig 987f699a8f nvme: move ->retries setup to nvme_setup_cmd
->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command.  We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-05 12:05:08 -06:00
Christoph Hellwig 77f02a7acd nvme: factor request completion code into a common helper
This avoids duplicating the logic four times, and it also allows to keep
some helpers static in core.c or just opencode them.

Note that this loses printing the aborted status on completions in the
PCI driver as that uses a data structure not available any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-04 09:48:23 -06:00
Christoph Hellwig f1dd03a84d nvme: add missing byte swap in nvme_setup_discard
Fixes: b35ba01e ("nvme: support ranged discard requests")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2017-04-02 10:24:15 +03:00
Ming Lei 1671d522cd block: rename blk_mq_freeze_queue_start()
As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().

So rename it and we can use this function in both
paths.

Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-29 08:03:42 -06:00
Keith Busch 302ad8cc09 nvme: Complete all stuck requests
If the nvme driver is shutting down its controller, the drievr will not
start the queues up again, preventing blk-mq's hot CPU notifier from
making forward progress.

To fix that, this patch starts a request_queue freeze when the driver
resets a controller so no new requests may enter. The driver will wait
for frozen after IO queues are restarted to ensure the queue reference
can be reinitialized when nvme requests to unfreeze the queues.

If the driver is doing a safe shutdown, the driver will wait for the
controller to successfully complete all inflight requests so that we
don't unnecessarily fail them. Once the controller has been disabled,
the queues will be restarted to force remaining entered requests to end
in failure so that blk-mq's hot cpu notifier may progress.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-03-02 08:56:59 -07:00
Keith Busch f33447b90e nvme/core: Fix race kicking freed request_queue
If a namespace has already been marked dead, we don't want to kick the
request_queue again since we may have just freed it from another thread.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Andy Lutomirski c5552fde10 nvme: Enable autonomous power state transitions
NVMe devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each successive
non-operational state will be entered after an idle time equal to 100%
of the total latency (entry plus exit) associated with that state.
The maximum acceptable latency is controlled using dev_pm_qos
(e.g. power/pm_qos_latency_tolerance_us in sysfs); non-operational
states with total latency greater than this value will not be used.
As a special case, setting the latency tolerance to 0 will disable
APST entirely.  On hardware without APST support, the sysfs file will
not be exposed.

The latency tolerance for newly-probed devices is set by the module
parameter nvme_core.default_ps_max_latency_us.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

This feature is quirked off on a known-buggy Samsung device.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Andy Lutomirski bd4da3abaa nvme: Add a quirk mechanism that uses identify_ctrl
Currently, all NVMe quirks are based on PCI IDs.  Add a mechanism to
define quirks based on identify_ctrl's vendor id, model number,
and/or firmware revision.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Parav Pandit 986994a275 nvme: Use CNS as 8-bit field and avoid endianness conversion
This patch defines CNS field as 8-bit field and avoids cpu_to/from_le
conversions.
Also initialize nvme_command cns value explicitly to NVME_ID_CNS_NS
for readability (don't rely on the fact that NVME_ID_CNS_NS = 0).

Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Max Gurtovoy 778f067c18 nvme: add semicolon in nvme_command setting
Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Sagi Grimberg 8432bdb290 nvme: Make controller state visible via sysfs
Easier for debugging and testing state machine
transitions.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-22 13:34:00 -07:00
Jens Axboe 818551e2b2 Merge branch 'for-4.11/next' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:08:19 -07:00
Jens Axboe 6010720da8 Merge branch 'for-4.11/block' into for-4.11/linus-merge
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 14:06:45 -07:00
Scott Bauer 8a9ae52328 nvme: Check for Security send/recv support before issuing commands.
We need to verify that the controller supports the security
commands before actually trying to issue them.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
[hch: moved the check so that we don't call into the OPAL code if not
      supported]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 12:41:49 -07:00
Christoph Hellwig 4f1244c829 block/sed-opal: allocate struct opal_dev dynamically
Insted of bloating the containing structure with it all the time this
allocates struct opal_dev dynamically.  Additionally this allows moving
the definition of struct opal_dev into sed-opal.c.  For this a new
private data field is added to it that is passed to the send/receive
callback.  After that a lot of internals can be made private as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Scott Bauer <scott.bauer@intel.com>
Reviewed-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-17 12:41:47 -07:00
Scott Bauer e225c20eb0 Move stack parameters for sed_ioctl to prevent oversized stack with CONFIG_KASAN
When CONFIG_KASAN is enabled, compilation fails:

block/sed-opal.c: In function 'sed_ioctl':
block/sed-opal.c:2447:1: error: the frame size of 2256 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

Moved all the ioctl structures off the stack and dynamically allocate
using _IOC_SIZE()

Fixes: 455a7b238c ("block: Add Sed-opal library")

Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-14 19:47:18 -07:00
Christoph Hellwig b35ba01ea6 nvme: support ranged discard requests
NVMe supports up to 256 ranges per DSM command, so wire up support
for ranged discards up to that limit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-08 13:43:10 -07:00
Scott Bauer a98e58e54f nvme: Add Support for Opal: Unlock from S3 & Opal Allocation/Ioctls
This patch implements the necessary logic to unlock an Opal
enabled device coming back from an S3.

The patch also implements the SED/Opal allocation necessary to support
the opal ioctls.

Signed-off-by: Scott Bauer <scott.bauer@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-02-06 09:44:21 -07:00
Christoph Hellwig aebf526b53 block: fold cmd_type into the REQ_OP_ space
Instead of keeping two levels of indirection for requests types, fold it
all into the operations.  The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 14:00:44 -07:00
Matias Bjørling 84d4add793 lightnvm: add ioctls for vector I/Os
Enable user-space to issue vector I/O commands through ioctls. To issue
a vector I/O, the ppa list with addresses is also required and must be
mapped for the controller to access.

For each ioctl, the result and status bits are returned as well, such
that user-space can retrieve the open-channel SSD completion bits.

The implementation covers the traditional use-cases of bad block
management, and vectored read/write/erase.

Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Metadata implementation, test, and fixes.
Signed-off-by: Simon A.F. Lund <slund@cnexlabs.com>
Signed-off-by: Matias Bjørling <matias@cnexlabs.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-01-31 08:32:13 -07:00
Guilherme G. Piccoli b5a10c5f75 nvme: apply DELAY_BEFORE_CHK_RDY quirk at probe time too
Commit 54adc01055 ("nvme/quirk: Add a delay before checking for adapter
readiness") introduced a quirk to adapters that cannot read the bit
NVME_CSTS_RDY right after register NVME_REG_CC is set; these adapters
need a delay or else the action of reading the bit NVME_CSTS_RDY could
somehow corrupt adapter's registers state and it never recovers.

When this quirk was added, we checked ctrl->tagset in order to avoid
quirking in probe time, supposing we would never require such delay
during probe. Well, it was too optimistic; we in fact need this quirk
at probe time in some cases, like after a kexec.

In some experiments, after abnormal shutdown of machine (aka power cord
unplug), we booted into our bootloader in Power, which is a Linux kernel,
and kexec'ed into another distro. If this kexec is too quick, we end up
reaching the probe of NVMe adapter in that distro when adapter is in
bad state (not fully initialized on our bootloader). What happens next
is that nvme_wait_ready() is unable to complete, except if the quirk is
enabled.

So, this patch removes the original ctrl->tagset verification in order
to enable the quirk even on probe time.

Fixes: 54adc01055 ("nvme/quirk: Add a delay before checking for adapter readiness")
Reported-by: Andrew Byrne <byrneadw@ie.ibm.com>
Reported-by: Jaime A. H. Gomez <jahgomez@mx1.ibm.com>
Reported-by: Zachary D. Myers <zdmyers@us.ibm.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Acked-by: Jeffrey Lien <Jeff.Lien@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-01-11 17:21:35 +01:00
Keith Busch e6282aef7b nvme: simplify stripe quirk
Some OEMs believe they own the Identify Controller vendor specific
region and will repurpose it with their own values. While not common,
we can't rely on the PCI VID:DID to tell use how to decode the field
we reserved for this as the stripe size so we need to do something else
for the list of devices using this quirk.

The field was supposed to allow flexibility on the device's back-end
striping, but it turned out that never materialized; the chunk is always
the same as MDTS in the products subscribing to this quirk, so this
patch removes the stripe_size field and sets the chunk to the max hw
transfer size for the devices using this quirk.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-12-21 11:33:06 +01:00
Linus Torvalds cdb98c2698 Revert "nvme: add support for the Write Zeroes command"
This reverts commit 6d31e3ba23.

This causes bootup problems for me both on my laptop and my desktop.
What they have in common is that they have NVMe disks with dm-crypt, but
it's not the same controller, so it's not controller-specific.

Jens does not see it on his machine (also NVMe), so it's presumably
something that triggers just on bootup.  Possibly related to dm-crypt
and the fact that I mark my luks volume with "allow-discards" in
/etc/crypttab.

It's 100% repeatable for me, which made it fairly straightforward to
bisect the problem to this commit. Small mercies.

So we don't know what the reason is yet, but the revert is needed to get
things going again.

Acked-by: Jens Axboe <axboe@fb.com>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13 19:53:37 -08:00
Linus Torvalds 36869cb93d Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main block pull request this series. Contrary to previous
  release, I've kept the core and driver changes in the same branch. We
  always ended up having dependencies between the two for obvious
  reasons, so makes more sense to keep them together. That said, I'll
  probably try and keep more topical branches going forward, especially
  for cycles that end up being as busy as this one.

  The major parts of this pull request is:

   - Improved support for O_DIRECT on block devices, with a small
     private implementation instead of using the pig that is
     fs/direct-io.c. From Christoph.

   - Request completion tracking in a scalable fashion. This is utilized
     by two components in this pull, the new hybrid polling and the
     writeback queue throttling code.

   - Improved support for polling with O_DIRECT, adding a hybrid mode
     that combines pure polling with an initial sleep. From me.

   - Support for automatic throttling of writeback queues on the block
     side. This uses feedback from the device completion latencies to
     scale the queue on the block side up or down. From me.

   - Support from SMR drives in the block layer and for SD. From Hannes
     and Shaun.

   - Multi-connection support for nbd. From Josef.

   - Cleanup of request and bio flags, so we have a clear split between
     which are bio (or rq) private, and which ones are shared. From
     Christoph.

   - A set of patches from Bart, that improve how we handle queue
     stopping and starting in blk-mq.

   - Support for WRITE_ZEROES from Chaitanya.

   - Lightnvm updates from Javier/Matias.

   - Supoort for FC for the nvme-over-fabrics code. From James Smart.

   - A bunch of fixes from a whole slew of people, too many to name
     here"

* 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits)
  blk-stat: fix a few cases of missing batch flushing
  blk-flush: run the queue when inserting blk-mq flush
  elevator: make the rqhash helpers exported
  blk-mq: abstract out blk_mq_dispatch_rq_list() helper
  blk-mq: add blk_mq_start_stopped_hw_queue()
  block: improve handling of the magic discard payload
  blk-wbt: don't throttle discard or write zeroes
  nbd: use dev_err_ratelimited in io path
  nbd: reset the setup task for NBD_CLEAR_SOCK
  nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME
  nvme-fabrics: Add target support for FC transport
  nvme-fabrics: Add host support for FC transport
  nvme-fabrics: Add FC transport LLDD api definitions
  nvme-fabrics: Add FC transport FC-NVME definitions
  nvme-fabrics: Add FC transport error codes to nvme.h
  Add type 0x28 NVME type code to scsi fc headers
  nvme-fabrics: patch target code in prep for FC transport support
  nvme-fabrics: set sqe.command_id in core not transports
  parser: add u64 number parser
  nvme-rdma: align to generic ib_event logging helper
  ...
2016-12-13 10:19:16 -08:00
Christoph Hellwig f9d03f96b9 block: improve handling of the magic discard payload
Instead of allocating a single unused biovec for discard requests, send
them down without any payload.  Instead we allow the driver to add a
"special" payload using a biovec embedded into struct request (unioned
over other fields never used while in the driver), and overloading
the number of segments for this case.

This has a couple of advantages:

 - we don't have to allocate the bio_vec
 - the amount of special casing for discard requests in the block
   layer is significantly reduced
 - using this same scheme for other request types is trivial,
   which will be important for implementing the new WRITE_ZEROES
   op on devices where it actually requires a payload (e.g. SCSI)
 - we can get rid of playing games with the request length, as
   we'll never touch it and completions will work just fine
 - it will allow us to support ranged discard operations in the
   future by merging non-contiguous discard bios into a single
   request
 - last but not least it removes a lot of code

This patch is the common base for my WIP series for ranges discards and to
remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
so it would be good to get it in quickly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-09 08:30:51 -07:00
James Smart 721b3917c4 nvme-fabrics: set sqe.command_id in core not transports
Currently, core.c sets command_id only on rd/wr commands, leaving it to
the transport to set it again to ensure the request had a command id.

Move location of set in core so applies to all commands.
Remove transport sets.

Signed-off-by: James Smart <james.smart@broadcom.com>
Reviewed-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
2016-12-06 10:17:03 +02:00
Chaitanya Kulkarni 6d31e3ba23 nvme: add support for the Write Zeroes command
Allow write zeroes operations (REQ_OP_WRITE_ZEROES) on the block
device, if the device supports optional command bit set for write
zeroes. Add support to setup write zeroes command. Set maximum possible
write zeroes sectors in one write zeroes command according to
nvme write zeroes command definition.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@hgst.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-12-01 07:58:40 -07:00
Matias Bjørling 3dc87dd048 nvme: lightnvm: attach lightnvm sysfs to nvme block device
Previously, LBA read and write were not supported in the lightnvm
specification. Now that it supports it, lets use the traditional
NVMe gendisk, and attach the lightnvm sysfs geometry export.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-29 12:12:51 -07:00
Omar Sandoval bac0000af5 nvme: untangle 0 and BLK_MQ_RQ_QUEUE_OK
Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
specific values. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-15 12:50:11 -07:00
Christoph Hellwig 7bf58533a0 nvme: don't pass the full CQE to nvme_complete_async_event
We only need the status and result fields, and passing them explicitly
makes life a lot easier for the Fibre Channel transport which doesn't
have a full CQE for the fast path case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:06:26 -07:00
Christoph Hellwig d49187e97e nvme: introduce struct nvme_request
This adds a shared per-request structure for all NVMe I/O.  This structure
is embedded as the first member in all NVMe transport drivers request
private data and allows to implement common functionality between the
drivers.

The first use is to replace the current abuse of the SCSI command
passthrough fields in struct request for the NVMe command passthrough,
but it will grow a field more fields to allow implementing things
like common abort handlers in the future.

The passthrough commands are handled by having a pointer to the SQE
(struct nvme_command) in struct nvme_request, and the union of the
possible result fields, which had to be turned from an anonymous
into a named union for that purpose.  This avoids having to pass
a reference to a full CQE around and thus makes checking the result
a lot more lightweight.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-10 10:06:24 -07:00
Bart Van Assche a6eaa8849f nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
that became superfluous because of this change. Change
blk_queue_stopped() tests into blk_mq_queue_stopped().

This patch fixes a race condition: using queue_flag_clear_unlocked()
is not safe if any other function that manipulates the queue flags
can be called concurrently, e.g. blk_cleanup_queue().

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 3174dd33fa nvme: Fix a race condition related to stopping queues
Avoid that nvme_queue_rq() is still running when nvme_stop_queues()
returns.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Keith Busch <keith.busch@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 2b053aca76 blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Bart Van Assche 9b7dd572cc blk-mq: Remove blk_mq_cancel_requeue_work()
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-02 12:50:19 -06:00
Christoph Hellwig fa60682677 nvme: use symbolic constants for CNS values
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Gabriel Krisman Bertazi 8ef2074d28 nvme: Add tertiary number to NVME_VS
NVMe 1.2.1 specification adds a tertiary element to the version number.
This updates the macro and its callers to include the final number and
fixup a single place in nvmet where the version was generated manually.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-19 11:36:22 -06:00
Keith Busch 0df1e4f5e0 nvme: Stop probing a removed device
There is no reason the nvme controller can ever return all 1's from
reading the CSTS register. This patch returns an error if we observe
that status. Without this, we may incorrectly proceed with controller
initialization and unnecessarilly rely on error handling to clean this.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-10-12 08:40:13 -06:00
Linus Torvalds 12e3d3cdd9 Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
 "This is the block-irq topic branch for 4.9-rc. It's mostly from
  Christoph, and it allows drivers to specify their own mappings, and
  more importantly, to share the blk-mq mappings with the IRQ affinity
  mappings. It's a good step towards making this work better out of the
  box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
  blk_mq: linux/blk-mq.h does not include all the headers it depends on
  blk-mq: kill unused blk_mq_create_mq_map()
  blk-mq: get rid of the cpumask in struct blk_mq_tags
  nvme: remove the post_scan callout
  nvme: switch to use pci_alloc_irq_vectors
  blk-mq: provide a default queue mapping for PCI device
  blk-mq: allow the driver to pass in a queue mapping
  blk-mq: remove ->map_queue
  blk-mq: only allocate a single mq_map per tag_set
  blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09 17:29:33 -07:00
Andy Lutomirski 1a6fe74dfd nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
Any user I can imagine that needs a buffer at all will want to pass
a pointer directly.  There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-24 10:56:26 -06:00
Simon A. F. Lund 40267efddc lightnvm: expose device geometry through sysfs
For a host to access an Open-Channel SSD, it has to know its geometry,
so that it writes and reads at the appropriate device bounds.

Currently, the geometry information is kept within the kernel, and not
exported to user-space for consumption. This patch exposes the
configuration through sysfs and enables user-space libraries, such as
liblightnvm, to use the sysfs implementation to get the geometry of an
Open-Channel SSD.

The sysfs entries are stored within the device hierarchy, and can be
found using the "lightnvm" device type.

An example configuration looks like this:

/sys/class/nvme/
└── nvme0n1
   ├── capabilities: 3
   ├── device_mode: 1
   ├── erase_max: 1000000
   ├── erase_typ: 1000000
   ├── flash_media_type: 0
   ├── media_capabilities: 0x00000001
   ├── media_type: 0
   ├── multiplane: 0x00010101
   ├── num_blocks: 1022
   ├── num_channels: 1
   ├── num_luns: 4
   ├── num_pages: 64
   ├── num_planes: 1
   ├── page_size: 4096
   ├── prog_max: 100000
   ├── prog_typ: 100000
   ├── read_max: 10000
   ├── read_typ: 10000
   ├── sector_oob_size: 0
   ├── sector_size: 4096
   ├── media_manager: gennvm
   ├── ppa_format: 0x380830082808001010102008
   ├── vendor_opcode: 0
   ├── max_phys_secs: 64
   └── version: 1

Signed-off-by: Simon A. F. Lund <slund@cnexlabs.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21 07:57:31 -06:00
Matias Bjørling b0b4e09c1a lightnvm: control life of nvm_dev in driver
LightNVM compatible device drivers does not have a method to expose
LightNVM specific sysfs entries.

To enable LightNVM sysfs entries to be exposed, lightnvm device
drivers require a struct device to attach it to. To allow both the
actual device driver and lightnvm sysfs entries to coexist, the device
driver tracks the lifetime of the nvm_dev structure.

This patch refactors NVMe and null_blk to handle the lifetime of struct
nvm_dev, which eliminates the need for struct gendisk when a lightnvm
compatible device is provided.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21 07:56:18 -06:00
Matias Bjørling ac81bfa986 nvme: refactor namespaces to support non-gendisk devices
With LightNVM enabled namespaces, the gendisk structure is not exposed
to the user. This prevents LightNVM users from accessing the NVMe device
driver specific sysfs entries, and LightNVM namespace geometry.

Refactor the revalidation process, so that a namespace, instead of a
gendisk, is revalidated. This later allows patches to wire up the
sysfs entries up to a non-gendisk namespace.

Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-21 07:56:12 -06:00
Christoph Hellwig b5af7f2ff0 nvme: remove the post_scan callout
No need now that we don't have to reverse engineer the irq affinity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-09-15 08:42:03 -06:00
Andy Lutomirski 9b47f77a68 nvme: Fix nvme_get/set_features() with a NULL result pointer
nvme_set_features() callers seem to expect that passing NULL as the
result pointer is acceptable.  Teach nvme_set_features() not to try to
write to the NULL address.

For symmetry, make the same change to nvme_get_features(), despite the
fact that all current callers pass a valid result pointer.

I assume that this bug hasn't been reported in practice because
the callers that pass NULL are all in the SCSI translation layer
and no one uses the relevant operations.

Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-24 08:11:10 -06:00
Gabriel Krisman Bertazi f6b6a28e2d nvme: Prevent controller state invalid transition
Acquiring the nvme_ctrl lock before reading ctrl->state in
nvme_change_ctrl_state() should prevent a theoretical invalid state
transition, in the event of two threads racing inside that function.

I haven't been able to observe this happening with the current code, and
the current state machine seems to be simple enough to not be
affected by these invalid transitions, but future modifications could
make it more likely to happen.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Reviewed-by: Sagi Grimberg <sag@grimberg.me>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-15 09:46:46 -06:00
Linus Torvalds 3fc9d69093 Merge branch 'for-4.8/drivers' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "This branch also contains core changes.  I've come to the conclusion
  that from 4.9 and forward, I'll be doing just a single branch.  We
  often have dependencies between core and drivers, and it's hard to
  always split them up appropriately without pulling core into drivers
  when that happens.

  That said, this contains:

   - separate secure erase type for the core block layer, from
     Christoph.

   - set of discard fixes, from Christoph.

   - bio shrinking fixes from Christoph, as a followup up to the
     op/flags change in the core branch.

   - map and append request fixes from Christoph.

   - NVMeF (NVMe over Fabrics) code from Christoph.  This is pretty
     exciting!

   - nvme-loop fixes from Arnd.

   - removal of ->driverfs_dev from Dan, after providing a
     device_add_disk() helper.

   - bcache fixes from Bhaktipriya and Yijing.

   - cdrom subchannel read fix from Vchannaiah.

   - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

   - set of drbd updates and fixes from Fabian, Lars, and Philipp.

   - mg_disk error path fix from Bart.

   - user notification for failed device add for loop, from Minfei.

   - NVMe in general:
        + NVMe delay quirk from Guilherme.
        + SR-IOV support and command retry limits from Keith.
        + fix for memory-less NUMA node from Masayoshi.
        + use UINT_MAX for discard sectors, from Minfei.
        + cancel IO fixes from Ming.
        + don't allocate unused major, from Neil.
        + error code fixup from Dan.
        + use constants for PSDT/FUSE from James.
        + variable init fix from Jay.
        + fabrics fixes from Ming, Sagi, and Wei.
        + various fixes"

* 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
  nvme/pci: Provide SR-IOV support
  nvme: initialize variable before logical OR'ing it
  block: unexport various bio mapping helpers
  scsi/osd: open code blk_make_request
  target: stop using blk_make_request
  block: simplify and export blk_rq_append_bio
  block: ensure bios return from blk_get_request are properly initialized
  virtio_blk: use blk_rq_map_kern
  memstick: don't allow REQ_TYPE_BLOCK_PC requests
  block: shrink bio size again
  block: simplify and cleanup bvec pool handling
  block: get rid of bio_rw and READA
  block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
  block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
  NVMe: don't allocate unused nvme_major
  nvme: avoid crashes when node 0 is memoryless node.
  nvme: Limit command retries
  loop: Make user notify for adding loop device failed
  nvme-loop: fix nvme-loop Kconfig dependencies
  nvmet: fix return value check in nvmet_subsys_alloc()
  ...
2016-07-26 15:37:51 -07:00
Linus Torvalds d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Jay Freyensee fa9a89fc66 nvme: initialize variable before logical OR'ing it
It is typically not good coding or secure coding practice
to logical OR a variable without an initialization value first.
Here on this line:

integrity.flags |= BLK_INTEGRITY_DEVICE_CAPABLE;

BLK_INTEGRITY_DEVICE_CAPABLE is being OR'ed to a member variable
never set to an initial value. This patch fixes that.

Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Ming Lin <ming.l@samsung.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 21:26:16 -06:00
Christoph Hellwig 0c4de0f33b block: ensure bios return from blk_get_request are properly initialized
blk_get_request is used for BLOCK_PC and similar passthrough requests.
Currently we always need to call blk_rq_set_block_pc or an open coded
version of it to allow appending bios using the request mapping helpers
later on, which is a somewhat awkward API.  Instead move the
initialization part of blk_rq_set_block_pc into blk_get_request, so that
we always have a safe to use request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-20 17:38:30 -06:00
NeilBrown b09dcf585d NVMe: don't allocate unused nvme_major
When alloc_disk(0) is used, the ->major number is ignored.  All device
numbers are allocated with a major of BLOCK_EXT_MAJOR.

So remove all references to nvme_major.

[akpm@linux-foundation.org: one unregister_blkdev() was missed]
Link: http://lkml.kernel.org/r/20160602064318.4403.93301.stgit@noble
Signed-off-by: NeilBrown <neilb@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Maxim Levitsky <maximlevitsky@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-14 08:52:56 -07:00
Keith Busch 32f0c4afb4 nvme: Remove RCU namespace protection
We can't sleep with RCU read lock held, but we need to do potentially
blocking stuff to namespace queues when iterating the list. This patch
removes the RCU locking and holds a mutex instead.

To prevent deadlocks, this patch removes holding the mutex during
namespace scanning and removal. The unlocked namespace scanning is made
safe by holding a reference to the namespace being scanned.

List iteration that does IO has to be unlocked to allow error recovery.
The caller must ensure the list can not be manipulated during such an
event, so this patch adds a comment explaining this requirement to the
only function that iterates an unlocked list. All callers currently
meet this requirement, so no further changes required.

List iterations that do not do IO can safely use the lock since it couldn't
block recovery from missing forced IO completions.

Reported-by: Ming Lin <mlin at kernel.org>
[fixes 0bf77e9 nvme: switch to RCU freeing the namespace]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-14 08:48:08 -07:00
Keith Busch f80ec966c1 nvme: Limit command retries
Many controller implementations will return errors to commands that will
not succeed, but without the DNR bit set. The driver previously retried
these commands an unlimited number of times until the command timeout
has exceeded, which takes an unnecessarilly long period of time.

This patch limits the number of retries a command can have, defaulting
to 5, but is user tunable at load or runtime.

The struct request's 'retries' field is used to track the number of
retries attempted. This is in contrast with scsi's use of this field,
which indicates how many retries are allowed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 16:20:31 -07:00
Guilherme G. Piccoli 54adc01055 nvme/quirk: Add a delay before checking for adapter readiness
When disabling the controller, the specification says the register
NVME_REG_CC should be written and then driver needs to wait the
adapter to be ready, which is checked by reading another register
bit (NVME_CSTS_RDY). There's a timeout validation in this checking,
so in case this timeout is reached the driver gives up and removes
the adapter from the system.

After a firmware activation procedure, the PCI_DEVICE(0x1c58, 0x0003)
(HGST adapter) end up being removed if we issue a reset_controller,
because driver keeps verifying the NVME_REG_CSTS until the timeout is
reached. This patch adds a necessary quirk for this adapter, by
introducing a delay before nvme_wait_ready(), so the reset procedure
is able to be completed. This quirk is needed because just increasing
the timeout is not enough in case of this adapter - the driver must
wait before start reading NVME_REG_CSTS register on this specific
device.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-12 08:23:00 -07:00
Jens Axboe 41d512e51b Merge branch 'for-4.8/block' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm into for-4.8/drivers
Dan writes:

"The removal of ->driverfs_dev in favor of just passing the parent
device in as a parameter to add_disk().  See below, it has received a
"Reviewed-by" from Christoph, Bart, and Johannes.

It is also a pre-requisite for Fam Zheng's work to cleanup gendisk
uevents vs attribute visibility [1].  We would extend device_add_disk()
to take an attribute_group list.

This is based off a branch of block.git/for-4.8/drivers and has
received a positive build success notification from the kbuild robot
across several configs.

[1]: "gendisk: Generate uevent after attribute available"
http://marc.info/?l=linux-virtualization&m=146725201522201&w=2"
2016-07-08 16:04:11 -06:00
Christoph Hellwig def61eca96 nvme: add new reconnecting controller state
The nvme fabric (RDMA, FC, etc...) can introduce port, link or node
failures that may require a reconnect to re-establish the connection.

Add a new reconnecting state that will initially be used by the RDMA
driver.

Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Tested-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-08 08:38:49 -06:00
Sagi Grimberg 038bd4cb67 nvme: add keep-alive support
Periodic keep-alive is a mandatory feature in NVMe over Fabrics, and
optional in NVMe 1.2.1 for PCIe.  This patch adds periodic keep-alive
sent from the host to verify that the controller is still responsive
and vice-versa.  The keep-alive timeout is user-defined (with
keep_alive_tmo connection parameter) and defaults to 5 seconds.

In order to avoid a race condition where the host sends a keep-alive
competing with the target side keep-alive timeout expiration, the host
adds a grace period of 10 seconds when publishing the keep-alive timeout
to the target.

In case a keep-alive failed (or timed out), a transport specific error
recovery kicks in.

For now only NVMe over Fabrics is wired up to support keep alive, but
we can add PCIe support easily once controllers actually supporting it
become available.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Steve Wise <swise@chelsio.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:20 -06:00
Christoph Hellwig 07bfcd09a2 nvme-fabrics: add a generic NVMe over Fabrics library
The NVMe over Fabrics library provides an interface for both transports
and the nvme core to handle fabrics specific commands and attributes
independent of the underlying transport.

In addition, the fabrics library adds a misc device interface that allow
actually creating a fabrics controller, as we can't just autodiscover
it like in the PCI case.  The nvme-cli utility has been enhanced to use
this interface to support fabric connect and discovery.

Signed-off-by: Armen Baloyan <armenx.baloyan@intel.com>,
Signed-off-by: Jay Freyensee <james.p.freyensee@intel.com>,
Signed-off-by: Ming Lin <ming.l@ssi.samsung.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-07-05 11:28:16 -06:00