summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-03-30blktrace: fix comment in blktrace_api.hSouvik Banerjee
The `__u64 time` field of the blk_io_trace struct refers to the time in nanoseconds, not in microseconds. It is set in __blk_add_trace, which does the following: t->time = ktime_to_ns(ktime_get()); ktime_to_ns returns ktime_t in nanoseconds, not microseconds. Signed-off-by: Souvik Banerjee <souvik1997@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove function name in stringsMatias Bjørling
For the sysfs functions, the function names are embedded into their error strings. If the function name later changes, the string may not be updated accordingly. Update the strings to use __func__ to avoid this. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: remove some unnecessary NULL checksDan Carpenter
Smatch complains that flush_workqueue() dereferences the work queue pointer but then we check if it's NULL on the next line when it's too late. These NULL checks can be removed because the module won't load if we can't allocate the work queues. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: don't recover unwritten linesHans Holmberg
If the line has not been written to, we should not try to recover any data from it, so check the state of the chunks in the line before attempting to read smeta. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: implement 2.0 supportJavier González
Implement 2.0 support in pblk. This includes the address formatting and mapping paths, as well as the sysfs entries for them. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: implement get log report chunkJavier González
In preparation of pblk supporting 2.0, implement the get log report chunk in pblk. Also, define the chunk states as given in the 2.0 spec. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: rename ppaf* to addrf*Javier González
In preparation for 2.0 support in pblk, rename variables referring to the address format to addrf and reserve ppaf for the 1.2 path. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: check for supported versionJavier González
At this point, only 1.2 spec is supported, thus check for it. Also, since device-side L2P is only supported in the 1.2 spec, make sure to only check its value under 1.2. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: implement get log report chunk helpersJavier González
The 2.0 spec provides a report chunk log page that can be retrieved using the stangard nvme get log page. This replaces the dedicated get/put bad block table in 1.2. This patch implements the helper functions to allow targets retrieve the chunk metadata using get log page. It makes nvme_get_log_ext available outside of nvme core so that we can use it form lightnvm. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: make address conversions depend on generic deviceJavier González
On address conversions, use the generic device, instead of the target device. This allows to use conversions outside of the target's realm. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add support for 2.0 address formatJavier González
Add support for 2.0 address format. Also, align address bits for 1.2 and 2.0 to be able to operate on channel and luns without requiring a format conversion. Use a generic address format for this purpose. Also, convert the generic operations to the generic format in pblk. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: normalize geometry nomenclatureJavier González
Normalize nomenclature for naming channels, luns, chunks, planes and sectors as well as derivations in order to improve readability. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: complete geo structure with maxoc*Javier González
Complete the generic geometry structure with the maxoc and maxocpu felds, present in the 2.0 spec. Also, expose them through sysfs. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add shorten OCSSD version in geoJavier González
Create a shorten version to use in the generic geometry. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add minor version to generic geometryJavier González
Separate the version between major and minor on the generic geometry and represent it through sysfs in the 2.0 path. The 1.2 path only shows the major version to preserve the existing user space interface. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: simplify geometry structureJavier González
Currently, the device geometry is stored redundantly in the nvm_id and nvm_geo structures at a device level. Moreover, when instantiating targets on a specific number of LUNs, these structures are replicated and manually modified to fit the instance channel and LUN partitioning. Instead, create a generic geometry around nvm_geo, which can be used by (i) the underlying device to describe the geometry of the whole device, and (ii) instances to describe their geometry independently. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: refactor init/exit sequencesJavier González
Refactor init and exit sequences to eliminate dependencies among init modules and improve readability. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: Avoid validation of default op valueHeiner Litz
Fixes: 38401d231de65 ("lightnvm: set target over-provision on create ioctl") Signed-off-by: Heiner Litz <hlitz@ucsc.edu> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: centralize permission check for lightnvm ioctlJohannes Thumshirn
Currently all functions for handling the lightnvm core ioctl commands do a check for CAP_SYS_ADMIN. Change this to fail early in nvm_ctl_ioctl(), so we don't have to duplicate the permission checks all over. Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: fix bad block initializationHeiner Litz
fix reading bad block device information to correctly setup the per line blk_bitmap during lightnvm initialization Signed-off-by: Heiner Litz <hlitz@ucsc.edu> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29nvme: lightnvm: add late setup of block size and metadataMatias Bjørling
The nvme driver sets up the size of the nvme namespace in two steps. First it initializes the device with standard logical block and metadata sizes, and then sets the correct logical block and metadata size. Due to the OCSSD 2.0 specification relies on the namespace to expose these sizes for correct initialization, let it be updated appropriately on the LightNVM side as well. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove nvm_dev_ops->max_phys_sectMatias Bjørling
The value of max_phys_sect is always static. Instead of defining it in the nvm_dev_ops structure, declare it as a global value. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove max_rq_sizeMatias Bjørling
The field is no longer used. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: add 2.0 geometry identificationMatias Bjørling
Implement the geometry data structures for 2.0 and enable a drive to be identified as one, including exposing the appropriate 2.0 sysfs entries. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: flatten nvm_id_group into nvm_idMatias Bjørling
There are no groups in the 2.0 specification, make sure that the nvm_id structure is flattened before 2.0 data structures are added. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: make 1.2 data structures explicitMatias Bjørling
Make the 1.2 data structures explicit, so it will be easy to identify the 2.0 data structures. Also fix the order of which the nvme_nvm_* are declared, such that they follow the nvme_nvm_command order. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: refactor bad block identificationJavier González
In preparation for the OCSSD 2.0 spec. bad block identification, refactor the current code to generalize bad block get/set functions and structures. Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: prevent race in pblk_rb_flush_point_setHans Holmberg
Make sure that we are not advancing the sync pointer while we're adding bios to the write buffer entry completion list. This race condition results in bios not completing and was identified by a hang when running xfstest generic/113. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: allow allocation of new lines during shutdownHans Holmberg
When shutting down pblk the write buffer is flushed and if the current line can't fit the data in the write buffer we need to allocate a new line, so remove the check that prevents this. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: delete writer kick timer before stopping threadHans Holmberg
Unless we delete the timer that wakes up the write thread before we stop the thread we risk re-starting the thread, so delete the timer first. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: add padding distribution sysfs attributeHans Holmberg
When pblk receives a sync, all data up to that point in the write buffer must be comitted to persistent storage, and as flash memory comes with a minimal write size there is a significant cost involved both in terms of time for completing the sync and in terms of write amplification padded sectors for filling up to the minimal write size. In order to get a better understanding of the costs involved for syncs, Add a sysfs attribute to pblk: padded_dist, showing a normalized distribution of sectors padded. In order to facilitate measurements of specific workloads during the lifetime of the pblk instance, the distribution can be reset by writing 0 to the attribute. Do this by introducing counters for each possible padding: {0..(minimal write size - 1)} and calculate the normalized distribution when showing the attribute. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Rearranged total_buckets statement in pblk_sysfs_get_padding_dist Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove multiple groups in 1.2 data structureMatias Bjørling
Only one id group from the 1.2 specification is supported. Make sure that only the first group is accessible. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove mlc pairs structureMatias Bjørling
The known implementations of the 1.2 specification, and upcoming 2.0 implementation all expose a sequential list of pages to write. Remove the data structure, as it is no longer needed. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: export write amplification counters to sysfsHans Holmberg
In a SSD, write amplification, WA, is defined as the average number of page writes per user page write. Write amplification negatively affects write performance and decreases the lifetime of the disk, so it's a useful metric to add to sysfs. In plkb's case, the number of writes per user sector is the sum of: (1) number of user writes (2) number of sectors written by the garbage collector (3) number of sectors padded (i.e. due to syncs) This patch adds persistent counters for 1-3 and two sysfs attributes to export these along with WA calculated with five decimals: write_amp_mileage: the accumulated write amplification stats for the lifetime of the pblk instance write_amp_trip: resetable stats to facilitate delta measurements, values reset at creation and if 0 is written to the attribute. 64-bit counters are used as a 32 bit counter would wrap around already after about 17 TB worth of user data. It will take a long long time before the 64 bit sector counters wrap around. The counters are stored after the bad block bitmap in the first emeta sector of each written line. There is plenty of space in the first emeta sector, so we don't need to bump the major version of the line data format. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: check data lines version on recoveryHans Holmberg
As a preparation for future bumps of data line persistent storage versions, we need to start checking the emeta line version during recovery. Also slit up the current emeta/smeta version into two bytes (major,minor). Recovering lines with the same major number as the current pblk data line version must succeed. This means that any changes in the persistent format must be: (1) Backward compatible: if we switch back to and older kernel, recovery of lines stored with major == current_major and minor > current_minor must succeed. (2) Forward compatible: switching to a newer kernel, recovery of lines stored with major=current_major and minor < minor must handle the data format differences gracefully(i.e. initialize new data structures to default values). If we detect lines that have a different major number than the current we must abort recovery. The user must manually migrate the data in this case. Previously the version stored in the emeta header was copied from smeta, which has version 1, so we need to set the minor version to 1. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: pblk: handle bad sectors in the emeta area correctlyHans Holmberg
Unless we check if there are bad sectors in the entire emeta-area we risk ending up with valid bitmap / available sector count inconsistency. This results in lines with a bad chunk at the last LUN marked as bad, so go through the whole emeta area and mark up the invalid sectors. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm: remove chnl_offset in nvme_nvm_identityMatias Bjørling
The identity structure is initialized to zero in the beginning of the nvme_nvm_identity function. The chnl_offset is separately set to zero. Since both the variable and assignment is never changed, remove them. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-29lightnvm/pblk-gc: Delete an error message for a failed memory allocation in ↵Markus Elfring
pblk_gc_line_prepare_ws() Omit an extra message for a memory allocation failure in this function. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Reviewed-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-27blk-mq: Allow PCI vector offset for mapping queuesKeith Busch
The PCI interrupt vectors intended to be associated with a queue may not start at 0; a driver may allocate pre_vectors for special use. This patch adds an offset parameter so blk-mq may find the intended affinity mask and updates all drivers using this API accordingly. Cc: Don Brace <don.brace@microsemi.com> Cc: <qla2xxx-upstream@qlogic.com> Cc: <linux-scsi@vger.kernel.org> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-27loop: use killable lock in ioctlsOmar Sandoval
Even after the previous patch to drop lo_ctl_mutex while calling vfs_getattr(), there are other cases where we can end up sleeping for a long time while holding lo_ctl_mutex. Let's avoid the uninterruptible sleep from the ioctls. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-27loop: don't call into filesystem while holding lo_ctl_mutexOmar Sandoval
We hit an issue where a loop device on NFS was stuck in loop_get_status() doing vfs_getattr() after the NFS server died, which caused a pile-up of uninterruptible processes waiting on lo_ctl_mutex. There's no reason to hold this lock while we wait on the filesystem; let's drop it so that other processes can do their thing. We need to grab a reference on lo_backing_file while we use it, and we can get rid of the check on lo_device, which has been unnecessary since commit a34c0ae9ebd6 ("[PATCH] loop: remove the bio remapping capability") in the linux-history tree. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26block, bfq: lower-bound the estimated peak rate to 1Paolo Valente
If a storage device handled by BFQ happens to be slower than 7.5 KB/s for a certain amount of time (in the order of a second), then the estimated peak rate of the device, maintained in BFQ, becomes equal to 0. The reason is the limited precision with which the rate is represented (details on the range of representable values in the comments introduced by this commit). This leads to a division-by-zero error where the estimated peak rate is used as divisor. Such a type of failure has been reported in [1]. This commit addresses this issue by: 1. Lower-bounding the estimated peak rate to 1 2. Adding and improving comments on the range of rates representable [1] https://www.spinics.net/lists/kernel/msg2739205.html Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: make nvme_get_log_ext non-staticMatias Bjørling
Enable the lightnvm integration to use the nvme_get_log_ext() function. Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvmet: constify struct nvmet_fabrics_opsChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvmet: refactor configfs transport type handlingChristoph Hellwig
Have a common table of mappings from numerical transport ids to names, and zero the transport specific area in common code in nvmet_addr_trtype_store. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvmet: move device_uuid configfs attr definition to suitable placeMax Gurtovoy
Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: Add .stop_ctrl to nvme ctrl opsNitzan Carmi
For consistancy reasons, any fabric-specific works (e.g error recovery/reconnect) should be canceled in nvme_stop_ctrl, as for all other NVMe pending works (e.g. scan, keep alive). The patch aims to simplify the logic of the code, as we now only rely on a vague demand from any fabric to flush its private workqueues at the beginning of .delete_ctrl op. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme-rdma: Allow DELETING state change failure in error_recoveryNitzan Carmi
While error recovery is ongoing, it is OK to move ctrl to DELETING state (from concurrent delete_work). Thus we don't need a warning for that case. Signed-off-by: Nitzan Carmi <nitzanc@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme: Skip checking heads without namespacesKeith Busch
If a task is holding a reference to a namespace on a removed controller, the head will not be released. If the same controller is added again later, its namespaces may not be successfully added. Instead, the user will see kernel message "Duplicate IDs for nsid <X>". This patch fixes that by skipping heads that don't have namespaces when considering if a new namespace is safe to add. Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com> Cc: stable@vger.kernel.org Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26nvme-rdma: Don't flush delete_wq by default during remove_oneMax Gurtovoy
The .remove_one function is called for any ib_device removal. In case the removed device has no reference in our driver, there is no need to flush the work queue. Reviewed-by: Israel Rukshin <israelr@mellanox.com> Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>