aboutsummaryrefslogtreecommitdiff
path: root/drivers/infiniband
AgeCommit message (Collapse)Author
2013-11-08IB/srp: Remove target from list before freeing Scsi_Host structureVu Pham
Remove an SRP target from the SRP target list before invoking the last scsi_host_put() call. This change is necessary because that last put frees the memory that holds the srp_target_port structure. This patch prevents the following kernel oops: RIP: 0010:[<ffffffff810b00d0>] __lock_acquire+0x500/0x1570 Call Trace: [<ffffffff810b11e4>] lock_acquire+0xa4/0x120 [<ffffffff81531206>] _spin_lock+0x36/0x70 [<ffffffffa01b6d8f>] srp_remove_work+0xef/0x180 [ib_srp] [<ffffffff8109125c>] worker_thread+0x21c/0x3d0 [<ffffffff81096e86>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 Signed-off-by: Vu Pham <vuhuong@mellanox.com> [ bvanassche - Modified path description and CC'ed stable. ] Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: <stable@vger.kernel.org> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Add change_queue_depth and change_queue_type supportJack Wang
Currently, it's not possible to change queue depth for a device behind SRP host. Sometimes, we need to adjust queue_depth for performance reason (eg storage busy, we need lower queue_depth to avoid running into SCSI error handler), so this patch add support for SRP driver. Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Tested-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Make queue size configurableBart Van Assche
Certain storage configurations, e.g. a sufficiently large array of hard disks in a RAID configuration, need a queue depth above 64 to achieve optimal performance. Hence make the queue depth configurable. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Tested-by: Jack Wang <xjtuwjp@gmail.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Introduce srp_alloc_req_data()Bart Van Assche
This patch does not change any functionality. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Cc: Roland Dreier <roland@purestorage.com> Cc: Vu Pham <vu@mellanox.com> Cc: Sebastian Riemer <sebastian.riemer@profitbricks.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Export sgid to sysfsBart Van Assche
On an initiator system with multiple IB ports it is not yet possible to figure out what the originating port of an SRP connection is. Hence make the source GID available in sysfs. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Add periodic reconnect functionalityBart Van Assche
After a transport layer occurred, periodically try to reconnect to the target until the dev_loss timer expires. Protect the callback functions that can be invoked from inside the SCSI EH against concurrent invocation with srp_reconnect_rport() via the rport mutex. Change the default dev_loss_tmo from 60s into 600s to give the reconnect mechanism a chance to kick in. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08scsi_transport_srp: Add periodic reconnect supportBart Van Assche
Add support for periodically reconnecting to an SRP target until the dev_loss timer expires. After the tenth reconnection attempt, gradually slow down subsequent reconnect attempts. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Start timers if a transport layer error occursBart Van Assche
Start the reconnect timer, fast_io_fail timer and dev_loss timers if a transport layer error occurs. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Use SRP transport layer error recoveryBart Van Assche
Enable fast_io_fail_tmo and dev_loss_tmo functionality for the IB SRP initiator. Add kernel module parameters that allow to specify default values for these parameters. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Keep rport as long as the IB transport layerBart Van Assche
Keep the rport data structure around after srp_remove_host() has finished until cleanup of the IB transport layer has finished completely. This is necessary because later patches use the rport pointer inside the queuecommand callback. Without this patch accessing the rport from inside a queuecommand callback is racy because srp_remove_host() must be invoked before scsi_remove_host() and because the queuecommand callback could get invoked after srp_remove_host() has finished. In other words, without this patch the queuecommand callback can get invoked after the rport data structure has been freed. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/srp: Make transport layer retry count configurableVu Pham
Allow the InfiniBand RC retry count to be configured by the user as an option in the target login string. Reducing this retry count allows to reduce the path failover time. Signed-off-by: Vu Pham <vu@mellanox.com> [ bvanassche: Rewrote patch description / changed default retry count ] Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: David Dillow <dillowda@ornl.gov> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/qib: Fix txselect regressionMike Marciniszyn
Commit 7fac33014f54("IB/qib: checkpatch fixes") was overzealous in removing a simple_strtoul for a parse routine, setup_txselect(). That routine is required to handle a multi-value string. Unwind that aspect of the fix. Cc: <stable@vger.kernel.org> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/qib: Fix checkpatch __packed warningsMike Marciniszyn
Convert __attribute__ ((packed)) to __packed. Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/qib: Convert qib_user_sdma_pin_pages() to use get_user_pages_fast()Jan Kara
qib_user_sdma_queue_pkts() gets called with mmap_sem held for writing. Except for get_user_pages() deep down in qib_user_sdma_pin_pages() we don't seem to need mmap_sem at all. Even more interestingly the function qib_user_sdma_queue_pkts() (and also qib_user_sdma_coalesce() called somewhat later) call copy_from_user() which can hit a page fault and we deadlock on trying to get mmap_sem when handling that fault. So just make qib_user_sdma_pin_pages() use get_user_pages_fast() and leave mmap_sem locking for mm. This deadlock has actually been observed in the wild when the node is under memory pressure. Cc: <stable@vger.kernel.org> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/ipath: Convert ipath_user_sdma_pin_pages() to use get_user_pages_fast()Jan Kara
ipath_user_sdma_queue_pkts() gets called with mmap_sem held for writing. Except for get_user_pages() deep down in ipath_user_sdma_pin_pages() we don't seem to need mmap_sem at all. Even more interestingly the function ipath_user_sdma_queue_pkts() (and also ipath_user_sdma_coalesce() called somewhat later) call copy_from_user() which can hit a page fault and we deadlock on trying to get mmap_sem when handling that fault. So just make ipath_user_sdma_pin_pages() use get_user_pages_fast() and leave mmap_sem locking for mm. This deadlock has actually been observed in the wild when the node is under memory pressure. Cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> [ Merged in fix for call to get_user_pages_fast from Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>. - Roland ] Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08RDMA/ocrdma: Remove redundant check in ocrdma_build_fr()Naresh Gottumukkala
Remove the redundant check of comparing if a 32-bit value is greater than 0xffffffffULL. Reported by Dan Carpenter. Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08RDMA/ocrdma: Fix a crash in rmmodNaresh Gottumukkala
1) ocrdma_remove_free() is called from a call_rcu callback funtion context, which can be a bottom-half context. So the code in ocrdma_remove_free should not sleep. But ocrdma_cleanup_hw() can sleep, So move it ocrdma_remove() instead of ocrdma_remove_free. 2) Fix a couple of kbuild test robot warnings. Signed-off-by: Naresh Gottumukkala <bgottumukkala@emulex.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08RDMA/ocrdma: Silence an integer underflow warningDan Carpenter
We recently added a cap on "max_wqe_allocated" in 43a6b4025c ('RDMA/ocrdma: Create IRD queue fix'). My static checker complains that the cap has a problem because it casts large values to negative. "attrs->cap.max_send_wr" is a u32. It comes from the user, but it's capped in ocrdma_check_qp_params() so it can't wrap here. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08mlx5: Use enum to indicate adapter page sizeEli Cohen
The Connect-IB adapter has an inherent page size which equals 4K. Define an new enum that equals the page shift and use it instead of using the value 12 throughout the code. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Update opt param mask for RTS2RTSEli Cohen
RTS to RTS transition should allow update of alternate path. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Remove "Always false" comparisonEli Cohen
mlx5_cur and mlx5_new cannot have negative values so remove the redundant condition. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Remove dead code in mr.cEli Cohen
In mlx5_mr_cache_init() the size variable is not used so remove it to avoid compiler warnings when running with make W=1. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08mlx5: Support communicating arbitrary host page size to firmwareEli Cohen
Connect-IB firmware requires 4K pages to be communicated with the driver. This patch breaks larger pages to 4K units to enable support for architectures utilizing larger page size, such as PowerPC. This patch also fixes several places that referred to PAGE_SHIFT instead of explicit 12 which is the inherent page shift on Connect-IB. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Fix srq free in destroy qpMoshe Lazer
On destroy QP the driver walks over the relevant CQ and removes CQEs reported for the destroyed QP. It also frees the related SRQ entry without checking that this is actually an SRQ-related CQE. In case of a CQ used for both send and receive QP, we could free SRQ entries for send CQEs. This patch resolves this issue by verifying that this is a SRQ related CQE by checking the SRQ number in the CQE is not zero. Signed-off-by: Moshe Lazer <moshel@mellanox.com> Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Simplify mlx5_ib_destroy_srqEli Cohen
Make use of destroy_srq_kernel() to clear SRQ resouces. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Fix overflow check in IB_WR_FAST_REG_MREli Cohen
Make sure not to overflow when reading the page list from struct ib_fast_reg_page_list. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Multithreaded create MREli Cohen
Use asynchronous commands to execute up to eight concurrent create MR commands. This is to fill memory caches faster so we keep consuming from there. Also, increase timeout for shrinking caches to five minutes. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/mlx5: Fix check of number of entries in create CQEli Cohen
Verify that the value is non negative before rounding up to power of 2. Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/netlink: Remove superfluous RDMA_NL_GET_OP() maskingMathias Krause
'op' is the already RDMA_NL_GET_OP() masked 'type'. No need to mask it again. Signed-off-by: Mathias Krause <minipli@googlemail.com> Reviewed-by: Yann Droneaud <ydroneaud@opteya.com> Acked-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/core: Pass imm_data from ib_uverbs_send_wr to ib_send_wr correctlyLatchesar Ionkov
Currently, we don't copy the immediate data from the userspace struct to the kernel one when UD messages are being sent. This patch makes sure that the immediate data is set correctly. Signed-off-by: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: lower NAPI weightMichal Schmidt
Since commit 82dc3c63c692 ("net: introduce NAPI_POLL_WEIGHT") netif_napi_add() produces an error message if a NAPI poll weight greater than 64 is requested. Use the standard NAPI weight. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Start multicast join process only on active portsErez Shitrit
The driver starts the mcast_join task whenever the netdev interface is UP without relation to the underlying IB port state. Until the port state is ACTIVE all the join requests are irrelevant, and the IB core returns -EINVAL. So the user will see errors such as: "multicast join failed for ff12:401b:... , status -22". Instead, have ipoib_mcast_join_task() return when the port is not active. It will be called again when the port state is changed and the low-level driver triggers the IB_EVENT_PORT_ACTIVE event or the IB_EVENT_CLIENT_REREGISTER event. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Add path query flushing in ipoib_ib_dev_cleanupErez Shitrit
The path_rec_completion() callback may be invoked asynchronously even at the middle of "driver uninit" process. This can lead to scheduling a task that tries to touch members of the priv object that are no longer valid. For example the function cm_create_tx_qp can attempt to create qp with no valid priv->pd object. The following crash is one of the results: RIP: 0010:[<ffffffffa021bb47>] [<ffffffffa021bb47>] ipoib_cm_create_tx_qp+0x57/0x90 [ib_ipoib] Process ipoib (pid: 5916, threadinfo ffff8803786e4000, task ffff8804150e1500) Stack: Call Trace: [<ffffffff81309ef0>] ? get_random_bytes+0x20/0x30 [<ffffffffa021be2a>] ipoib_cm_tx_init+0xca/0x340 [ib_ipoib] [<ffffffffa021f765>] ipoib_cm_tx_start+0x215/0x3f0 [ib_ipoib] [<ffffffffa021f550>] ? ipoib_cm_tx_start+0x0/0x3f0 [ib_ipoib] [<ffffffff8108b2b0>] worker_thread+0x170/0x2a0 [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8108b140>] ? worker_thread+0x0/0x2a0 [<ffffffff81090886>] kthread+0x96/0xa0 [<ffffffff8100c14a>] child_rip+0xa/0x20 [<ffffffff810907f0>] ? kthread+0x0/0xa0 [<ffffffff8100c140>] ? child_rip+0x0/0x20 Fix that by flushing all pending path queries at this point. Signed-off-by: Alex Markuze <markuze@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Fix usage of uninitialized multicast objectsErez Shitrit
The driver should avoid calling ib_sa_free_multicast on the mcast->mc object until it finishes its initialization state. Otherwise we can crash when ipoib_mcast_dev_flush() attempts to use the uninitialized multicast object. Instead, only call wait_for_completion() for multicast entries that started the join process, meaning that ib_sa_join_multicast() finished. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Avoid flushing the driver workqueue on dev_downErez Shitrit
The driver should not flush the whole workqueue when only one work (the pkey poll one) needs to be cancelled. Use cancel_delayed_work_sync() instead. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Fix deadlock between dev_change_flags() and __ipoib_dev_flush()Erez Shitrit
When ipoib interface is going down it takes all of its children with it, under mutex. For each child, dev_change_flags() is called. That function calls ipoib_stop() via the ndo, and causes flush of the workqueue. Sometimes in the workqueue an __ipoib_dev_flush work() is waiting and when invoked tries to get the same mutex, which leads to a deadlock, as seen below. The solution is to switch to rw-sem instead of mutex. The deadlock: [11028.165303] [<ffffffff812b0977>] ? vgacon_scroll+0x107/0x2e0 [11028.171844] [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0 [11028.178465] [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80 [11028.185962] [<ffffffff814ea743>] wait_for_common+0x123/0x180 [11028.192491] [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20 [11028.199504] [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20 [11028.206224] [<ffffffff8108b4f1>] flush_cpu_workqueue+0x61/0x90 [11028.212948] [<ffffffff8108b5a0>] ? wq_barrier_func+0x0/0x20 [11028.219375] [<ffffffff8108bfc4>] flush_workqueue+0x54/0x80 [11028.225712] [<ffffffffa05a0576>] ipoib_mcast_stop_thread+0x66/0x90 [ib_ipoib] [11028.233988] [<ffffffffa059ccea>] ipoib_ib_dev_down+0x6a/0x100 [ib_ipoib] [11028.241678] [<ffffffffa059849a>] ipoib_stop+0x8a/0x140 [ib_ipoib] [11028.248692] [<ffffffff8142adf1>] dev_close+0x71/0xc0 [11028.254447] [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0 [11028.261062] [<ffffffffa059851b>] ipoib_stop+0x10b/0x140 [ib_ipoib] [11028.268172] [<ffffffff8142adf1>] dev_close+0x71/0xc0 [11028.273922] [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0 [11028.280452] [<ffffffff8148f20b>] devinet_ioctl+0x5eb/0x6a0 [11028.286786] [<ffffffff814903b8>] inet_ioctl+0x88/0xa0 [11028.292633] [<ffffffff8141591a>] sock_ioctl+0x7a/0x280 [11028.298576] [<ffffffff81189012>] vfs_ioctl+0x22/0xa0 [11028.304326] [<ffffffff81140540>] ? unmap_region+0x110/0x130 [11028.310756] [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580 [11028.316897] [<ffffffff81189731>] sys_ioctl+0x81/0xa0 and 11028.017533] [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80 [11028.025030] [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20 [11028.031945] [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180 [11028.039053] [<ffffffff814eb14b>] mutex_lock+0x2b/0x50 [11028.044910] [<ffffffffa059f7e7>] __ipoib_ib_dev_flush+0x37/0x210 [ib_ipoib] [11028.052894] [<ffffffffa059fa00>] ? ipoib_ib_dev_flush_light+0x0/0x20 [ib_ipoib] [11028.061363] [<ffffffffa059fa17>] ipoib_ib_dev_flush_light+0x17/0x20 [ib_ipoib] [11028.069738] [<ffffffff8108b120>] worker_thread+0x170/0x2a0 [11028.076068] [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 [11028.083374] [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0 [11028.089709] [<ffffffff81090626>] kthread+0x96/0xa0 [11028.095266] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [11028.100921] [<ffffffff81090590>] ? kthread+0x0/0xa0 [11028.106573] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 [11028.112423] INFO: task ifconfig:23640 blocked for more than 120 seconds. Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Change CM skb memory allocation to be non-atomic during initTal Alon
Change CM skb memory allocation to use GFP_KERNEL when possible. During device init there's no need to use GFP_ATOMIC when allocating memory for the CM skbs -- use GFP_KERNEL instead. Signed-off-by: Tal Alon <talal@mellanox.com> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IPoIB: Fix crash in dev_open error flowErez Shitrit
If napi has never been enabled when calling ipoib_ib_dev_stop, a kernel crash occurs, because the verbs layer completion handler (ipoib_ib_completion) calls napi_schedule unconditionally. If the napi structure passed in the napi_schedule call has not been initialized, napi will crash. The cleanest solution is to simply enable napi before calling ipoib_ib_dev_stop in the dev_open error flow. (dev_stop then immediately disables napi). Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/cxgb4: Fix formatting of physical addressBen Hutchings
Physical addresses may be wider than virtual addresses (e.g. on i386 with PAE) and must not be formatted with %p. Compile-tested only. Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/cma: Check for GID on listening device firstDoug Ledford
As a simple optimization that should speed up the vast majority of connect attemps on IB devices, when we are searching for the GID of an incoming connection in the cached GID lists of devices, search the device that received the incoming connection request first. If we don't find it there, then move on to other devices. This reduces the time to perform 10,000 connections considerably. Prior to this patch, a bad run of cmtime would look like this: connect : 12399.26 12351.10 8609.00 1239.93 With this patch, it looks more like this: connect : 5864.86 5799.80 8876.00 586.49 Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-08IB/cma: Use cached gidsDoug Ledford
The cma_acquire_dev function was changed by commit 3c86aa70bf67 ("RDMA/cm: Add RDMA CM support for IBoE devices") to use find_gid_port() because multiport devices might have either IB or IBoE formatted gids. The old function assumed that all ports on the same device used the same GID format. However, when it was changed to use find_gid_port(), we inadvertently lost usage of the GID cache. This turned out to be a very costly change. In our testing, each iteration through each index of the GID table takes roughly 35us. When you have multiple devices in a system, and the GID you are looking for is on one of the later devices, the code loops through all of the GID indexes on all of the early devices before it finally succeeds on the target device. This pathological search behavior combined with 35us per GID table index retrieval results in results such as the following from the cmtime application that's part of the latest librdmacm git repo: ib1: step total ms max ms min us us / conn create id : 29.42 0.04 1.00 2.94 bind addr : 186705.66 19.00 18556.00 18670.57 resolve addr : 41.93 9.68 619.00 4.19 resolve route: 486.93 0.48 101.00 48.69 create qp : 4021.95 6.18 330.00 402.20 connect : 68350.39 68588.17 24632.00 6835.04 disconnect : 1460.43 252.65-1862269.00 146.04 destroy : 41.16 0.04 2.00 4.12 ib0: step total ms max ms min us us / conn create id : 28.61 0.68 1.00 2.86 bind addr : 2178.86 2.95 201.00 217.89 resolve addr : 51.26 16.85 845.00 5.13 resolve route: 620.08 0.43 92.00 62.01 create qp : 3344.40 6.36 273.00 334.44 connect : 6435.99 6368.53 7844.00 643.60 disconnect : 5095.38 321.90 757.00 509.54 destroy : 37.13 0.02 2.00 3.71 Clearly, both the bind address and connect operations suffer a huge penalty for being anything other than the default GID on the first port in the system. After applying this patch, the numbers now look like this: ib1: step total ms max ms min us us / conn create id : 30.15 0.03 1.00 3.01 bind addr : 80.27 0.04 7.00 8.03 resolve addr : 43.02 13.53 589.00 4.30 resolve route: 482.90 0.45 100.00 48.29 create qp : 3986.55 5.80 330.00 398.66 connect : 7141.53 7051.29 5005.00 714.15 disconnect : 5038.85 193.63 918.00 503.88 destroy : 37.02 0.04 2.00 3.70 ib0: step total ms max ms min us us / conn create id : 34.27 0.05 1.00 3.43 bind addr : 26.45 0.04 1.00 2.64 resolve addr : 38.25 10.54 760.00 3.82 resolve route: 604.79 0.43 97.00 60.48 create qp : 3314.95 6.34 273.00 331.49 connect : 12399.26 12351.10 8609.00 1239.93 disconnect : 5096.76 270.72 1015.00 509.68 destroy : 37.10 0.03 2.00 3.71 It's worth noting that we still suffer a bit of a penalty on connect to the wrong device, but the penalty is much less than it used to be. Follow on patches deal with this penalty. Many thanks to Neil Horman for helping to track the source of slow function that allowed us to track down the fact that the original patch I mentioned above backed out cache usage and identify just how much that impacted the system. Signed-off-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Roland Dreier <roland@purestorage.com>
2013-11-07net/mlx4_core: Initialize all mailbox buffers to zero before useJack Morgenstein
To guarantee that all unused fields in all FW commands for both inboxes and outboxes are zeroed out, initialize the mailbox buffer to all zeroes. This is especially important for SRIOV comm-channel virtual commands (such as QUERY_FUNC_CAP), where if new fields are added to support new features, the driver can depend on older kernels passing zeroes in these fields. In addition to zeroing out the mailbox buffer at allocation time, all (now unnecessary) calls to memset by the callers of mlx4_alloc_cmd_mailbox() are removed. Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07RDMA/cma: Set IBoE SL (user-priority) by egress map when using vlansEyal Perry
On top of commit 366cddb40 "IB/rdma_cm: TOS <=> UP mapping for IBoE", add support for case vlan egress map is used. When the IBoE session is being set over a vlan, inherit the socket priority to vlan priority mapping which was configured for the vlan device egress map. Signed-off-by: Eyal Perry <eyalpe@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07target: Drop left-over se_lun->lun_cmd_list shutdown codeNicholas Bellinger
Now with percpu refcounting for se_lun in place, go ahead and drop the legacy per se_cmd accounting for se_lun shutdown. This includes __transport_clear_lun_from_sessions(), the associated transport_lun_wait_for_tasks() logic, along with a handful of now unused se_cmd structure members and ->transport_state bits. Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-11-06ib_isert: Add support for completion interrupt coalescingNicholas Bellinger
This patch adds support for completion interrupt coalescing that allows only every ISERT_COMP_BATCH_COUNT (8) to set IB_SEND_SIGNALED, thus avoiding completion interrupts for every posted iser_tx_desc. The batch processing is done using a per isert_conn llist that once IB_SEND_SIGNALED has been set is saved to tx_desc->comp_llnode_batch, and completion processing of previously posted iser_tx_descs is done in a single shot from within isert_send_completion() code. Note this is only done for response PDUs from ISCSI_OP_SCSI_CMD, and all other control type of PDU responses will force an implicit batch drain to occur. Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Roland Dreier <roland@purestorage.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-11-04mlx4: Structures and init/teardown for VF resource quotasJack Morgenstein
This is step #1 for implementing SRIOV resource quotas for VFs. Quotas are implemented per resource type for VFs and the PF, to prevent any entity from simply grabbing all the resources for itself and leaving the other entities unable to obtain such resources. Resources which are allocated using quotas: QPs, CQs, SRQs, MPTs, MTTs, MAC, VLAN, and Counters. The quota system works as follows: Each entity (VF or PF) is given a max number of a given resource (its quota), and a guaranteed minimum number for each resource (starvation prevention). For QPs, CQs, SRQs, MPTs and MTTs: 50% of the available quantity for the resource is divided equally among the PF and all the active VFs (i.e., the number of VFs in the mlx4_core module parameter "num_vfs"). This 50% represents the "guaranteed minimum" pool. The other 50% is the "free pool", allocated on a first-come-first-serve basis. For each VF/PF, resources are first allocated from its "guaranteed-minimum" pool. When that pool is exhausted, the driver attempts to allocate from the resource "free-pool". The quota (i.e., max) for the VFs and the PF is: The free-pool amount (50% of the real max) + the guaranteed minimum For MACs: Guarantee 2 MACs per VF/PF per port. As a result, since we have only 128 MACs per port, reduce the allowable number of VFs from 64 to 63. Any remaining MACs are put into a free pool. For VLANs: For the PF, the per-port quota is 128 and guarantee is 64 (to allow the PF to register at least a VLAN per VF in VST mode). For the VFs, the per-port quota is 64 and the guarantee is 0. We assume that VGT VFs are trusted not to abuse the VLAN resource. For Counters: For all functions (PF and VFs), the quota is 128 and the guarantee is 0. In this patch, we define the needed structures, which are added to the resource-tracker struct. In addition, we do initialization for the resource quota, and adjust the query_device response to use quotas rather than resource maxima. As part of the implementation, we introduce a new field in mlx4_dev: quotas. This field holds the resource quotas used to report maxima to the upper layers (ib_core, via query_device). The HCA maxima of these values are passed to the VFs (via QUERY_HCA) so that they may continue to use these in handling QPs, CQs, SRQs and MPTs. Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/netconsole.c net/bridge/br_private.h Three mostly trivial conflicts. The net/bridge/br_private.h conflict was a function signature (argument addition) change overlapping with the extern removals from Joe Perches. In drivers/net/netconsole.c we had one change adjusting a printk message whilst another changed "printk(KERN_INFO" into "pr_info(". Lastly, the emulex change was a new inline function addition overlapping with Joe Perches's extern removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pendingLinus Torvalds
Pull SCSI target fixes from Nicholas Bellinger: "Here are the outstanding target pending fixes for v3.12-rc7. This includes a number of EXTENDED_COPY related fixes as a result of Thomas and Doug's continuing testing and feedback. Also included is an important vhost/scsi fix that addresses a long standing issue where the 'write' parameter for get_user_pages_fast() was incorrectly set for virtio-scsi WRITEs -> DMA_TO_DEVICE, and not for virtio-scsi READs -> DMA_FROM_DEVICE. This resulted in random userspace segfaults and other unpleasantness on KVM host, and unfortunately has been an issue since the initial merge of vhost/scsi in v3.6. This patch is CC'ed to stable, along with two other less critical items" * git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending: vhost/scsi: Fix incorrect usage of get_user_pages_fast write parameter target/pscsi: fix return value check target: Fail XCOPY for non matching source + destination block_size target: Generate failure for XCOPY I/O with non-zero scsi_status target: Add missing XCOPY I/O operation sense_buffer iser-target: check device before dereferencing its variable target: Return an error for WRITE SAME with ANCHOR==1 target: Fix assignment of LUN in tracepoints target: Reject EXTENDED_COPY when emulate_3pc is disabled target: Allow non zero ListID in EXTENDED_COPY parameter list target: Make target_do_xcopy failures return INVALID_PARAMETER_LIST
2013-10-23iser-target: check device before dereferencing its variableVu Pham
This patch changes isert_connect_release() to correctly check for the existence struct isert_device *device before checking for isert_device->use_frwr. Signed-off-by: Vu Pham <vu@mellanox.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
2013-10-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>