summaryrefslogtreecommitdiff
path: root/net/tipc/socket.h
AgeCommit message (Collapse)Author
2020-10-30tipc: add stricter control of reserved service typesJon Maloy
TIPC reserves 64 service types for current and future internal use. Therefore, the bind() function is meant to block regular user sockets from being bound to these values, while it should let through such bindings from internal users. However, since we at the design moment saw no way to distinguish between regular and internal users the filter function ended up with allowing all bindings of the reserved types which were really in use ([0,1]), and block all the rest ([2,63]). This is risky, since a regular user may bind to the service type representing the topology server (TIPC_TOP_SRV == 1) or the one used for indicating neighboring node status (TIPC_CFG_SRV == 0), and wreak havoc for users of those services, i.e., most users. The reality is however that TIPC_CFG_SRV never is bound through the bind() function, since it doesn't represent a regular socket, and TIPC_TOP_SRV can also be made to bypass the checks in tipc_bind() by introducing a different entry function, tipc_sk_bind(). It should be noted that although this is a change of the API semantics, there is no risk we will break any currently working applications by doing this. Any application trying to bind to the values in question would be badly broken from the outset, so there is no chance we would find any such applications in real-world production systems. v2: Added warning printout when a user is blocked from binding, as suggested by Jakub Kicinski Acked-by: Yung Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Link: https://lore.kernel.org/r/20201030012938.489557-1-jmaloy@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-28tipc: call tsk_set_importance from tipc_topsrv_create_listenerChristoph Hellwig
Avoid using kernel_setsockopt for the TIPC_IMPORTANCE option when we can just use the internal helper. The only change needed is to pass a struct sock instead of tipc_sock, which is private to socket.c Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19tipc: add trace_events for tipc socketTuong Lien
The commit adds the new trace_events for TIPC socket object: trace_tipc_sk_create() trace_tipc_sk_poll() trace_tipc_sk_sendmsg() trace_tipc_sk_sendmcast() trace_tipc_sk_sendstream() trace_tipc_sk_filter_rcv() trace_tipc_sk_advance_rx() trace_tipc_sk_rej_msg() trace_tipc_sk_drop_msg() trace_tipc_sk_release() trace_tipc_sk_shutdown() trace_tipc_sk_overlimit1() trace_tipc_sk_overlimit2() Also, enables the traces for the following cases: - When user creates a TIPC socket; - When user calls poll() on TIPC socket; - When user sends a dgram/mcast/stream message. - When a message is put into the socket 'sk_receive_queue'; - When a message is released from the socket 'sk_receive_queue'; - When a message is rejected (e.g. due to no port, invalid, etc.); - When a message is dropped (e.g. due to wrong message type); - When socket is released; - When socket is shutdown; - When socket rcvq's allocation is overlimit (> 90%); - When socket rcvq + bklq's allocation is overlimit (> 90%); - When the 'TIPC_ERR_OVERLOAD/2' issue happens; Note: a) All the socket traces are designed to be able to trace on a specific socket by either using the 'event filtering' feature on a known socket 'portid' value or the sysctl file: /proc/sys/net/tipc/sk_filter The file determines a 'tuple' for what socket should be traced: (portid, sock type, name type, name lower, name upper) where: + 'portid' is the socket portid generated at socket creating, can be found in the trace outputs or the 'tipc socket list' command printouts; + 'sock type' is the socket type (1 = SOCK_TREAM, ...); + 'name type', 'name lower' and 'name upper' are the service name being connected to or published by the socket. Value '0' means 'ANY', the default tuple value is (0, 0, 0, 0, 0) i.e. the traces happen for every sockets with no filter. b) The 'tipc_sk_overlimit1/2' event is also a conditional trace_event which happens when the socket receive queue (and backlog queue) is about to be overloaded, when the queue allocation is > 90%. Then, when the trace is enabled, the last skbs leading to the TIPC_ERR_OVERLOAD/2 issue can be traced. The trace event is designed as an 'upper watermark' notification that the other traces (e.g. 'tipc_sk_advance_rx' vs 'tipc_sk_filter_rcv') or actions can be triggerred in the meanwhile to see what is going on with the socket queue. In addition, the 'trace_tipc_sk_dump()' is also placed at the 'TIPC_ERR_OVERLOAD/2' case, so the socket and last skb can be dumped for post-analysis. Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19tipc: enable tracepoints in tipcTuong Lien
As for the sake of debugging/tracing, the commit enables tracepoints in TIPC along with some general trace_events as shown below. It also defines some 'tipc_*_dump()' functions that allow to dump TIPC object data whenever needed, that is, for general debug purposes, ie. not just for the trace_events. The following trace_events are now available: - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data, e.g. message type, user, droppable, skb truesize, cloned skb, etc. - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or queues, e.g. TIPC link transmq, socket receive queue, etc. - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g. sk state, sk type, connection type, rmem_alloc, socket queues, etc. - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g. link state, silent_intv_cnt, gap, bc_gap, link queues, etc. - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g. node state, active links, capabilities, link entries, etc. How to use: Put the trace functions at any places where we want to dump TIPC data or events. Note: a) The dump functions will generate raw data only, that is, to offload the trace event's processing, it can require a tool or script to parse the data but this should be simple. b) The trace_tipc_*_dump() should be reserved for a failure cases only (e.g. the retransmission failure case) or where we do not expect to happen too often, then we can consider enabling these events by default since they will almost not take any effects under normal conditions, but once the rare condition or failure occurs, we get the dumped data fully for post-analysis. For other trace purposes, we can reuse these trace classes as template but different events. c) A trace_event is only effective when we enable it. To enable the TIPC trace_events, echo 1 to 'enable' files in the events/tipc/ directory in the 'debugfs' file system. Normally, they are located at: /sys/kernel/debug/tracing/events/tipc/ For example: To enable the tipc_link_dump event: echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable To enable all the TIPC trace_events: echo 1 > /sys/kernel/debug/tracing/events/tipc/enable To collect the trace data: cat trace or cat trace_pipe > /trace.out & To disable all the TIPC trace_events: echo 0 > /sys/kernel/debug/tracing/events/tipc/enable To clear the trace buffer: echo > trace d) Like the other trace_events, the feature like 'filter' or 'trigger' is also usable for the tipc trace_events. For more details, have a look at: Documentation/trace/ftrace.txt MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc Acked-by: Ying Xue <ying.xue@windriver.com> Tested-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-06tipc: call start and done ops directly in __tipc_nl_compat_dumpit()Cong Wang
__tipc_nl_compat_dumpit() uses a netlink_callback on stack, so the only way to align it with other ->dumpit() call path is calling tipc_dump_start() and tipc_dump_done() directly inside it. Otherwise ->dumpit() would always get NULL from cb->args[]. But tipc_dump_start() uses sock_net(cb->skb->sk) to retrieve net pointer, the cb->skb here doesn't set skb->sk, the net pointer is saved in msg->net instead, so introduce a helper function __tipc_dump_start() to pass in msg->net. Ying pointed out cb->args[0...3] are already used by other callbacks on this call path, so we can't use cb->args[0] any more, use cb->args[4] instead. Fixes: 9a07efa9aea2 ("tipc: switch to rhashtable iterator") Reported-and-tested-by: syzbot+e93a2c41f91b8e2c7d9b@syzkaller.appspotmail.com Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29tipc: switch to rhashtable iteratorCong Wang
syzbot reported a use-after-free in tipc_group_fill_sock_diag(), where tipc_group_fill_sock_diag() still reads tsk->group meanwhile tipc_group_delete() just deletes it in tipc_release(). tipc_nl_sk_walk() aims to lock this sock when walking each sock in the hash table to close race conditions with sock changes like this one, by acquiring tsk->sk.sk_lock.slock spinlock, unfortunately this doesn't work at all. All non-BH call path should take lock_sock() instead to make it work. tipc_nl_sk_walk() brutally iterates with raw rht_for_each_entry_rcu() where RCU read lock is required, this is the reason why lock_sock() can't be taken on this path. This could be resolved by switching to rhashtable iterator API's, where taking a sleepable lock is possible. Also, the iterator API's are friendly for restartable calls like diag dump, the last position is remembered behind the scence, all we need to do here is saving the iterator into cb->args[]. I tested this with parallel tipc diag dump and thousands of tipc socket creation and release, no crash or memory leak. Reported-by: syzbot+b9c8f3ab2994b7cd1625@syzkaller.appspotmail.com Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-04-08tipc: use the right skb in tipc_sk_fill_sock_diag()Cong Wang
Commit 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag") tried to fix the crash but failed, the crash is still 100% reproducible with it. In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not correct to retrieve its NETLINK_CB(), instead, like other protocol diag, we should use NETLINK_CB(cb->skb).sk here. Reported-by: <syzbot+326e587eff1074657718@syzkaller.appspotmail.com> Fixes: 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag") Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC) Cc: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22tipc: implement socket diagnostics for AF_TIPCGhantaKrishnamurthy MohanKrishna
This commit adds socket diagnostics capability for AF_TIPC in netlink family NETLINK_SOCK_DIAG in a new kernel module (diag.ko). The following are key design considerations: - config TIPC_DIAG has default y, like INET_DIAG. - only requests with flag NLM_F_DUMP is supported (dump all). - tipc_sock_diag_req message is introduced to send filter parameters. - the response attributes are of TLV, some nested. To avoid exposing data structures between diag and tipc modules and avoid code duplication, the following additions are required: - export tipc_nl_sk_walk function to reuse socket iterator. - export tipc_sk_fill_sock_diag to fill the tipc diag attributes. - create a sock_diag response message in __tipc_add_sock_diag defined in diag.c and use the above exported tipc_sk_fill_sock_diag to fill response. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-03tipc: redesign connection-level flow controlJon Paul Maloy
There are two flow control mechanisms in TIPC; one at link level that handles network congestion, burst control, and retransmission, and one at connection level which' only remaining task is to prevent overflow in the receiving socket buffer. In TIPC, the latter task has to be solved end-to-end because messages can not be thrown away once they have been accepted and delivered upwards from the link layer, i.e, we can never permit the receive buffer to overflow. Currently, this algorithm is message based. A counter in the receiving socket keeps track of number of consumed messages, and sends a dedicated acknowledge message back to the sender for each 256 consumed message. A counter at the sending end keeps track of the sent, not yet acknowledged messages, and blocks the sender if this number ever reaches 512 unacknowledged messages. When the missing acknowledge arrives, the socket is then woken up for renewed transmission. This works well for keeping the message flow running, as it almost never happens that a sender socket is blocked this way. A problem with the current mechanism is that it potentially is very memory consuming. Since we don't distinguish between small and large messages, we have to dimension the socket receive buffer according to a worst-case of both. I.e., the window size must be chosen large enough to sustain a reasonable throughput even for the smallest messages, while we must still consider a scenario where all messages are of maximum size. Hence, the current fix window size of 512 messages and a maximum message size of 66k results in a receive buffer of 66 MB when truesize(66k) = 131k is taken into account. It is possible to do much better. This commit introduces an algorithm where we instead use 1024-byte blocks as base unit. This unit, always rounded upwards from the actual message size, is used when we advertise windows as well as when we count and acknowledge transmitted data. The advertised window is based on the configured receive buffer size in such a way that even the worst-case truesize/msgsize ratio always is covered. Since the smallest possible message size (from a flow control viewpoint) now is 1024 bytes, we can safely assume this ratio to be less than four, which is the value we are now using. This way, we have been able to reduce the default receive buffer size from 66 MB to 2 MB with maintained performance. In order to keep this solution backwards compatible, we introduce a new capability bit in the discovery protocol, and use this throughout the message sending/reception path to always select the right unit. Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-07-26tipc: clean up socket layer message receptionJon Paul Maloy
When a message is received in a socket, one of the call chains tipc_sk_rcv()->tipc_sk_enqueue()->filter_rcv()(->tipc_sk_proto_rcv()) or tipc_sk_backlog_rcv()->filter_rcv()(->tipc_sk_proto_rcv()) are followed. At each of these levels we may encounter situations where the message may need to be rejected, or a new message produced for transfer back to the sender. Despite recent improvements, the current code for doing this is perceived as awkward and hard to follow. Leveraging the two previous commits in this series, we now introduce a more uniform handling of such situations. We let each of the functions in the chain itself produce/reverse the message to be returned to the sender, but also perform the actual forwarding. This simplifies the necessary logics within each function. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-17tipc: fix netns refcnt leakYing Xue
When the TIPC module is loaded, we launch a topology server in kernel space, which in its turn is creating TIPC sockets for communication with topology server users. Because both the socket's creator and provider reside in the same module, it is necessary that the TIPC module's reference count remains zero after the server is started and the socket created; otherwise it becomes impossible to perform "rmmod" even on an idle module. Currently, we achieve this by defining a separate "tipc_proto_kern" protocol struct, that is used only for kernel space socket allocations. This structure has the "owner" field set to NULL, which restricts the module reference count from being be bumped when sk_alloc() for local sockets is called. Furthermore, we have defined three kernel-specific functions, tipc_sock_create_local(), tipc_sock_release_local() and tipc_sock_accept_local(), to avoid the module counter being modified when module local sockets are created or deleted. This has worked well until we introduced name space support. However, after name space support was introduced, we have observed that a reference count leak occurs, because the netns counter is not decremented in tipc_sock_delete_local(). This commit remedies this problem. But instead of just modifying tipc_sock_delete_local(), we eliminate the whole parallel socket handling infrastructure, and start using the regular sk_create_kern(), kernel_accept() and sk_release_kernel() calls. Since those functions manipulate the module counter, we must now compensate for that by explicitly decrementing the counter after module local sockets are created, and increment it just before calling sk_release_kernel(). Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace") Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reported-by: Cong Wang <cwang@twopensource.com> Tested-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-09tipc: convert legacy nl socket dump to nl compatRichard Alpe
Convert socket (port) listing to compat dumpit call. If a socket (port) has publications a second dumpit call is issued to collect them and format then into the legacy buffer before continuing to process the sockets (ports). Command converted in this patch: TIPC_CMD_SHOW_PORTS Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: eliminate race condition at multicast receptionJon Paul Maloy
In a previous commit in this series we resolved a race problem during unicast message reception. Here, we resolve the same problem at multicast reception. We apply the same technique: an input queue serializing the delivery of arriving buffers. The main difference is that here we do it in two steps. First, the broadcast link feeds arriving buffers into the tail of an arrival queue, which head is consumed at the socket level, and where destination lookup is performed. Second, if the lookup is successful, the resulting buffer clones are fed into a second queue, the input queue. This queue is consumed at reception in the socket just like in the unicast case. Both queues are protected by the same lock, -the one of the input queue. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: simplify socket multicast receptionJon Paul Maloy
The structure 'tipc_port_list' is used to collect port numbers representing multicast destination socket on a receiving node. The list is not based on a standard linked list, and is in reality optimized for the uncommon case that there are more than one multicast destinations per node. This makes the list handling unecessarily complex, and as a consequence, even the socket multicast reception becomes more complex. In this commit, we replace 'tipc_port_list' with a new 'struct tipc_plist', which is based on a standard list. We give the new list stack (push/pop) semantics, someting that simplifies the implementation of the function tipc_sk_mcast_rcv(). Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05tipc: resolve race problem at unicast message receptionJon Paul Maloy
TIPC handles message cardinality and sequencing at the link layer, before passing messages upwards to the destination sockets. During the upcall from link to socket no locks are held. It is therefore possible, and we see it happen occasionally, that messages arriving in different threads and delivered in sequence still bypass each other before they reach the destination socket. This must not happen, since it violates the sequentiality guarantee. We solve this by adding a new input buffer queue to the link structure. Arriving messages are added safely to the tail of that queue by the link, while the head of the queue is consumed, also safely, by the receiving socket. Sequentiality is secured per socket by only allowing buffers to be dequeued inside the socket lock. Since there may be multiple simultaneous readers of the queue, we use a 'filter' parameter to reduce the risk that they peek the same buffer from the queue, hence also reducing the risk of contention on the receiving socket locks. This solves the sequentiality problem, and seems to cause no measurable performance degradation. A nice side effect of this change is that lock handling in the functions tipc_rcv() and tipc_bcast_rcv() now becomes uniform, something that will enable future simplifications of those functions. Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12tipc: make subscriber server support net namespaceYing Xue
TIPC establishes one subscriber server which allows users to subscribe their interesting name service status. After tipc supports namespace, one dedicated tipc stack instance is created for each namespace, and each instance can be deemed as one independent TIPC node. As a result, subscriber server must be built for each namespace. Signed-off-by: Ying Xue <ying.xue@windriver.com> Tested-by: Tero Aho <Tero.Aho@coriant.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12tipc: make tipc socket support net namespaceYing Xue
Now tipc socket table is statically allocated as a global variable. Through it, we can look up one socket instance with port ID, insert a new socket instance to the table, and delete a socket from the table. But when tipc supports net namespace, each namespace must own its specific socket table. So the global variable of socket table must be redefined in tipc_net structure. As a concequence, a new socket table will be allocated when a new namespace is created, and a socket table will be deallocated when namespace is destroyed. Signed-off-by: Ying Xue <ying.xue@windriver.com> Tested-by: Tero Aho <Tero.Aho@coriant.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12tipc: make tipc node table aware of net namespaceYing Xue
Global variables associated with node table are below: - node table list (node_htable) - node hash table list (tipc_node_list) - node table lock (node_list_lock) - node number counter (tipc_num_nodes) - node link number counter (tipc_num_links) To make node table support namespace, above global variables must be moved to tipc_net structure in order to keep secret for different namespaces. As a consequence, these variables are allocated and initialized when namespace is created, and deallocated when namespace is destroyed. After the change, functions associated with these variables have to utilize a namespace pointer to access them. So adding namespace pointer as a parameter of these functions is the major change made in the commit. Signed-off-by: Ying Xue <ying.xue@windriver.com> Tested-by: Tero Aho <Tero.Aho@coriant.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-12tipc: cleanup core.c and core.h filesYing Xue
Only the works of initializing and shutting down tipc module are done in core.h and core.c files, so all stuffs which are not closely associated with the two tasks should be moved to appropriate places. Signed-off-by: Ying Xue <ying.xue@windriver.com> Tested-by: Tero Aho <Tero.Aho@coriant.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-08tipc: convert tipc reference table to use generic rhashtableYing Xue
As tipc reference table is statically allocated, its memory size requested on stack initialization stage is quite big even if the maximum port number is just restricted to 8191 currently, however, the number already becomes insufficient in practice. But if the maximum ports is allowed to its theory value - 2^32, its consumed memory size will reach a ridiculously unacceptable value. Apart from this, heavy tipc users spend a considerable amount of time in tipc_sk_get() due to the read-lock on ref_table_lock. If tipc reference table is converted with generic rhashtable, above mentioned both disadvantages would be resolved respectively: making use of the new resizable hash table can avoid locking on the lookup; smaller memory size is required at initial stage, for example, 256 hash bucket slots are requested at the beginning phase instead of allocating the entire 8191 slots in old mode. The hash table will grow if entries exceeds 75% of table size up to a total table size of 1M, and it will automatically shrink if usage falls below 30%, but the minimum table size is allowed down to 256. Also converts ref_table_lock to a separate mutex to protect hash table mutations on write side. Lastly defers the release of the socket reference using call_rcu() to allow using an RCU read-side protected call to rhashtable_lookup(). Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Erik Hugne <erik.hugne@ericsson.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21tipc: add publication dump to new netlink apiRichard Alpe
Add TIPC_NL_PUBL_GET command to the new tipc netlink API. This command supports dumping of all publications for a specific socket. Netlink logical layout of request message: -> socket -> reference Netlink logical layout of response message: -> publication -> type -> lower -> upper Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-21tipc: add sock dump to new netlink apiRichard Alpe
Add TIPC_NL_SOCK_GET command to the new tipc netlink API. This command supports dumping of all available sockets with their associated connection or publication(s). It could be extended to reply with a single socket if the NLM_F_DUMP isn't set. The information about a socket includes reference, address, connection information / publication information. Netlink logical layout of response message: -> socket -> reference -> address [ -> connection -> node -> socket [ -> connected flag -> type -> instance ] ] [ -> publication flag ] Signed-off-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23tipc: merge struct tipc_port into struct tipc_sockJon Paul Maloy
We complete the merging of the port and socket layer by aggregating the fields of struct tipc_port directly into struct tipc_sock, and moving the combined structure into socket.c. We also move all functions and macros that are not any longer exposed to the rest of the stack into socket.c, and rename them accordingly. Despite the size of this commit, there are no functional changes. We have only made such changes that are necessary due of the removal of struct tipc_port. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23tipc: remove files ref.h and ref.cJon Paul Maloy
The reference table is now 'socket aware' instead of being generic, and has in reality become a socket internal table. In order to be able to minimize the API exposed by the socket layer towards the rest of the stack, we now move the reference table definitions and functions into the file socket.c, and rename the functions accordingly. There are no functional changes in this commit. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23tipc: remove include file port.hJon Paul Maloy
We move the inline functions in the file port.h to socket.c, and modify their names accordingly. We move struct tipc_port and some macros to socket.h. Finally, we remove the file port.h. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23tipc: use registry when scanning socketsJon Paul Maloy
The functions tipc_port_get_ports() and tipc_port_reinit() scan over all sockets/ports to access each of them. This is done by using a dedicated linked list, 'tipc_socks' where all sockets are members. The list is in turn protected by a spinlock, 'port_list_lock', while each socket is locked by using port_lock at the moment of access. In order to reduce complexity and risk of deadlock, we want to get rid of the linked list and the accompanying spinlock. This is what we do in this commit. Instead of the linked list, we use the port registry to scan across the sockets. We also add usage of bh_lock_sock() inside the scope of port_lock in both functions, as a preparation for the complete removal of port_lock. Finally, we move the functions from port.c to socket.c, and rename them to tipc_sk_sock_show() and tipc_sk_reinit() repectively. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-23tipc: use pseudo message to wake up sockets after link congestionJon Paul Maloy
The current link implementation keeps a linked list of blocked ports/ sockets that is populated when there is link congestion. The purpose of this is to let the link know which users to wake up when the congestion abates. This adds unnecessary complexity to the data structure and the code, since it forces us to involve the link each time we want to delete a socket. It also forces us to grab the spinlock port_lock within the scope of node_lock. We want to get rid of this direct dependence, as well as the deadlock hazard resulting from the usage of port_lock. In this commit, we instead let the link keep list of a "wakeup" pseudo messages for use in such situations. Those messages are sent to the pending sockets via the ordinary message reception path, and wake up the socket's owner when they are received. This enables us to get rid of the 'waiting_ports' linked lists in struct tipc_port that manifest this direct reference. As a consequence, we can eliminate another BH entry into the socket, and hence the need to grab port_lock. This is a further step in our effort to remove port_lock altogether. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16tipc: add new functions for multicast and broadcast distributionJon Paul Maloy
We add a new broadcast link transmit function in bclink.c and a new receive function in socket.c. The purpose is to move the branching between external and internal destination down to the link layer, just as we have done with unicast in earlier commits. We also make use of the new link-independent fragmentation support that was introduced in an earlier commit series. This gives a shorter and simpler code path, and makes it possible to obtain copy-free buffer delivery to all node local destination sockets. The new transmission code is added in parallel with the existing one, and will be used by the socket multicast send function in the next commit in this series. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tipc: simplify connection congestion handlingJon Paul Maloy
As a consequence of the recently introduced serialized access to the socket in commit 8d94168a761819d10252bab1f8de6d7b202c3baa ("tipc: same receive code path for connection protocol and data messages") we can make a number of simplifications in the detection and handling of connection congestion situations. - We don't need to keep two counters, one for sent messages and one for acked messages. There is no longer any risk for races between acknowledge messages arriving in BH and data message sending running in user context. So we merge this into one counter, 'sent_unacked', which is incremented at sending and subtracted from at acknowledge reception. - We don't need to set the 'congested' field in tipc_port to true before we sent the message, and clear it when sending is successful. (As a matter of fact, it was never necessary; the field was set in link_schedule_port() before any wakeup could arrive anyway.) - We keep the conditions for link congestion and connection connection congestion separated. There would otherwise be a risk that an arriving acknowledge message may wake up a user sleeping because of link congestion. - We can simplify reception of acknowledge messages. We also make some cosmetic/structural changes: - We rename the 'congested' field to the more correct 'link_cong´. - We rename 'conn_unacked' to 'rcv_unacked' - We move the above mentioned fields from struct tipc_port to struct tipc_sock. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tipc: clean up connection protocol reception functionJon Paul Maloy
We simplify the code for receiving connection probes, leveraging the recently introduced tipc_msg_reverse() function. We also stick to the principle of sending a possible response message directly from the calling (tipc_sk_rcv or backlog_rcv) functions, hence making the call chain shallower and easier to follow. We make one small protocol change here, allowed according to the spec. If a protocol message arrives from a remote socket that is not the one we are connected to, we are currently generating a connection abort message and send it to the source. This behavior is unnecessary, and might even be a security risk, so instead we now choose to only ignore the message. The consequnce for the sender is that he will need longer time to discover his mistake (until the next timeout), but this is an extreme corner case, and may happen anyway under other circumstances, so we deem this change acceptable. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-14tipc: merge port message reception into socket reception functionJon Paul Maloy
In order to reduce complexity and save a call level during message reception at port/socket level, we remove the function tipc_port_rcv() and merge its functionality into tipc_sk_rcv(). Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-14tipc: compensate for double accounting in socket rcv bufferJon Paul Maloy
The function net/core/sock.c::__release_sock() runs a tight loop to move buffers from the socket backlog queue to the receive queue. As a security measure, sk_backlog.len of the receiving socket is not set to zero until after the loop is finished, i.e., until the whole backlog queue has been transferred to the receive queue. During this transfer, the data that has already been moved is counted both in the backlog queue and the receive queue, hence giving an incorrect picture of the available queue space for new arriving buffers. This leads to unnecessary rejection of buffers by sk_add_backlog(), which in TIPC leads to unnecessarily broken connections. In this commit, we compensate for this double accounting by adding a counter that keeps track of it. The function socket.c::backlog_rcv() receives buffers one by one from __release_sock(), and adds them to the socket receive queue. If the transfer is successful, it increases a new atomic counter 'tipc_sock::dupl_rcvcnt' with 'truesize' of the transferred buffer. If a new buffer arrives during this transfer and finds the socket busy (owned), we attempt to add it to the backlog. However, when sk_add_backlog() is called, we adjust the 'limit' parameter with the value of the new counter, so that the risk of inadvertent rejection is eliminated. It should be noted that this change does not invalidate the original purpose of zeroing 'sk_backlog.len' after the full transfer. We set an upper limit for dupl_rcvcnt, so that if a 'wild' sender (i.e., one that doesn't respect the send window) keeps pumping in buffers to sk_add_backlog(), he will eventually reach an upper limit, (2 x TIPC_CONN_OVERLOAD_LIMIT). After that, no messages can be added to the backlog, and the connection will be broken. Ordinary, well- behaved senders will never reach this buffer limit at all. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12tipc: align usage of variable names and macros in socketJon Paul Maloy
The practice of naming variables in TIPC is inconistent, sometimes even within the same file. In this commit we align variable names and declarations within socket.c, and function and macro names within socket.h. We also reduce the number of conversion macros to two, in order to make usage less obsure. These changes are purely cosmetic. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12tipc: eliminate upcall function pointers between port and socketJon Paul Maloy
Due to the original one-to-many relation between port and user API layers, upcalls to the API have been performed via function pointers, installed in struct tipc_port at creation. Since this relation now always is one-to-one, we can instead use ordinary function calls. We remove the function pointers 'dispatcher' and ´wakeup' from struct tipc_port, and replace them with calls to the renamed functions tipc_sk_rcv() and tipc_sk_wakeup(). At the same time we change the name and signature of the functions tipc_createport() and tipc_deleteport() to reflect their new role as mere initialization/destruction functions. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12tipc: aggregate port structure into socket structureJon Paul Maloy
After the removal of the tipc native API the relation between a tipc_port and its API types is strictly one-to-one, i.e, the latter can now only be a socket API. There is therefore no need to allocate struct tipc_port and struct sock independently. In this commit, we aggregate struct tipc_port into struct tipc_sock, hence saving both CPU cycles and structure complexity. There are no functional changes in this commit, except for the elimination of the separate allocation/freeing of tipc_port. All other changes are just adaptatons to the new data structure. This commit also opens up for further code simplifications and code volume reduction, something we will do in later commits. Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>