|
|
|
|
Real Time Performance Guarantees on Switched Networks
Srinidhi Varadarajan srinidhi@cs.sunysb.edu March 21st, 1997
Abstract With an explosion in the growth of multimedia, real time traffic sources earlier restricted to special purpose domains have now proliferated into Local Area Networks. Real Time applications such as multimedia and videoconferencing need Quality of Service (QoS) guarantees from the network in terms of bandwidth, delay and jitter. Apart from QoS guarantees, these applications are also data intensive requiring an ever increasing bandwidth to support them. Switched networking provides an attractive alternative with a scalable solution to the bandwidth bottleneck. Furthermore, advances in cell networking have led to technologically viable ATM switches which can provide service guarantees. However, they are currently too expensive to see deployment in the LAN market which is dominated by Ethernet. In this paper we survey guarantee mechanisms and their implementations on switched networks from a computer systems engineer perspective. We finally present our solution to provide bandwidth guarantees on Ethernet, the EtheReal switch, and contrast it with guarantee schemes on other switched networks. With increasing requirement for QoS support in the existing Ethernet market we believe that EtheReal provides an attractive solution to the problem. 1. Introduction The past few years have seen an explosion in the growth of multimedia applications, placing stringent requirements on the network in terms of packet delay, packet loss and delay variance (jitter). These requirements cannot be satisfied by current networks that only support best effort delivery. Hence, any effort to integrate real-time traffic with existing best-effort delivery traffic necessitates the introduction of a Quality of Service contract model through which the network enters into a contract with an application to provide a certain quality of service. Most of the local area networks are based on a shared media architecture. As traffic requirements increase, they quickly become congested. With no inherent scalability, they rely on periodic technological advances to bring in increased bandwidth which usually results in a tear-down of the existing network to upgrade to a new one. Huge existing investments into networks make this an expensive and cumbersome solution, with upgrades lasting over years. Switched networks provide a scalable alternative to this bandwidth bottleneck. They use a network switch to switch data between sub-networks connected to them. Each switch has a fixed switching capacity (in terms of bandwidth), and as traffic requirements increase, more switches can be added to increase the available bandwidth. Switched networks can broadly be classified into (i) Cell based networks and (ii) Packet networks on the basis of their unit of transmission. Cell based networks use fixed length cells as their transmission element leading to predictable cell transfer delays. To reduce the overhead of partially filled cells, the size of the cell is small. Higher layer protocols like IP and TCP require a large cell size to reduce the protocol header to payload ratio. This has lead to the design of adaptation layers to merge the two technologies thus adding more overhead. On the other hand, packet networks use relatively large variable length packets, bounded by a fixed maximum data transfer size, as their transmission element. Due to their variable length, the only partially filled packets in packet networks are trailer packets occurring at the end of a large transmission that is broken into maximum transfer unit sized packets. Their relatively large packet size also reduces the higher layer protocol overhead. Inspite of its high overhead, fixed data cell size networks have several desirable features. Unlike packet networks, fixed data unit size in cell based networks leads to a more predictable transmission delay variance. Coupled with a predictable transmission delay, cell based networks offer distinct advantages towards the implementation of Quality of Service guarantees.. The current generation of cell based networks such as Asynchronous Transfer Mode (ATM) networks support several classes of Quality of Service guarantees. A number of surveys [1,2,3] have been conducted on QoS issues in local area and wide area networks. However, the focus has been on mechanisms involved in delivering QoS guarantees like congestion control, flow control, error control, policing, guarantee mechanisms etc. In this paper we provide an overview of QoS standards for cell based ATM networks with the focus being on a survey of their commercial implementations. The survey is conducted from a systems design perspective and aims to provide an overview of the state of commercial ATM switch implementations with respect to ATM standards. Owing primarily to its high cost, the current deployment of ATM favors its usage as a backbone medium, rather than as a workgroup solution, whereas real time traffic such as video-conferencing is more prevalent in local area to enterprise-wide networks serviced mostly by Ethernet based networks. Hence, support for QoS would require either changing the local area network infrastructure to ATM, or a mechanism to deliver QoS guarantees over Ethernet. Changing the local area network infrastructure from Ethernet to the radically different ATM is not expected in the near future due the complexity of the conversion and the high costs involved. Thus a real time guarantee mechanism over Ethernet provides feasible solution to provide QoS guarantees in the intranet environment. In this paper, we propose a framework for a real time Ethernet switch called EtheReal which provides QoS guarantees in a switched Ethernet environment. We also present an implementation outline using off-the-shelf hardware. Unlike other commercial/research real time Ethernet implementations [4,5,23,26], which require either changing the network hardware [4,5] or the NIC driver[23,26], our solution works at the user level. No modifications are needed in the underlying operating system to support our scheme. In our scheme, multiplatform real time Ethernet support merely requires porting of the EtheReal software libraries. The rest of the paper is organized as follows. Section 2 provides an overview of QoS standards in ATM networks. Section 3 talks about QoS issues in ATM networks. In Section 4 we discuss QoS implementation details in commercial switches. In Section 5 we propose a framework for QoS guarantees on switched Ethernet. In Section 6 we conclude this paper with a summary of major findings from this research.
2. Overview of Quality of Service on ATM Networks
2.1 Fundamentals of ATM Network Operation In an ATM network, hosts communicate with each other using fixed data size transmission elements known as cells. An ATM network consists of ATM switches connected to host interfaces and other ATM switches by point-to-point links. ATM networks support two kinds of network interfaces : (i) User - Network Interface (UNI) and (ii) Network - Network Interface (NNI). UNI refers to a link that uses the UNI signalling protocol used to communicate between a host and a switch. NNI is a term for the link used between ATM switches talking on the NNI signalling protocol. ATM networks are inherently connection oriented in nature i.e., they need to setup a connection before data transfer can begin. These connections are referred to by two numbers viz., the Virtual Path ID (VPI) and the Virtual Channel ID (VCI). Virtual paths are a bundle of traffic consisting of multiple virtual channels. All virtual channels on a virtual path are switched along the same route on the basis of their common VPI. VPI and VCI however are only locally significant between a host and a switch or a switch and a switch. They are remapped at each switch along the route. Unlike current network interface technologies, this scheme does away with globally unique identifiers. ATM switches reserve VPI 0 for transfer of control messages. Fundamentally, ATM switch operation is very simple. The switches receive data at their input ports on a certain VPI/VCI. They have a local translation table that contains the destination port for the incoming VPI/VCI and a new remapped VPI/VCI pair. They then transmit the data on the destination output port with the new VPI/VCI. The simplicity of this scheme is due to the fact that the local translation tables are maintained by external sources. Based on the mechanism to setup the translation tables, there are two basic types of ATM connections :
To setup a connection, the source host initiates an ATM signaling sequence on VPI = 0 VCI = 5, well-known virtual channel. The connection request is routed through the network and if the requested QoS constraints (explained below) are met, a connection is set up. The fundamental types of ATM connections are (i) Point-to-Point connections which connect two ATM hosts or switches and (ii) Point-to-Multipoint connections which connect one sender to multiple receivers i.e. the source node (root) can send messages to multiple receiving nodes (leaves). This transfer is unidirectional, the leaves cannot send data back to the root.
Figure 1 : Layout of an ATM network. Hosts communicate on point-to-point links with the switch using UNI signalling. Inter-switch communication uses NNI signalling.
Figure 1 shows the layout of an ATM network with hosts A and B connected to switch M and hosts C and D connected to switch N. A connection request origination from host B to host C would travel through the switches as shown by the dark dashed lines. VPI/VCI remapping would occur at each switch along the path. Since the QoS constraints guaranteed by the network have to be adhered to, further data transactions on this connection always follow the same route. Owing to its high cost, ATM switching has been employed primarily as a backbone networking technology, with the rest of the network relying on shared media LAN technologies. The vast majority of networks today are packet based. Combining the two technologies in a homogenous network requires an ATM Adaptation Layer (AAL) to segment packets to cells at the source and reassemble cells to packets at the destination. The fundamental purpose of the AAL's is to efficiently package higher level information such as datagrams, multimedia traffic etc., onto an appropriate connection at the ATM layer so that the information can be reassembled back at the receiver. Based on the framing trade-off, ATM networks offer the following adaptation layers [7,8,9] :
Choice of the ATM adaptation layer then depends on the application, with the two prominent ones being AAL 1 and 5. Most of the work on IP over ATM use AAL 5 as the adaptation layer. It seems likely that AAL 5 will emerge as the only adaptation layer in the future.
2.2 Quality of Service over ATM Networks One of the main advantages of ATM networks is its ability to guarantee its service quality. ATM networks carry both real time and best effort delivery traffic and are expected to deliver independent Quality Of Service contracts. To achieve this, applications describe their traffic parameters to the network and request a QoS guarantee from it. The network attempting to provide a Quality of Service guarantee needs to :
The rest of the section discusses the above phases in the delivery of a QoS guarantee. 2.2.1 The ATM Service Contract In ATM networks, traffic is categorized into service classes on the basis of its traffic pattern and the guarantees it expects from the network. An application needs to classify its traffic into one of the available service classes before requesting a QoS guarantee. The service classes defined for ATM [9] are : 1. Guaranteed Traffic Classes a) Constant Bit Rate (CBR) Traffic b) Variable Bit Rate (VBR) Traffic i) Real Time Variable Bit Rate (RT-VBR) Traffic ii) Non Real Time Variable Bit Rate (NRT-VBR) Traffic 2. Best Effort Traffic Classes a) Available Bit Rate (ABR) Traffic
Figure 2 : Traffic patterns in the ATM service classes The CBR service class is used by connections that require a fixed amount of bandwidth. This is specified by the Peak Cell Rate (PCR) demand from the application. The application can then transmit at or below the requested rate which is assumed to be constant by the network throughout the duration of the connection. This category is intended for real time traffic requiring bounded cell transfer latency and latency variation [10]. CBR traffic is typically generated by voice, video and circuit emulation traffic. Cell delay variation is controlled by dropping cells which exceed the negotiated bounds with the network assuming that these non-confirming cells are no longer of any use. The traffic parameter specified for this class is its PCR. The RT-VBR service class is meant for applications which transmit a bursty traffic load, but require tight bounds on delay and delay variation. It is typically generated by video traffic. Sources indicate their average Sustained Cell Rate (SCR) and their peak rate. The network gives a hard guarantee on SCR and a statistical guarantee for PCR. As in the case of CBR, cells delayed beyond the delay or delay variation bounds may be dropped. Traffic parameters specified for this class are PCR, SCR and its Maximum Burst Size (MBS) which defines the maximum number of contiguous cells that can be transmitted. The NRT-VBR service class is intended for applications that generate a bursty traffic load, but do not have any constraints on cell delay and delay variation. Similar to RT-VBR, NRT-VBR is guaranteed its SCR, with PCR being a statistical guarantee. The network attempts to provide low latency low delay variation service for all cells belonging to this traffic class. Both RT-VBR and NRT-VBR classes support statistical multiplexing of connections. Traffic parameters specified for this class are PCR, SCR and MBS.
Table 1 : ATM Traffic Class Attributes and Guarantees The ABR service class is meant for traffic sources which can change their transfer rate based on an indication from the network on the current bandwidth availability. This allows them to exploit changes in availability of network resources after connection setup. There are a wide variety of applications that need a minimum and maximum data transfer rate, but do not have a consistent average data transfer rate. ABR is tailored towards this category of applications. Sources specify a Minimum and Maximum (Peak) Cell rate (MCR & PCR). The network provides a varying bandwidth between the specified MCR and PCR based on sharing of its internal resources between multiple ABR connections. ABR is not meant to support real-time applications. Since ABR connections are throttled at source to control their data rate, a flow control mechanism is specified as a feedback scheme to indicate the available data rate back to the source. The ATM Forum specifications define a closed loop feedback scheme which uses a rate based flow control mechanism. Traffic parameters specified for this class are PCR and MCR. The UBR traffic class is intended for non critical "best effort delivery" traffic sources which do not have tightly constrained delay and delay variation. UBR sources are expected to transmit in non continuous cell bursts. This service supports statistical multiplexing of sources. UBR sources specify their PCR, but the network does not take it into account, i.e. the network will not refuse a connection if it does not have the bandwidth to sustain the specified PCR. UBR sources can transmit at, below or above the specified PCR. The network does not provide any guarantees. The traffic parameter specified for a source in this class is its PCR. Table 1 contrasts the traffic classes by their parameters they specify and the guarantees given to them. 2.2.2 Connection Setup To establish an ATM connection, an ATM host initiates a connection request, requesting a route through the network leading to the target host. This request leads to a negotiation between the host and the network, wherein the host specifies its desired QoS parameters. These QoS parameters can be broken down into (i) Non-additive link attributes (ii) Additive link metrics. Parameters like bandwidth and delay variance are absolute parameters which are known at each switch along the route. Thus each switch can independently give a guarantee on these parameters. Parameters like total cell transfer delay are additive in nature with the summation of the values at each intermediate switch yielding the final value. When a connection is initiated, the switch executes an admission control function on the absolute parameters. The admission control function decides if the switch has enough resources to accept the connection. On accepting the connection the switch attempts to find a path that can satisfy the additive parameters. If any switch discovers that such a route cannot be found , or it does not have the resources to accept the connection, it initiates a crankback mechanism, wherein it traces the route back to the connection initiating host (source) and tears down each link in the path. If a route is found through the network meeting the requested QoS parameters, a connection is set up and the network enters into a QoS contract with the host. The network guarantees to meet the QoS, if the host agrees not to send more traffic than was negotiated in the connection establishment procedure. Ideally, this scenario would work if all applications stuck to the traffic parameters they negotiated. This requires that applications be able to predict, their exact requirements, something that is not always possible. To ensure that applications stick to their negotiated contract, the network has to police the contract, dropping cells that do not confirm to the contract.
2.2.3 Traffic Management After a connection has been set up, the network needs to manage its incoming traffic in order to deliver the negotiated guarantee. This forms the basis for traffic management. The fundamental functions of traffic management are (i) Traffic Shaping (ii) Traffic Policing and (iii) Congestion Control. Traffic shaping is performed at the periphery of the network, where hosts connect to the network, typically at network adapters, hubs, bridges and routers. It ensures that traffic originating from the host matches the contract negotiated with the network. Traffic policing is a management function performed by the ATM network switches and ensures that traffic parameters are within the negotiated contract. Ideally, traffic shaping at the hosts would ensure that the traffic meets the negotiated contract and switches would not have to do any policing, as all sources would appear as constant rate flows. However traffic shaping at the host is performed at the AAL. Both the choice as well as the presence of an AAL at a host is optional with no mechanism to verify its presence. Since switches cannot rely on the presence of an AAL, they would have to do their own policing to prevent malicious sources from disrupting other guaranteed traffic at the switch. To police non confirming cells, ATM switches set a bit in the cell header called the Cell Loss Priority (CLP) bit, which effectively reduces their priority. These cells are dropped only when the switch does not have enough resources to transfer them. Policing is implemented in modern ATM switches on a connection by connection basis, by a buffering technique called the leaky bucket algorithm. In this mechanism, traffic flows into the buffers (bucket) at a variable rate from the network, but leaves the buffers at a constant rate which is the negotiated rate between the host and the network. The size of the buffer is determined by the negotiated contract. To ensure fair service to best effort delivery traffic (UBR and ABR), ATM switches implement congestion control, since these service classes are not given any performance guarantees and are more likely to run into arbitrary congestion based on the current network load. Congestion control schemes can be implemented on a (i) Link-by-Link Basis and (ii) End-to-End basis. Link-by-Link flow control schemes provide control on a per VC basis. These schemes use forms of back-pressure (jamming) algorithms to indicate back to the sender that there is a congestion further along in the network and that it should reduce its traffic to help ameliorate the congestion. Since this scheme works on a per VC basis, the network can isolate the congestion down to the over-active ports and reduce their traffic without affecting the uncongested ports. The downside of this scheme is that it requires precise control which is currently both complicated and expensive. The back-pressure algorithms in current use are (i) Rate Based Flow Control and (ii) Credit-based Flow Control. In the rate based flow control scheme, downstream switches periodically indicate the rate at which upstream switches can communicate to them to avoid congestion. Congested downstream ports reduce the available rate to upstream switches. Credit based flow control schemes control traffic flow by transmitting the remaining buffer space (credits) [13] at the downstream switch back to the upstream switches. End-to-End flow control schemes provide flow control at the periphery of the network. The network itself does not provide any other flow control mechanism, so this flow control scheme calls for a large number of buffers at the periphery to handle traffic bursts. It does not require any back-pressure intelligence in the switch, and since the periphery of the network currently connects to existing LAN's like Ethernet which do not have any form of flow control (CSMA/CD is a primitive form of congestion control), the performance penalty is not very great. The benefit of this scheme is the simplicity of the hardware implementation which justifies the extra buffer overhead and the moderate loss observed due to inability of the scheme to quickly react to congestion. As LAN's start moving towards ATM, this scheme would have to be dropped in favor of the more precise Link-by-Link flow control. At the periphery of the network, ATM adaptation layers connect to existing local area network technologies, breaking networks packets into cells. ATM switches operate at the cell level without knowledge of the packet structure of the cell. When traffic sources exceed their negotiated values and the combined traffic load is more than the switch (with its buffering) can handle, switches resort to dropping cells in order to maintain the invariant Input Bandwidth = Output Bandwidth. In this scenario, if ATM switches drop cells without any knowledge of the adaptation layers operating at the periphery, the cells dropped could be from different packets. End hosts receiving a packet with a dropped cell cannot reassemble the packet and would request retransmission, a high overhead exercise. ATM networks define a term called goodput which refers to the number of network packets successfully transmitted through the network. To improve goodput (as opposed to cell drop rate), switches use intelligent packet discard mechanisms where they look into segmentation and reassembly header information maintained by the ATM adaptation layer in each cell. Under congestion, switches then use this information to drop cells from the same packet to reduce end-to-end packet retransmission. 2.3 Broadcast/Multicast Unlike shared medium based LAN's, ATM networks do not have an analog for broadcast. They rely on switches to be able to be able to perform cell replication. An obvious analog to the broadcast mechanism would be a multipoint to multipoint link. However, the design of the AAL 5 protocol does not include a sequence ID. Hence all cells from a source to a destination would have to be received consecutively for them to be processed correctly. In a point-to-multipoint connection, if leaves start transmitting cells, other leaves would receive cells from the root interleaved with cells from transmitting leaf nodes. This interleaved transmission would disrupt the sequenced flow outlined. The receiving leaves would not be able to reassemble the cells into packets. Despite this deficiency in AAL 5 (AAL 3/4 do not have this problem), it is a low overhead protocol designed to supplant the more complex AAL 3/4 protocols and is in wide use. The following solutions have been proposed to solve the problem :
Figure 3 : Multipoint to Multipoint connection in the VP Multicasting scheme. The dark lines show hosts A, B, C and D connected in the multicast group using VP=1. Each host communicates on a different VC
Figure 4 : Multipoint to Multipoint connection in the overlaid point to multipoint link scheme. Each host opens a point to multipoint link to every other host in the multicast group with itself as the root and the other hosts as leaves.
Each of the schemes above has its own strengths and weaknesses with no solution dominating over any other. The VP-Muticasting scheme requires a protocol to uniquely allocate VCI's to the participating nodes. No such protocol is in place currently. Also it is not know if segmentation and reassembly layers can handle this scheme. The Multicast Server scheme is a scalable solution since it requires only 2 connections from every host. However it presents a single point to failure in the server. With large multicast groups, the server could quickly become a bottleneck. The Overlaid Point-to-Multipoint Links scheme requires that each node maintain N connections for an N node multicast group. It also requires admission protocols to admit a new host into the group. Each host has to be informed of the new entry and all point-to-multipoint links have to be updated. The new entrant has to be given to name of each host in the group to open its point-to multipoint link. Currently, proprietary protocols perform the above actions. The Multicast Server and Overlaid Point-to-Multipoint Link schemes are being used at the higher protocol layers. The ATM Forum is looking into proposals for multipoint-to-multipoint communication.
3. QoS Issues in ATM Networks 3.1 Connection Setup In Section 0 we had discussed the connection setup procedure in ATM networks. On receiving a connection setup, a generic framework for Quality of Service implementations would need to perform (i) Admission Control (ii) Routing and (iii) Resource reservation in order to support a mechanism for guarantee delivery. This section deals with the issues involved in the connection setup procedure and the mechanisms needed to solve these issues. 3.1.1 Admission Control When a connection setup call is sent by a host to a switch, the switch has to decide if it can accept the connection. This decision is made by the admission control mechanism and based on two criteria :
The decision on resource availability is made by keeping a running total of resources already committed to existing connections. The running total is queried to determine if the resources required by the new connection would violate a guarantee given to some other connection. The connection parameters are then checked for compliance with policy restrictions. If the above criteria are both satisfied, the connection is accepted. 3.1.2 Routing To setup a connection, we need a mechanism to find a route to the target host specified by the connection setup call. This requires a routing mechanism at the switch to determine a route that would fit the QoS needs of the connection. Since a switch can a only give a guarantee on its own behavior, the routing algorithm also needs a mechanism to query the next switch along a prospective route to determine if it has the resources to support the connection. The routing algorithm can maintain a routing table at the switch which consists of tuples of the format <incoming VPI/VCI, outgoing VPI/VCI, output port>. Using this information the switch can then forward cells based on the incoming VPI/VCI field in the ATM header. As it stands, the ATM Forum has not standardized any routing mechanisms, allowing each vendor to use his own algorithm. 3.1.3 Resource Allocation In order to deliver a QoS guarantee, ATM switches need to allocate resources after accepting a connection. When a connection setup is initiated from a host, the admission control mechanism queries its running total of allocated resources to decide if it has sufficient resources to support the new connection. If the connection is accepted, resources are then reserved within the switch to meet the requested (and granted) service guarantee. Within the switch, resource allocation has to occur at (i) Input Buffers (ii) Switching fabric (iii) Shared memory (iv) Output Buffers and (v) Output link. For input buffered switches, the input buffer space can be partitioned, either statically or dynamically, to provide independent bandwidth to CBR, VBR and UBR/ABR traffic. Based on the traffic service class of the connection, it is allocated bandwidth from one the above input buffer partitions. Shared Memory and Shared Medium switches typically have 2 input buffers organized in the classic double buffer scheme. They rely on the speed of their switching fabric to transfer data out of their input buffers, using them only as temporary storage for the switching fabric. Thus they do not require reservation on the input buffers. Switching fabrics can be divided into (i) Shared medium (bus) based fabrics (ii) Multi-bus fabrics (iii) Cross-bar fabrics and (iv) Banyan networks. To perform resource allocation, shared medium and multi-bus fabrics operate in slotted mode, i.e. they divide their bandwidth into slots with each slot corresponding to a cell transfer. Resource reservation in this context translates to reserving slots on the switching fabric corresponding to the traffic requirement of the connection. Crossbar networks can be considered an extreme case of multi-bus fabrics, using slotted mode operation on each link between an input and output. To ensure that a guarantee can be met in Banyan networks, bandwidth reservation has to be performed on every 2x2 switch along a connection path. This requires that each 2x2 switch be independently capable of performing bandwidth reservation apart from switching, complicating hardware requirements. Banyan networks have not been used in commercial switch implementations. Resource allocation in shared memory and output buffers can be performed in the following ways :
Resource reservation on the output link is required to ensure that multiple VCs headed for the same output link do not exceed the capacity of the link. Since the output link bandwidth is constant, switches keep track of the bandwidth allocated to outgoing connections on each link. This information is also used to perform minimum delay/minimum queue based routing decisions.
3.2 Traffic Management Once a connection has been setup, traffic on the connection has to be managed in order to deliver the guarantee and ensure that connections stay within their contracted parameters. The basic functions performed by traffic management are (i) Shaping (ii) Policing and (iii) Scheduling cells for delivery based on their guarantees.
Traffic shaping is performed at the periphery of the network to ensure that traffic stays within the negotiated contract. To this end, all traffic sources from a host are multiplexed, and shaped using the Generic Cell Rate Algorithm specified by the ATM Forum. The Generic Cell Rate Algorithm specifies the logical framework for cell scheduling. Any practical algorithm which can be shown to be equivalent can be used in its place. The GCRA defines the relationship between PCR and CDV as well as between SCR and burst tolerance. For each cell arrival, the GCRA determines if it conforms to the traffic contract of the connection. Thus it can be viewed as a formal definition of the conformance contract between the network and the host.
Figure 5 : Virtual Scheduling Algorithm of the GCRA The GCRA depends on 2 parameters the time increment I and the arrival time limit L. The notation GCRA(I,L) is used to indicate an instance of the GCRA with increment set to I and Limit set to L. The virtual scheduling algorithm used in the GCRA is shown in Figure 5. The scheduling process updates a Theoretical Arrival Time (TAT), which is the nominal arrival period of the cell assuming a constant rate flow. If the actual arrival time of the cell is not too early, relative to the TAT, i.e. it arrives at or after TAT - L, it is a confirming cell, else it is non conforming. In the virtual scheduling algorithm flowchart of Figure 5, at the arrival of the first cell ta(1), the TAT is initialized to the current time i.e. ta(1). For all subsequent cells, if the arrival time of the kth cell is after the current value of TAT, then it is a confirming cell and the TAT is updated to TAT + I. Also if the cell arrival time is between TAT - L and TAT then it is a confirming cell and the TAT is updated to TAT + I. Finally, if the cell arrival is before TAT - L ( TAT > ta(k) + L), it is marked as non-conforming and the TAT is not updated. As mentioned above any algorithm that can be shown to mark at least the same cells as confirming as the GCRA is considered equivalent to it. Commercial implementations use the leaky bucket algorithm which has been shown to be equivalent to the virtual scheduling algorithm [11].
3.2.2 Traffic Policing Traffic policing is a management function that is performed at the switch to ensure that traffic from connections stays within its negotiated contract. In section 0 we had mentioned that the GCRA represents the conformance contract for the ATM traffic classes. Thus, traffic policing can use the GCRA to check for conformance. However, the GCRA does not define the behavior for non confirming cells. A non conforming cell may be dropped or transferred if sufficient resources exist within the switch to handle it. It is not desirable to drop every non conforming cell, since even otherwise well behaved sources may have a few non conforming cells. For this reason, the traffic policing algorithm has not been standardized by the ATM Forum. Any algorithm may be used for policing, as long as it does not affect the guarantee given to a connection. Commercial implementations use the leaky bucket algorithm to police traffic.
3.2.3 Scheduling In ATM networks, some of the guaranteed traffic classes are given guarantees on the maximum cell delay and cell delay variation. Since we can have multiple real guaranteed traffic sources at each host and multiple connections through a switch, we need a mechanism at the hosts and the switch to schedule cells for delivery from multiple sources, taking into account the stringency of their guarantees. An obvious solution to this problem would be to use deadline based scheduling. Deadline based scheduling involves maintaining deadline metadata on a per connection basis and determining a scheduling (servicing) order for connections. The scheduling order can be computed statically and recomputed when a new connection is setup. While this scheduling policy is possible at hosts, switches may find the compute overhead too high. In the ATM Forum specifications revision 3.1, there are no numerical guarantees given for delay and delay variation, instead theres a concept of delay and delay variation classes. Each class has a bound on the maximum cell delay and delay variation. This class mechanism lends itself easily to prioritization. The greater the stringency of a deadline, the greater the priority. A weighted round robin scheduler at the host is sufficient to ensure that deadlines are met. A similar weighted round robin scheduler is also needed at the switch to ensure that deadlines for traffic from different ports are met. Current commercial implementations use the weighted round robin scheme for deadline based scheduling owing to its simplicity and ease of implementation. The ATM Forum has put out a draft version of its new ATM specification, revision 4.0. In this revision, ATM switches are expected to give numerical commitments on cell delay and delay variation. Currently, ATM manufacturers propose to continue using the above class mechanism and map the classes onto numerical values for maximum cell delay and variation, this solution would only permit a range of discrete values. A deadline based scheduler would be required to truly support traffic classes in the new standard.
3.3 Guarantee Mechanisms In terms of guarantees offered by ATM switches, a first assumption would be that these guarantees are hard in nature. As mentioned above, guarantees necessitate resource allocation, with hard guarantees requiring allocation based on worst case assumptions. This might not always be the best solution. In some cases worst case resource allocation is not possible without significant increase in hardware complexity. For instance, VBR traffic is currently given a guarantee only on its sustained rate, with its peak rate being best effort delivery with a priority over ABR/UBR traffic. To deliver a hard guarantee to VBR would require resource allocation based on the peak rate as opposed to its sustained rate, essentially equating it to CBR traffic with a higher traffic load variance. This solution is inherently pessimistic in nature. Initial VBR connections would take up all the bandwidth in the VBR partition and the switch would deny further VBR connections, statistically even when it has buffer space. A judicious mix of statistical guarantees based on proffered load on a per connection basis, and hard guarantees on constant rate sources constitutes a balanced solution to the problem.
4. Commercial Implementations In this section we survey commercial implementations of workgroup/campus wide network switches offering QoS guarantees from Fore Systems, CISCO and 3Com networking . The focus here is on implementation details of (i) Architecture (ii) Traffic Management (iii) Cell Scheduling and (iv) Congestion Control
4.1 Fore Systems The FORE systems ForeRunner series of ATM switches use a hierarchical shared memory architecture with a Time Division Multiplexing (TDM) based switching fabric [16]. The TDM switching fabric is slotted to allow resource reservation to occur on the fabric, a necessity for a real time scheduler to ensure that the negotiated traffic guarantees are met. The slotted bandwidth of the TDM switching fabric is enough to meet peak transmission rates at all input ports, making the switch non blocking.
Figure 6 : Fores ForeRunner switch Architecture
In this architecture cells are processed in three steps. At the input cells enter an ingress port where they go through a stage of VPI/VCI translation and are checked for compliance with the traffic contract. The cells are then distributed using the TDM switching fabric to multiple shared memory output port clusters managed by the output port controller. Here they go through another stage of VPI/VCI translation and switching finally directing them to a particular port in the cluster. Unlike traditional shared memory architectures, these switches are based on distributed shared memory (DSM) to prevent a single point of failure and provide a upgrade path for scalability. Scalability in terms of number of switched ports can be achieved by adding more output port controllers. This requires an increase in switching capacity of the internal TDM fabric. In the ForeRunner series the internal TDM switching fabric is scalable with a switching capacity of up to 10Gbps. 4.1.2 Traffic Management When a connection is initiated from the host, the switch looks into its internal resources committed to existing connections. If the resources required by the new connection are available, the connection is accepted and the appropriate resources are reserved. To this end, The ForeRunner series requires resource reservation on (i) The TDM switching fabric and (ii) Shared (Buffer) memory. Reservation on the switching fabric is simple owing to its slotted nature. For each accepted connection, the appropriate bandwidth in terms of cell cycles is reserved on the fabric. The switch uses an elastic (dynamic) buffer scheme for buffer memory allocation [18]. Instead of reserving static buffers on a per traffic class basis i.e. partitioning the memory space based on traffic classes, it treats the shared memory as one unit with memory allocation performed on-demand. This gives a better utilization compared to static memory space partitioning which could potentially deny a request even when buffer space exists as shown in Figure 7. The remaining buffer space after resource reservation for the guaranteed traffic classes is given to ABR/UBR traffic.
Figure 7 : Differences in memory allocation between dynamically allocated buffers and static partitions. Traffic Policing is done at the input by the leaky bucket algorithm to ensure that the traffic flow seen inside the switch is the contracted traffic [17]. The only guarantee given to CBR traffic it is its PCR. Thus a single leaky bucket is sufficient to police traffic for it. On the other hand VBR traffic is guaranteed its sustained bandwidth and a higher priority best effort delivery for its peak bandwidth. This requires a leaky bucket to police its sustained bandwidth, and another to police its peak bandwidth over an arbitrary time period. Policing is performed by setting the CLP bit, and letting the switch decide if it has the resources to transfer the cell. 4.1.3 Scheduling In order to meet cell delay and delay variation deadlines ATM switches require real time schedulers to switch data across the switching fabric. Typically ATM switches use a FIFO queue at the input, with the scheduler pulling data out of the queue in the order that it arrived to maintain the "no cell reordering" invariant. The ForeRunner series of switches use Per VC queuing allowing more sophisticated scheduling schemes [18]. FIFO queuing has the disadvantage that CDV is not bounded as it depends on the number of cells currently ahead in the queue. Since simple FIFO queues only allow inserts to the tail and removals from the head, cell prioritization is not possible, resulting in cells from the ABR/UBR traffic classes disrupting CDVs of the guaranteed classes. CDV is deterministic in Per VC queuing since each queue consists of cells from the same connection. The real time scheduler operates on a weighted round robin scheme transferring higher priority cells first, with cells at the same priority level transferred on a round robin basis. This scheme ensures that deadlines for guaranteed traffic are met before best effort delivery traffic is given its share. 4.1.4 Congestion Control Since LAN packets are comprised of a large number of individual cells, the aim is to maximize the number of complete packets transmitted, i.e. maximization of goodput. This means that packets with incomplete cell counts should be immediately discarded, else they would consume network resources such as buffer space and bandwidth. This scenario becomes even more critical when the switch faces congestion, where these useless cells deny resources to good cells. The ForeRunner series of switches implement a complementary set of schemes called Early Packet Discard (EPD) and Partial Packet Discard (PPD) [18] to handle congestion. These schemes specifically improve AAL5 goodput (since they recognize cells based on the AAL5 header), which comprise most of the current ATM traffic. In Early Packet Discard, if the total number of cells at the output buffers exceeds a user settable threshold, the EPD mechanism prevents new packets from entering the rapidly filling buffer where if would mostly incur lost cells due to congestion. EPD thus reserves the remaining buffer space to maximize the chance of getting out currently complete packets within the switch. Once the number of cells falls below the threshold, new packets are allowed into the switch. In spite of the EPD mechanism, if output buffers start overflowing (This can occur if the EPD threshold is set too high, and the tails of the packets already in the switch require more than the remaining buffer space), the PPD scheme discards all the remaining cells in the tail of a packet inside the switch. This scheme clusters the dropped cells into the fewest lost tails in an effort to maintain goodput. This scheme only operates in conjunction EPD and attempts to reduce congestion levels to below the EPD threshold. The last cell in an AAL5 packet is permitted through the network to allow the AAL reassembly layer at the destination to recognize the end of a packet. The ForeRunner switch also implements a scheme called Explicit Forward Congestion Indication (EFCI), wherein the Congestion Indication (CI) bit in the cell header is set to indicate that the cell has traveled through intermediate switches facing congestion. The end systems may react to EFCI as a part of the feedback flow mechanism to reduce their traffic rate.
4.2 CISCO 4.2.1 Architecture The CISCO LightStream 1010 ATM switching fabric uses a shared memory architecture with the memory organized as a monolithic shared memory [20]. Cells arrive at the input ports, undergo a stage of VPI/VCI translation and are fed directly into the shared memory. The output ports are given pointers to the locations of cells in shared memory destined for them. In the final phase ports then collect the cells destined for them from shared memory. Relative to an output buffered TDM bus based architecture, shared memory architectures offer the advantage of better hardware utilization, and because of statistical sharing, they have a higher buffering efficiency. The LightStream series also implements a multicast engine where multicast cells are not duplicated. The multicast engine maintains pointers to the cells in shared memory, and in the case of multicast cells, all output ports in the multicast group point to them. The cell is removed from memory after all the ports in the multicast group have transmitted it.
4.2.2 Traffic Management The LightStream series of switches execute a local Connection Admission Control (CAC) function at the connection initiation phase to determine if can support the connection . This decision is based on two sets of criteria [20] :
A generic CAC function is needed in the path selection process to choose a path that can best meet the requested QoS level. The RCAC algorithm provides a set of user configurable parameters to specify the minimum and maximum bandwidth as a percentage share of the link bandwidth for CBR and VBR connections. The LightStream series also implement Multi Link Trunk Groups (MLTG) where a number of ports can be grouped together as a trunk offering higher bandwidth and/or better service. In this scenario, the switch offers two modes of operation for CBR and VBR traffic.
ABR and UBR traffic are also offered two modes of operation
The buffer management strategy on the LightStream series in novel, in that it allows a tradeoff between the statistical advantages of buffer sharing and per port fairness. Fairness is a necessary criterion since it prevents one port (or a group of ports) from hogging up resources, for e.g. a group of ports might start sending in bursty ABR traffic at high traffic rates to a congested port. This could starve ABR connections on the other ports, or in the worst case reject new ABR connections. Fairness requires that each port have a dedicated queue, but this goes against the statistical sharing advantages of shared memory. The LightStream series of switches define maximum buffer limits for the four delay service classes (CBR, VBR, ABR, UBR) with the default being the total available cell memory (64Kcells in this case). They define a term called the Over Subscription Factor (OSF) which defines the tradeoff between sharing and fairness. OSF is defined as OSF = where i is the service class between 1..4, j is the number of ports between 1..32 and L represents the length of the queue. Assuming the default values for maximum total queue size per port is the same for all ports, i.e. for any j (j=1, 2, 32), the equation for the queue size per port is given by
Varying the value of OSF changes the per port queue size. OSF can take any value between 1..32. A value of 1 would lead to a per port queue size of 2K, which has maximum fairness, but minimum sharing since 32 ports x 2K per port = 64K which is the total buffer size. So there is no sharing in this case. At the other extreme, for OSF = 32, we have a maximum per port queue of 64K. In this case it behaves as a true shared memory buffer with maximum sharing and no fairness. Intermediate values of OSF yeild varying trade-offs between sharing and fairness. Policing in the LightStream series of switches is done using the Dual Mode Single Leaky bucket algorithm. In the case of CBR traffic, a single leaky bucket is used to police its PCR, whereas for VBR, two buckets are used to police its SCR and PCR. The switch provides a threshold based discard mechanism, wherein the maximum of buffers that can be shared by CLP = 0 and CLP = 1 cells is specified. When the threshold is reached, only CLP = 0 cells are permitted into the system. 4.2.3 Scheduling The LightStream series of switches use a modified form of per port FIFO queuing combined with a real time scheduler to deliver the negotiated QoS guarantee [20]. At each input there is an independent queue for each service class as shown in Figure 8.
Figure 8 : Per Port Queuing structure in the LightStream ATM switch The queues are prioritized according to their service class. This reduces delay for the guaranteed traffic classes. This scheme is simpler in terms of implementation as compared to the Per VC queuing scheme. It could potentially suffer from a higher jitter due to the non determinism introduced by the presence of a random number of cells in the queue ahead of it. It however should not affect the QoS guarantee since the leaky bucket would maintain the CDV within the negotiated values. The scheduler operates on a weighted (prioritized) round robin basis, servicing higher priority cells first and operating in round robin mode for cells of the same priority. 4.2.4 Congestion Control The LightStream series of switches is implement the complementary EPD/TPD schemes to maximize goodput under congestion [19, 20]. Under EPD, when the switch buffer queues reach a user settable threshold, the switch starts dropping all cells belonging to a new packet. Cells belonging to packets already in the switch are let in. When congestion levels fall below the threshold, the switch starts admitting new packets. If, the cells belonging to existing packets cause further congestion, TPD is activated and it drops all cells coming into the switch belonging to the tail of a packet. The last cell of a packet is not dropped to allow the AAL5 reassembly layer at the destination to recognize the end-of-packet information encoded in the last cell. These schemes act together to reduce congestion. The EPD/TPD mechanism in this switch is similar to the mechanism in the ForeRunner switch, with the primary difference being that buffering in the LighStream is implemented on a per port basis using shared memory. Hence this switch performs per port EPD/TPD. The LightStream series also implement the EFCI signaling scheme. End Stations may react to the EFCI as a feedback mechanism which indicates congestion, and reduce their traffic. The switch implements EFCI as a user settable threshold, with EFCI flag set when buffer occupancy is beyond the threshold. The switch also implements rate based flow control for ABR traffic. In this scheme, end systems periodically use resource management (RM) cells to probe the network for information regarding bandwidth availability, congestion state and impending congestion. ABR connections intermix RM cells with their usual traffic. These cells are received at the destination and returned back to the source in a closed loop to indicate if any intermediate switches have experienced congestion. The LightStream switch uses the following two schemes to indicate and control congestion at its queuing points :
ABR traffic sources can use this information to increase or decrease their transmission rate in response to the available bandwidth. It should be noted, that this estimation is only an approximation (that can be arbitrarily close to the real available bandwidth) of the state when the RM cell was sent.
In this section we survey a real time Ethernet switch from 3Com networking that supports multiple QoS classes, with the focus being on implementation details. The real time Ethernet switch from 3Com networking called PACE (Priority Access Control Enabled) provides a connection oriented deterministic Ethernet protocol [22,23]. Traffic prioritization is done to provide multiple QoS service classes. The basic network architecture consists of end stations connected to PACE switches via 10Mbps point-to-point links. Considering the current costs of 10Mbps Ethernet, this cost is justified in return for the dedicated bandwidth now available to each user. The switch also has a higher bandwidth uplink channel to connect to a backbone network. The network architecture is shown in Figure 9. Jitter in Ethernet networks is caused largely by the capture effect [23]. In the PACE solution, since the network switch is connected by point-to-point links to the end-stations, it controls the
Figure 9 : Network Architecture using the 3Com Real Time Ethernet Switch. Hosts are connected to switches on point to point links. The switches are connected using a high bandwidth backbone. traffic flow resulting in the elimination of monopolization caused by the capture effect. The solution does not require any modification to host NICs. It should be noted however, that the PACE solution for jitter works only for point-to-point connections. Solutions for Ethernet segments would require modification to the Ethernet NICs at the end-stations. It is not entirely clear that these solutions would any fairer than the current solution which attempts to maximize bandwidth availability. To support data prioritization, PACE uses multiple QoS classes, with a different priority for each class at the switch. It uses a clever mechanism to identify higher priority packets. Since the NIC driver at each end station requiring QoS (called Classes of Service or CoS by 3Com) support has to be changed, 3Com has written an NIC driver that snoops on specified socket ranges used by higher priority traffic like video conferencing and marks them, so they can be identified by the switch. This scheme allows for prioritization for any kind of traffic, and based on need, ABR traffic may be given higher priority than say CBR video conferencing. The downside of the scheme is that QoS support requires a driver change, with drivers only available for the 3Com family of Ethernet adapters. Unless the above solution is standardized, QoS support based on it requires an NIC change, inherently limiting the solution. 5. Our Approach : The EtheReal Switch With an installed base of 50 million NICs, [23] Ethernet is the most prevalent MAC protocol. With current advances in technology, Ethernet can now support data transmission speeds of up to 100Mbps. Work on Gigabit Ethernet is already underway and protocols are expected to be standardized within 18 months. This increase in bandwidth has spawned several new traffic sources like multimedia which require guarantees from the network. As it currently stands, Ethernet does not support any form of QoS. On the other hand switched network technologies like ATM which inherently support QoS guarantees are too expensive to be used in Local Area Networks/Workgroups which see most of the multimedia traffic. Any attempt to bring in real time traffic to Ethernet based Local Area Networks would require a mechanism to support QoS. As mentioned in Section 0, any addition to the Ethernet MAC layer to support QoS guarantees would require its standardization and changing the driver associated with the NIC. This severely limits the scope of the solution. In this section we propose a scheme to implement Quality of Service Guarantees on top of the existing Ethernet MAC protocol without changing it. The proposed scheme works at the application level. This section is organized as follows. In section 0 we talk about the issues involved in providing QoS guarantees over Ethernet. Section 0 deals with the hardware architecture of the EtheReal switch. In section 0 we discuss the Quality of Service support under EtheReal. Section 0 deals with the software architecture of EtheReal. We conclude the section with a discussion on the limitations of our approach.
The Ethernet protocol was designed to provide connectionless unreliable data delivery. This poses problems in the delivery of QoS guarantees. The primary problems are listed below :
An implementation of real time guarantees would require that the above issues be addressed. Effects like the exponential backoff mechanism can only be ameliorated and not be removed without change to the Ethernet protocol. In the subsequent sections we discuss our architecture that deals with these issues.
The EtheReal switching architecture consists of end hosts connected to EtheReal switches on 10Mbps point to point links. The switches are interconnected using 100Mbps inter-switch point to point links. The network architecture is shown in Figure 10.
Figure 10 : EtheReal Network Architecture The proposed switch prototype uses off-the-shelf hardware for simplicity. It is based on a Intel Pentium computer with a PCI bus. The prototype consists of two 4 port 10/100 Mbps Ethernet cards with an onboard processor on each card and 1MB of buffer memory. The cards have a local bus for transferring data between ports on the same card. On receiving a packet, the onboard processor schedules it for delivery. If the target port is on the same card, it transfers data on the local bus without generating any traffic on the PCI bus. Transfers between ports residing on different cards requires a data transaction on the PCI bus. Pentium based computers support up to 4 PCI slots. In the worst case we have four, 4 port Ethernet cards with 15 Ethernet ports operating in 10Mbps mode and 1 operating in 100Mbps mode. If data transactions on all ports require a transaction on the PCI bus, this leads to a worst case PCI traffic of 15 x 10Mbps + 100 Mbps = 250 Mbps. This is below the sustained PCI bandwidth of 33MBps (= 33 x 8 = 264 Mbps) and well below the PCI burst data transfer rate of 133MBps. The switch runs a stripped down version of the Linux Operating System with the TCP/IP protocol stack.
5.3 Quality of Service Support The primary design goal of EtheReal is to support real time traffic along with current best effort delivery traffic, delivering independent guarantees to both of them. To achieve this, applications with real time requirements indicate their traffic parameters and request a guarantee from the switch. As mentioned in section 0, this process involves (i) Defining the service contract (ii) Setting up a connection that meets the contract and (iii) Delivering the contracted service. The rest of this section discusses the above phases in EtheReal. 5.3.1 The EtheReal Traffic Contract In order to support real time traffic, we need a mechanism to prioritize data. This is done by classifying traffic into service classes based on expected traffic patterns. Each service class has a data priority level and associated guarantees. Applications that need real time guarantees first need to classify their traffic into one of the available service classes based on their expected traffic behavior before requesting a QoS guarantee from the network. The EtheReal switch supports the following two service classes :
Since the primary focus of this architecture is to support real time applications we support only the essential traffic classes. Video and audio traffic sources are either constant or variable bit rate - both of which can be supported by the RT-VBR service class. The connectionless UBR class is provided to support best effort delivery based applications. 5.3.2 Connection Setup In order to deliver Quality of Service guarantees, we need a connection oriented network. The Ethernet MAC protocol is a connectionless shared medium protocol. In this scenario, connection setup involves (i) a connection setup mechanism (ii) admission control (iii) resource reservation and (iii) routing. 5.3.2.1 Connection Setup Mechanism Since the constraint on the problem is that we cannot change the network driver, the difficulty lies in generating a connection ID to distinguish connection oriented RT-VBR traffic from connectionless UBR traffic originating from the same host. This problem is solved in a novel way. Applications requiring real time guarantees talk to a Real Time Communication daemon (RTCD) on the local host. This daemon is responsible for connection setup and teardown. The RTCD initiates a connection request to the target host by contacting the switch on UDP, indicating the target IP address and the desired QoS parameters. Based on this information, the switch reserves resources within itself and determines a route to the target host. The daemon then generates a proxy Ethernet address of the format FF FF FF FF X X (Typical Ethernet addresses are 6 bytes long) where X X is a 2 byte connection ID. This address is given to the switch as the connection ID. The daemon also generates a proxy IP address of the form 0.0.X.X (where X.X is the 2 byte connection ID) and associates the IP address with the proxy Ethernet address using ARP.
Figure 11 : Connection Setup Procedure in the EtheReal Architecture. Host A initiates a connection to Host B through the Real Time Communication daemon. The daemon at host A creates a Proxy IP address and a Proxy Ethernet address as a connection ID. Connection ID remapping occurs at each switch along the route.
To transfer real time data, we open a socket to the proxy IP address. Since this has been associated with a corresponding proxy Ethernet address, the MAC layer would generate a packet with the proxy Ethernet address as the destination address. This is picked up by the switch which recognizes the connection ID embedded in the proxy Ethernet address. The switch then routes the packet to its next hop. The last switch along the route then substitutes the proxy Ethernet address with the Ethernet address of the destination and delivers the packet to the target host. The problem here is that the IP header carried in the Ethernet packet contains the proxy IP address generated at the source host as the destination address. This will not be recognized by IP at the target host. To rectify this, the last switch along the route also replaces the proxy IP address with the real IP address. On receiving a connection request, the EtheReal switch attempts to find a route to the target host. There are two possibilities here. (i) The switch is connected directly to the target host or (ii) The switch is connected to the target via a series of switches. The first case is simple, the switch delivers any packet it receives directly to the target host after performing the IP address translation outlined in the previous paragraph. In the second case, we perform connection ID remapping at each intermediate switch. With this scheme we do not need globally unique connection IDs since the tuple <src Ethernet address, Connection ID> would be unique for any point to point link in the network. The switch initiates a connection request on UDP to the next switch along the route and generates a remapped connection ID which is delivered to the next switch along the route. Since we are allowed to modify the network level drivers at the switch, we do not require ARP to stuff the connection ID into the Ethernet packet. Figure 11 shows an example of this translation scheme. The routing table consists of tuples of the format <Source Ethernet address, Source Connection ID, Output Port, Translated Connection ID, Target IP address, QoS parameters>
When a connection setup is initiated from a host, the switch needs a mechanism to determine if it has sufficient resources to satisfy the requested QoS. This is done by an admission control mechanism which keeps track of the shared resources within the switch. The shared resources are input link bandwidth and output link bandwidth. The admission control mechanism uses the following algorithm :
When a connection setup is accepted at a switch by the admission control function, we need to reserve resources in order to deliver the promised QoS. Resource reservation within the switch requires (i) Bandwidth reservation on the input and output ports (ii) Time slot reservation in the switching fabric and (iii) Buffer reservation. As mentioned in section 0, bandwidth reservation on the input and output ports is performed by keeping a running total of the allocated bandwidth. When a connection is accepted, the running total is incremented to indicate the bandwidth allocated to the new connection. Inside the switch, EtheReal operates in a Time Division Multiplexed mode. Each slot on the TDM corresponds to transferring one packet from an input port to an output port. Each input link is allocated time slots on the TDM fabric corresponding to the allocated bandwidth on the link. Buffer reservation would be required at (i) Input Ports (Input Buffers) to buffer cells before they are transferred to the corresponding output ports and (ii) Output Ports to buffer cells in case more than one input port has packets destined for it simultaneously. As shown in section 0, the EtheReal switch has sufficient switching bandwidth to handle all inputs streaming data at peak rate. Thus a fixed number of input buffers can be allocated to each input port to buffer packets. Prior to resource reservation, admission control verifies that sufficient bandwidth exists at each output port to handle all routes passing through it. However independent sources might simultaneously send packets to the same output port. This requires buffer reservation at the output port. At the worst case all the input ports may send data simultaneously, so the maximum number of output buffers required is equal to the number of input ports supported by the switch. While the above mechanism for output buffer reservation is correct for real time connections, multiple connectionless non real time traffic sources may also send packets to the same output port. Since admission control is not performed on non real time traffic, we have no means of knowing if sufficient bandwidth exists at an output port to transfer all non real time packets destined for it, without disrupting guarantees to real time traffic. To solve this, we use two buffer queues at each input port and output port, one for real time traffic and the other for non real time traffic. Packets in the non real time queues are scheduled for delivery after the real time queues have been serviced. Since admission control has verified that real time sources have sufficient bandwidth along their route, congestion would only be experienced by non real time sources. In this scenario, congestion translates to overflow of the fixed size non real time buffer queues. At the input, this overflow would occur if the data rate of non real time sources is greater than their allowed data rate. At the output buffer an overflow would occur if the output port does not have sufficient bandwidth to transfer non real time packets after meeting the guarantees for real time sources. On an overflow, packets in the non real time queue would be overwritten and effectively dropped. Experimental measurements need to be made to determine the extent of the effect of this congestion control mechanism on non real traffic sources.
5.3.2.4 Routing When a connection setup call is received, we need a routing mechanism to determine a route to the target machine. To do this, each switch in the network needs to maintain a routing table which contains the route to every host in the network. Initially the routing table at each switch is empty. The switches use a generic flooding algorithm which works in two phases. In phase 1 each switch accesses the point to point links connected to it and determines the Ethernet address of the connected hosts. After this phase every switch in the network has a routing table containing the hosts directly connected to it. In phase 2, each switch passes this information to every other switch in the network. At the end of this phase, every routing table in the network contains a list of routes to every host in the network. This routing table setup mechanism occurs during the startup phase of the network. If a host is added to the network after the startup phase, the switch sees a new source Ethernet address on the first transmission from the host. It inserts this Ethernet address into the routing table and informs the other switches in the network of this new addition. This mechanism ensures that hosts can be added to the network (or network cards on existing hosts changed) without having to re-initialize (reboot) the network. In the process of routing, there may be multiple routes to the target, all of which are potentially capable of satisfying the requested QoS. Since switches can only grant guarantees on local resources, the connection setup call would have to go through every switch along the route to determine if sufficient resources exist to establish a connection. To determine a route, an upstream switch selects a route from the available routes and passes the connection request to the next switch. The next switch would then pass it to further downstream switches along the route till the last switch in the route is reached. If all the switches along the route have accepted the connection, the connection is established. If a switch along the route does not have the resources to deliver the requested QoS, we perform an exhaustive route search. To do this we backtrack to the nearest successful upstream switch and try a different route to the target. This continues till either a route is found that has the resources to satisfy the QoS, or no such route can be found. This mechanism works on the premise that connection setup delays are better than no connection at all.
5.3.3 Traffic Management After a connection has been setup, the switch needs to manage its traffic in order to deliver the promised guarantee. This forms the basis for Traffic Management. In the EtheReal architecture, traffic management is a distributed function performed both by the switches as well as hosts connected to the switches. It involves (i) Traffic Shaping (ii) Traffic Policing and (iii) Scheduling Packets for delivery 5.3.3.1 Traffic Shaping and Policing Traffic Shaping is required to generate a constant rate flow of traffic. Policing is then performed on the constant rate flow to ensure that traffic sources are within their negotiated values. In the EtheReal environment, both shaping and policing are done at the host end to offload compute load from the switch and provide a path for scalability. At each host, buffers are allocated to individual real time sources. The size of buffer is determined by the bandwidth guarantee of the source and the maximum burst size specified by the source. Data flows into these buffers at a variable rate from the sources and is picked up by the scheduler running at each host at a constant rate. If traffic sources exceed their contract, buffers associated with them overflow. The overflow causes the excess packets to be dropped. This mechanism does both shaping and policing at each host. Thus the switch sees constant rate flows from each host connected to it.
5.3.3.2 Scheduling In the EtheReal architecture, scheduling is performed at both the hosts and the switch. Based on the guarantees given, the scheduling mechanism at the hosts schedules packets from multiple real time sources for delivery. At the switch the packet scheduler schedules packets from the input ports for delivery to their corresponding output ports. As mentioned in section 0, each real time source at a host is allocated buffers based on (i) Guaranteed Bandwidth and (ii) Maximum Burst Size in a predefined interval. The scheduler at each host operates with a time resolution of 10ms. Once every 10ms, the scheduler tells all the real time sources to send data. The actual data sent by each source corresponds to the reserved bandwidth interpolated to a 10ms interval. Figure 12 shows an example of two real time sources at Host A. Source 1 has reserved 3Mbps with a maximum burst size of 40Kbits/10ms and Source 2 has reserved 2Mbps with a maximum burst size of 25Kbits/10ms. Thus after shaping, the constant rate flow from Source 1 would be 30Kbits/10ms and the rate from Source 2 would be 20Kbits/10ms. Thus the buffer size for Source 1 would be 40 Kbits and that for Source 2 would be 35 Kbits. Once every 10ms, each real time source gets a signal to transfer data. It transmits data assuming a constant rate flow of the peak traffic rate. Using the above example, every 10 ms, Source 1 would transmit 30 Kbits and source 2 would transmit 20 Kbits.
Figure 12 : Operation of the scheduler at a host. There are two real time sources 1 and 2. The scheduler determines the transmission size and transmission interval of each source.
The process of transferring a packet from an input port to an output port at the EtheReal switch involves (i) receiving the packet at an input port (ii) scheduling it for delivery to the output port buffer and (iii) transferring it from the output port buffer to the output port. The process of receiving a packet at an input port is performed by the onboard CPU on the Ethernet card. This CPU can work either in polling mode or can be interrupt driven. Since the onboard CPU is designed for microcontroller based applications, it has a low interrupt latency overhead. Experimental results would be required to determine the performance trade-off between polling and an interrupt driven approach. The onboard CPU identifies the packet to be either real time or a non real time and places it in the appropriate queue. The scheduling algorithm works on the main CPU, in our case it would be the main Pentium processor of the machine. The main CPU works at a time resolution of 10ms. The bandwidth at each input port is limited to 10Mbps since that is the maximum allowed transmission rate. The breakup of allocated bandwidth for real time and non real time data for each port is obtained by querying the admission control metadata. This breakup is dynamic. The bandwidth available after satisfying all real time connections is given to non real time connections. The buffers allocated to real time and non real time traffic are also dynamic and change as per the current bandwidth allocation breakup. The advantage of this scheme lies in the automatic policing performed by it. If non real time sources exceed their allocated bandwidth, they will dropped at the input buffers. The scheduler design is also simplified since it uses to current bandwidth breakup to allocate time slots on the switching fabric. To prevent real time connections from taking up all the bandwidth and starving non real time connections, we have a configurable maximum real time bandwidth limit on a per port basis at the switch. If the total bandwidth of all real time connections at an input port reaches this limit, further real time connections will be refused. The scheduler works in a time slotted, weighted round robin mode. Real time traffic is given higher priority. Each input port is allocated slots in the scheduler for real time and non real time traffic based on the bandwidth breakup discussed in the previous paragraph. Figure 13 shows an example with 2 ports. Real time traffic at port 1 takes up 6Mbps with the other 4Mbps allocated to non real time sources. At port 2 real time traffic takes up 8Mbps with 2Mbps allocated to non real time traffic. At each interval of the scheduler (10ms) port 1 would have 60Kbits of real time traffic and 40Kbits of non real time traffic. Port 2 would have 80Kbits of real time traffic and 20Kbits of non real time traffic. The scheduler executes the following algorithm during every 10ms interval :
Figure 13 : Scheduling in the EtheReal switch. The scheduler schedules and transfers data from sources 1 and 2 to their respective outputs. Packets are received by the onboard CPU on each 4 port Ethernet card and scheduled for delivery by the main Pentium CPU.
The above algorithm tries to minimize the delay variance seen by the real time connections at the input ports. To do this it transfers packets from the real time queues in a round robin fashion. On the other hand, non real time connections do not require this minimization. Hence we burst data from each of the non real time queues, emptying one first, before transferring data from the next. The final phase of scheduling at the switch involves transmitting the packets from the output buffer queues to the output port. This phase uses a feature of the 4 port Ethernet card used in EtheReal. The Ethernet card supports a circular transmit queue of pointers to buffers containing packets for transmission. When the scheduler has transferred a packet from the input queue to the output queue in steps 2 and 5 of the above algorithm, it updates the pointers in the circular queue of the Ethernet card. The Ethernet card then uses these pointers to transmit data out on the cable. One of the significant differences between our approach and that of QoS implementations on ATM is that we perform policing only at the host end. This offloads shaping compute delays from the switch. At the host, we need a mechanism to multiplex and shape traffic flows from multiple real time traffic sources onto a single flow. In the preceding sections we had mentioned a Real Time Communication Daemon (RTCD) which is primarily responsible for connection setup and teardown. We also discussed the host end buffering mechanism responsible for multiplexing, shaping and policing packets from a host. One place to implement this mechanism would be inside the RTCD, resulting in a centralized scheme. In this scheme data transfer would have to occur using sockets which are opened between the RTCD at the transmitting end and the application at the receiving end. To transfer data, each application gives a copy of the data to the RTCD using some form of IPC. The RTCD then passes another copy of the data to IP for transmission. It is apparent that data copying between the various layers introduces an overhead which could be minimized. An alternative to the centralized RTCD approach is to write a real time socket library. The library would present an API with calls to open a real time connection based on the required QoS, transmit/receive data and teardown a connection. The data transmission call in the library would be responsible for shaping, policing and scheduling the traffic. In this case data copying between layers is less than in the RTCD approach. The application gives a pointer to data to the library which then passes a copy of it to IP for transmission. Inspite of its lower overhead, there are several problems with this scheme.
Figure 14 : Software architecture of the EtheReal implementation at each host
Thus each approach has its own strengths and weaknesses. Our solution to this problem is to use both approaches, combining the centralized scheduling advantages of the daemon mechanism with the lower data copy overhead of the library scheme. In this scheme, the RTCD is used for connection setup and teardown. It also generates a timer once every 10ms to inform all real time sources to transmit data. To create a robust connection teardown mechanism, all applications which require real time guarantees have to register themselves with the RTCD. The registration involves opening a pipe to the RTCD which remains open throughout the lifetime of the application. If the application is killed, the RTCD would receive a signal indicating that the pipe corresponding to an application is no longer valid. This would indicate that the application is dead and the connection can be torn down. Once a connection has been setup by the RTCD, data transfer on the open connection is done using a library. On a successful connection setup, buffers are allocated by the library to perform shaping and policing on the traffic flow from the application. The library also registers a signal handler to receive the scheduler timer tick from the RTCD. The scheduler is implemented in the data transmit/receive call of the library. On receiving the timer, each application transmits data corresponding to the guarantee it was given. As can be seen, this scheme reduces data copy overhead while maintaining the advantages of a centralized scheduler.
The EtheReal approach guarantees the negotiated QoS contract at the switch. However, guarantees have to be given on an end to end transaction, i.e. from the originating source station to the receiving end station. Since our approach works at the user level, end to end guarantees pose two problems :
The problem of non deterministic collisions is solved by using full duplex transmission. On Ethernet segments, the Ethernet MAC protocol supports half duplex transmission, i.e. it is either sending or receiving at any instant in time. As opposed to half duplex transmission, on point to point links, Ethernet supports full duplex transmission, wherein it can both transmit and receive at the same time. This is done by segmenting the cable bandwidth into two bands, with each host transmitting in one band and receiving in the other. For 10Mbps Ethernet, each band supports 10Mbps, making the effective full duplex bandwidth 20Mbps. This scheme prevents collisions and the associated non-determinism since the stations transmitting simultaneously do so on different bands. The second problem arises due to the need for a low-level mechanism for prioritization of data. One solution to this is to let IP do the prioritization. The IP header contains a 3 bit Type of Service (TOS) field which allows 7 precedence levels. Real Time sources can set this field to the highest priority, thus ensuring that they do not get starved. The trouble with this solution is that the TOS field is ignored on most hosts and routers since the IP layer does not implement the functionality to support precedence levels [24]. We could change the IP layer to support precedence classes. This would be a general MAC independent solution to support prioritization. However, this goes againt our original constraint that the solution has to run at the user level requiring no changes in the Operating System/Network Layers. To solve the buffer starvation problem, we note that most general purpose non real time sources are TCP based, with NFS being the prominent exception. Our solution activates TCPs congestion control to keep TCP bandwidth within the limits allocated to non real time sources. When non real time sources exceed their allocated bandwidth, the switch drops those packets. When a TCP packet is dropped, TCP would not receive an acknowledgment for it and invoke its congestion control algorithm. TCP then reduces its transmission rate, repeating it if necessary, till it finds no dropped packets. Thus the switch gets to control the transfer rate of TCP sources at the source host. Since TCP implements congestion control at the transport layer, this solves the buffer starvation problem at IP by controlling the data rate from TCP to IP. This solution still does not solve the problem of UDP sources. Since UDP is a connectionless unreliable transmission protocol, it does not implement congestion control. Thus a rate control mechanism at the switch would not work for UDP sources. Typically NFS is the primary UDP traffic generator. NFS requests from a host are asymmetric in nature with the request from a host being smaller than the response from the NFS server. It is hoped that these small requests would not unduly disrupt real time communication. Furthermore NFS is moving towards a stateless TCP implementation to avoid the address translation overhead associated with each packet in UDP. TCP has no such overhead since it is a connection oriented protocol. Thus end to end guarantees in the EtheReal switching environment have to be statistical in nature, since there is no way of guaranteeing that UDP sources would not cause buffer starvation. While this represents a limitation, we believe that it may not be a significant one since such misbehaving UDP sources can be controlled by the user.
6. Conclusions With the advent of real time traffic sources into Local Area Networks, applications such as multimedia and videoconferencing require Quality of Service guarantees from the network. Furthermore, with increasing bandwidth demand by applications, scalable solutions are needed for LANs. To this end, switched networking provides an attractive alternative with a scalable solution to the bandwidth bottleneck. Advances in cell networking have led to technologically viable ATM switches which can provide service guarantees. However, they are currently too expensive to see deployment in the LAN market which is dominated by Ethernet. In this paper we survey guarantee mechanisms and their commercial implementations on switched networks from a computer systems engineer perspective. We finally present our solution to provide bandwidth guarantees on Ethernet from the user level, and contrast it with guarantee schemes on other switched networks. We also describe the architectural details of our scheme and its limitations. With increasing requirement for QoS support in the existing Ethernet market we believe that EtheReal provides an attractive solution to the problem.
References [1] : "Survey of Congestion Control Techniques For an ATM Network", Duke P. Hong; Tatsuya Suda; Jaime Jungok Bae; May 1991; University of Californaia, Irvine; Document ICS-TR-91-48 [2] : "Survey of Traffic Control Schemes and Error Control Schemes for ATM Networks", Jaime Jungok Bae; Tatsuya Suda; May 1991; University of Californaia, Irvine; Document ICS-TR-91-46 [3] : "On Defining, Computing and Guaranteeing Statistical Quality-Of-Service In High-Speed Networks"; Nagarajan R.; Kurose J.F.; Towsley D.; July 1990; University of Massachusetts, Amherst; Document UM-CS-1990-123 [4] : "100VG-AnyLAN : A Technical Overview", White Paper by Hewlett Packard [5] : "Isochronous Ethernet - The Multimedia LAN for Real Time Desktop Connectivity", White paper by National Semiconductors [6] : "Benchmarking Terminology for LAN Switching Devices", IETF Document [7] : "Gigabit Networking", Partridge Craig, Addison Wesley pp. 73-82 [8] : "ATM User Network Interface Specification (v3.1)", ATM Forum [9] : "ATM User Network Interface Specification (v4.0)", ATM Forum [10] : "ATM Service Categories: The Benefits to the User", White Paper by ATM Forum [11] : "ATM Layer Specification", ATM UNI specification (v3.1) pp. 77-79 [12] : "Traffic Management 4.0", ATM Forum [13] : "Credit-Based Flow Control for ATM Networks: Credit Update Protocol, Adaptive Credit Allocation, and Statistical Multiplexing", Kung H.T.; Blackwell T., Proceedings of the ACM SIGCOMM '94 Conference, August 1994, pp 101-114 [14] : "ForeThought and the ATM Internetwork Version 2.0", White Paper by Fore Systems [15] : "ForeThought and the ATM Internetwork, An Implementation Overview Version 1.0", White Paper by Fore Systems. [16] : "ForeRunner ATM Switch Architecture Version 1.0", White Paper by Fore Systems [17] : "Traffic Management Version and Congestion Control Version 1.0", White Paper by Fore Systems. [18] : "ForeThough Bandwidth Management Version 1.0", White Paper by Fore Systems. [19] : "LightStream ATM Switches and Cisco IOS ATM Software", White Paper by Cisco [20] : "LightStream 1010 Switch Architecture and Traffic Management", Arthur Lin, White Paper by Cisco [21] : "ATM Internetworking", May 1995, White Paper by Cisco [22] : "Dynamic Access Features : Redefining the role of the Network Interface Card", White Paper by Network Solutions Center, 3Com Corporation. [23] : "PACE Brochure", White Paper by Network Solutions Center, 3Ccom Corporation [24] : "Internetworking with TCP/IP Volume I, Principles, Protocols, and Architecture", Douglas E. Comer, Prentice Hall Publishers, pp. 92-94 [25] : "Computer Networks", Andrew S. Tanenbaum, Prentice Hall Publishers [26] : "The Design, Implementation and Evaluation of RETHER : A Real-Time ethernet Protocol", Chitra Venkatramani, Ph.D. Dissertation, 1996, State University of New York, Stony Brook [27] : "IBMs ATM switching technology" White Paper by IBM Corporation [28] : "High Speed Switch Scheduling for Local Area Networks" Thomas Anderson, Susan Owicki, James Saxe, Charles Thacker, Report 99, DEC Systems Research Center [29] : "Myrinet: a gigabit-per-second local area network" Boden, N.J.; Cohen, D.; Felderman, R.E.; Kulawik, A.E.; Seitz, C.L.; Seizovic, J.N.; Wen-King Su, IEEE Micro (Feb. 1995) vol.15, no.1, p. 29-36. [30] : "Measured capacity of an Ethernet: myths and reality" Boggs, D.R.; Mogul, J.C.; Kent, C.A., Computer Communication Review (Aug. 1988) vol.18, no.4, p. 222-34. [31] : "A Flexible Shared-Buffer Switch for ATM at Gb/s Rates" Denzel, W.; Engbersen, A.; Iliadis, I.; Computer Networks & ISDN Systems 27(1995), pp. 611-624. [32] : "Virtual-Channel Flow Control" Dally W.; Proc. of 17th Intl. Symp. on Comp. Arch., ACM SIGARCH vol. 18, no. 2, May 1990 pp.60-68 [33] : "Multicast in the Asynchronous Transfer Mode environment" Doar J.M.S.; Technical report 245, University of Cambridge Computer Laboratory, January 1992 Ph.D. dissertation [34] : "Round Robin Scheduling for Fair Flow Control" Hahne E.; Gallagher R., Proc. IEEE Intl. Conf. on Comm. June 1986, pp 103-107 [35] : "Weighted round-robin cell multiplexing in a general-purpose ATM switch chip" Katevenis, M.; Sidiropoulos, S.; Courcoubetis, C., IEEE Journal on Selected Areas in Communications (Oct. 1991) vol.9, no.8, p. 1265-79. [36] : "VC-level Flow Control and Shared Buffering in the Telegraphos Switch" Katevenis, M; Vatsolaki, P.; Efthymiou, A.; Stratakis, M., IEEE Hot Interconnect III Symposium Proceedings, Stanford, CA, 10-12 August 1995. [37] : "Fairisle: An ATM Network for the Local Area" Leslie I.; McAuley D., Computer Architectures and Protocols, volume 21(4) of Computer Communication Review, ACM Press, September 1991, pp. 327-336. [38] : "Theory of periodic contention and its application to periodic switching" Li S.Y. Proceedings of INFOCOM 1988, pp. 320-325 [39] : "A Quantitative Comparison of Scheduling Algorithms for Input-Queued Switches." Nick McKeown; Thomas E. Anderson; {\em submitted for publication} [40] : "Large scale ATM multistage switching network with shared buffer memory switches" Sakurai Y., Ido N., Gohara S., Endo N. ; IEEE communications Magazine, vol. 29, no. 1, Jan 1991, pp. 90-96 [41] : "Throttled-buffer asynchronous switch for ATM" Schultz K.J., Gulak P.G.; IRICE Trans. Comm. vol. E77-B, no. 3, March 1994, pp-351-358 [42] : "The Knockout Switch: A Simple, Modular Architecture for High-Performance Packet Switching" Yu-Shuan-Yeh, Michael Hluchyj G. , Acampora S., International Switching Symposium, IEEE Journal on Selected Areas in Communication, vol sac-5, no. 8, October 1987 pp. 16-35 [43] : "A scalable nonblocking shared multibuffer ATM switch with a new concept of searchable queue" Yamanaka H., Kondoh H., Saito H., Tsuzuki M., Sasaki Y., Kohama S., Yamada H., Matsuda Y., Oshima K.; Proc. XV International Switching Symposium (ISS '95) vol. 1, April 1995 pp.278-282.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||