Skip to main content

Reliability

In computer networking, a reliable protocol provides reliability properties with respect to the delivery of data to the intended recipient(s), as opposed to an unreliable protocol, which does not provide notifications to the sender as to the delivery of transmitted data. The term "reliable" is a synonym for assured, which is the term used by the ITU and ATM Forum in the context of the ATM Service-Specific Coordination Function, for example for transparent assured delivery with AAL5.

Reliable protocols typically incur more overhead than unreliable protocols, and as a result, function more slowly and with less scalability. This often is not an issue for unicast protocols, but it may become a problem for reliable multicast protocols.

TCP, the main protocol used on the Internet, is a reliable unicast protocol. UDP, often used in computer games or in other situations where speed is an issue and the loss of a little data is not as important because of the transitory nature of the data, is an unreliable protocol.

Often, a reliable unicast protocol is also connection-oriented. For example, TCP is connection-oriented, with the virtual-circuit ID consisting of source and destination IP addresses and port numbers. Some unreliable protocols are connection-oriented as well. These include ATM and frame relay. There are also reliable connectionless protocols, such as AX.25 when it passes data in I-frames. But this combination occurs rarely: reliable-connectionless is uncommon in commercial and academic networks.

History

When the ARPANET pioneered packet switching, it provided a reliable packet delivery procedure to its connected hosts via its 1822 interface. A host computer simply arranged the data in the correct packet format, inserted the address of the destination host computer, and sent the message across the interface to its connected Interface Message Processor. Once the message was delivered to the destination host, an acknowledgement was delivered to the sending host. If the network could not deliver the message, it would send an error message back to the sending host.

Meanwhile, the developers of CYCLADES and of ALOHAnet demonstrated that it was possible to build an effective computer network without providing reliable packet transmission. This lesson was later embraced by the designers of Ethernet.

If a network does not guarantee packet delivery, then it becomes the host's responsibility to provide reliability by detecting and retransmitting lost packets. Subsequent experience on the ARPANET indicated that the network itself could not reliably detect all packet delivery failures, and this pushed responsibility for error detection onto the sending host in any case. This led to the development of the end-to-end principle, which is one of the Internet's fundamental design assumptions.

Reliability properties

A reliable service is one that notifies the user if delivery fails, while an "unreliable" one does not notify the user if delivery fails. For example, IP provides an unreliable service. Together, TCP and IP provide a reliable service, whereas UDP and IP provide an unreliable one. All these protocols use packets, but UDP packets are generally called datagrams.

In the context of distributed protocols, reliability properties specify the guarantees that the protocol provides with respect to the delivery of messages to the intended recipient(s).

An example of a reliability property for a unicast protocol is "at least once", i.e. at least one copy of the message is guaranteed to be delivered to the recipient.

Reliability properties for multicast protocols can be expressed on a per-recipient basis (simple reliability properties), or they may relate the fact of delivery or the order of delivery among the different recipients (strong reliability properties).

In the context of multicast protocols, strong reliability properties express the guarantees that the protocol provides with respect to the delivery of messages to different recipients.

An example of a strong reliability property is last copy recall, meaning that as long as at least a single copy of a message remains available at any of the recipients, every other recipient that does not fail eventually also receives a copy. Strong reliability properties such as this one typically require that messages are retransmitted or forwarded among the recipients.

An example of a reliability property stronger than last copy recall is atomicity. The property states that if at least a single copy of a message has been delivered to a recipient, all other recipients will eventually receive a copy of the message. In other words, each message is always delivered to either all or none of the recipients.

One of the most complex strong reliability properties is virtual synchrony.

Strong reliability properties are offered by group communication systems (GCS) such as IS-IS, Appia framework, Spread, JGroups or QuickSilver Scalable Multicast. The QuickSilver Properties Framework is a flexible platform that allows strong reliability properties to be expressed in a purely declarative manner, using a simple rule-based language, and automatically translated into a hierarchical protocol.

Reliable delivery in real-time systems

There is, however, a problem with the definition of reliability as "delivery or notification of failure" in real-time computing. In such systems, failure to deliver the real-time data will adversely affect the performance of the systems, and some systems, e.g. safety-critical, safety-involved, and some secure mission-critical systems, must be proved to perform at some specified minimum level. This, in turn, requires that there be a specified minimum reliability for the delivery of the critical data. Hence, the notification of failure does not negate or ameliorate delivery failure by the real-time system's transport layer.

In hard and firm real-time systems the data has to be delivered within a deadline, i.e. data that is delivered late is valueless. In hard real-time systems all data must be delivered within its deadline or it is considered a system failure. In firm real-time systems, there is some acceptable probability that data will be not be delivered or may be delivered late – these being equivalent.

There are a number of protocols that are capable of meeting real-time requirements for reliable delivery and timeliness, at least for firm real-time systems (due to the inevitable and unavoidable losses from, e.g., the physical layer bit error rates):

MIL-STD-1553B and STANAG 3910 are well known examples of such timely and reliable protocols for avioninc data buses. MIL-1553 uses a 1 Mbit/s shared media for the transmission of data and the control of these transmissions, and is widely used in federated military avionics systems (in which "Each system has its own computers performing its own functions"). It uses a Bus Controller (BC) to command the connected Remote Terminals (RTs) to receive or transmit this data. The BC can therefore ensure that there will be no congestion, and transfers are always timely. The MIL-1553 protocol also allows for automatic retries that can still ensure timely delivery and increase the reliability above that of the physical layer. STANAG 3910, also known as EFABus in its use on the Eurofighter Typhoon, is, in effect, a version of MIL-1553 augmented with a 20 Mbit/s shared media bus for data transfers, retaining the 1 Mbit/s shared media bus for control purposes.

The Asynchronous Transfer Mode (ATM), the Avionics Full-Duplex Switched Ethernet (AFDX), and Time Triggered Ethernet (TTEthernet) are examples of packet switched networks protocols where the timeliness and reliability of data transfers can be proved. AFDX and TTEthernet are also based on IEEE 802.3 Ethernet, though not entirely compatible with it.

ATM uses connection oriented virtual channels (VCs), which have fully deterministic paths through the network, and usage and network parameter control (UPC/NPC), to limit the traffic on each VC separately. This allows the usage of the shared resources (switch buffers) in the network to be calculated from the parameters of the traffic to be carried, in advance, i.e. at system design time. These can then be compared with the capacities of these resources to show that, given the constraints on the routes and the bandwidths of these connections, the resource used for these transfers will never be over-subscribed. These transfers will therefore never be affected by congestion, and there will be no losses due to this effect. Then, from the predicted maximum usages of the switch buffers, the maximum delay through the network can also be predicted. However, for the reliability and timeliness to be proved, and for the proofs to be tolerant of faults in and malicious actions by the equipment connected to the network, the calculations of these resource usages cannot be based on any parameters that are not actively enforced by the network, i.e. they cannot be based on what the sources of the traffic are expected to do or on statistical analyses of the traffic characteristics (see network calculus).

AFDX uses frequency domain traffic policing or bandwidth allocation, that allows the traffic on each virtual link (VL) to be limited so that the requirements for shared resources can be predicted and congestion prevented so it can be proved not to affect the critical data. However, the techniques for predicting the resource requirements and proving that congestion is prevented are not part of the AFDX standard.

TTEthernet provides the lowest possible latency in transferring data across such a network by using time domain control methods – each time triggered transfer is scheduled at a specific time, so that contention for shared resources is entirely controlled and thus the possibility of congestion is eliminated. The switches in the network enforce this timing to provide tolerance of faults in, and malicious actions on the part of, the other connected equipments. However, "synchronized local clocks are the fundamental prerequisite for time-triggered communication". This is because the sources of critical data will have to have the same view of time as the switch, in order that they can transmit at the correct time and the switch will see this as correct. This also requires that the sequence with which a critical transfer is scheduled has to be predictable to both source and switch. This, in turn, will limit the transmission schedule to a highly deterministic one, e.g. the cyclic executive.

However, low latency in transferring data over the bus or network does not necessarily translate into low transport delays between the application processes that source and sink this data. This is especially true where the transfers over the bus or network are cyclically scheduled (as is commonly the case with MIL-STD-1553B and STANAG 3910, and necessarily so with AFDX and TTEthernet) but the application processes are asynchronous, e.g. pre-emptively scheduled, or only plesiosynchronous with this schedule. In which case, the maximum delay and jitter will be twice the update rate for the cyclic transfer (transfers wait up to the update interval between release and transmission and again wait up to the update interval between delivery and use).

With both AFDX and TTEthernet, there are additional functions required of the interfaces to the network for the transmission of critical data, etc., that make it difficult to use standard Ethernet interfaces, e.g. AFDX's Bandwidth Allocation Gap control, and TTEthernet's requirement for very close synchronization of the sources of time triggered data. Other methods for control of the traffic in the network that would allow the use of such standard IEEE 802.3 network interfaces is a subject of current research.

Source: Wikipedia, Google