This is an archived page and is no longer updated.
Please visit our current pages at https://rvs-bi.de

Data Ambiguity and Discontinuity

Peter Ladkin

Data Ambiguity

A data ambiguity (DA) is a case in which two separate pieces of data in a computer system have the same representation. For example, suppose there are two people named Frederick James Smith and Frederick John Smith. Suppose that a data base system contains records indexed by name, and that names are represented Lastname;Firstname;Middle-Initial. Then both people will be represented as Smith;Frederick;J and there is data ambiguity.

There are many incidents caused by data ambiguity, usually in databases containing personal records. The on-line journal RISKS Forum Digest contains examples of just about everything unpleasant that can happen to people because of name ambiguity. A typical case might be two people as above, one of whom is an upstanding citizen and the other a habitual drunk driver who has skipped bail. The former is stopped for speeding, doing 60mph on a 55mph freeway in another state -- and is arrested for skipping bail (Cases like this happen -- read RISKS!).

The General Physical Constraint on Data Representation

Recall that data is represented in computer systems by being encoded into bytes. There are only a fixed, finite number of pieces of data that can be represented by a coding scheme in a fixed number of bytes. That means that any time one has more pieces of data than there are fixed bytes for its representation, software must use a coding scheme, some of which are mathematically very clever, to increase the size of the set of data that can be represented. Since there are potentially arbitrary dates in the future, representing arbitrary dates is an example in which such a coding scheme is needed.

Data Discontinuity

Data discontinuity happens when data is ambiguous, and ordered. Dates are ordered: 12 October 1992 comes before 3 January 1993. If it is required to represent more pieces of ordered data than the coding scheme will allow, then one has a data discontinuity. which leads to the effect that

An example. Suppose a system attempts to represent integers as 8 bits, without a `+' or `-' sign (i.e., as an unsigned integer). Consider trying to represent the integer 292. One can only represent the integers 0 to 255 in 8 bits `unsigned'. So the range of the coding scheme is 0 to 255. 292 is (256 + 36), which is (256 + 32 + 4), outside the range. D in this case is 292, represented in 9 binary digits as 100100100. An attempt to cram D into 8 bits will likely chop off the first, or the last, digit. This yields respectively 00100100 (= 36) or 11010010 (= 210). So 292 is likely to become either 36 or 210 if an attempt is made to cram it into 8 bits. D' will be, respectively, 36 or 210. We say there is a data discontinuity between 255 and 256 in this system.

Suppose that an attempt is made to represent a piece of data D which is outside the range of a data representation. We speak of an overflow of the range. A data discontinuity is present when an overflow is not detected. If the overflow is reliably detected and announced, as we have mostly come to expect for, say, data operations taking place in the hardware of a computer system, then we do not have a data discontinuity. A data discontinuity is an undetected overflow.

Data Discontinuity Problems (DDP)

A system has a data discontinuity problem (DDP) if the data representation contains a data discontinuity which has effects which are semantically important to the working of the system. To classify something as a DDP, one must determine that

The Year 2000 problem is a DDP, because it concerns systems in which

Note that a system can suffer from data ambiguity, by representing the year using two digits, without it having a DDP - if it handles potentially ambiguous data within its range specially (and correctly), or if the ambiguity has no meaningful consequences for the system.