This is an archived page and is no longer updated.
Data Ambiguity and Discontinuity
Peter Ladkin
Data Ambiguity
A
data ambiguity (DA) is a case in which two separate pieces of
data in a computer system have the same representation.
For example, suppose there are two people named
Frederick James Smith
and
Frederick John Smith. Suppose that a data base system
contains records indexed by name, and that names are represented
Lastname;
Firstname;
Middle-Initial.
Then both people will be represented as
Smith;Frederick;J and there is data ambiguity.
There are many incidents caused by data ambiguity, usually in databases
containing personal records. The on-line journal
RISKS Forum Digest
contains examples of just about everything unpleasant that can happen
to people because of name ambiguity. A typical case might be
two people as above, one of whom is an upstanding citizen and the other
a habitual drunk driver who has skipped bail. The former is stopped for
speeding, doing 60mph on a 55mph freeway in another state -- and is
arrested for skipping bail (Cases like this happen -- read RISKS!).
The General Physical Constraint on Data Representation
Recall that data is represented in computer systems by being encoded
into bytes.
There are only a
fixed, finite
number of pieces of data that can be represented by a coding scheme
in a
fixed number of bytes. That means that any time one has more
pieces of data than there are fixed bytes for its representation,
software must use a coding scheme, some of which are
mathematically very clever, to increase the size of the set of data that
can be represented. Since there are potentially arbitrary dates in the
future, representing arbitrary dates is an example in which
such a coding scheme is needed.
Data Discontinuity
Data discontinuity happens when data is ambiguous, and ordered.
Dates are ordered: 12 October 1992 comes before 3 January 1993.
If it is required to represent more pieces of ordered data than the coding
scheme will allow, then one has a
data discontinuity.
- There is a valid piece of data D outside the range of
the coding scheme,
- which is however represented, without warning,
as some byte representation
inside the range, say D'
which leads to the effect that
- D' also represents a different piece of valid data
from D,
- D' does not hold the same place in the data ordering
as D,
- decisions and conclusions concerning D will be made
by looking at the properties of D', including the ordering,
and may very well be false.
An example.
Suppose a system attempts to represent integers as 8 bits, without
a `+' or `-' sign (i.e., as an unsigned integer).
Consider trying to represent the integer 292.
One can only represent the integers 0 to 255 in 8 bits `unsigned'.
So the range of the coding scheme is 0 to 255.
292 is (256 + 36), which is (256 + 32 + 4), outside the range.
D in this case is 292, represented in 9 binary digits
as 100100100.
An attempt to cram D into
8 bits will likely chop off the first, or the last, digit. This yields
respectively
00100100 (= 36) or 11010010 (= 210). So 292 is likely to become
either 36 or 210 if an attempt is made to cram it into 8 bits.
D' will be, respectively, 36 or 210.
We say there is a data discontinuity between 255 and 256 in this
system.
Suppose that an attempt is made to represent a piece of data
D which is outside the range of a data representation.
We speak of an overflow of the range.
A data discontinuity is present when an overflow is not detected.
If the overflow is reliably detected and announced,
as we have mostly come to expect for, say, data operations
taking place in the hardware of a computer system,
then we do not have a data discontinuity.
A data discontinuity is an undetected overflow.
Data Discontinuity Problems (DDP)
A system has a
data discontinuity problem (DDP)
if the data representation
contains a data discontinuity which has effects which are semantically
important to the working of the system. To classify something as a
DDP, one must determine that
- there is a data discontinuity
- the discontinuity has effects which cause the system to behave
inappropriately as a result
The Year 2000 problem is a DDP, because it concerns systems in which
- there is (by hypothesis) a date discontinuity between 1999 and 2000
in the system;
- this discontinuity causes the system to behave inappropriately
as a result
Note that a system can suffer from
data ambiguity, by representing
the year using two digits,
without it having a DDP - if it handles
potentially ambiguous data within its range specially (and correctly), or if the
ambiguity has no meaningful consequences for the system.