|
Mark Davis / http://www.macchiato.com/ |
Durable Java | Immutables | Abstraction | Serialization | Liberté, Égalité, Fraternité | Hashing and Cloning
|
[Note to the editor: here is the contents of the left-hand first-page information. Both it and the column title have changed.] Design your code from the start to be durable--so it can evolve without breaking your clients' code.
Dr. Mark Davis is lead architect at IBM's Center for Java Technology, Silicon Valley, co-founder and president of the Unicode Consortium, and architect for the bulk of JDK1.1 internationalization. [Note to the editor: end of first-page left-hand information.] |
James Gosling reportedly once said that "Java is like C++, but without the broken glass". Java's older brother does have more bells and whistles than Java, but also has nasty corners full of razor wire. To take just one example, none of the C++ programmers that we have interviewed over the years — no matter how experienced — has been able to show how to write an assignment operator without mistakes. And yet the assignment operator is a fundamental part of C++. (For more information, see the references for Rich Gillam's paper).
In our experience, for example, programmers are easily twice as productive in
Java than in C++. However, we mustn't let this lull us into a false sense of
complacency; Java has its own pitfalls — some of them in the
basic methods of every class. In this month's column, we will look at the
surprising issues involved in the implementations of equals, and
walk through the common mistakes that people make. We'll also discuss some
interesting performance optimizations.
Next column, we'll discuss the related implementations of hashcode
and clone, which are also far too easy to get wrong.
[Note to the editor: if any of the code samples are too wide for the column, let me know. I will fix them for you to prevent the code from being damaged!]
Let's start with equals. This method is called all over the
place Java, especially in any use of collections such as Hashtable.
In particular, the equals implemenation must be coordinated with hashCode,
otherwise collections will become corrupted. Incorrect implementations of equals
will cause all sorts of errors both in your code and your clients' code; often
errors that are difficult to diagnose.
A quick note: most Java programmers are familiar with the difference between ==
and equals, so we won't belabor that point. Suffice it to say that x
== y tests for object identity (do x and y
refer to the same object) while equals tests for object equality
(are the contents of x and y the same). You must
almost always use equals when x and y are
objects; you must always use == when x and y
are primitives.
The implementation of equals would seem quite simple; just compare all the fields of the object for equality. Or so you think; yet it is full of traps for the unwary. Moreover, much of the sample code given for doing this method is incorrect in one way or another!
Let's look at the equals method as defined for a class called ClassA.
The first few lines of implementation are pretty straightforward. The method
parameter is any Object — not just those of type ClassA
— but the compiler will remind
you if you forget. So far, so good. We can then make a quick check to see if the
objects are actually identical. Since this is so fast, it is generally worth
doing.
public boolean equals(Object object) {
if (this == object)
return true;
Here comes the first possible pitfall. You need to remember to call
your superclass's equals method, to insure that all of its relevant
fields are compared.
if (!super.equals(object))
return false;
You can avoid this if you have read access to all of your superclass's fields
(and its superclass's, etc.). However, if you do that your code is fragile;
a change in your superclasses' internal fields will cause your code to fail. So
it is far better to call your superclass, to let it make the decision itself. So
given the following chain of inheritance, ClassD will call ClassC
to have it check its fields, ClassC will call ClassB
to have it check its fields, etc.

But, here you run into an unexpected problem; you can't let ClassA
call its superclass, since the value returned from Object.equals
will always be incorrect. In ClassA, you have to explicitly not
call the superclass. Instead, if we are at ClassA — at the top of
our food chain — we need to make
a few different checks. We have to check for null explicitly: the Java documentation
requires this.
// a immediate subclass of Object, so...
if (object == null)
return false;
We then need to see if the other object we are comparing ourself to is of the right class. This is most often done with the following type of code:
if (!(object instanceof ClassA)) // BAD
return false;
Unfortunately, using instanceof is almost always
wrong. Here's why. Suppose a is of type ClassA, and b
is of type ClassB.
a.equal(b), the test for (b
instanceof A) succeeds.b.equals(a), the test for (a
instanceof B) fails!So if we use this instanceof, then a.equals(b)
will return a different result than b.equals(a).
This symmetry in equals is not just a good idea — it's the law! (E.g. in the Java documentation). In particular, if symmetry is not maintained then corruption problems will creep into all your use of collections.
In the next section we'll discuss the few cases where it is ok
to use instanceof, but except for those few cases you always
want check that the classes are precisely the same, using the following code:
if (object.getClass() != getClass()) // GOOD return false;
Since we are always calling up the inheritance chain anyway, both the test
for null and the test for matching classes only need to be done
once, in the class immediately below Object.
Note: Don't make the mistake of using
ClassA.classinstead ofgetClass(); that will fail ifthisobject happens to be a subclass that is using the inherited method.
Now that we have gotten all of the class tests out of the way, we can safely cast the other object to our type and start comparing fields. We have to remember to include all the fields in the object; and when you add new fields to an object you have to remember to add them to the list. (Scott Oak's October '99 Java Report column shows how you can do this automatically with reflection, though you still have to handle the cases below specially.)
ClassA other = (ClassA)object;
if (primitive != other.primitive)
return false;
if (!objectField.equals(other.objectField))
return false;
Syntactically we always have to remember to use equals with
objects and == with primitives, but other than that we just compare
one field after another. Right? Wrong. The first possible gotcha is that
some fields are cached information, typically transient fields, and need to be
omitted since they are not part of the identity of an object. We will represent
this with a commented line, just to remind ourselves that this is a deliberate
omission.
// if (!tempField.equals(other.tempField)) // return false;
The second gotcha is that some fields may be null, and take a slightly more complicated test:
if (possiblyNull == null) {
if (other.possiblyNull != null)
return false;
} else {
if (!possiblyNull.equals(other.possiblyNull))
return false;
}
The third gotcha, and the nastiest one, is that some objects may have incorrect implementations for equals. The most common of these are arrays. For example, the following will not correctly compare the contents of the arrays.
if (!array.equals(other.array)) // BAD
return false;
To fix this, you need to include whatever code is required to correctly do an equality check, such as the following.
if (array.length != other.array.length)
return false;
for (int i = 0; i < array.length; ++i) {
if (array[i] != other.array[i])
return false;
}
Note: Java 2 provides additional utility methods onjava.util.Arraysfor array comparisons, and also correctly implements equality on collections.Note: I have seen some people use
toStringto work around bad equals. Don't do it except withStringBuffer. ThetoStringmethod is relatively expensive and not guaranteed to contain the complete state of the object:toStringjust spews whatever debugging information the class designer thought worthwhile.
In general, if a class does not override equals, its
implementation is just plain wrong. You'll have to supply special purpose code
for doing your own equality tests if any of your fields are classes that don't
override equals. Besides arrays, other examples of these in the JDK
include Cursor and StringBuffer, and any collections
before Java 2.
Once you pass the gauntlet of all these tests, you return true
at the end. For the complete examples, see Listing 1 and
Listing 2.
So what are the circumstances where you can use instanceof
instead of checking for identical classes? The first case is where the class is final.
The implementation for String, for example, can safely use instanceof.
This is because there is no ambiguity in the class structure: there can be no
subclasses of String.
The second case is where each class up and down the heirarchy is expected to
test with instanceof. Although this approach also gives the right
answer, it has the disadvantage that in very deep inheritance chains, redundant
checks are being made. However, for JVMs that have an inefficient implementation
for getClass(), this may be the best approach.
The only other time when you really want to use instanceof is
where you are prepared to compare objects for equality across classes.
Generally, this only works well with a restricted domain like Colors
or Numbers, where all of the objects can be convertible to some
common class without loss of information. For example, this could have been done
with Numbers, as follows.
abstract class NumberX {
public boolean equals(Object object) {
if (this == object)
return true;
// a direct subclass of Object, so...
if (object == null)
return false;
if (!(object instanceof NumberX))
return false;
// convert to common class and compare
NumberX other = (NumberX)object;
return getBigDecimalValue().equals(
other.getBigDecimalValue());
}
abstract public BigDecimal
getBigDecimalValue();
...
}
As long as all subclasses of Number can be converted to BigDecimal
without loss of information, then it can implement a version of equals that
compares objects across classes. This has to be part of the required
semantics for this class, however, so in this instance we could not add a ComplexNumber
subclass without violating those semantics.
The subclasses don't have to override equals; just implementing getBigDecimalValue
is sufficient, as follows:
public BigDecimal getBigDecimalValue() {
return BigDecimal.valueOf(longValue);
}
For performance, subclasses could override equals, to save the
cost of conversion in simpler cases such as when comparing Double
to Integer. (By the way, IBM has API-compatible versions of BigDecimal
and BigInteger that are much faster than the Java 2
implementations, and provide more function as well; see the references for more
information.)
We will discuss hashCode next time. For now it is worth noting
that hashCode can be used to do an interesting optimization for
equality checks. If your class caches the hashCode value in a
transient field, then right after casting the object you can do a quick check on
this cached value.
ClassA other = (ClassA)object;
if (hash != object.hash)
return false;
This can dramatically improve the performance on equals, skipping the expense of comparing all of your fields in typical cases, since the probability is very high that the hash value will be different if the other fields are different. The test on the hash values is not conclusive, of course; you still need to test the rest of the fields if the hash values are the same.
The downside to this would be the extra storage required by the hash value, plus the recomputation necessary if the object is changed, plus the first time. If the hash values are used a great deal in any event, then this optimization is often worth the extra effort. The hash field can also be lazy-evaluated, which helps further.
Another common optimization can be useful in certain cases. Where you strictly control the creation of objects through factory methods (with a private constructor), you can ensure that you only create distinct objects when they are unequal to any other objects of that type. This is often done in a Poor Man's Enum (as discussed in a previous column). In these circumstances, the implementation of equals is trivial:
final public boolean equals(Object object) {
return (this == object);
}
Common sample code for equals will make use of exceptions
instead of doing a class check. They typically look like the following:
public boolean equals(Object object) {
if (this == object) return true;
try {
// throws exception if not ClassA
ClassA other = (ClassA)object;
// test fields
...
} catch (Exception e) {
return false;
}
// passed the gauntlet
return true;
}
This strategy avoids the test on the class. Since class mismatches are
relatively infrequent, the cost of the exception is not a problem. However,
this has the same problem as instanceof, and should only be
used where instanceof is safe, as described above.
In the next column, we'll follow up on the related implementations of hashcode
and clone, which are also far too easy to get wrong. In the
meantime, here are some comments provoked by previous columns.
Samuel Yang caught a bug in my column on Serialization; a most embarrassing bug, given my connection with Unicode!
I had said that there was a restriction on serialized strings, and that Strings greater than 64K needed to be broken up into 64K pieces in Serialization. The restriction is on the number of bytes in the UTF-8 format, not the number of chars in the String. As he says, the simplest solution is to divide any string with length greater than 21,845 chars into pieces, since one char can expand into at most 3 bytes.
For more information on UTF-8 and other forms of Unicode, see the references.
Gary Gordon wrote in with the following observation on serialization:
In JDK 1.1.8, at least, there appears to be a severe restriction in externalization, such that it will fail on input if the externalizable objects lack public no-arg constructors — an
IllegalAccessExceptionis raised. However, when I switch over to serialization, either default or with explicitreadObject()andwriteObject()methods, all is fine. Upon consulting the JDK 1.1.8 Serialization spec, I found this: "ForExternalizableobjects, the no-arg constructor for the class is run and then thereadExternalmethod is called to restore the contents of the object."So the behavior is expected, but not particularly helpful, especially if the class doesn't have public constructors, either because end users must instantiate it via a factory method, or because a no-arg constructor would allow the user to construct an ill-formed object. The exception is raised in a native method called
allocateNewObject()inObjectInputStream, so my gut feeling is that Sun could fix the behavior, if it isn't already fixed in JDK 1.2 (although I would be surprised if it is actually fixed).
Serialization has an interesting little security (and thread-safety) hole in it. As we discussed in the column, multiple references to the same object are handled specially. The object is serialized the first time, while any subsequent references to that object are serialized with just a special tag that points to the first object. When the stream is deserialized, all of the references are properly restored. However, a malicious programmer can use this feature to modify the serialized stream to gain access to an object's internals by manufacturing a fake reference to those internals. Once the stream is read in, the tainted objects now have internals that can be inspected or modified at will.
Sun has a nice optimization in their String.substring
implementation. If you make a substring of another string, it doesn't copy the
array of characters. Instead, it just makes a new String object
that points at the same array of characters, but has an offset into that array
to get to the right point, and its own length. This avoids an object allocation,
and needless copying of character data.
But this implementation can also bite you. Recently I was doing some analysis of some long files of data for English words. From each line in the file, I used one word out of the line for a key in a hash table. When processing a file, it would slow down to a crawl around the letter T, then soon grind to a stop.
Here's what was happening. Even though I was only keeping one word out of each line, behind my back those substrings pointed to the whole array of characters for the line. Since those whole lines were always live, they could not be flushed by the garbage collector. Gradually memory filled up until the program effectively hung. Changing the line as follows fixed the problem.
word = line.substring(s,e);
changes to:
word = new String(line.substring(s,e));
This is related to — but not the same as — the StringBuffer gotcha that we discussed in the Abstraction column.
Listing 1: ClassA
(extends Object)
public boolean equals(Object object) {
if (this == object) return true;
// IS a direct subclass of Object, so...
if (object == null) return false;
if (object.getClass() != getClass()) return false;
// check non-transient fields
ClassA other = (ClassA)object;
if (primitive != other.primitive) return false;
if (!objectField.equals(other.objectField)) return false;
// if (!tempField.equals(other.tempField)) return false;
// special checks for objects that may be null
if (possiblyNull == null) {
if (other.possiblyNull != null) return false;
} else {
if (!possiblyNull.equals(other.possiblyNull)) return false;
}
// special checks for classes without correct equals
if (array.length != other.array.length) return false;
for (int i = 0; i < array.length; ++i) {
if (array[i] != other.array[i]) return false;
}
if (cursor.getType() != other.cursor.getType()) return false;
// passed the gauntlet
return true;
}
|
Listing 2: ClassB
(extends ClassA)
public boolean equals(Object object) {
if (this == object) return true;
// NOT a direct subclass of Object, so...
if (!super.equals(object)) return false;
// check non-transient fields
ClassB other = (ClassB)object;
if (integer != other.integer) return false;
// passed the gauntlet
return true;
}
|
ReferencesDavis, Mark; "Forms of Unicode"; Davis, Mark; "Durable Java"
columns; Java Report; Gillam, Richard; "The Anatomy of the
Assignment Operator"; IBM's BigDecimal package; Java API Specification for Equals. |
Copyright (c) 1999, Mark Davis, All rights
reserved.