|
Mark Davis / http://www.macchiato.com/ |
Durable Java | Immutables | Abstraction | Serialization | Liberté, Égalité, Fraternité | Hashing and Cloning
|
[Note to the editor: here is the contents of the left-hand first-page information. Both it and the column title have changed.] Design your code from the start to be durable--so it can evolve without breaking your clients' code.
Dr. Mark Davis is lead architect at IBM's Center for Java Technology, Silicon Valley, co-founder and president of the Unicode Consortium, and architect for the bulk of JDK1.1 internationalization. [Note to the editor: end of first-page left-hand information.] |
Last column, we discussed just how easy it is to screw up when implementing equals;
this time, we'll take on hashCode and clone, which are
also far too easy to get wrong. We will first have a look at hashCode.
The carrot for writing hashCode is that whenever your objects
will be in hash-based collections such as Hashtable, HashMap,
or Set, a good implementation of hashCode will give
you an excellent payback in performance.
But that's not all — there is a stick in here, as well as a carrot!
Unfortunately, with hashCode — as with equals —
you really don't have much choice. Because hashCode is inherited
from Object, your class has it defined whether you need it or not.
As with equals, if you do nothing you will get a default
implementation, an implementation that is almost invariably wrong. So as a
general rule, you must implement both hashCode and equals,
rather than rely on Object's implementation. (The other alternative
is to throw an exception, at least warning people at runtime that equals/hashCode
is not implemented.)
Let's quickly review why there is a problem with equals. Take
the following example:
[Note to the editor: if any of the code samples are too wide for the column, let me know. I will fix them for you to prevent the code from being damaged!]
public class Currency {
private BigDecimal value;
private String ISOCurrencyCode;
public Currency (BigDecimal value, String code)...
public Currency (double value, String code)...
...
}
...
Currency money1 = new Currency(3.5, "USD");
Currency money2 = new Currency(3.5, "USD");
if (!money1.equals(money2)) {
System.out.println("Alert Alan Greenspan!");
}
Since you didn't explicitly write an equals method, anyone using
it will get the wrong answer: two objects that should have been equal will not
be!
Moreover, your hashCode implementation must be synchronized with
your equals implementation, so since you are overriding equals, you
have to override hashCode. If you don't maintain this invariant,
then hash-based collections will become corrupt.
To see why this happens, it helps to review a little bit how hash tables work. They are typically implemented with an array of buckets. Each bucket is itself an array or linked list of <key, value> pairs. To decide which keys go into which buckets, the array index for the bucket is produced by taking the hash value, modulo the size of the collection. That is,
bucketIndex = abs(hashValue % tableSize);
Within each bucket, the search for the object is sequential, and uses equals.
In the ideal case, the table size is a prime number, the hash values would be
evenly distributed, and each bucket would contain just one value. Lookup in that
case is incredibly fast: just one call to hashCode and one call to equals.
If the hash values (modulo the table size) are constant, you have the worst
case; all the objects will be in one bucket, and accessing the objects will
require a sequential search through every single item in the hash table, calling
equals for every single item. Lookup in that case will be
incredibly slooow!
Let's get back to the question of why to synchronize your hashCode
method with your equals method. When you put the key money1
(constructed from 3.5 and "USD") into the Hashtable, it
goes in (say) bucket 97. When you search for a key (again constructed from 3.5
and "USD"), it has (probably) a different hash value and you end up
looking in bucket 472. Whoops, won't find it there. If you try to delete a key,
it won't be found; if you try setting the key to a different value, you get
duplicate keys and inconsistent values.
This leads to the one strict principle that you absolutely must follow with hashCode:
Synchronization with Equality
If x.equals(y), then x.hashCode() == y.hashCode().
That is, if two objects are equal, then their hash values are equal. Not that
the reverse is not true; two objects may have the same hash value but not
be equal! So how do you do a good implementation for hashCode
? Well, the most trivial implementation would be the following:
public int hashCode() {
return 0;
}
This obeys the Synchronization with Equality principle, but will give you rotten performance if these objects go into hash-based collections, as we discussed above. So what would an ideal implementation look like? Beyond the Synchronization with Equality principle, there are three basic design goals:
Even Distribution
An ideal implementation would return integer values that are evenly distributed over the range from zero to
Integer.MAX_VALUE. That is, if I picked a million random objects, their hash values would be pretty randomly distributed over this range. In the best case, any unequal objects would have different hash values.
Fast
An ideal implementation would compute the hash value very quickly. After all, this method is going to be called every time an object is put into a hash-based collection, and every time you query whether an object is in a hash-based collection.
Simple
Since you should implement this for every object that could go into hash-based collections, you want your code to be simple to write and easy to maintain.
These goals are often at odds; you have to find a balance between them. So let's start with a typical implementation. For illustration, we will unroll the code a little bit, so that we can talk about different features as we go. We will start with a brute-force implementation, then refine it from there.
A brute-force implementation of hashCode factors in all of the
fields of the object that are relevant to equals. First, as we
discussed with equals in the last column, you grab your
superclass's hashCode.
public int hashCode() {
int result = super.hashCode();
However, just as with equals, we have to avoid a pitfall right
here. If you are inheriting directly from Object, you mustn't
do this (unless you didn't define equals!). If you did, you would
break the Synchronization with Equality principle, as we discussed. So
when inheriting directly from Object, we will just start with zero.
int result = 0;
We then factor in all of the fields of the object that are relevant to equals.
Typically, this means all of the non-transient fields. There is a fairly simple
way to factor these in that distributes them fairly evenly. For each successive
field, multiply the current result by a prime, then add in the appropriate value
for the field. A nice prime to use is 1,000,003. It may seem simpler to just add
up successive values without multiplying, or just shift the result instead of
multiplying, but these methods don't give very good distributions in general.
Fields of most primitive types, this looks like the following:
result = PRIME*result + primitiveField;
Boolean fields can just be tested:
result = PRIME*result + (flagField ? 1 : 0);
Long fields can just be converted to two ints and factored in:
result = PRIME*result + (int)(longField >>> 32); result = PRIME*result + (int)(longField & 0xFFFFFFFF);
Double fields can just be converted to a long, then handled like a long field.
long temp = Double.doubleToLongBits(doubleField); result = PRIME*result + (int)(temp >>> 32); result = PRIME*result + (int)(temp & 0xFFFFFFFF);
| Ed note: make the following a sidebar:
Surprises!
If your object has fields that could conceivably have these values,
then you must be careful. Either use (x == 0.0F ? 0 : Float.floatToIntBits(x)) (x == 0.0 ? 0L : Double.doubleToLongBits(x)) |
Non-primitive fields (except for arrays) can just use their hashCode
value.
result = PRIME*result + objectField.hashCode();
However, they take special handling if they might be null:
if (maybeNullField != null) {
result = PRIME*result
+ maybeNullField.hashCode();
}
Arrays must be handled specially: do not call their hashCode
method, since you will once again get the wrong answer. Instead, the brute force
approach walks through each element in the array, and adds in its value.
Depending on the base type of the array, you will use one of the techniques we
discussed here: either add in the value directly, convert to a number, or walk
through subarrays. For example:
for (int i = 0; i < arrayField.length; ++i) {
result = PRIME*result
+ arrayField[i].hashCode();
}
At the very end, return the result we've been accumulating:
return result; }
We are now done with the brute-force approach. You can see the results of
what we have done so far in Listing 1. You can also make
this simpler by using a few utility methods, so instead of taking different
approaches depending on the type, you can use a uniform, simpler approach. A
sample of the utility methods is shown in Listing 2,
with the simpler hashCode implementation in Listing
3.
The brute-force process works fine, and is straightforward to implement. It's
ok for simple classes with a small number of fields and only small arrays or
collections. However, its performance may leave something to be desired for
larger classes. Generally, a pretty good distribution can be achieved with a lot
less work than visiting every possible field. Fields that don't vary much
between instances can simply be skipped. For others, think about (or measure)
the cost of getting the hashCode values from the superclass, from
object fields, and from arrays, compared to the value of better distribution.
It may be sufficient to simply probe a few values in an array, for example, rather than to scan every element. The following example factors in the length of the array, then adds in values for just the indices that are powers of two. In a thousand-element array, this only takes 10 probes. Where the array values vary significantly from object to object, this may be plenty to get a good distribution.
result = PRIME*result + arrayField.length;
for (int i = 0; i < arrayField.length; i <<= 1) {
result = PRIME*result + arrayField[i].hashCode();
}
Other techniques can be used to get more probes without using every value. Other techniques can be used to get more probes without using every value. For example, the following probes the first 16 and last 16 values, but gradually skips more and more values between. It makes 130 probes in a thousand elements, and 206 probes in ten thousand.
int last = arrayField.length - 1;
int i = 0, j = last;
result = PRIME*result + last;
for (; i < j; i = 17*(i + 1) >> 4, j = last - i) {
result = PRIME*result + arrayField[i].hashCode();
result = PRIME*result + arrayField[j].hashCode();
}
if (i == j) {
result = PRIME*result + arrayField[i].hashCode();
}
However, if all of the values might be important, you must be very careful not to skip over them.
So here are the things to remember about implementing hashCode.
First, with very few exceptions you always want to do it. Second, remember that
you have to choose different approaches whether you extend Object
or not. Third, for simple classes just use the brute-force approach. Finally,
for more complex classes, you will want to pick and choose the values that you
factor into the implementation to speed it up.
So far we have let you off easy — hashCode was pretty simple
to do, as long as you remember these few points. We'll now turn to clone,
which is not quite so simple.
| Ed note: make the following a
sidebar:
Warning: Be careful when using collections if you need to
change the value of a key. Otherwise your collection becomes corrupt!
For example, suppose at sometime you have put a
Later on, you can't just change the value of the key:
Instead, you must first remove the key from the table, and then re-enter the pair after you change the value of the key, as follows:
The Java API documentation is very clear on this, but it never hurts to repeat this warning, since it leads to nasty, hard-to-find bugs. |
Implementing clone allows other programmers to use your objects as fields and to safely implement getters, setters, and clone themselves. For more details on this, see my April, 1999 column: "Immutables". As a general rule, you really should provide a clone operator for all of your classes. However, suppose you are feeling lazy, and want to get away with the absolute minimum. You do not need to provide a clone method if your superclass does not implement a public clone method, and your object falls under one of the following cases:
We will see the reasons for these conditions as we show how to implement a clone
method. You'll see, for example, that if you don't implement clone,
it can be very unpleasant for clients who have to use your class in implementing
their clone method.
The only strict principles that you must follow for clone are:
This is what is known as a safe clone. There are cases where it may make sense to provide what is called a shallow clone, especially with collection classes. Such a shallow clone only duplicates the top-level structure of the object, not the lower levels. A shallow clone is useful in many circumstances so long as programmers can somehow still implement a deep clone on top of those objects. Ideally, such classes would implement both with different names.
A safe clone is different than a deep clone, which copies the object and all of its fields recursively. A safe copy does not need to copy any field that is irrelevant to Clone Independence. For example, a String field does not need to be cloned. Because Strings are immutable, even if one String is shared by two objects, the contents of the string cannot be changed, and thus cannot break Clone Independence. Similarly, if a field cannot be altered through the class API, then it can be shared between two objects without breaking Clone Independence.
A safe clone for a class X is often required in implementing another class Y
that has a field of that type X. It is generally required where Y is immutable
(for more information, see my April 1999 JavaReport column "Immutables",
archived on http://www.macchiato.com/).
It is also often required even where Y is mutable, such as in GridBagLayout.setConstraints().
You will see it used in those cases in the Java 2D implementations. On the other
hand, a deep clone — as opposed to a safe clone — is rarely ever required.
If your class is immutable and all its subclasses are immutable, you don't
need to supply a clone method; your class is automatically safe. On
the other hand, it is trivial to supply a safe clone operator as a
convenience for your clients, as shown below. Since the class is immutable, both
of the above principles are handled trivially!
public Object clone () {
return this;
}
| Ed note: make the following a
sidebar:
Immutables As we have remarked in previous columns, it is very unfortunate that
while the Java programming model depends very heavily on immutability in
lieu of
|
We'll take a moment to point out that the clone method is very
different than the so-called copy constructor. A copy constructor is one
that takes object of its own type as a single parameter. For example, Dimension
has a copy constructor, so we can call it as follows:
Dimension dimension2 = new Dimension(dimension1);
This works out fine if the class we are working with is final.
If it is not — and the variable is actually an instance of a subclass — then
we will get the wrong answer; what is called a slice. For example,
suppose that dimension1 is actually an instance of a subclass NewDimension:
class NewDimension extends Dimension {
private int seconds;
public NewDimension(int width, int height, int seconds) {...}
...
}
If Dimension had a clone operator, then the following would produce a correct
copy, with the right subclass (NewDimension).
The copy constructor cannot! It can only produce a slice in that case,
discarding all the subclass information.
Dimension dimension2 = dimension1.clone();
Here is an example of how to correctly implement clone, with the
different cases that you may be faced with. As before, the code is unrolled
slightly to make it easier to walk through. First, make your object implement Cloneable.
Next, set up the implementation of clone with a try-catch block for CloneNotSupportedException.
This is just boilerplate code you generally have to include because of the way Object.clone()
is defined. Trust me on this one; I could tell you all the gory details, but I'd
run over my word limit!
Then, as the first line you call super.clone(). Normally that is
not a problem, but if your superclass did not implement clone, it is. You'll
have to emulate the missing clone method, if possible. (We'll talk about this in
a bit.) Here is what you have so far:
public Object clone() {
try {
MyClass result = (MyClass) super.clone();
// more code to go here (possibly)
return result;
} catch (CloneNotSupportedException e) {
return null; // never invoked
}
}
The call to super.clone() copies your superclass's fields, and
makes bitwise copies of your fields.
Because of this, you do not have to copy any primitive or immutable
fields (such as String) in the rest of your code. So if your object
doesn't contain any mutable objects, you are done! For example, for the Currency
class we defined above, the above would be sufficient. As a matter of fact, if
your superclass defines a public clone operator, and you only have primitive or
immutable fields, you don't even have to define the clone operator at all; the
superclass implementation will do the right thing.
If your object contains mutable objects (neither primitive nor immutable), you will have to add some more code just for them. These lines will go where we have the comment above. For well-mannered field objects, we just clone them.
result.insets = (Insets) insets.clone();
Ideally, that's all we would have to do. Sadly, many classes are missing the clone method, or their clone method is not safe. If the author of the class for one of your fields was a lazy bum, and did not supply you with a clone method, it starts to get painful. You will have to emulate what the clone method would have done, if possible. If you know that the field object could not be a subclass, use the copy constructor if available.
result.dimension = new Dimension(dimension);
If the field object does not have a copy constructor, see if you can construct it by extracting infomation from it and building a duplicate yourself.:
borderLayout = new BorderLayout(
borderLayout.getHgap(), borderLayout.getVgap());
If the class of the field is badly designed, you may not be able to get enough information from public getters to reconstruct a copy. In that case, you are just out of luck.
If the field could be a subclass, it really starts to get ugly. If you know it could only be one of a limited number of possible ones, then you can try to set up code to deal with that, checking for each type of class. If you cannot limit ahead of time the number of subclasses that your field might have, then you might have to try to use reflection to synthesize your own clone. This is a real pain in the ass, so just hope you don't run into this!
I'll leave this as a problem for the reader. If you feel like a challenge, send me your try; I'll print the best one in my next column.
As if that weren't annoying enough, we still have to handle arrays and collections. If these contain just immutables or primitives they're easy; we just clone them.
result.array = (int[]) array.clone(); result.vector = (Vector) vector.clone();
If some elements of the collection or array are mutable objects (e.g. neither immutable or primitive), then we have to do more work. Since collections and arrays don't have safe clones, only have shallow clones, not only do we have to clone them as above, but we have to clone all of their elements one by one.
for (int i = 0; i < vector.size(); ++i) {
result.vector.setElementAt(
((SomeType) vector.elementAt(i)).clone(), i);
}
| Ed note: make the following a
sidebar. Make sure it is AFTER the above text.
Cloneable? Inexplicably, the
Instead, we have to cast it to some explicit class (if we know it!)
or use reflection. (And reflection has much poorer performance than
calling methods directly.) The only thing the |
If the elements of the collection or array do not have clone methods then we have emulate a clone method, as we discussed above.
So here are the things to remember about implementing clone.
First, as with equals and hashCode, with few
exceptions you always want to do it. Second, you always call your superclass's clone
method. Third, you only need to worry about fields that are mutable objects.
Fourth, you generally just call their clone operators, though you are in for
some work if those operators don't exist, or if you have arrays or collections
of mutable objects. Fifth, if you ever add any fields to your class, make sure
you add appropriate code to the clone method for them! The sample code we've
been working on is in Listing 4.
Although it may seem that equals, hashCode and clone
are unnecessarily annoying to have to deal with, once you recognize their
pitfalls in the majority of cases they are pretty straightforward. And if you
are feeling overwhelmed, just keep repeating to yourself "It's much, much
worse in C++, it's much, much worse in C++,..."
| Ed note: make the following a
sidebar.
Readers Write Q: From Robert Allan Schwartz: I was very interested to read your article "Liberte, equalite, fraternite". If I understand correctly, then classes that inherit directly from Object must implement equals() differently from classes that don't inherit directly from Object. The downside is that the same "different" implementation of equals() must now be repeated on many classes. I have a suggestion. Create a subclass of Object called EqualsImplementor. EqualsImplementor is now the only place that must have a "different" implementation of equals(). Everyone who used to inherit from Object should now inherit from EqualsImplementor. What do you think? A: That is a possible approach. You would have to have Ideally, Java would not have put either Q: From Steven M Rosenthal: Your Java Report (January 2000) column on equals() was very useful, but overlooks the issue of thread-safe implementation strategies in equals() functions. Consider: public class Complex {
private double real_;
private double imag_;
...
public boolean equals(Object object) {
if (this == object) return true;
if (object == null) return false;
if (object.getClass() != getClass())
return false;
Complex cx = (Complex) object;
return cx.real_ == this.real_
&& cx.imag_ == this.imag;
}
}
This isn't thread-safe-- if either this or object are mutated during
the execution of equals(), then an inconsistent answer can be returned.
[...] Good solution:
public boolean equals(Object object) {
// boilerplate
...
Complex cx = (Complex) object;
double r = 0.0;
double i = 0.0;
synchronized(this) {
r = this.real_;
i = this.imag_;
}
synchronized(cx) {
return cx.real_ == r
&& cx.imag_ == i;
}
}
Best solution: Design class Complex to be immutable. Am I missing
something, or do the issues of thread-safety permeate even the simplest
of situations such as equals()? I'd be interested in having your
thoughts on this question.
A: In our experience, it causes both performance problems and unnecessary risk of deadlocks to try to make every method call thread-safe. It also doesn't really protect against threading problems, as the following illustrates: if (foo.equals(fii)) { // line 1
bar.resetTo(foo); // line 2
}
Even if foo.equals() and bar.resetTo() are
both thread-safe, a change in foo can occur between line 1 and line 2,
thus invalidating the code. Synchronized methods don't protect code in
collections either; if I were copying from one Vector to another, I
would have to synchronize the copying code, not just depend on
the methods being synchronized.
Synchronization has a price: at too low a level it will also cause performance problems; witness how the Java 2 collections now allow a choice of synchronization for just that reason. Synchronization also wastes cycles when an object is really intended to be used only within a single thread, such as StringBuffer. For that reason, we have found it more useful to synchronize at a higher level. Immutables are possible, but also can have severe performance impacts
depending on their intended usage scenarios. The alternative in this
example is to either not share |
Listing 1: HashCode
Sample
public int hashCode() {
int result = super.hashCode(); // PICK ONE
//int result = 0; // PICK ONE
result = PRIME*result + primitiveField;
result = PRIME*result + (flagField ? 1 : 0);
result = PRIME*result + (int)(longField >>> 32);
result = PRIME*result + (int)(longField & 0xFFFFFFFF);
long temp = Double.doubleToLongBits(doubleField);
result = PRIME*result + (int)(temp >>> 32);
result = PRIME*result + (int)(temp & 0xFFFFFFFF);
result = PRIME*result + objectField.hashCode();
if (maybeNullField != null) {
result = PRIME*result + maybeNullField.hashCode();
}
for (int i = 0; i < arrayField.length; ++i) {
result = PRIME*result + arrayField[i].hashCode();
}
return result;
}
|
Listing 2: HashCode
Utilities
final class Utility {
static final int hashCode(int source, boolean x) {
return PRIME*source + (x ? 1 : 0);
}
static final int hashCode(int source, int x) {
return PRIME*source + x;
}
static final int hashCode(int source, long x) {
return PRIME*source + (int)(PRIME*(x >>> 32)
+ (x & 0xFFFFFFFF));
}
static final int hashCode(int source, float x) {
return hashCode(source,
x == 0.0F ? 0 : Float.floatToIntBits(x));
}
static final int hashCode(int source, double x) {
return hashCode(source,
x == 0.0 ? 0L : Double.doubleToLongBits(x));
}
static final int hashCode(int source, Object x) {
return hashCode(source,
x == null ? 0 : x.hashCode();
}
static final int hashCode(int source, Object[] x) {
for (int i = 0; i < x.length; ++i) {
source = hashCode(source, x[i]);
}
return source;
}
// repeat for arrays of primitives,
// arrays of arrays,... as required
}
|
Listing 3: Simplified
HashCode Sample
public int hashCode2() {
int result = super.hashCode(); // PICK ONE
//int result = 0; // PICK ONE
result = Utility.hashCode(result,primitiveField);
result = Utility.hashCode(result,flagField);
result = Utility.hashCode(result,longField);
result = Utility.hashCode(result,doubleField);
result = Utility.hashCode(result,objectField);
result = Utility.hashCode(result,maybeNullField);
result = Utility.hashCode(result,arrayField);
return result;
}
|
Listing 4: Clone
Sample
public Object clone() {
try {
MyClass result = (MyClass) super.clone();
result.insets = (Insets) insets.clone();
// result.dimension = (Dimension) dimension.clone();
result.dimension = new Dimension(dimension); // SLICE?
borderLayout = new BorderLayout(
borderLayout.getHgap(), borderLayout.getVgap());
result.array = (int[])array.clone();
result.vector = (Vector) vector.clone();
for (int i = 0; i < vector.size(); ++i) {
result.vector.setElementAt(
((SomeType) vector.elementAt(i)).clone(), i);
}
return result;
} catch (CloneNotSupportedException e) {
return null; // never invoked
}
}
|
Thanks to Dwight Duego, Arnaud Le Hors, Pat O'Connor, Richard Gillam, Josh Bloch, and John Raley for feedback on drafts of this article.