Durable Java

Mark Davis / http://www.macchiato.com/


Hashing and Cloning

Durable Java | Immutables | Abstraction | Serialization | Liberté, Égalité, Fraternité | Hashing and Cloning

[Note to the editor: here is the contents of the left-hand first-page information. Both it and the column title have changed.]

Design your code from the start to be durable--so it can evolve without breaking your clients' code.

Dr. Mark Davis is lead architect at IBM's Center for Java Technology, Silicon Valley, co-founder and president of the Unicode Consortium, and architect for the bulk of JDK1.1 internationalization.

[Note to the editor: end of first-page left-hand information.]

Last column, we discussed just how easy it is to screw up when implementing equals; this time, we'll take on hashCode and clone, which are also far too easy to get wrong. We will first have a look at hashCode.

Why Care?

The carrot for writing hashCode is that whenever your objects will be in hash-based collections such as Hashtable, HashMap, or Set, a good implementation of hashCode will give you an excellent payback in performance.

But that's not all — there is a stick in here, as well as a carrot! Unfortunately, with hashCode — as with equals — you really don't have much choice. Because hashCode is inherited from Object, your class has it defined whether you need it or not. As with equals, if you do nothing you will get a default implementation, an implementation that is almost invariably wrong. So as a general rule, you must implement both hashCode and equals, rather than rely on Object's implementation. (The other alternative is to throw an exception, at least warning people at runtime that equals/hashCode is not implemented.)

Let's quickly review why there is a problem with equals. Take the following example:

[Note to the editor: if any of the code samples are too wide for the column, let me know. I will fix them for you to prevent the code from being damaged!]

public class Currency {
  private BigDecimal value;
  private String ISOCurrencyCode;
  public Currency (BigDecimal value, String code)...
  public Currency (double value, String code)...
  ...
}
...

Currency money1 = new Currency(3.5, "USD");
Currency money2 = new Currency(3.5, "USD");
if (!money1.equals(money2)) {
    System.out.println("Alert Alan Greenspan!");
}

Since you didn't explicitly write an equals method, anyone using it will get the wrong answer: two objects that should have been equal will not be!

Moreover, your hashCode implementation must be synchronized with your equals implementation, so since you are overriding equals, you have to override hashCode. If you don't maintain this invariant, then hash-based collections will become corrupt.

To see why this happens, it helps to review a little bit how hash tables work. They are typically implemented with an array of buckets. Each bucket is itself an array or linked list of <key, value> pairs. To decide which keys go into which buckets, the array index for the bucket is produced by taking the hash value, modulo the size of the collection. That is,

bucketIndex = abs(hashValue % tableSize);

Within each bucket, the search for the object is sequential, and uses equals. In the ideal case, the table size is a prime number, the hash values would be evenly distributed, and each bucket would contain just one value. Lookup in that case is incredibly fast: just one call to hashCode and one call to equals. If the hash values (modulo the table size) are constant, you have the worst case; all the objects will be in one bucket, and accessing the objects will require a sequential search through every single item in the hash table, calling equals for every single item. Lookup in that case will be incredibly slooow!

Let's get back to the question of why to synchronize your hashCode method with your equals method. When you put the key money1 (constructed from 3.5 and "USD") into the Hashtable, it goes in (say) bucket 97. When you search for a key (again constructed from 3.5 and "USD"), it has (probably) a different hash value and you end up looking in bucket 472. Whoops, won't find it there. If you try to delete a key, it won't be found; if you try setting the key to a different value, you get duplicate keys and inconsistent values.

Requirements

This leads to the one strict principle that you absolutely must follow with hashCode:

Synchronization with Equality 

If x.equals(y), then x.hashCode() == y.hashCode().

That is, if two objects are equal, then their hash values are equal. Not that the reverse is not true; two objects may have the same hash value but not be equal! So how do you do a good implementation for hashCode ? Well, the most trivial implementation would be the following:

public int hashCode() {
  return 0;
}

This obeys the Synchronization with Equality principle, but will give you rotten performance if these objects go into hash-based collections, as we discussed above. So what would an ideal implementation look like? Beyond the Synchronization with Equality principle, there are three basic design goals:

Even Distribution 

An ideal implementation would return integer values that are evenly distributed over the range from zero to Integer.MAX_VALUE. That is, if I picked a million random objects, their hash values would be pretty randomly distributed over this range. In the best case, any unequal objects would have different hash values.

Fast

An ideal implementation would compute the hash value very quickly. After all, this method is going to be called every time an object is put into a hash-based collection, and every time you query whether an object is in a hash-based collection.

Simple

Since you should implement this for every object that could go into hash-based collections, you want your code to be simple to write and easy to maintain.

These goals are often at odds; you have to find a balance between them. So let's start with a typical implementation. For illustration, we will unroll the code a little bit, so that we can talk about different features as we go. We will start with a brute-force implementation, then refine it from there.

The Brute Force Approach

A brute-force implementation of hashCode factors in all of the fields of the object that are relevant to equals. First, as we discussed with equals in the last column, you grab your superclass's hashCode.

public int hashCode() {
  int result = super.hashCode();

However, just as with equals, we have to avoid a pitfall right here. If you are inheriting directly from Object, you mustn't do this (unless you didn't define equals!). If you did, you would break the Synchronization with Equality principle, as we discussed. So when inheriting directly from Object, we will just start with zero.

  int result = 0;

We then factor in all of the fields of the object that are relevant to equals. Typically, this means all of the non-transient fields. There is a fairly simple way to factor these in that distributes them fairly evenly. For each successive field, multiply the current result by a prime, then add in the appropriate value for the field. A nice prime to use is 1,000,003. It may seem simpler to just add up successive values without multiplying, or just shift the result instead of multiplying, but these methods don't give very good distributions in general.

Fields of most primitive types, this looks like the following:

  result = PRIME*result + primitiveField;

Boolean fields can just be tested:

  result = PRIME*result + (flagField ? 1 : 0);

Long fields can just be converted to two ints and factored in:

  result = PRIME*result + (int)(longField >>> 32);
  result = PRIME*result + (int)(longField & 0xFFFFFFFF);

Double fields can just be converted to a long, then handled like a long field.

  long temp = Double.doubleToLongBits(doubleField);
  result = PRIME*result + (int)(temp >>> 32);
  result = PRIME*result + (int)(temp & 0xFFFFFFFF);
Ed note: make the following a sidebar:

Surprises!

Floats and doubles lay a subtle trap for equality and hashing. For brevity, we'll let bits(x) be short for Double.doubleToLongBits(x) or Float.floatToIntBits(x), depending on whether x is a double or a float. Normally, x == y if and only if bits(x) == bits(y). This rule breaks in two particular cases, with negative zero and NaN values such as 0.0/0.0:

x y x == y bits(x) == bits(y)
0.0 -0.0 YES NO
NaN NaN NO YES

If your object has fields that could conceivably have these values, then you must be careful. Either use  bits(x) == bits(y) in your equals, or in your hashCode use one of the following according to the type of x.

(x == 0.0F ? 0 : Float.floatToIntBits(x))
(x == 0.0 ? 0L : Double.doubleToLongBits(x))

Non-primitive fields (except for arrays) can just use their hashCode value.

  result = PRIME*result
  + objectField.hashCode();

However, they take special handling if they might be null:

  if (maybeNullField != null) {
    result = PRIME*result
    + maybeNullField.hashCode();
  }

Arrays must be handled specially: do not call their hashCode method, since you will once again get the wrong answer. Instead, the brute force approach walks through each element in the array, and adds in its value. Depending on the base type of the array, you will use one of the techniques we discussed here: either add in the value directly, convert to a number, or walk through subarrays. For example:

  for (int i = 0; i < arrayField.length; ++i) {
    result = PRIME*result
    + arrayField[i].hashCode();
  }

At the very end, return the result we've been accumulating:

  return result;
}

We are now done with the brute-force approach. You can see the results of what we have done so far in Listing 1. You can also make this simpler by using a few utility methods, so instead of taking different approaches depending on the type, you can use a uniform, simpler approach. A sample of the utility methods is shown in Listing 2, with the simpler hashCode implementation in Listing 3.

The Kinder, Gentler Approach

The brute-force process works fine, and is straightforward to implement. It's ok for simple classes with a small number of fields and only small arrays or collections. However, its performance may leave something to be desired for larger classes. Generally, a pretty good distribution can be achieved with a lot less work than visiting every possible field. Fields that don't vary much between instances can simply be skipped. For others, think about (or measure) the cost of getting the hashCode values from the superclass, from object fields, and from arrays, compared to the value of better distribution.

It may be sufficient to simply probe a few values in an array, for example, rather than to scan every element. The following example factors in the length of the array, then adds in values for just the indices that are powers of two. In a thousand-element array, this only takes 10 probes. Where the array values vary significantly from object to object, this may be plenty to get a good distribution.

  result = PRIME*result + arrayField.length;
  for (int i = 0; i < arrayField.length; i <<= 1) {
    result = PRIME*result + arrayField[i].hashCode();
  }

Other techniques can be used to get more probes without using every value. Other techniques can be used to get more probes without using every value. For example, the following probes the first 16 and last 16 values, but gradually skips more and more values between. It makes 130 probes in a thousand elements, and 206 probes in ten thousand.

  int last = arrayField.length - 1;
  int i = 0, j = last;
  result = PRIME*result + last;
  for (; i < j; i = 17*(i + 1) >> 4, j = last - i) {
    result = PRIME*result + arrayField[i].hashCode();
    result = PRIME*result + arrayField[j].hashCode();
  }
  if (i == j) {
    result = PRIME*result + arrayField[i].hashCode();
  }

However, if all of the values might be important, you must be very careful not to skip over them.

So here are the things to remember about implementing hashCode. First, with very few exceptions you always want to do it. Second, remember that you have to choose different approaches whether you extend Object or not. Third, for simple classes just use the brute-force approach. Finally, for more complex classes, you will want to pick and choose the values that you factor into the implementation to speed it up.

So far we have let you off easy — hashCode was pretty simple to do, as long as you remember these few points. We'll now turn to clone, which is not quite so simple.

Ed note: make the following a sidebar:

Warning: Be careful when using collections if you need to change the value of a key. Otherwise your collection becomes corrupt! For example, suppose at sometime you have put a Point into a hash table:

myHashtable.put(myPoint, someValue);

Later on, you can't just change the value of the key:

myPoint.move(1,2); // DANGER, WILL ROBINSON!

Instead, you must first remove the key from the table, and then re-enter the pair after you change the value of the key, as follows:

Object oldValue = myHashtable.get(myPoint);
myHashtable.remove(myPoint);
myPoint.move(1,2);
myHashtable.put(myPoint, oldValue);

The Java API documentation is very clear on this, but it never hurts to repeat this warning, since it leads to nasty, hard-to-find bugs.

Cloning

Implementing clone allows other programmers to use your objects as fields and to safely implement getters, setters, and clone themselves. For more details on this, see my April, 1999 column: "Immutables". As a general rule, you really should provide a clone operator for all of your classes. However, suppose you are feeling lazy, and want to get away with the absolute minimum. You do not need to provide a clone method if your superclass does not implement a public clone method, and your object falls under one of the following cases:

We will see the reasons for these conditions as we show how to implement a clone method. You'll see, for example, that if you don't implement clone, it can be very unpleasant for clients who have to use your class in implementing their clone method.

Requirements

The only strict principles that you must follow for clone are:

Clone Equality
If we set y = clone(x), then x.equals(y).
 
Clone Independence
If we set y = clone(x), then no method on y can cause the value of x to be modified.

This is what is known as a safe clone. There are cases where it may make sense to provide what is called a shallow clone, especially with collection classes. Such a shallow clone only duplicates the top-level structure of the object, not the lower levels. A shallow clone is useful in many circumstances so long as programmers can somehow still implement a deep clone on top of those objects. Ideally, such classes would implement both with different names.

A safe clone is different than a deep clone, which copies the object and all of its fields recursively. A safe copy does not need to copy any field that is irrelevant to Clone Independence. For example, a String field does not need to be cloned. Because Strings are immutable, even if one String is shared by two objects, the contents of the string cannot be changed, and thus cannot break Clone Independence. Similarly, if a field cannot be altered through the class API, then it can be shared between two objects without breaking Clone Independence.

A safe clone for a class X is often required in implementing another class Y that has a field of that type X. It is generally required where Y is immutable (for more information, see my April 1999 JavaReport column "Immutables", archived on http://www.macchiato.com/). It is also often required even where Y is mutable, such as in GridBagLayout.setConstraints().  You will see it used in those cases in the Java 2D implementations. On the other hand, a deep clone — as opposed to a safe clone — is rarely ever required.

If your class is immutable and all its subclasses are immutable, you don't need to supply a clone method; your class is automatically safe. On the other hand, it is trivial to supply a safe clone operator as a convenience for your clients, as shown below. Since the class is immutable, both of the above principles are handled trivially!

public Object clone () {
  return this;
}
Ed note: make the following a sidebar:

Immutables

As we have remarked in previous columns, it is very unfortunate that while the Java programming model depends very heavily on immutability in lieu of const, the Java language provides no support for it:

  • You can't determine programmatically whether an object is immutable or not.
  • You can't insure that subclasses of an immutable class are themselves immutable.

Clone vs. Copy Constructor

We'll take a moment to point out that the clone method is very different than the so-called copy constructor. A copy constructor is one that takes object of its own type as a single parameter. For example, Dimension has a copy constructor, so we can call it as follows:

Dimension dimension2 = new Dimension(dimension1);

This works out fine if the class we are working with is final. If it is not — and the variable is actually an instance of a subclass — then we will get the wrong answer; what is called a slice. For example, suppose that dimension1 is actually an instance of a subclass NewDimension:

class NewDimension extends Dimension {
  private int seconds;
  public NewDimension(int width, int height, int seconds) {...}
  ...
}

If Dimension had a clone operator, then the following would produce a correct copy, with the right subclass (NewDimension). The copy constructor cannot! It can only produce a slice in that case, discarding all the subclass information.

Dimension dimension2 = dimension1.clone();

Implementing Clone

Here is an example of how to correctly implement clone, with the different cases that you may be faced with. As before, the code is unrolled slightly to make it easier to walk through. First, make your object implement Cloneable. Next, set up the implementation of clone with a try-catch block for CloneNotSupportedException. This is just boilerplate code you generally have to include because of the way Object.clone() is defined. Trust me on this one; I could tell you all the gory details, but I'd run over my word limit!

Then, as the first line you call super.clone(). Normally that is not a problem, but if your superclass did not implement clone, it is. You'll have to emulate the missing clone method, if possible. (We'll talk about this in a bit.) Here is what you have so far:

public Object clone() {
  try {
    MyClass result = (MyClass) super.clone();
    // more code to go here (possibly)
    return result;
  } catch (CloneNotSupportedException e) {
    return null; // never invoked
  }
}

The call to super.clone() copies your superclass's fields, and makes bitwise copies of your fields. Because of this, you do not have to copy any primitive or immutable fields (such as String) in the rest of your code. So if your object doesn't contain any mutable objects, you are done! For example, for the Currency class we defined above, the above would be sufficient. As a matter of fact, if your superclass defines a public clone operator, and you only have primitive or immutable fields, you don't even have to define the clone operator at all; the superclass implementation will do the right thing.

If your object contains mutable objects (neither primitive nor immutable), you will have to add some more code just for them. These lines will go where we have the comment above. For well-mannered field objects, we just clone them.

 result.insets = (Insets) insets.clone();

Ideally, that's all we would have to do. Sadly, many classes are missing the clone method, or their clone method is not safe. If the author of the class for one of your fields was a lazy bum, and did not supply you with a clone method, it starts to get painful. You will have to emulate what the clone method would have done, if possible. If you know that the field object could not be a subclass, use the copy constructor if available.

 result.dimension = new Dimension(dimension);

If the field object does not have a copy constructor, see if you can construct it by extracting infomation from it and building a duplicate yourself.:

  borderLayout = new BorderLayout(
      borderLayout.getHgap(), borderLayout.getVgap());

If the class of the field is badly designed, you may not be able to get enough information from public getters to reconstruct a copy. In that case, you are just out of luck.

If the field could be a subclass, it really starts to get ugly. If you know it could only be one of a limited number of possible ones, then you can try to set up code to deal with that, checking for each type of class. If you cannot limit ahead of time the number of subclasses that your field might have, then you might have to try to use reflection to synthesize your own clone. This is a real pain in the ass, so just hope you don't run into this!

I'll leave this as a problem for the reader. If you feel like a challenge, send me your try; I'll print the best one in my next column.

As if that weren't annoying enough, we still have to handle arrays and collections. If these contain just immutables or primitives they're easy; we just clone them.

  result.array = (int[]) array.clone();
  result.vector = (Vector) vector.clone();

If some elements of the collection or array are mutable objects (e.g. neither immutable or primitive), then we have to do more work. Since collections and arrays don't have safe clones, only have shallow clones, not only do we have to clone them as above, but we have to clone all of their elements one by one.

  for (int i = 0; i < vector.size(); ++i) {
    result.vector.setElementAt(
      ((SomeType) vector.elementAt(i)).clone(), i);
  }
Ed note: make the following a sidebar. Make sure it is AFTER the above text.

Cloneable?

Inexplicably, the Cloneable interface does not make the clone method public. So while we would like to be able to write the following, we can't.

result.myVector.setElementAt(((Cloneable) myVector.elementAt()).clone(), i);

Instead, we have to cast it to some explicit class (if we know it!) or use reflection. (And reflection has much poorer performance than calling methods directly.) The only thing the Cloneable interface does do is enable access to Object.clone().

If the elements of the collection or array do not have clone methods then we have emulate a clone method, as we discussed above.

So here are the things to remember about implementing clone. First, as with equals and hashCode, with few exceptions you always want to do it. Second, you always call your superclass's clone method. Third, you only need to worry about fields that are mutable objects. Fourth, you generally just call their clone operators, though you are in for some work if those operators don't exist, or if you have arrays or collections of mutable objects. Fifth, if you ever add any fields to your class, make sure you add appropriate code to the clone method for them! The sample code we've been working on is in Listing 4.

Although it may seem that equals, hashCode and clone are unnecessarily annoying to have to deal with, once you recognize their pitfalls in the majority of cases they are pretty straightforward. And if you are feeling overwhelmed, just keep repeating to yourself "It's much, much worse in C++, it's much, much worse in C++,..."

Ed note: make the following a sidebar.

Readers Write

Q: From Robert Allan Schwartz: I was very interested to read your article "Liberte, equalite, fraternite". If I understand correctly, then classes that inherit directly from Object must implement equals() differently from classes that don't inherit directly from Object.

The downside is that the same "different" implementation of equals() must now be repeated on many classes. I have a suggestion. Create a subclass of Object called EqualsImplementor. EqualsImplementor is now the only place that must have a "different" implementation of equals(). Everyone who used to inherit from Object should now inherit from EqualsImplementor. What do you think?

A: That is a possible approach. You would have to have EqualsImplementor.equals() always return true, and EqualsImplementor.hashCode() always return a constant (such as zero). You also have to make it abstract, or hide the constructor. If you did that, then at the expense of always having an extra object in your heirarchies, you could always call the superclass's equal and hashCode in your equals code. You still have to override equals in the subclasses, but your code is slightly more uniform. I don't know that it is worth it.

Ideally, Java would not have put either equals or hashCode on Object; they would have been in separate interfaces, as Comparable is in Java 2. But that's water under the bridge.


Q: From Steven M Rosenthal: Your Java Report (January 2000) column on equals() was very useful, but overlooks the issue of thread-safe implementation strategies in equals() functions. Consider:

public class Complex {
  private double real_;
  private double imag_;
  ...
  public boolean equals(Object object) {
    if (this == object) return true;
    if (object == null) return false;
    if (object.getClass() != getClass())
      return false;

    Complex cx = (Complex) object;
    return cx.real_ == this.real_
        && cx.imag_ == this.imag;
  }
}
This isn't thread-safe-- if either this or object are mutated during the execution of equals(), then an inconsistent answer can be returned. [...] Good solution:
 public boolean equals(Object object) {
    // boilerplate
    ...
    Complex cx = (Complex) object;
    double r = 0.0;
    double i = 0.0;
    synchronized(this) {
      r = this.real_;
      i = this.imag_;
    }
    synchronized(cx) {
      return cx.real_ == r
          && cx.imag_ == i;
    }
  }
Best solution: Design class Complex to be immutable. Am I missing something, or do the issues of thread-safety permeate even the simplest of situations such as equals()? I'd be interested in having your thoughts on this question.

A: In our experience, it causes both performance problems and unnecessary risk of deadlocks to try to make every method call thread-safe. It also doesn't really protect against threading problems, as the following illustrates:

if (foo.equals(fii)) { // line 1
  bar.resetTo(foo);    // line 2
}
Even if foo.equals() and bar.resetTo() are both thread-safe, a change in foo can occur between line 1 and line 2, thus invalidating the code. Synchronized methods don't protect code in collections either; if I were copying from one Vector to another, I would have to synchronize the copying code, not just depend on the methods being synchronized.

Synchronization has a price: at too low a level it will also cause performance problems; witness how the Java 2 collections now allow a choice of synchronization for just that reason. Synchronization also wastes cycles when an object is really intended to be used only within a single thread, such as StringBuffer. For that reason, we have found it more useful to synchronize at a higher level.

Immutables are possible, but also can have severe performance impacts depending on their intended usage scenarios. The alternative in this example is to either not share Complex objects among several threads, or to synchronize the code that actually uses them.


Listing 1: HashCode Sample
public int hashCode() {
  int result = super.hashCode(); // PICK ONE
  //int result = 0;              // PICK ONE
  result = PRIME*result + primitiveField;
  result = PRIME*result + (flagField ? 1 : 0);
  result = PRIME*result + (int)(longField >>> 32);
  result = PRIME*result + (int)(longField & 0xFFFFFFFF);
  
  long temp = Double.doubleToLongBits(doubleField);
  result = PRIME*result + (int)(temp >>> 32);
  result = PRIME*result + (int)(temp & 0xFFFFFFFF);

  result = PRIME*result + objectField.hashCode();
  if (maybeNullField != null) {
    result = PRIME*result + maybeNullField.hashCode();
  }
  for (int i = 0; i < arrayField.length; ++i) {
    result = PRIME*result + arrayField[i].hashCode();
  }
  return result;
}

Listing 2: HashCode Utilities
final class Utility {

 static final int hashCode(int source, boolean x) {
  return PRIME*source + (x ? 1 : 0);
 }
 
 static final int hashCode(int source, int x) {
  return PRIME*source + x;
 }
 
 static final int hashCode(int source, long x) {
  return PRIME*source + (int)(PRIME*(x >>> 32)
    + (x & 0xFFFFFFFF));
 }
 
 static final int hashCode(int source, float x) {
  return hashCode(source, 
   x == 0.0F ? 0 : Float.floatToIntBits(x));
 }
 
 static final int hashCode(int source, double x) {
  return hashCode(source,
   x == 0.0 ? 0L : Double.doubleToLongBits(x));
 }
 
 static final int hashCode(int source, Object x) {
  return hashCode(source, 
   x == null ? 0 : x.hashCode();
 }

 static final int hashCode(int source, Object[] x) {
  for (int i = 0; i < x.length; ++i) {
    source = hashCode(source, x[i]);
  }
  return source;
 }
 // repeat for arrays of primitives,
 // arrays of arrays,... as required
}

Listing 3: Simplified HashCode Sample
public int hashCode2() {
  int result = super.hashCode(); // PICK ONE
  //int result = 0;              // PICK ONE
  result = Utility.hashCode(result,primitiveField);
  result = Utility.hashCode(result,flagField);
  result = Utility.hashCode(result,longField);
  result = Utility.hashCode(result,doubleField);
  result = Utility.hashCode(result,objectField);
  result = Utility.hashCode(result,maybeNullField);
  result = Utility.hashCode(result,arrayField);
  return result;
}

Listing 4: Clone Sample
public Object clone() {
  try {
    MyClass result = (MyClass) super.clone();
    result.insets = (Insets) insets.clone();
    // result.dimension = (Dimension) dimension.clone();
    result.dimension = new Dimension(dimension); // SLICE?
    borderLayout = new BorderLayout(
      borderLayout.getHgap(), borderLayout.getVgap());
    result.array = (int[])array.clone();
    result.vector = (Vector) vector.clone();
    for (int i = 0; i < vector.size(); ++i) {
      result.vector.setElementAt(
        ((SomeType) vector.elementAt(i)).clone(), i);
    }
    return result;
  } catch (CloneNotSupportedException e) {
    return null; // never invoked
  }
}

Thanks to Dwight Duego, Arnaud Le Hors, Pat O'Connor, Richard Gillam, Josh Bloch, and John Raley for feedback on drafts of this article.

Copyright (c) 1999, Mark Davis, All rights reserved.