If you’re using Hive sooner or later you’ll need to create user defined functions (UDFs). Chances are such a function would use a code that depends on the Guava library. And it is not that unlikely that the required Guava version would be newer than Hive’s. Then you’re running into a trouble. Hopefully this article save you much of the pain I had to suffer to make it working. Let’s make a simple UDF and fix it using both maven and gradle.
Example UDF - top private domain
As a concrete example take a UDF that given a internet host name computes the top private domain. Eg. for host name
co.uk are public domains,
www.google.co.uk are private domains and
google.co.uk is the highest (top) private domain in this case.
InternetDomainName.topPrivateDomain() already does this job. We need lastest Guava in order to have an up-to-date trie representing public domains. However, there’s a difference between ancient Guava 11 and latest Guava 18:
1 2 3 4
The whole UDF class looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
In this case the incompatibility is just in the return value but in other cases there might be runtime errors such as added/removed/changed classes, methods, etc.
Conflicts in library versions
So what’s the problem with Guava and Hive? Hadoop and Hive are in general packaged with ancient or outdated version of Guava. Eg. Hive 0.14 available in CDH5 has Guava 11 and current Hive 1.2 has Guava 15, while the most recent Guava version is 18. The bigger the version difference the higher chance of incompatibilities. And the problem is that more than one version of the library cannot be easily loaded within JVM (at least in Hive).
However, there’s a trick. If there a conflict between classes in the same package(s) (in this case
com.google.thirdparty), why not to rename the package(s)?
There’s one more catch with Hive UDFs. For some strange reason when registering the UDF Hive sees only classes from the JAR than contains the UDF class and not from other specified JARs. This along with renaming the packages leads to the need for a fat JAR, ie. a JAR containing all the necessary dependencies for the UDF class (or at least those shaded or not seen by Hive). Blindly packaging all transitive dependencies might result in a big bloated JAR, so we’d also like to minimize it just to classes that are actually used by our UDF.
The overall plan is thus to:
- package all the dependencies of the UDF into a fat JAR
- to make Hive see the other classes
- rename (shade) the Guava packages
- both the packages of the Guava classes and their imports in our code
- to prevent conflict with Hive’s own Guava
- (optional) minimize the resuling fat JAR
- remove unnecessary dependencies or even unused classes
We’ll explore how to do it using maven’s
maven-shade-plugin and gradle’s
Maven - maven-shade-plugin
If your’re using maven for building fortunately there’s a plugin called maven-shade-plugin which can do both creating a fat JAR and renaming packages.
pom.xml in the plugin’s configuration just specify using the
<relocation> tags which package(s) should be renamed to what. The old package name is in the
<pattern> and the new one in
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
In order to minimize the JAR, enable it via the following. Beware that this might remove some classes used only by reflection. In such a case explicitly include the class via
configuration.filters.filter.include. Please find the details in the docs.
1 2 3 4 5
By default the plugin produces the shaded JAR with the same name as the original and add a prefix
original- to the latter. If we want the opposite behavior (mark the shaded JAR, eg. with
-jar-with-dependencies suffix), we can to this via:
1 2 3 4 5 6
Then just build the JAR with
$ maven package and copy it to the machine(s) running Hive.
Gradle - shadow
Gradle Shadow plugin is inspired by
maven-shade-plugin. The basic configuration might look like this:
1 2 3 4 5 6 7 8 9 10 11
Registering the UDF
Within Hive register the UDF like this:
The first command tells Hive to copy the JAR into the distributed cache and put it on the claspath, so that it is visible from all MapReduce tasks. The second one registers the UDF class with the given function name.
In case you just need to modify the classpath and give your JAR more priority (without renaming packages) try to fiddle with these options:
More info on Hive UDFs: