There are several notable differences between the two APIs:
• The new API favors abstract classes over interfaces, since these are easier to evolve.
This means that you can add a method (with a default implementation) to an
abstract class without breaking old implementations of the class.1 For example,
the Mapper and Reducer interfaces in the old API are abstract classes in the new API.
• The new API favors abstract classes over interfaces, since these are easier to evolve.
This means that you can add a method (with a default implementation) to an
abstract class without breaking old implementations of the class.1 For example,
the Mapper and Reducer interfaces in the old API are abstract classes in the new API.
• The new API is in the org.apache.hadoop.mapreduce package (and subpackages).
The old API can still be found in org.apache.hadoop.mapred.
• The new API makes extensive use of context objects that allow the user code to
communicate with the MapReduce system. The new Context, for example, essen-
tially unifies the role of the JobConf, the OutputCollector, and the Reporter from
the old API.
• In both APIs, key-value record pairs are pushed to the mapper and reducer, but in
addition, the new API allows both mappers and reducers to control the execution
flow by overriding the run() method. For example, records can be processed in
batches, or the execution can be terminated before all the records have been pro-
cessed. In the old API this is possible for mappers by writing a MapRunnable, but no
equivalent exists for reducers.
• Job control is performed through the Job class in the new API, rather than the old
JobClient, which no longer exists in the new API.
• Configuration has been unified. The old API has a special JobConf object for job
configuration, which is an extension of Hadoop’s vanilla Configuration object
(used for configuring daemons; see “The Configuration API” on page 144). In the
new API, job configuration is done through a Configuration, possibly via some of
the helper methods on Job.
• Output files are named slightly differently: in the old API both map and reduce
outputs are named part-nnnnn, whereas in the new API map outputs are named
part-m-nnnnn, and reduce outputs are named part-r-nnnnn (where nnnnn is an integer
designating the part number, starting from zero).
• User-overridable methods in the new API are declared to throw java.lang.Inter
ruptedException. This means that you can write your code to be responsive to
interrupts so that the framework can gracefully cancel long-running operations if
it needs to.2
• In the new API, the reduce() method passes values as a java.lang.Iterable, rather
than a java.lang.Iterator (as the old API does). This change makes it easier to
iterate over the values using Java’s for-each loop construct:
for (VALUEIN value : values) { ... }
No comments:
Post a Comment