Three problems which a framework like hadoop solve (Or why go with hadoop):
1. First, dividing the work into equal-size pieces isn’t always easy or obvious. In this case,
the file size for different years varies widely, so some processes will finish much earlier
than others. Even if they pick up further work, the whole run is dominated by the
longest file. A better approach, although one that requires more work, is to split the
input into fixed-size chunks and assign each chunk to a process.
2.Second, combining the results from independent processes may need further process-
ing. In this case, the result for each year is independent of other years and may be
combined by concatenating all the results and sorting by year. If using the fixed-size
chunk approach, the combination is more delicate. For this example, data for a par-
ticular year will typically be split into several chunks, each processed independently.
We’ll end up with the maximum temperature for each chunk, so the final step is to
look for the highest of these maximums for each year.
3.Third, you are still limited by the processing capacity of a single machine. If the best
time you can achieve is 20 minutes with the number of processors you have, then that’s
it. You can’t make it go faster. Also, some datasets grow beyond the capacity of a single
machine. When we start using multiple machines, a whole host of other factors come
into play, mainly falling into the category of coordination and reliability. Who runs the
overall job? How do we deal with failed processes?
So, although it’s feasible to parallelize the processing, in practice it’s messy. Using a framework like Hadoop to take care of these issues is a great help.
1. First, dividing the work into equal-size pieces isn’t always easy or obvious. In this case,
the file size for different years varies widely, so some processes will finish much earlier
than others. Even if they pick up further work, the whole run is dominated by the
longest file. A better approach, although one that requires more work, is to split the
input into fixed-size chunks and assign each chunk to a process.
2.Second, combining the results from independent processes may need further process-
ing. In this case, the result for each year is independent of other years and may be
combined by concatenating all the results and sorting by year. If using the fixed-size
chunk approach, the combination is more delicate. For this example, data for a par-
ticular year will typically be split into several chunks, each processed independently.
We’ll end up with the maximum temperature for each chunk, so the final step is to
look for the highest of these maximums for each year.
3.Third, you are still limited by the processing capacity of a single machine. If the best
time you can achieve is 20 minutes with the number of processors you have, then that’s
it. You can’t make it go faster. Also, some datasets grow beyond the capacity of a single
machine. When we start using multiple machines, a whole host of other factors come
into play, mainly falling into the category of coordination and reliability. Who runs the
overall job? How do we deal with failed processes?
So, although it’s feasible to parallelize the processing, in practice it’s messy. Using a framework like Hadoop to take care of these issues is a great help.
No comments:
Post a Comment