MapReduce and Distributed Filesystems
MapReduce is a "batch" processing model: it takes a fixed set of input and produces a fixed output.
Join Strategies in MapReduce
How do you join two large datasets (e.g., Users and Clicks)?
- Reduce-Side Join (Sort-Merge):
- Mappers extract the join key.
- Shuffle sorts and groups by that key.
- Reducer gets all records for that key (e.g., the User record and all their Clicks) and merges them.
- Map-Side Joins:
- Broadcast Hash Join: If one table is small (fits in memory), every mapper loads it into a hash map. Fast, no shuffle!
- Partitioned Hash Join: If both tables are partitioned by the same key, a mapper only needs to load the corresponding partition of the other table.
Handling Skew (Hot Keys)
If one key has millions of records (e.g., a "Celebrity" user), a single reducer will be overloaded. Solution: The "linchpin" strategy. Send the hot key records to a random set of reducers, and replicate the other side of the join to all those reducers.
Knowledge Check
Which join strategy is best if one of the tables fits entirely in memory?
Sort-Merge Join
Broadcast Hash Join
Remote Lookup Join