- make all your data immutable
- index snapshots of your data using "purely functional" batch computing (e.g. MapReduce)
- index the realtime data arriving after the last snapshot using an incremental computing system
- for querying, merge the index of the last batch job with that of the incremental system
As Marz explains, incremental computing has a much higher complexity than batch computing. Using it only for the last few hours of data removes a huge complexity burden from your shoulders:
Since the realtime layer only compensates for the last few hours of data, everything the realtime layer computes is eventually overridden by the batch layer. So if you make a mistake or something goes wrong in the realtime layer, the batch layer will correct it. All that complexity is transient.
No comments:
Post a Comment