when you’re dealing with a couple systems at max, the possibilty of a system crashing due to hardware / software bugs or internal issues are super low

but when you have scaled it up to 100s or 1000s of systems talking to each other

the possibility of a failure increases by a couple magnitudes

network cables faulting, some transistor bursting, some electricity dying etc

so in distributed systems, keeping fault tolerance in mind during mind is necessary and should be implemented from that basis, instead of trying to optimize the players in the system