MIT Distributed Systems : Lecture 1

May 15, 2025

Lecture 1

" If you can solve it without distributed systems then you should do it , try everything else then try distributed systems because it can get difficult "

Why people use Distributed Systems
  • parallelism
  • fault tolerance ( more systems available )
  • for security and isolation ( code you not trust or can have bug , so you split that code to run on different system rather than main system)
Challenges
  • you have multiple pieces and networks , so you can face very different kind of errors that are unexpected like partial faniures which are much difficult to deal with in comparison to working with single machine.
  • careful design is needed for system yo give you performance you expect distributed system is not a magic that will improve performance 1000x , it requires careful design

RAFT for fault tolerance ( ofc -> out of context )

we have to talk about systems that provide all these 3 types of infrastructure :

  • storage
  • computation systems (like map reduce)
  • communication

" dream is to build systems that look like non distributed systems to user but are actually vast high performant fault tolerant systems underneath "

Implementation

for implementation we use

  • RPC - remote procedure calls whose goal is to mask the fact that we are communication over an unreliable network
  • Threads - allow us to harness multicore computers , concurrent operations are carried out ,
  • Concurrency control like Locks
Performance

usual goal of a distributed system is to address scalability , for eg. 2x computer -> 2x throughput , best systems designs are which can scale by just getting more of same resources ( money solves the problem instead of developers )

Fault Tolerance

distributed systems turns problems of faults like network failing , or almost anything related to systems which are rare to see in single servers to very common and constant in distributed systems with lots of systems , something will most certainly will fail , all the best :)

  • Availability : Is the system up and running and able to respond to requests. This is a measure of the system's uptime and responsiveness.
  • Recoverability : if something goes wrong the system will stop to work and wait for a repair but after repair is done it shall work as nothing ever happened.
  • NV storage : a kind of storage system that can retain data even when the computer is powered off. This is different from volatile memory, which requires a constant power supply to retain data.[expensive]
  • replication : stores copy of system , too common problem is that the replicas will eventually sync out and will no longer have same data
Consistency

lets say we have a multi server based key value store , we have operation like get and put while these operations may sound simple with respect to single server , they can have issues of consistency in a distributed system with multiple servers like

we have 2 servers 1 is replication of first having exactly same data lets say key value store having just 1 value of (1,20 ) in both tables

KeyValue
120
221

now if a client sent an update request like it first update key value in server 1 to be (1,21) but before it could update in server 2 , the server connection or server itself crashes due to which server 2 still has value of (1,20), hence consistency issue, now if user comes to get request sometimes it might get (1,21) sometimes (1,20) , BAD

  • now we can have strong consistency here by assigning a reader that reads all the copes and always make sure every copy has most recent change BUT THAT IS EXPENSIVE

Outro

These are my raw notes from MIT 6.824 Distributed Systems. If you found this helpful or want to discuss distributed systems further, feel free to reach out on Twitter / X or LinkedIn.