How Does Git Find Changes? (The Diff Algorithm)
git diff is a command every developer uses daily.
Isn't it amazing how it instantly finds exactly which lines were added or deleted among thousands of lines of code?
Is it simply comparing two files line by line? If so, inserting a single line in the middle would mark all subsequent lines as "changed."
Git uses a smart method called the Myers Diff Algorithm to find the "minimal set of changes."
1. Edit Distance
The problem of calculating the difference between two strings (or files) boils down to finding the "minimum number of edits required to transform A into B." This is called Edit Distance.
For example, to change ABC to ABD:
- Delete
C - Add
D(Edit Distance: 2)
Git performs this calculation not character by character, but line by line.
2. Longest Common Subsequence (LCS)
The core of Myers algorithm is finding the Longest Common Subsequence (LCS) between two files.
Once the LCS is found, the rest is easy.
- Parts of A not in the LCS are Deleted (-).
- Parts of B not in the LCS are Added (+).
For example:
- File A:
1 2 3 4 - File B:
1 3 5 4
Here, the LCS is 1 3 4.
Therefore, we can deduce that 2 was deleted and 5 was added.
3. The Magic of Myers Algorithm
Published by Eugene W. Myers in 1986, this algorithm transforms the LCS problem into a graph traversal problem (finding the shortest path) to solve it very efficiently.
Git uses this algorithm by default, but also offers other algorithms like histogram or patience as options.
Conclusion
Behind the + and - signs we take for granted lies an elegant mathematical algorithm proven over 30 years.
If you find
git diffoutput confusing or messy, trygit diff --algorithm=histogram. It often produces results that better reflect the code's structure.
Explore Related Tools
Try these free developer tools from Pockit