Back

How Does Git Find Changes? (The Diff Algorithm)

git diff is a command every developer uses daily.

Isn't it amazing how it instantly finds exactly which lines were added or deleted among thousands of lines of code?

Is it simply comparing two files line by line? If so, inserting a single line in the middle would mark all subsequent lines as "changed."

Git uses a smart method called the Myers Diff Algorithm to find the "minimal set of changes."

1. Edit Distance

The problem of calculating the difference between two strings (or files) boils down to finding the "minimum number of edits required to transform A into B." This is called Edit Distance.

For example, to change ABC to ABD:

  1. Delete C
  2. Add D (Edit Distance: 2)

Git performs this calculation not character by character, but line by line.

2. Longest Common Subsequence (LCS)

The core of Myers algorithm is finding the Longest Common Subsequence (LCS) between two files.

Once the LCS is found, the rest is easy.

  • Parts of A not in the LCS are Deleted (-).
  • Parts of B not in the LCS are Added (+).

For example:

  • File A: 1 2 3 4
  • File B: 1 3 5 4

Here, the LCS is 1 3 4.
Therefore, we can deduce that 2 was deleted and 5 was added.

3. The Magic of Myers Algorithm

Published by Eugene W. Myers in 1986, this algorithm transforms the LCS problem into a graph traversal problem (finding the shortest path) to solve it very efficiently.

Git uses this algorithm by default, but also offers other algorithms like histogram or patience as options.

Conclusion

Behind the + and - signs we take for granted lies an elegant mathematical algorithm proven over 30 years.

If you find git diff output confusing or messy, try git diff --algorithm=histogram. It often produces results that better reflect the code's structure.

TechGitAlgorithmDiff

Explore Related Tools

Try these free developer tools from Pockit