Comparing files is something developers do every once in a while. For example, comparing configuration files to see what is different in the other environment or compare programming files to see what has changed in the source code. Implementations of text comparison algorithms are therefore widespread and used in several fields. For instance, in blogs and content managements systems, one might need to know what was altered in an update of a text (in cms like systems) or a programmer in a team would like to see what changed in the source code (svn). Also a lot of (combined) search, spell checking, speech recognition and plagiarism detection software compare texts (strings) in a certain way. This article covers the Levenshtein distance algorithm and how to use it to indicate alterations to texts.
The are several ways to compare texts and find differences and similarity scores. For this article the similarity scores are not relevant because these scores are just numbers. We are interested in what is added, deleted or substituted in the transformation from text A to text B. In other words, we would like to mark the minimal number of primitive operations needed to transform A to B. To do this we’ll need the basics from the classic computer science problem: the longest common subsequence problem. The technique described hereunder which is derived from this problem is the Levenshtein distance algorithm. The algorithm was developed by Vladimir Levenshtein to replace the Hamming distance. The result of Levenshtein’s algorithm is exactly the minimal number of operations, but you can use the unpolished result of this algorithm to determine what parts of text were added, deleted or substituted. Originally this is done per character, but with a little tweak this can be changed to a per word, line or paragraph level function. When you’ve read this article you will know how this works (and the demo, which is referred to at the end). This demo takes two texts as input and outputs what was added, deleted and replaced.
For this article and the demo I used search results for some inspiration. A nice explanation already there is this description of the Levenshtein algorithm, as well as the Wiki page on it. For this article, let’s use two sample lines: “The brown dog jumped away from the sprinkler” and “The dog ran towards the green sprinkler”. Now we want to know with words were added, deleted or replaced in the transition from the first sentence to the second. To do this, let’s take a closer look on how the iterative process of the Levenshtein algorithm is executed.
- The first step is to contruct a matrix of n+1 by m+1, where n is the number of words in the first line and m the number of words in the second line.
- Secondly, fill the first row and column with (from top left to bottom or right) zero to respectively m and n.
- Now for each n, evaluate each m. If the evaluated word of n matches m, the cost is 0, otherwise it’s 2.
- Fill out cell (n, m) having (n, m) is the minumum of:
- The value of the cell above + 1
- The left neighbour cell value + 1
- The above left cell value + cost
For the Levenshtein distance it stops right here, when the iteration is completed. The distance is the value in the lower right cell. Here we are not interested in the Levenshtein distance itself, but in the matrix we’ve just constructed. The lowest cost route from bottom right to top left reveals information on what words have been added, deleted and/or substituted. To show how, we need to costruct the matrix with the algorithm above.
For the first cell (*), the words “The” versus “The” are equal, so the cost is 0. Now the minumum of the cell above + 1 (2), the cell to the left + 1 (2) and the above left + cost (0), is the latter one.
The cell underneath it (**) has cost 1 (“The” versus “dog”) and gets a value equal to the minimum of the cell above + 1 (1), the cell to the left + 1 (3) and the above left + cost (2), which is 1.
Continue to fill this out and the table will look like this:
Now we need to find the lowest cost path from the bottom right to the zero in the top left. To do this simply jump to the cell with the lowest value adjacent to the current cell (to left, above or diagonal). Jumping diagonal is only allowed if the words are the same (column and row). If two or more have the same (lower) value, the priority of choosing a route is to try diagonal first, then either left or above. So, from the bottom right 5 we start the route to the diagonally adjacent 5 (because ’sprinkler’ equals ’sprinkler’). From the 5 the next step would be the lower 4 above it, then the diagonally adjacent 4, etc… The route table will look like this:
After the route is calculated, every step in it tells something about the operations needed (from top left to bottom right).
- Every diagonal step (not increasing the score) tells us nothing happened. E.g. the first step from 0 to 0 tells us “The” from the first line stays “The” in the second line.
- Every horizontal step means a word is deleted. E.g. the step from 0 to 1 tells us “brown” was deleted at this point.
- Every vertical step means a word is added. E.g. the step from 4 to 5 tells us “green” was added at this point.
- Every diagonal step having the score increased means a word is substituted (added and deleted) at this point. E.g. the step from 2 to 3 tells “away” is substituted by “ran”. This is an illegal opperation in the detection of addition and deletion of words.
This way a text indicating the operations can be constructed:
The brown dog jumped ranaway towardsfrom the green sprinkler.
Red indicates deletion, green for insertion and a red and green pair indicates substitution.
Of course some optimalizations can be performed. The above for instance does give a good indication of what happened to the text. Imagine a larger text than just these lines and the relevance of changes are marked this way will become more obvious. But because only the primitive operations are detected at the word level, word groups are not taken into account. In this example for instance, the algorithm would be better if it marked “jumped away from” as replaced by “ran towards” instead of each seperate word as it does now:
The brown dog ran towardsjumped away from the green sprinkler.
This operation is not that hard to implement, simply replace subsequent differing operations by substitute operations.
And about the featured picture on top: there are 12 differences to spot.