Diff
diff is a shell command that compares the content of files and reports differences. The term diff is also used to identify the output of the command and is used as a verb for running the command. To diff files, one runs diff to create a diff.Typically, the command is used to compare text files, but it does support comparing binary files. If one of the input files contains non-textual data, then the command defaults to brief-mode in which it reports only a summary indication of whether the files differ. With the option, it always reports line-based differences, but the output may be difficult to understand since binary data is generally not structured in lines like text is.
Although the command is primarily used ad hoc to analyze changes between two files, a special use is for creating a patch file for use with the
patch command which was specifically designed to use a diff output report as a patch file. POSIX standardized the and commands including their shared file format.
History
The original utility was developed in the early 1970s for the Unix operating system, at Bell Labs in Murray Hill, New Jersey. It was part of the 5th Edition of Unix released in 1974, and was written by Douglas McIlroy, and James Hunt. This research was published in a 1976 paper co-written with James W. Hunt, who developed an initial prototype of. The algorithm this paper described became known as the Hunt–Szymanski algorithm.McIlroy's work was preceded and influenced by Steve Johnson's comparison program on GECOS and Mike Lesk's program. also originated on Unix and, like, produced line-by-line changes and even used angle-brackets for presenting line insertions and deletions in the program's output. The heuristics used in these early applications were, however, deemed unreliable. The potential usefulness of a diff tool provoked McIlroy into researching and designing a more robust tool that could be used in a variety of tasks, but perform well in the processing and size limitations of the PDP-11's hardware. His approach to the problem resulted from collaboration with individuals at Bell Labs including Alfred Aho, Elliot Pinson, Jeffrey Ullman, and Harold S. Stone.
In the context of Unix, the use of the ed | line editor provided with the natural ability to create machine-usable "edit scripts". These edit scripts, when saved to a file, can, along with the original file, be reconstituted by into the modified file in its entirety. This greatly reduced the secondary storage necessary to maintain multiple versions of a file. McIlroy considered writing a post-processor for where a variety of output formats could be designed and implemented, but he found it more frugal and simpler to have be responsible for generating the syntax and reverse-order input accepted by the command.
In 1984, Larry Wall created the patch | utility for patching text files, using the output from plus the diff input file with the content before changes to create a file with the content after changes.
X/Open Portability Guide issue 2 of 1987 includes diff. Context mode was added in POSIX.1-2001. Unified mode was added in POSIX.1-2008.
In 's early years, common uses included comparing changes in the source of software code and markup for technical documents, verifying program debugging output, comparing filesystem listings and analyzing computer assembly code. The output targeted for was motivated to provide compression for a sequence of modifications made to a file. The Source Code Control System and its ability to archive revisions emerged in the late 1970s as a consequence of storing edit scripts from.
Algorithm
Unlike edit distance notions used for other purposes, is line-oriented rather than character-oriented, but it is like Levenshtein distance in that it tries to determine the smallest set of deletions and insertions to create one file from the other.The operation of is based on solving the longest common subsequence problem. In this problem, given two sequences of items:
h q
e i k r x y
and we want to find a longest sequence of items that is present in both original sequences in the same order. That is, we want to find a new sequence which can be obtained from the first original sequence by deleting some items, and from the second original sequence by deleting other items. We also want this sequence to be as long as possible. In this case it is
a b c d f g j z
From a longest common subsequence it is only a small step to get -like output: if an item is absent in the subsequence but present in the first original sequence, it must have been deleted. If it is absent in the subsequence but present in the second original sequence, it must have been inserted.
e h i q k r x y
+ - + - + + + +
Use
Thediff command accepts two arguments like: diff original ''new''. Commonly, the arguments each identify normal files, but if the two arguments identify directories, then the command compares corresponding files in the directories. With the -r option, it recursively descends matching subdirectories to compare files with corresponding relative paths.Default output format
The example below shows the original and new file content as well as the resultingdiff output in the default format. The output is shown with coloring to improve readability. By default, diff outputs plain text, but GNU diff does use color highlighting when the option is used.original:
This part of the
document has stayed the
same from version to
version. It shouldn't
be shown if it doesn't
change. Otherwise, that
would not be helping to
compress the size of the
changes.
This paragraph contains
text that is outdated.
It will be deleted in the
near future.
It is important to spell
check this dokument. On
the other hand, a
misspelled word isn't
the end of the world.
Nothing in the rest of
this paragraph needs to
be changed. Things can
be added after it.
new:
This is an important
notice! It should
therefore be located at
the beginning of this
document!
This part of the
document has stayed the
same from version to
version. It shouldn't
be shown if it doesn't
change. Otherwise, that
would not be helping to
compress the size of the
changes.
It is important to spell
check this document. On
the other hand, a
misspelled word isn't
the end of the world.
Nothing in the rest of
this paragraph needs to
be changed. Things can
be added after it.
This paragraph contains
important new additions
to this document.
output:
In this default format, stands for added, for deleted and for changed. The line number of the original file appears before the single-letter code and the line number of the new file appears after. The less-than and greater-than signs indicate which file the lines appear in. Addition lines are added to the original file to appear in the new file. Deletion lines are deleted from the original file to be missing in the new file.
By default, lines common to both files are not shown. Lines that have moved are shown as added at their new location and as deleted from their old location. However, some diff tools highlight moved lines.
Edit script
An ed script can be generated by modern versions of diff with the-e option. The resulting edit script for this example is as follows:24a
This paragraph contains
important new additions
to this document.
.
17c
check this document. On
.
11,15d
0a
This is an important
notice! It should
therefore be located at
the beginning of this
document!
.
In order to transform the content of the original file into the content of new file using, one appends two lines to this diff file, one line containing a
w command, and one containing a q command. Here we gave the diff file the name mydiff and the transformation will then happen when we run.Context format
The Berkeley distribution of Unix made a point of adding the context format and the ability to recurse on filesystem directory structures, adding those features in 2.8 BSD, released in July 1981. The context format of diff introduced at Berkeley helped with distributing patches for source code that may have been changed minimally.In the context format, any changed lines are shown alongside unchanged lines before and after. The inclusion of any number of unchanged lines provides a context to the patch. The context consists of lines that have not changed between the two files and serve as a reference to locate the lines' place in a modified file and find the intended location for a change to be applied regardless of whether the line numbers still correspond. The context format introduces greater readability for humans and reliability when applying the patch, and an output which is accepted as input to the patch program. This intelligent behavior is not possible with the traditional diff output.
The number of unchanged lines shown above and below a change hunk can be defined by the user, even zero, but three lines is typically the default. If the context of unchanged lines in a hunk overlap with an adjacent hunk, then diff will avoid duplicating the unchanged lines and merge the hunks into a single hunk.
A "" represents a change between lines that correspond in the two files, whereas a "" represents the addition of a line, and a "" the removal of a line. A blank space represents an unchanged line. At the beginning of the patch is the file information, including the full path and a time stamp delimited by a tab character. At the beginning of each hunk are the line numbers that apply for the corresponding change in the files. A number range appearing between sets of three asterisks applies to the original file, while sets of three dashes apply to the new file. The hunk ranges specify the starting and ending line numbers in the respective file.
The command produces the following output:
- ** /path/to/original timestamp
- **************
- ** 1,3 ****
+ This is an important
+ notice! It should
+ therefore be located at
+ the beginning of this
+ document!
This part of the
document has stayed the
same from version to
- **************
- ** 8,20 ****
changes.
- This paragraph contains
- text that is outdated.
- It will be deleted in the
- near future.
It is important to spell
! check this dokument. On
the other hand, a
misspelled word isn't
the end of the world.
--- 14,21 ----
compress the size of the
changes.
It is important to spell
! check this document. On
the other hand, a
misspelled word isn't
the end of the world.
- **************
- ** 22,24 ****
this paragraph needs to
be changed. Things can
be added after it.
+ This paragraph contains
+ important new additions
+ to this document.