|
Packit |
6f700b |
Information about dos2unix' implementation choices.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
1. Smart conversion
|
|
Packit |
6f700b |
===================
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
There are some dos2unix implementations that automatically convert all type of
|
|
Packit |
6f700b |
line breaks. For instance converting both DOS and Mac linebreaks to Unix line
|
|
Packit |
6f700b |
breaks at once. Or automatically detect the line break type and convert to the
|
|
Packit |
6f700b |
other side.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
Smart conversions could lead to unexpected behaviour. For instance when a
|
|
Packit |
6f700b |
dos2unix is run on a file with only Unix line breaks and the line breaks are
|
|
Packit |
6f700b |
flipped to the other side. This dos2unix implementation does exactly what you
|
|
Packit |
6f700b |
tell it to do. When you run 'dos2unix' only DOS line breaks are converted to
|
|
Packit |
6f700b |
Unix line breaks. Unix line breaks stay in the file. Seen from a DOS or Unix
|
|
Packit |
6f700b |
perspective, a Mac line break is not a line break, so also Mac line breaks stay
|
|
Packit |
6f700b |
untouched. The same applies for mac2unix. Mac2unix leaves Unix and DOS line
|
|
Packit |
6f700b |
breaks untouched.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
2. Unix filter
|
|
Packit |
6f700b |
==============
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
When a standard Unix filter, e.g. sed or tr, reads input from a file it sends
|
|
Packit |
6f700b |
its output by default to standard out. This implementation of dos2unix does by
|
|
Packit |
6f700b |
default in-place conversion (overwriting the input file), which seems not in line.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
Dos2unix is not part of the Unix standard. Most Unixes have their
|
|
Packit |
6f700b |
own implementation of dos2unix. There is a lot of variation in command names,
|
|
Packit |
6f700b |
options, and behavior. The SunOS version of dos2unix, after which this version was
|
|
Packit |
6f700b |
modeled, does by default paired conversion.
|
|
Packit |
6f700b |
This implementation of dos2unix has too much legacy to change the current behaviour.
|
|
Packit |
6f700b |
Changing it would have more disadvantages than advantages. Most people expect
|
|
Packit |
6f700b |
dos2unix to do in-place conversion. The majority of other open source implementations
|
|
Packit |
6f700b |
also convert by default in-place. In-place conversion has the advantage that it is
|
|
Packit |
6f700b |
very easy to convert multiple files by using wild cards.
|
|
Packit |
6f700b |
This implementation of dos2unix does send the output to standard-out when the
|
|
Packit |
6f700b |
input comes from standard-in. So you can use it as filter. Note that dos2unix/
|
|
Packit |
6f700b |
unix2dos is also used a lot on non-Unix operating systems where the filter idea
|
|
Packit |
6f700b |
is less known.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
3. Recursive conversion of files
|
|
Packit |
6f700b |
================================
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
There are implementations that have builtin functionality to do recursive
|
|
Packit |
6f700b |
conversion of all files in a directory tree.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
This functionality is not needed in dos2unix. By using an external program,
|
|
Packit |
6f700b |
like Unix 'find', you can do recursive conversion of directory trees. There is
|
|
Packit |
6f700b |
no need to duplicate this.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
4. Encoding conversion
|
|
Packit |
6f700b |
======================
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
Dos2unix can do several encoding conversions. First there are the conversions
|
|
Packit |
6f700b |
of several DOS code pages to and from ISO-8859-1. These conversions are also
|
|
Packit |
6f700b |
part of the SunOS dos2unix implementation after which this implementation has
|
|
Packit |
6f700b |
been modeled. Although these conversions are not much used these days they have
|
|
Packit |
6f700b |
been added for the sake completeness. Conversion of CP1252 was added, because
|
|
Packit |
6f700b |
it is used a lot in the Western world. It's almost identical to ISO-8859-1. There
|
|
Packit |
6f700b |
is no intention to add other conversions to and from ISO-8859-1.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
Conversion from UTF-16 was added, because the world is moving towards
|
|
Packit |
6f700b |
Unicode. Microsoft Windows uses by default UTF-16 format for Unicode. UTF-16
|
|
Packit |
6f700b |
is part of Windows' core design for historical reasons. Microsoft standardized
|
|
Packit |
6f700b |
on UCS-2, a predecessor of UTF-16, in a time when UTF-8 did not exist yet.
|
|
Packit |
6f700b |
However a lot of Windows software is able to read UTF-8 files. In Windows
|
|
Packit |
6f700b |
"Unicode" means usually UTF-16. For instance saving a file with Notepad in
|
|
Packit |
6f700b |
"Unicode" encoding means in UTF-16 encoding. When you work in PowerShell and
|
|
Packit |
6f700b |
echo some text to a file you get an UTF-16 encoded text file. UTF-16 is there
|
|
Packit |
6f700b |
to stay, although many people would like to see otherwise and are dreaming of
|
|
Packit |
6f700b |
an UTF-8 only world. The Unix/Linux world is moving towards UTF-8 encoding,
|
|
Packit |
6f700b |
because it's backwards compatible with ASCII. Unix programs typically do not
|
|
Packit |
6f700b |
support UTF-16.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
One end of the encoding spectrum is an ASCII only world, where the only
|
|
Packit |
6f700b |
differences between DOS and Unix text files are line breaks. In English
|
|
Packit |
6f700b |
speaking regions this is a good working environment, because ASCII is in
|
|
Packit |
6f700b |
practice sufficient for English language. Diacritics are hardly used and can be
|
|
Packit |
6f700b |
omitted. The other end of the spectrum is an Unicode only world. All languages
|
|
Packit |
6f700b |
of the world are supported. Dos2unix aims to support these two ends of the
|
|
Packit |
6f700b |
spectrum: ASCII and Unicode. The Chinese GB18030 encoding is also seen as an
|
|
Packit |
6f700b |
Unicode transformation format. UTF-32 is not supported, because this is
|
|
Packit |
6f700b |
practically only used as an internal format. Other encoding transformations
|
|
Packit |
6f700b |
are left to specialized programs like iconv and recode. The few conversion
|
|
Packit |
6f700b |
modes to and from ISO-8859-1 are only there for legacy reasons.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
In the ASCII days DOS to Unix text file conversion, and vice versa, was only
|
|
Packit |
6f700b |
converting line breaks. In the Unicode era it is not only line break
|
|
Packit |
6f700b |
conversion, but also Unicode transformation format conversion (e.g. UTF-16 to
|
|
Packit |
6f700b |
UTF-8), and Byte Order Mark (BOM) removal or addition.
|
|
Packit |
6f700b |
|
|
Packit |
6f700b |
Conversion towards UTF-16 is not supported and there is no intention to support
|
|
Packit |
6f700b |
it in the future. UTF-8 encoded files are well supported on Windows, so
|
|
Packit |
6f700b |
conversion to UTF-16 is not needed. And we keep on dreaming of an UTF-8 only
|
|
Packit |
6f700b |
world...
|