pydiffx.utils.text#
Utilities for processing text.
Module Attributes
A mapping of newline format types to character sequences. |
|
A mapping of encodings to possible BOM markers. |
Functions
|
Return the newline for a given type of line endings. |
|
Return the line endings that appear to be used for text. |
|
Split data along newline boundaries. |
|
Strip a BOM from the beginning of a string. |
- pydiffx.utils.text.NEWLINE_FORMATS = {'dos': '\r\n', 'unix': '\n'}#
A mapping of newline format types to character sequences.
This contains only formats that are allowed in the
line_endings=
option in DiffX content sections.- Type
- pydiffx.utils.text.BOMS = {'utf-16': (b'\xfe\xff', b'\xff\xfe'), 'utf-16-be': (b'\xfe\xff',), 'utf-16-le': (b'\xff\xfe',), 'utf-32': (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'), 'utf-32-be': (b'\x00\x00\xfe\xff',), 'utf-32-le': (b'\xff\xfe\x00\x00',), 'utf-8': (b'\xef\xbb\xbf',)}#
A mapping of encodings to possible BOM markers.
- pydiffx.utils.text.split_lines(data, newline, keep_ends=False)#
Split data along newline boundaries.
This differs from
str.splitlines()
in that it will split across a specific newline boundary, rather than against any sequence of newline characters.
- pydiffx.utils.text.get_newline_for_type(line_endings, encoding=None)#
Return the newline for a given type of line endings.
The resulting newline characters will be encoded into the given encoding, if specified, or as plain ASCII if not specified.
If a BOM is present in the result, it will be stripped.
- Parameters
line_endings (
unicode
) – The type of line endings. This will be of ofLineEndings.DOS
orLineEndings.UNIX
.encoding (
unicode
, optional) – The encoding to use for the resulting newline. IfNone
, “ascii” will be used.
- Returns
The resulting encoded newline characters.
- Return type
- Raises
LookupError –
encoding
was not a valid encoding type.ValueError –
line_endings
was not a valid type of line endings.
- pydiffx.utils.text.guess_line_endings(text, encoding=None)#
Return the line endings that appear to be used for text.
This will check the first line of content and see if it appears to be DOS or UNIX line endings.
If there are no newlines, UNIX line endings are assumed.
- Parameters
- Returns
A 2-tuple of:
The guessed line endings type (as a
line_endings=
option value).The line ending characters (in the same string type as
text
).
- Return type
- pydiffx.utils.text.strip_bom(data, encoding)#
Strip a BOM from the beginning of a string.
If the encoding is one that contains a BOM, and any version (such as Big Endian or Little Endian) of the BOM are present, they’ll be stripped.