pydiffx.utils.text¶
Utilities for processing text.
Module Attributes
A mapping of newline format types to character sequences. |
|
A mapping of encodings to possible BOM markers. |
Functions
|
Return the newline for a given type of line endings. |
|
Return the line endings that appear to be used for text. |
|
Split data along newline boundaries. |
|
Strip a BOM from the beginning of a string. |
- pydiffx.utils.text.NEWLINE_FORMATS = {'dos': '\r\n', 'unix': '\n'}¶
A mapping of newline format types to character sequences.
This contains only formats that are allowed in the
line_endings=option in DiffX content sections.- Type:
- pydiffx.utils.text.BOMS = {'utf-16': (b'\xfe\xff', b'\xff\xfe'), 'utf-16-be': (b'\xfe\xff',), 'utf-16-le': (b'\xff\xfe',), 'utf-32': (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'), 'utf-32-be': (b'\x00\x00\xfe\xff',), 'utf-32-le': (b'\xff\xfe\x00\x00',), 'utf-8': (b'\xef\xbb\xbf',)}¶
A mapping of encodings to possible BOM markers.
- pydiffx.utils.text.split_lines(data, newline, keep_ends=False)¶
Split data along newline boundaries.
This differs from
str.splitlines()in that it will split across a specific newline boundary, rather than against any sequence of newline characters.
- pydiffx.utils.text.get_newline_for_type(line_endings, encoding=None)¶
Return the newline for a given type of line endings.
The resulting newline characters will be encoded into the given encoding, if specified, or as plain ASCII if not specified.
If a BOM is present in the result, it will be stripped.
- Parameters:
line_endings (
unicode) – The type of line endings. This will be of ofLineEndings.DOSorLineEndings.UNIX.encoding (
unicode, optional) – The encoding to use for the resulting newline. IfNone, “ascii” will be used.
- Returns:
The resulting encoded newline characters.
- Return type:
- Raises:
LookupError –
encodingwas not a valid encoding type.ValueError –
line_endingswas not a valid type of line endings.
- pydiffx.utils.text.guess_line_endings(text, encoding=None)¶
Return the line endings that appear to be used for text.
This will check the first line of content and see if it appears to be DOS or UNIX line endings.
If there are no newlines, UNIX line endings are assumed.
- Parameters:
- Returns:
A 2-tuple of:
The guessed line endings type (as a
line_endings=option value).The line ending characters (in the same string type as
text).
- Return type:
- pydiffx.utils.text.strip_bom(data, encoding)¶
Strip a BOM from the beginning of a string.
If the encoding is one that contains a BOM, and any version (such as Big Endian or Little Endian) of the BOM are present, they’ll be stripped.