pydiffx.utils.text#

Utilities for processing text.

Module Attributes

NEWLINE_FORMATS

A mapping of newline format types to character sequences.

BOMS

A mapping of encodings to possible BOM markers.

Functions

get_newline_for_type(line_endings[, encoding])

Return the newline for a given type of line endings.

guess_line_endings(text[, encoding])

Return the line endings that appear to be used for text.

split_lines(data, newline[, keep_ends])

Split data along newline boundaries.

strip_bom(data, encoding)

Strip a BOM from the beginning of a string.

pydiffx.utils.text.NEWLINE_FORMATS = {'dos': '\r\n', 'unix': '\n'}#

A mapping of newline format types to character sequences.

This contains only formats that are allowed in the line_endings= option in DiffX content sections.

Type:

dict

pydiffx.utils.text.BOMS = {'utf-16': (b'\xfe\xff', b'\xff\xfe'), 'utf-16-be': (b'\xfe\xff',), 'utf-16-le': (b'\xff\xfe',), 'utf-32': (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'), 'utf-32-be': (b'\x00\x00\xfe\xff',), 'utf-32-le': (b'\xff\xfe\x00\x00',), 'utf-8': (b'\xef\xbb\xbf',)}#

A mapping of encodings to possible BOM markers.

pydiffx.utils.text.split_lines(data, newline, keep_ends=False)#

Split data along newline boundaries.

This differs from str.splitlines() in that it will split across a specific newline boundary, rather than against any sequence of newline characters.

Parameters
  • data (bytes) – The data to split.

  • newline (bytes) – The newline character(s) used to split the data into lines.

  • keep_ends (bool, optional) – Whether to keep the line endings in the resulting lines.

Returns

The split list of lines.

Return type

list of bytes

pydiffx.utils.text.get_newline_for_type(line_endings, encoding=None)#

Return the newline for a given type of line endings.

The resulting newline characters will be encoded into the given encoding, if specified, or as plain ASCII if not specified.

If a BOM is present in the result, it will be stripped.

Parameters
  • line_endings (unicode) – The type of line endings. This will be of of LineEndings.DOS or LineEndings.UNIX.

  • encoding (unicode, optional) – The encoding to use for the resulting newline. If None, “ascii” will be used.

Returns

The resulting encoded newline characters.

Return type

bytes

Raises
  • LookupErrorencoding was not a valid encoding type.

  • ValueErrorline_endings was not a valid type of line endings.

pydiffx.utils.text.guess_line_endings(text, encoding=None)#

Return the line endings that appear to be used for text.

This will check the first line of content and see if it appears to be DOS or UNIX line endings.

If there are no newlines, UNIX line endings are assumed.

Parameters
  • text (bytes or unicode) – The text to guess line endings from.

  • encoding (unicode, optional) – The encoding of the text, if it’s a byte string.

Returns

A 2-tuple of:

  1. The guessed line endings type (as a line_endings= option value).

  2. The line ending characters (in the same string type as text).

Return type

tuple

pydiffx.utils.text.strip_bom(data, encoding)#

Strip a BOM from the beginning of a string.

If the encoding is one that contains a BOM, and any version (such as Big Endian or Little Endian) of the BOM are present, they’ll be stripped.

Parameters
  • data (bytes) – The byte string to strip a BOM from.

  • encoding (unicode) – The encoding of the byte string.

Returns

The string, without any BOM markers.

Return type

bytes