pydiffx.utils.text

Utilities for processing text.

Module Attributes

NEWLINE_FORMATS

A mapping of newline format types to character sequences.

BOMS

A mapping of encodings to possible BOM markers.

Functions

get_newline_for_type(line_endings[, encoding])

Return the newline for a given type of line endings.

guess_line_endings(text[, encoding])

Return the line endings that appear to be used for text.

split_lines(data, newline[, keep_ends])

Split data along newline boundaries.

strip_bom(data, encoding)

Strip a BOM from the beginning of a string.

pydiffx.utils.text.NEWLINE_FORMATS = {'dos': '\r\n', 'unix': '\n'}

A mapping of newline format types to character sequences.

This contains only formats that are allowed in the line_endings= option in DiffX content sections.

Type:

dict

pydiffx.utils.text.BOMS = {'utf-16': (b'\xfe\xff', b'\xff\xfe'), 'utf-16-be': (b'\xfe\xff',), 'utf-16-le': (b'\xff\xfe',), 'utf-32': (b'\x00\x00\xfe\xff', b'\xff\xfe\x00\x00'), 'utf-32-be': (b'\x00\x00\xfe\xff',), 'utf-32-le': (b'\xff\xfe\x00\x00',), 'utf-8': (b'\xef\xbb\xbf',)}

A mapping of encodings to possible BOM markers.

pydiffx.utils.text.split_lines(data, newline, keep_ends=False)

Split data along newline boundaries.

This differs from str.splitlines() in that it will split across a specific newline boundary, rather than against any sequence of newline characters.

Parameters:
  • data (bytes) – The data to split.

  • newline (bytes) – The newline character(s) used to split the data into lines.

  • keep_ends (bool, optional) – Whether to keep the line endings in the resulting lines.

Returns:

The split list of lines.

Return type:

list of bytes

pydiffx.utils.text.get_newline_for_type(line_endings, encoding=None)

Return the newline for a given type of line endings.

The resulting newline characters will be encoded into the given encoding, if specified, or as plain ASCII if not specified.

If a BOM is present in the result, it will be stripped.

Parameters:
  • line_endings (unicode) – The type of line endings. This will be of of LineEndings.DOS or LineEndings.UNIX.

  • encoding (unicode, optional) – The encoding to use for the resulting newline. If None, “ascii” will be used.

Returns:

The resulting encoded newline characters.

Return type:

bytes

Raises:
  • LookupErrorencoding was not a valid encoding type.

  • ValueErrorline_endings was not a valid type of line endings.

pydiffx.utils.text.guess_line_endings(text, encoding=None)

Return the line endings that appear to be used for text.

This will check the first line of content and see if it appears to be DOS or UNIX line endings.

If there are no newlines, UNIX line endings are assumed.

Parameters:
  • text (bytes or unicode) – The text to guess line endings from.

  • encoding (unicode, optional) – The encoding of the text, if it’s a byte string.

Returns:

A 2-tuple of:

  1. The guessed line endings type (as a line_endings= option value).

  2. The line ending characters (in the same string type as text).

Return type:

tuple

pydiffx.utils.text.strip_bom(data, encoding)

Strip a BOM from the beginning of a string.

If the encoding is one that contains a BOM, and any version (such as Big Endian or Little Endian) of the BOM are present, they’ll be stripped.

Parameters:
  • data (bytes) – The byte string to strip a BOM from.

  • encoding (unicode) – The encoding of the byte string.

Returns:

The string, without any BOM markers.

Return type:

bytes