scrubadub.post_processors

PostProcessors generally can be used to process the detected Filth objects and make changes to them.

These are a new addition to scrubadub and at the moment only simple ones exist that alter the replacement string.

class scrubadub.post_processors.base.PostProcessor(name: Optional[str] = None)[source]

Bases: object

autoload: bool = False
index: int = 10000
__init__(name: Optional[str] = None)[source]
name: str = 'post_processor'
process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth][source]
class scrubadub.post_processors.filth_replacer.FilthReplacer(include_type: bool = True, include_count: bool = False, include_hash: bool = False, uppercase: bool = True, separator: Optional[str] = None, hash_length: Optional[int] = None, hash_salt: Optional[Union[str, bytes]] = None, **kwargs)[source]

Bases: scrubadub.post_processors.base.PostProcessor

Creates tokens that are used to replace the Filth found in the text of a document.

This can be configured to include the filth type (eg phone, name, email, …), a unique number for each piece of Filth, and a hash of the Filth.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at PHONE or EMAIL'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(include_hash=True, hash_salt='example', hash_length=8),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at PHONE-7358BF44 or EMAIL-AC0B8AC3'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(include_count=True),
... ])
>>> scrubber.clean("Contact me at taylordaniel@example.com or hernandezjenna@example.com, "
...                "but taylordaniel@example.com is probably better.")
'Contact me at EMAIL-0 or EMAIL-1, but EMAIL-0 is probably better.'
name: str = 'filth_replacer'
autoload: bool = False
index: int = 0
typed_lookup: Dict[str, scrubadub.utils.Lookup] = {}
__init__(include_type: bool = True, include_count: bool = False, include_hash: bool = False, uppercase: bool = True, separator: Optional[str] = None, hash_length: Optional[int] = None, hash_salt: Optional[Union[str, bytes]] = None, **kwargs)[source]

Initialise the FilthReplacer.

Parameters
  • include_type (bool, default True) –

  • include_count (bool, default False) –

  • include_hash (bool, default False) –

  • uppercase (bool, default True) – Make the label uppercase

  • separator (Optional[str], default None) – Used to separate labels if a merged filth is being replaced

  • hash_length (Optional[int], default None) – The length of the hexadecimal hash

  • hash_salt (Optional[Union[str, bytes]], default None) – The salt used in the hashing process

classmethod reset_lookup()[source]

Reset the lookups that maintain a map of filth to a numeric ID.

filth_label(filth: scrubadub.filth.base.Filth) str[source]

This function takes a filth and creates a label that can be used to replace the original text.

Parameters

filth (Filth) – Limit the named entities to those in this list, defaults to {'PERSON', 'PER', 'ORG'}

Returns

The replacement label that should be used for this Filth.

Return type

str

static get_hash(text: str, salt: bytes, length: int) str[source]

Get a hash of some text, that has been salted and truncated.

Parameters
  • text (str) – The text to be hashed

  • salt (bytes) – The salt that should be used in this hashing

  • length (int) – The number of characters long that the hexadecimal hash should be

Returns

The hash of the text

Return type

str

process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth][source]

Processes the filth to replace the original text

Parameters

filth_list (Sequence[Filth]) – The text to be hashed

Returns

The processed filths

Return type

Sequence[Filth]

class scrubadub.post_processors.prefix_suffix.PrefixSuffixReplacer(prefix: Optional[str] = '{{', suffix: Optional[str] = '}}', name: Optional[str] = None)[source]

Bases: scrubadub.post_processors.base.PostProcessor

Add a prefix and/or suffix to the Filth’s replacement string.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at PHONE or EMAIL'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
...     scrubadub.post_processors.PrefixSuffixReplacer(prefix='{{', suffix='}}'),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at {{PHONE}} or {{EMAIL}}'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
...     scrubadub.post_processors.PrefixSuffixReplacer(prefix='<b>', suffix='</b>'),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at <b>PHONE</b> or <b>EMAIL</b>'
name: str = 'prefix_suffix_replacer'
autoload: bool = False
index: int = 1
__init__(prefix: Optional[str] = '{{', suffix: Optional[str] = '}}', name: Optional[str] = None)[source]
process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth][source]

Processes the filth to add prefixes and suffixes to the replacement text

Parameters

filth_list (Sequence[Filth]) – The text to be hashed

Returns

The processed filths

Return type

Sequence[Filth]

class scrubadub.post_processors.remover.FilthRemover(name: Optional[str] = None)[source]

Bases: scrubadub.post_processors.base.PostProcessor

Removes all found filth from the original document.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthRemover(),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at  or '
name: str = 'filth_remover'
autoload: bool = False
index: int = 0
process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) Sequence[scrubadub.filth.base.Filth][source]

Processes the filth to remove the filth

Parameters

filth_list (Sequence[Filth]) – The text to be hashed

Returns

The processed filths

Return type

Sequence[Filth]

Catalogue functions

scrubadub.post_processors.register_post_processor

scrubadub.post_processors.register_post_processor(post_processor: Type[PostProcessor], autoload: Optional[bool] = None, index: Optional[int] = None) None[source]

Register a PostProcessor for use with the Scrubber class.

You can use register_post_processor(NewPostProcessor) after your post-processor definition to automatically register it with the Scrubber class so that it can be used to process Filth.

The argument autoload sets if a new Scrubber() instance should load this PostProcessor by default.

Parameters
  • post_processor (PostProcessor class) – The PostProcessor to register with the scrubadub post-processor configuration.

  • autoload (bool) – Whether to automatically load this Detector on Scrubber initialisation.

  • index (int) – The location/index in which this PostProcessor should be added.

scrubadub.post_processors.remove_post_processor

scrubadub.post_processors.remove_post_processor(post_processor: Union[Type[PostProcessor], str]) None[source]

Remove an already registered post-processor.

Parameters

post_processor (Union[Type['PostProcessor'], str]) – The PostProcessor to register with the scrubadub post-processor configuration.