scrubadub.post_processors¶

PostProcessors generally can be used to process the detected Filth objects and make changes to them.

These are a new addition to scrubadub and at the moment only simple ones exist that alter the replacement string.

class scrubadub.post_processors.base.PostProcessor(name: Optional[str] = None)[source]¶

Bases: object

autoload: bool = False¶

index: int = 10000¶

__init__(name: Optional[str] = None)[source]¶

name: str = 'post_processor'¶

process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) → Sequence[scrubadub.filth.base.Filth][source]¶

class scrubadub.post_processors.filth_replacer.FilthReplacer(include_type: bool = True, include_count: bool = False, include_hash: bool = False, uppercase: bool = True, separator: Optional[str] = None, hash_length: Optional[int] = None, hash_salt: Optional[Union[str, bytes]] = None, **kwargs)[source]¶

Bases: scrubadub.post_processors.base.PostProcessor

Creates tokens that are used to replace the Filth found in the text of a document.

This can be configured to include the filth type (eg phone, name, email, …), a unique number for each piece of Filth, and a hash of the Filth.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at PHONE or EMAIL'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(include_hash=True, hash_salt='example', hash_length=8),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at PHONE-7358BF44 or EMAIL-AC0B8AC3'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(include_count=True),
... ])
>>> scrubber.clean("Contact me at taylordaniel@example.com or hernandezjenna@example.com, "
...                "but taylordaniel@example.com is probably better.")
'Contact me at EMAIL-0 or EMAIL-1, but EMAIL-0 is probably better.'

name: str = 'filth_replacer'¶

autoload: bool = False¶

index: int = 0¶

typed_lookup: Dict[str, scrubadub.utils.Lookup] = {}¶

__init__(include_type: bool = True, include_count: bool = False, include_hash: bool = False, uppercase: bool = True, separator: Optional[str] = None, hash_length: Optional[int] = None, hash_salt: Optional[Union[str, bytes]] = None, **kwargs)[source]¶

Initialise the FilthReplacer.

Parameters

include_type (bool, default True) –
include_count (bool, default False) –
include_hash (bool, default False) –
uppercase (bool, default True) – Make the label uppercase
separator (Optional[str], default None) – Used to separate labels if a merged filth is being replaced
hash_length (Optional[int], default None) – The length of the hexadecimal hash
hash_salt (Optional[Union[str, bytes]], default None) – The salt used in the hashing process

classmethod reset_lookup()[source]¶: Reset the lookups that maintain a map of filth to a numeric ID.

filth_label(filth: scrubadub.filth.base.Filth) → str[source]¶

This function takes a filth and creates a label that can be used to replace the original text.

Parameters: filth (Filth) – Limit the named entities to those in this list, defaults to {'PERSON', 'PER', 'ORG'}
Returns: The replacement label that should be used for this Filth.
Return type: str

static get_hash(text: str, salt: bytes, length: int) → str[source]¶

Get a hash of some text, that has been salted and truncated.

Parameters

text (str) – The text to be hashed
salt (bytes) – The salt that should be used in this hashing
length (int) – The number of characters long that the hexadecimal hash should be

Returns

The hash of the text

Return type

str

process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) → Sequence[scrubadub.filth.base.Filth][source]¶

Processes the filth to replace the original text

Parameters: filth_list (Sequence[Filth]) – The text to be hashed
Returns: The processed filths
Return type: Sequence[Filth]

class scrubadub.post_processors.prefix_suffix.PrefixSuffixReplacer(prefix: Optional[str] = '{{', suffix: Optional[str] = '}}', name: Optional[str] = None)[source]¶

Bases: scrubadub.post_processors.base.PostProcessor

Add a prefix and/or suffix to the Filth’s replacement string.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at PHONE or EMAIL'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
...     scrubadub.post_processors.PrefixSuffixReplacer(prefix='{{', suffix='}}'),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at {{PHONE}} or {{EMAIL}}'
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthReplacer(),
...     scrubadub.post_processors.PrefixSuffixReplacer(prefix='<b>', suffix='</b>'),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at <b>PHONE</b> or <b>EMAIL</b>'

name: str = 'prefix_suffix_replacer'¶

autoload: bool = False¶

index: int = 1¶

__init__(prefix: Optional[str] = '{{', suffix: Optional[str] = '}}', name: Optional[str] = None)[source]¶

process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) → Sequence[scrubadub.filth.base.Filth][source]¶

Processes the filth to add prefixes and suffixes to the replacement text

Parameters: filth_list (Sequence[Filth]) – The text to be hashed
Returns: The processed filths
Return type: Sequence[Filth]

class scrubadub.post_processors.remover.FilthRemover(name: Optional[str] = None)[source]¶

Bases: scrubadub.post_processors.base.PostProcessor

Removes all found filth from the original document.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(post_processor_list=[
...     scrubadub.post_processors.FilthRemover(),
... ])
>>> scrubber.clean("Contact me at 522-368-8530 or hernandezjenna@example.com")
'Contact me at  or '

name: str = 'filth_remover'¶

autoload: bool = False¶

index: int = 0¶

process_filth(filth_list: Sequence[scrubadub.filth.base.Filth]) → Sequence[scrubadub.filth.base.Filth][source]¶

Processes the filth to remove the filth

Parameters: filth_list (Sequence[Filth]) – The text to be hashed
Returns: The processed filths
Return type: Sequence[Filth]

Catalogue functions¶

scrubadub.post_processors.register_post_processor¶

scrubadub.post_processors.register_post_processor(post_processor: Type[PostProcessor], autoload: Optional[bool] = None, index: Optional[int] = None) → None[source]¶

Register a PostProcessor for use with the Scrubber class.

You can use register_post_processor(NewPostProcessor) after your post-processor definition to automatically register it with the Scrubber class so that it can be used to process Filth.

The argument autoload sets if a new Scrubber() instance should load this PostProcessor by default.

Parameters

post_processor (PostProcessor class) – The PostProcessor to register with the scrubadub post-processor configuration.
autoload (bool) – Whether to automatically load this Detector on Scrubber initialisation.
index (int) – The location/index in which this PostProcessor should be added.

scrubadub.post_processors.remove_post_processor¶

scrubadub.post_processors.remove_post_processor(post_processor: Union[Type[PostProcessor], str]) → None[source]¶

Remove an already registered post-processor.

Parameters: post_processor (Union[Type['PostProcessor'], str]) – The PostProcessor to register with the scrubadub post-processor configuration.