scrubadub

There are several convenience functions to make using scrubadub quick and simple. These functions either remove the Filth from the text (such as scrubadub.clean) or return a list of Filth objects that were found (such as scrubadub.list_filth). These functions either work on a single document in a string (such as scrubadub.clean) or work on a set of documents given in either a dictonary or list (such as scrubadub.clean_documents).

scrubadub.clean

scrubadub.clean(text: str, locale: Optional[str] = None, **kwargs) str[source]

Seaches for Filth in text in a string and replaces it with placeholders.

>>> import scrubadub
>>> scrubadub.clean(u"contact me at joe@example.com")
'contact me at {{EMAIL}}'
Parameters
  • text (str) – The text containing possible PII that needs to be redacted

  • locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”

Returns

Text with all :class:Filth replaced.

Return type

str

scrubadub.clean_documents

scrubadub.clean_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], locale: Optional[str] = None, **kwargs) Union[Sequence[str], Dict[Optional[str], str]][source]

Seaches for Filth in documents and replaces it with placeholders.

documents can be in a dict, in the format of {'document_name': 'document'}, or as a list of strings (each a seperate document). This can be useful when processing many documents.

>>> import scrubadub
>>> scrubadub.clean_documents({'contact.txt': "contact me at joe@example.com",
...     'hello.txt': 'hello world!'})
{'contact.txt': 'contact me at {{EMAIL}}', 'hello.txt': 'hello world!'}

>>> scrubadub.clean_documents(["contact me at joe@example.com", 'hello world!'])
['contact me at {{EMAIL}}', 'hello world!']
Parameters
  • documents (list of str objects, dict of str objects) – Documents containing possible PII that needs to be redacted in the form of a list of documents or a dictonary with the key as the document name and the value as the document text

  • locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”

Returns

Documents in the same format as input, but with Filth redacted

Return type

list of str objects, dict of str objects; same as input

scrubadub.list_filth

scrubadub.list_filth(text: str, locale: Optional[str] = None, **kwargs) List[scrubadub.filth.base.Filth][source]

Return a list of Filth that was detected in the string text.

>>> import scrubadub
>>> scrubadub.list_filth(u"contact me at joe@example.com")
[<EmailFilth text='joe@example.com' beg=14 end=29 detector_name='email' locale='en_US'>]
Parameters
  • text (str) – The text containing possible PII that needs to be found

  • locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”

Returns

A list of all the :class:Filth objects that were found

Return type

list of :class:Filth objects

scrubadub.list_filth_documents

scrubadub.list_filth_documents(documents: Union[List[str], Dict[Optional[str], str]], locale: Optional[str] = None, **kwargs) List[scrubadub.filth.base.Filth][source]

Return a list of Filth that was detected in the string text.

documents can be in a dict, in the format of {'document_name': 'document'}, or as a list of strings (each a seperate document). This can be useful when processing many documents.

>>> import scrubadub
>>> scrubadub.list_filth_documents(
...     {'contact.txt': "contact me at joe@example.com", 'hello.txt': 'hello world!'}
... )
[<EmailFilth text='joe@example.com' document_name='contact.txt' beg=14 end=29 detector_name='email' locale='en_US'>]

>>> scrubadub.list_filth_documents(["contact me at joe@example.com", 'hello world!'])
[<EmailFilth text='joe@example.com' document_name='0' beg=14 end=29 detector_name='email' locale='en_US'>]
Parameters
  • documents (list of str objects, dict of str objects) – Documents containing possible PII that needs to be found in the form of a list of documents or a dictonary with the key as the document name and the value as the document text

  • locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”

Returns

A list of all the :class:Filth objects that were found

Return type

list of :class:Filth objects

scrubadub.Scrubber

All of the Detector’s are managed by the Scrubber. The main job of the Scrubber is to handle situations in which the same section of text contains different types of Filth.

class scrubadub.scrubbers.Scrubber(detector_list: Optional[Sequence[Union[Type[scrubadub.detectors.base.Detector], scrubadub.detectors.base.Detector, str]]] = None, post_processor_list: Optional[Sequence[Union[Type[scrubadub.post_processors.base.PostProcessor], scrubadub.post_processors.base.PostProcessor, str]]] = None, locale: Optional[str] = None)[source]

Bases: object

The Scrubber class is used to clean personal information out of dirty dirty text. It manages a set of Detector’s that are each responsible for identifying Filth. PostProcessor objects are used to alter the found Filth. This could be to replace the Filth with a hash or token.

__init__(detector_list: Optional[Sequence[Union[Type[scrubadub.detectors.base.Detector], scrubadub.detectors.base.Detector, str]]] = None, post_processor_list: Optional[Sequence[Union[Type[scrubadub.post_processors.base.PostProcessor], scrubadub.post_processors.base.PostProcessor, str]]] = None, locale: Optional[str] = None)[source]

Create a Scrubber object.

Parameters
  • detector_list (Optional[Sequence[Union[Type[Detector], Detector, str]]]) – The list of detectors to use in this scrubber.

  • post_processor_list (Optional[Sequence[Union[Type[Detector], Detector, str]]]) – The locale that the phone number should adhere to.

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

add_detector(detector: Union[scrubadub.detectors.base.Detector, Type[scrubadub.detectors.base.Detector], str], warn: bool = True)[source]

Add a Detector to Scrubber

You can add a detector to a Scrubber by passing one of three objects to this function:

  1. the uninitalised class to this function, which initialises the class with default settings.

  2. an instance of a Detector class, where you can initialise it with the settings desired.

  3. a string containing the name of the detector, which again initialises the class with default settings.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[])
>>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector)
>>> scrubber.add_detector('skype')
>>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=False)
>>> scrubber.add_detector(detector)
Parameters
  • detector (a Detector class, a Detector instance, or a string with the detector's name) – The Detector to add to this scrubber.

  • warn (bool, default True) – raise a warning if the locale is not supported by the detector.

remove_detector(detector: Union[scrubadub.detectors.base.Detector, Type[scrubadub.detectors.base.Detector], str])[source]

Remove a Detector from a Scrubber

You can remove a detector from a Scrubber by passing one of three objects to this function:

  1. the uninitalised class to this function, which removes the initalised detector of the same name.

  2. an instance of a Detector class, which removes the initalised detector of the same name.

  3. a string containing the name of the detector, which removed the detector of that name.

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.remove_detector(scrubadub.detectors.CreditCardDetector)
>>> scrubber.remove_detector('url')
>>> detector = scrubadub.detectors.email.EmailDetector()
>>> scrubber.remove_detector(detector)
Parameters

detector (a Detector class, a Detector instance, or a string with the detector's name) – The Detector to remove from this scrubber.

add_post_processor(post_processor: Union[scrubadub.post_processors.base.PostProcessor, Type[scrubadub.post_processors.base.PostProcessor], str], index: Optional[int] = None)[source]

Add a PostProcessor to a Scrubber

You can add a post-processor to a Scrubber by passing one of three objects to this function:

  1. the uninitalised class to this function, which initialises the class with default settings.

  2. an instance of a PostProcessor class, where you can initialise it with the settings desired.

  3. a string containing the name of the detector, which again initialises the class with default settings.

>>> import scrubadub, scrubadub.post_processors
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_post_processor('filth_replacer')
>>> scrubber.add_post_processor(scrubadub.post_processors.PrefixSuffixReplacer)
Parameters

post_processor (a PostProcessor class, a PostProcessor instance, or a string with the post-processor's name) – The PostProcessor to remove from this scrubber.

remove_post_processor(post_processor: Union[scrubadub.post_processors.base.PostProcessor, Type[scrubadub.post_processors.base.PostProcessor], str])[source]

Remove a PostProcessor from a Scrubber

You can remove a post-processor from a Scrubber by passing one of three objects to this function:

  1. the uninitalised class to this function, which removes the initalised post-processor of the same name.

  2. an instance of a PostProcessor class, which removes the initalised post-processor of the same name.

  3. a string containing the name of the detector, which removed the post-processor of that name.

>>> import scrubadub, scrubadub.post_processors
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.remove_post_processor('filth_type_replacer')
>>> scrubber.remove_post_processor(scrubadub.post_processors.PrefixSuffixReplacer)
Parameters

post_processor (a PostProcessor class, a PostProcessor instance, or a string with the post-processor's name) – The PostProcessor to remove from this scrubber.

clean(text: str, **kwargs) str[source]

This is the master method that cleans all of the filth out of the dirty dirty text. All keyword arguments to this function are passed through to the Filth.replace_with method to fine-tune how the Filth is cleaned.

clean_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], **kwargs) Union[Dict[Optional[str], str], Sequence[str]][source]

This is the master method that cleans all of the filth out of the dirty dirty text. All keyword arguments to this function are passed through to the Filth.replace_with method to fine-tune how the Filth is cleaned.

iter_filth(text: str, document_name: Optional[str] = None, run_post_processors: bool = True) Generator[scrubadub.filth.base.Filth, None, None][source]

Iterate over the different types of filth that can exist.

iter_filth_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], run_post_processors: bool = True) Generator[scrubadub.filth.base.Filth, None, None][source]

Iterate over the different types of filth that can exist.