scrubadub¶
There are several convenience functions to make using scrubadub quick and simple.
These functions either remove the Filth from the text (such as scrubadub.clean) or
return a list of Filth objects that were found (such as scrubadub.list_filth).
These functions either work on a single document in a string (such as scrubadub.clean) or
work on a set of documents given in either a dictonary or list (such as scrubadub.clean_documents).
scrubadub.clean¶
- scrubadub.clean(text: str, locale: Optional[str] = None, **kwargs) str[source]¶
Seaches for
Filthin text in a string and replaces it with placeholders.>>> import scrubadub >>> scrubadub.clean(u"contact me at joe@example.com") 'contact me at {{EMAIL}}'
- Parameters
text (str) – The text containing possible PII that needs to be redacted
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
Text with all :class:
Filthreplaced.- Return type
str
scrubadub.clean_documents¶
- scrubadub.clean_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], locale: Optional[str] = None, **kwargs) Union[Sequence[str], Dict[Optional[str], str]][source]¶
Seaches for
Filthin documents and replaces it with placeholders.documents can be in a dict, in the format of
{'document_name': 'document'}, or as a list of strings (each a seperate document). This can be useful when processing many documents.>>> import scrubadub >>> scrubadub.clean_documents({'contact.txt': "contact me at joe@example.com", ... 'hello.txt': 'hello world!'}) {'contact.txt': 'contact me at {{EMAIL}}', 'hello.txt': 'hello world!'} >>> scrubadub.clean_documents(["contact me at joe@example.com", 'hello world!']) ['contact me at {{EMAIL}}', 'hello world!']
- Parameters
documents (list of str objects, dict of str objects) – Documents containing possible PII that needs to be redacted in the form of a list of documents or a dictonary with the key as the document name and the value as the document text
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
Documents in the same format as input, but with Filth redacted
- Return type
list of str objects, dict of str objects; same as input
scrubadub.list_filth¶
- scrubadub.list_filth(text: str, locale: Optional[str] = None, **kwargs) List[scrubadub.filth.base.Filth][source]¶
Return a list of
Filththat was detected in the string text.>>> import scrubadub >>> scrubadub.list_filth(u"contact me at joe@example.com") [<EmailFilth text='joe@example.com' beg=14 end=29 detector_name='email' locale='en_US'>]
- Parameters
text (str) – The text containing possible PII that needs to be found
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
A list of all the :class:
Filthobjects that were found- Return type
list of :class:
Filthobjects
scrubadub.list_filth_documents¶
- scrubadub.list_filth_documents(documents: Union[List[str], Dict[Optional[str], str]], locale: Optional[str] = None, **kwargs) List[scrubadub.filth.base.Filth][source]¶
Return a list of
Filththat was detected in the string text.documents can be in a dict, in the format of
{'document_name': 'document'}, or as a list of strings (each a seperate document). This can be useful when processing many documents.>>> import scrubadub >>> scrubadub.list_filth_documents( ... {'contact.txt': "contact me at joe@example.com", 'hello.txt': 'hello world!'} ... ) [<EmailFilth text='joe@example.com' document_name='contact.txt' beg=14 end=29 detector_name='email' locale='en_US'>] >>> scrubadub.list_filth_documents(["contact me at joe@example.com", 'hello world!']) [<EmailFilth text='joe@example.com' document_name='0' beg=14 end=29 detector_name='email' locale='en_US'>]
- Parameters
documents (list of str objects, dict of str objects) – Documents containing possible PII that needs to be found in the form of a list of documents or a dictonary with the key as the document name and the value as the document text
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
A list of all the :class:
Filthobjects that were found- Return type
list of :class:
Filthobjects
scrubadub.Scrubber¶
All of the Detector’s are managed by the Scrubber. The main job of the
Scrubber is to handle situations in which the same section of text contains
different types of Filth.
- class scrubadub.scrubbers.Scrubber(detector_list: Optional[Sequence[Union[Type[scrubadub.detectors.base.Detector], scrubadub.detectors.base.Detector, str]]] = None, post_processor_list: Optional[Sequence[Union[Type[scrubadub.post_processors.base.PostProcessor], scrubadub.post_processors.base.PostProcessor, str]]] = None, locale: Optional[str] = None)[source]¶
Bases:
objectThe Scrubber class is used to clean personal information out of dirty dirty text. It manages a set of
Detector’s that are each responsible for identifyingFilth.PostProcessorobjects are used to alter the found Filth. This could be to replace the Filth with a hash or token.- __init__(detector_list: Optional[Sequence[Union[Type[scrubadub.detectors.base.Detector], scrubadub.detectors.base.Detector, str]]] = None, post_processor_list: Optional[Sequence[Union[Type[scrubadub.post_processors.base.PostProcessor], scrubadub.post_processors.base.PostProcessor, str]]] = None, locale: Optional[str] = None)[source]¶
Create a
Scrubberobject.- Parameters
detector_list (Optional[Sequence[Union[Type[Detector], Detector, str]]]) – The list of detectors to use in this scrubber.
post_processor_list (Optional[Sequence[Union[Type[Detector], Detector, str]]]) – The locale that the phone number should adhere to.
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- add_detector(detector: Union[scrubadub.detectors.base.Detector, Type[scrubadub.detectors.base.Detector], str], warn: bool = True)[source]¶
Add a
Detectorto ScrubberYou can add a detector to a
Scrubberby passing one of three objects to this function:the uninitalised class to this function, which initialises the class with default settings.
an instance of a
Detectorclass, where you can initialise it with the settings desired.a string containing the name of the detector, which again initialises the class with default settings.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber(detector_list=[]) >>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector) >>> scrubber.add_detector('skype') >>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=False) >>> scrubber.add_detector(detector)
- Parameters
detector (a Detector class, a Detector instance, or a string with the detector's name) – The
Detectorto add to this scrubber.warn (bool, default True) – raise a warning if the locale is not supported by the detector.
- remove_detector(detector: Union[scrubadub.detectors.base.Detector, Type[scrubadub.detectors.base.Detector], str])[source]¶
Remove a
Detectorfrom a ScrubberYou can remove a detector from a
Scrubberby passing one of three objects to this function:the uninitalised class to this function, which removes the initalised detector of the same name.
an instance of a
Detectorclass, which removes the initalised detector of the same name.a string containing the name of the detector, which removed the detector of that name.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber() >>> scrubber.remove_detector(scrubadub.detectors.CreditCardDetector) >>> scrubber.remove_detector('url') >>> detector = scrubadub.detectors.email.EmailDetector() >>> scrubber.remove_detector(detector)
- Parameters
detector (a Detector class, a Detector instance, or a string with the detector's name) – The
Detectorto remove from this scrubber.
- add_post_processor(post_processor: Union[scrubadub.post_processors.base.PostProcessor, Type[scrubadub.post_processors.base.PostProcessor], str], index: Optional[int] = None)[source]¶
Add a
PostProcessorto a ScrubberYou can add a post-processor to a
Scrubberby passing one of three objects to this function:the uninitalised class to this function, which initialises the class with default settings.
an instance of a
PostProcessorclass, where you can initialise it with the settings desired.a string containing the name of the detector, which again initialises the class with default settings.
>>> import scrubadub, scrubadub.post_processors >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_post_processor('filth_replacer') >>> scrubber.add_post_processor(scrubadub.post_processors.PrefixSuffixReplacer)
- Parameters
post_processor (a PostProcessor class, a PostProcessor instance, or a string with the post-processor's name) – The
PostProcessorto remove from this scrubber.
- remove_post_processor(post_processor: Union[scrubadub.post_processors.base.PostProcessor, Type[scrubadub.post_processors.base.PostProcessor], str])[source]¶
Remove a
PostProcessorfrom a ScrubberYou can remove a post-processor from a
Scrubberby passing one of three objects to this function:the uninitalised class to this function, which removes the initalised post-processor of the same name.
an instance of a
PostProcessorclass, which removes the initalised post-processor of the same name.a string containing the name of the detector, which removed the post-processor of that name.
>>> import scrubadub, scrubadub.post_processors >>> scrubber = scrubadub.Scrubber() >>> scrubber.remove_post_processor('filth_type_replacer') >>> scrubber.remove_post_processor(scrubadub.post_processors.PrefixSuffixReplacer)
- Parameters
post_processor (a PostProcessor class, a PostProcessor instance, or a string with the post-processor's name) – The
PostProcessorto remove from this scrubber.
- clean(text: str, **kwargs) str[source]¶
This is the master method that cleans all of the filth out of the dirty dirty
text. All keyword arguments to this function are passed through to theFilth.replace_withmethod to fine-tune how theFilthis cleaned.
- clean_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], **kwargs) Union[Dict[Optional[str], str], Sequence[str]][source]¶
This is the master method that cleans all of the filth out of the dirty dirty
text. All keyword arguments to this function are passed through to theFilth.replace_withmethod to fine-tune how theFilthis cleaned.
- iter_filth(text: str, document_name: Optional[str] = None, run_post_processors: bool = True) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Iterate over the different types of filth that can exist.
- iter_filth_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], run_post_processors: bool = True) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Iterate over the different types of filth that can exist.