scrubadub¶
There are several convenience functions to make using scrubadub quick and simple.
These functions either remove the Filth from the text (such as scrubadub.clean
) or
return a list of Filth objects that were found (such as scrubadub.list_filth
).
These functions either work on a single document in a string (such as scrubadub.clean
) or
work on a set of documents given in either a dictonary or list (such as scrubadub.clean_documents
).
scrubadub.clean¶
- scrubadub.clean(text: str, locale: Optional[str] = None, **kwargs) str [source]¶
Seaches for
Filth
in text in a string and replaces it with placeholders.>>> import scrubadub >>> scrubadub.clean(u"contact me at joe@example.com") 'contact me at {{EMAIL}}'
- Parameters
text (str) – The text containing possible PII that needs to be redacted
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
Text with all :class:
Filth
replaced.- Return type
str
scrubadub.clean_documents¶
- scrubadub.clean_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], locale: Optional[str] = None, **kwargs) Union[Sequence[str], Dict[Optional[str], str]] [source]¶
Seaches for
Filth
in documents and replaces it with placeholders.documents can be in a dict, in the format of
{'document_name': 'document'}
, or as a list of strings (each a seperate document). This can be useful when processing many documents.>>> import scrubadub >>> scrubadub.clean_documents({'contact.txt': "contact me at joe@example.com", ... 'hello.txt': 'hello world!'}) {'contact.txt': 'contact me at {{EMAIL}}', 'hello.txt': 'hello world!'} >>> scrubadub.clean_documents(["contact me at joe@example.com", 'hello world!']) ['contact me at {{EMAIL}}', 'hello world!']
- Parameters
documents (list of str objects, dict of str objects) – Documents containing possible PII that needs to be redacted in the form of a list of documents or a dictonary with the key as the document name and the value as the document text
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
Documents in the same format as input, but with Filth redacted
- Return type
list of str objects, dict of str objects; same as input
scrubadub.list_filth¶
- scrubadub.list_filth(text: str, locale: Optional[str] = None, **kwargs) List[scrubadub.filth.base.Filth] [source]¶
Return a list of
Filth
that was detected in the string text.>>> import scrubadub >>> scrubadub.list_filth(u"contact me at joe@example.com") [<EmailFilth text='joe@example.com' beg=14 end=29 detector_name='email' locale='en_US'>]
- Parameters
text (str) – The text containing possible PII that needs to be found
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
A list of all the :class:
Filth
objects that were found- Return type
list of :class:
Filth
objects
scrubadub.list_filth_documents¶
- scrubadub.list_filth_documents(documents: Union[List[str], Dict[Optional[str], str]], locale: Optional[str] = None, **kwargs) List[scrubadub.filth.base.Filth] [source]¶
Return a list of
Filth
that was detected in the string text.documents can be in a dict, in the format of
{'document_name': 'document'}
, or as a list of strings (each a seperate document). This can be useful when processing many documents.>>> import scrubadub >>> scrubadub.list_filth_documents( ... {'contact.txt': "contact me at joe@example.com", 'hello.txt': 'hello world!'} ... ) [<EmailFilth text='joe@example.com' document_name='contact.txt' beg=14 end=29 detector_name='email' locale='en_US'>] >>> scrubadub.list_filth_documents(["contact me at joe@example.com", 'hello world!']) [<EmailFilth text='joe@example.com' document_name='0' beg=14 end=29 detector_name='email' locale='en_US'>]
- Parameters
documents (list of str objects, dict of str objects) – Documents containing possible PII that needs to be found in the form of a list of documents or a dictonary with the key as the document name and the value as the document text
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”
- Returns
A list of all the :class:
Filth
objects that were found- Return type
list of :class:
Filth
objects
scrubadub.Scrubber¶
All of the Detector
’s are managed by the Scrubber
. The main job of the
Scrubber
is to handle situations in which the same section of text contains
different types of Filth
.
- class scrubadub.scrubbers.Scrubber(detector_list: Optional[Sequence[Union[Type[scrubadub.detectors.base.Detector], scrubadub.detectors.base.Detector, str]]] = None, post_processor_list: Optional[Sequence[Union[Type[scrubadub.post_processors.base.PostProcessor], scrubadub.post_processors.base.PostProcessor, str]]] = None, locale: Optional[str] = None)[source]¶
Bases:
object
The Scrubber class is used to clean personal information out of dirty dirty text. It manages a set of
Detector
’s that are each responsible for identifyingFilth
.PostProcessor
objects are used to alter the found Filth. This could be to replace the Filth with a hash or token.- __init__(detector_list: Optional[Sequence[Union[Type[scrubadub.detectors.base.Detector], scrubadub.detectors.base.Detector, str]]] = None, post_processor_list: Optional[Sequence[Union[Type[scrubadub.post_processors.base.PostProcessor], scrubadub.post_processors.base.PostProcessor, str]]] = None, locale: Optional[str] = None)[source]¶
Create a
Scrubber
object.- Parameters
detector_list (Optional[Sequence[Union[Type[Detector], Detector, str]]]) – The list of detectors to use in this scrubber.
post_processor_list (Optional[Sequence[Union[Type[Detector], Detector, str]]]) – The locale that the phone number should adhere to.
locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- add_detector(detector: Union[scrubadub.detectors.base.Detector, Type[scrubadub.detectors.base.Detector], str], warn: bool = True)[source]¶
Add a
Detector
to ScrubberYou can add a detector to a
Scrubber
by passing one of three objects to this function:the uninitalised class to this function, which initialises the class with default settings.
an instance of a
Detector
class, where you can initialise it with the settings desired.a string containing the name of the detector, which again initialises the class with default settings.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber(detector_list=[]) >>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector) >>> scrubber.add_detector('skype') >>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=False) >>> scrubber.add_detector(detector)
- Parameters
detector (a Detector class, a Detector instance, or a string with the detector's name) – The
Detector
to add to this scrubber.warn (bool, default True) – raise a warning if the locale is not supported by the detector.
- remove_detector(detector: Union[scrubadub.detectors.base.Detector, Type[scrubadub.detectors.base.Detector], str])[source]¶
Remove a
Detector
from a ScrubberYou can remove a detector from a
Scrubber
by passing one of three objects to this function:the uninitalised class to this function, which removes the initalised detector of the same name.
an instance of a
Detector
class, which removes the initalised detector of the same name.a string containing the name of the detector, which removed the detector of that name.
>>> import scrubadub >>> scrubber = scrubadub.Scrubber() >>> scrubber.remove_detector(scrubadub.detectors.CreditCardDetector) >>> scrubber.remove_detector('url') >>> detector = scrubadub.detectors.email.EmailDetector() >>> scrubber.remove_detector(detector)
- Parameters
detector (a Detector class, a Detector instance, or a string with the detector's name) – The
Detector
to remove from this scrubber.
- add_post_processor(post_processor: Union[scrubadub.post_processors.base.PostProcessor, Type[scrubadub.post_processors.base.PostProcessor], str], index: Optional[int] = None)[source]¶
Add a
PostProcessor
to a ScrubberYou can add a post-processor to a
Scrubber
by passing one of three objects to this function:the uninitalised class to this function, which initialises the class with default settings.
an instance of a
PostProcessor
class, where you can initialise it with the settings desired.a string containing the name of the detector, which again initialises the class with default settings.
>>> import scrubadub, scrubadub.post_processors >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_post_processor('filth_replacer') >>> scrubber.add_post_processor(scrubadub.post_processors.PrefixSuffixReplacer)
- Parameters
post_processor (a PostProcessor class, a PostProcessor instance, or a string with the post-processor's name) – The
PostProcessor
to remove from this scrubber.
- remove_post_processor(post_processor: Union[scrubadub.post_processors.base.PostProcessor, Type[scrubadub.post_processors.base.PostProcessor], str])[source]¶
Remove a
PostProcessor
from a ScrubberYou can remove a post-processor from a
Scrubber
by passing one of three objects to this function:the uninitalised class to this function, which removes the initalised post-processor of the same name.
an instance of a
PostProcessor
class, which removes the initalised post-processor of the same name.a string containing the name of the detector, which removed the post-processor of that name.
>>> import scrubadub, scrubadub.post_processors >>> scrubber = scrubadub.Scrubber() >>> scrubber.remove_post_processor('filth_type_replacer') >>> scrubber.remove_post_processor(scrubadub.post_processors.PrefixSuffixReplacer)
- Parameters
post_processor (a PostProcessor class, a PostProcessor instance, or a string with the post-processor's name) – The
PostProcessor
to remove from this scrubber.
- clean(text: str, **kwargs) str [source]¶
This is the master method that cleans all of the filth out of the dirty dirty
text
. All keyword arguments to this function are passed through to theFilth.replace_with
method to fine-tune how theFilth
is cleaned.
- clean_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], **kwargs) Union[Dict[Optional[str], str], Sequence[str]] [source]¶
This is the master method that cleans all of the filth out of the dirty dirty
text
. All keyword arguments to this function are passed through to theFilth.replace_with
method to fine-tune how theFilth
is cleaned.
- iter_filth(text: str, document_name: Optional[str] = None, run_post_processors: bool = True) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Iterate over the different types of filth that can exist.
- iter_filth_documents(documents: Union[Sequence[str], Dict[Optional[str], str]], run_post_processors: bool = True) Generator[scrubadub.filth.base.Filth, None, None] [source]¶
Iterate over the different types of filth that can exist.