Under the hood

scrubadub consists of three separate components:

  • Filth objects are used to identify specific parts of a piece of dirty dirty text that contain sensitive information and they are responsible for deciding how the resulting information should be replaced in the cleaned text.
  • Detector objects are used to detect specific types of Filth.
  • The Scrubber is responsible for managing all of the Detector objects and resolving any conflicts that may arise between different Detector objects.

Filth

Filth objects are responsible for marking particular sections of text as containing that type of filth. It is also responsible for knowing how it should be cleaned. Every type of Filth inherits from scrubadub.filth.base.Filth.

class scrubadub.filth.base.Filth(beg=0, end=0, text=u'')[source]

Bases: object

This is the base class for all Filth that is detected in dirty dirty text.

prefix = u'{{'
suffix = u'}}'
type = None
lookup = <scrubadub.utils.Lookup object>
placeholder
identifier
replace_with(replace_with='placeholder', **kwargs)[source]
merge(other_filth)[source]

There is also a convenience class for RegexFilth, which makes it easy to quickly remove new types of filth that can be identified from regular expressions:

class scrubadub.filth.base.RegexFilth(match)[source]

Bases: scrubadub.filth.base.Filth

Convenience class for instantiating a Filth object from a regular expression match

regex = None

Detectors

scrubadub consists of several Detector‘s, which are responsible for identifying and iterating over the Filth that can be found in a piece of text. Every type of Filth has a Detector that inherits from scrubadub.detectors.base.Detector:

class scrubadub.detectors.base.Detector[source]

Bases: object

filth_cls = None
iter_filth(text)[source]

For convenience, there is also a RegexDetector, which makes it easy to quickly add new types of Filth that can be identified from regular expressions:

class scrubadub.detectors.base.RegexDetector[source]

Bases: scrubadub.detectors.base.Detector

iter_filth(text)[source]

Scrubber

All of the Detector‘s are managed by the Scrubber. The main job of the Scrubber is to handle situations in which the same section of text contains different types of Filth.

class scrubadub.scrubbers.Scrubber(*args, **kwargs)[source]

Bases: object

The Scrubber class is used to clean personal information out of dirty dirty text. It manages a set of Detector‘s that are each responsible for identifying their particular kind of Filth.

add_detector(detector_cls)[source]

Add a Detector to scrubadub

remove_detector(name)[source]

Remove a Detector from scrubadub

clean(text, **kwargs)[source]

This is the master method that cleans all of the filth out of the dirty dirty text. All keyword arguments to this function are passed through to the Filth.replace_with method to fine-tune how the Filth is cleaned.

iter_filth(text)[source]

Iterate over the different types of filth that can exist.