Under the hood¶
scrubadub
consists of three separate components:
Filth
objects are used to identify specific parts of a piece of dirty dirty text that contain sensitive information and they are responsible for deciding how the resulting information should be replaced in the cleaned text.Detector
objects are used to detect specific types ofFilth
.- The
Scrubber
is responsible for managing all of theDetector
objects and resolving any conflicts that may arise between differentDetector
objects.
Filth¶
Filth objects are responsible for marking particular sections of text as
containing that type of filth. It is also responsible for knowing how it should
be cleaned. Every type of Filth
inherits from scrubadub.filth.base.Filth.
-
class
scrubadub.filth.base.
Filth
(beg=0, end=0, text=u'')[source]¶ Bases:
object
This is the base class for all
Filth
that is detected in dirty dirty text.-
prefix
= u'{{'¶
-
suffix
= u'}}'¶
-
type
= None¶
-
lookup
= <scrubadub.utils.Lookup object>¶
-
placeholder
¶
-
identifier
¶
-
There is also a convenience class for RegexFilth
, which makes it easy to
quickly remove new types of filth that can be identified from regular
expressions:
-
class
scrubadub.filth.base.
RegexFilth
(match)[source]¶ Bases:
scrubadub.filth.base.Filth
Convenience class for instantiating a
Filth
object from a regular expression match-
regex
= None¶
-
Detectors¶
scrubadub
consists of several Detector
‘s, which are responsible for
identifying and iterating over the Filth
that can be found in a piece of
text. Every type of Filth
has a Detector
that inherits from
scrubadub.detectors.base.Detector
:
For convenience, there is also a RegexDetector
, which makes it easy to
quickly add new types of Filth
that can be identified from regular
expressions:
Scrubber¶
All of the Detector
‘s are managed by the Scrubber
. The main job of the
Scrubber
is to handle situations in which the same section of text contains
different types of Filth
.
-
class
scrubadub.scrubbers.
Scrubber
(*args, **kwargs)[source]¶ Bases:
object
The Scrubber class is used to clean personal information out of dirty dirty text. It manages a set of
Detector
‘s that are each responsible for identifying their particular kind ofFilth
.