scrubadub.detectors

scrubadub consists of several Detector’s, which are responsible for identifying and iterating over the Filth that can be found in a piece of text.

Base classes

Every Detector that inherits from scrubadub.detectors.Detector.

scrubadub.detectors.Detector

class scrubadub.detectors.Detector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: object

This is the base class for all detectors.

A simple example of how to make a new detector is given below:

>>> import scrubadub
>>> class MyFilth(scrubadub.filth.Filth):
...     type = 'mine'
>>> class MyDetector(scrubadub.detectors.Detector):
...     name = 'my_fr_detector'
...     def iter_filth(self, text, document_name=None):
...         # This detector always returns this same Filth no matter the input.
...         # You should implement something better here.
...         yield MyFilth(beg=0, end=8, text='My stuff', document_name=document_name, detector_name=self.name)
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(MyDetector)
>>> text = "My stuff can be found there."
>>> scrubber.clean(text)
'{{MINE}} can be found there.'

You can also advertise a Detector as supporting a certain locale by defining the `Detector.supported_local()` function.

filth_cls

alias of scrubadub.filth.base.Filth

autoload: bool = False
__init__(name: Optional[str] = None, locale: str = 'en_US')[source]

Initialise the Detector.

Parameters
  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

name: str = 'detector'
static locale_transform(locale: str) str

Normalise the locale string, e.g. ‘fr’ -> ‘fr_FR’.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

The normalised locale string

Return type

str

static locale_split(locale: str) Tuple[Optional[str], Optional[str]]

Split the locale string into the language and region.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

The two-letter language code and the two-letter region code in a tuple.

Return type

tuple, (str, str)

iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in a list of documents.

Parameters
  • document_list (List[str]) – A list of documents to clean.

  • document_names (List[str]) – A list containing the name of each document.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.RegexDetector

For convenience, there is also a RegexDetector, which makes it easy to quickly add new types of Filth that can be identified from regular expressions:

class scrubadub.detectors.RegexDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.Detector

Base class to match PII with a regex.

This class requires that the filth_cls attribute be set to the class of the Filth that should be returned by this Detector.

>>> import re, scrubadub
>>> class NewUrlDetector(scrubadub.detectors.RegexDetector):
...     name = 'new_url_detector'
...     filth_cls = scrubadub.filth.url.UrlFilth
...     regex = re.compile(r'https.*$', re.IGNORECASE)
>>> scrubber = scrubadub.Scrubber(detector_list=[NewUrlDetector()])
>>> text = u"This url will be found https://example.com"
>>> scrubber.clean(text)
'This url will be found {{URL}}'
regex: Optional[Pattern[str]] = None
filth_cls

alias of scrubadub.filth.base.Filth

iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.RegionLocalisedRegexDetector

class scrubadub.detectors.RegionLocalisedRegexDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegexDetector

Detector to detect Filth localised using regular expressions localised by the region

region_regex: Dict[str, Pattern] = {}
__init__(**kwargs)[source]

Initialise the Detector.

Parameters
  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

True if the locale is supported, otherwise False

Return type

bool

Detectors enabled by default

These are the detectors that are enabled in the scrubber by default.

scrubadub.detectors.CredentialDetector

class scrubadub.detectors.CredentialDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Remove username/password combinations from dirty drity text.

regex: Optional[Pattern[str]]

Compiled regular expression object.

filth_cls

alias of scrubadub.filth.credential.CredentialFilth

name: str = 'credential'
autoload: bool = True

scrubadub.detectors.CreditCardDetector

class scrubadub.detectors.CreditCardDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Remove credit-card numbers from dirty dirty text.

Supports Visa, MasterCard, American Express, Diners Club and JCB.

regex: Optional[Pattern[str]]

Compiled regular expression object.

name: str = 'credit_card'
filth_cls

alias of scrubadub.filth.credit_card.CreditCardFilth

autoload: bool = True

scrubadub.detectors.DriversLicenceDetector

class scrubadub.detectors.DriversLicenceDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to detect UK driving licence numbers, Simple pattern matching, no checksum solution.

region_regex: Dict[str, Pattern]
name: str = 'drivers_licence'
autoload: bool = True
filth_cls

alias of scrubadub.filth.drivers_licence.DriversLicenceFilth

scrubadub.detectors.EmailDetector

class scrubadub.detectors.EmailDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Use regular expression magic to remove email addresses from dirty dirty text. This method also catches email addresses like john at gmail.com.

regex: Optional[Pattern[str]]

Compiled regular expression object.

filth_cls

alias of scrubadub.filth.email.EmailFilth

name: str = 'email'
autoload: bool = True
at_matcher = re.compile('@|\\sat\\s', re.IGNORECASE)
dot_matcher = re.compile('\\.|\\sdot\\s', re.IGNORECASE)
iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.en_GB.NationalInsuranceNumberDetector

class scrubadub.detectors.en_GB.NationalInsuranceNumberDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to remove the GB National Insurance number (NINO), Simple pattern matching, no checksum solution.

region_regex: Dict[str, Pattern]
name: str = 'national_insurance_number'
autoload: bool = True
filth_cls

alias of scrubadub.filth.en_GB.national_insurance_number.NationalInsuranceNumberFilth

scrubadub.detectors.PhoneDetector

class scrubadub.detectors.PhoneDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.Detector

Remove phone numbers from dirty dirty text using python-phonenumbers, a port of a Google project to correctly format phone numbers in text.

Set the locale on the scrubber or detector to set the region used to search for valid phone numbers. If the locale is set to ‘en_CA’ Canadian numbers will be searched for, while setting the local to ‘en_GB’ searches for British numbers.

filth_cls

alias of scrubadub.filth.phone.PhoneFilth

name: str = 'phone'
autoload: bool = True
iter_filth(text, document_name: Optional[str] = None)[source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

True if the locale is supported, otherwise False

Return type

bool

scrubadub.detectors.PostalCodeDetector

class scrubadub.detectors.PostalCodeDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Detects postal codes, currently only British post codes are supported.

region_regex: Dict[str, Pattern]
filth_cls

alias of scrubadub.filth.postalcode.PostalCodeFilth

name: str = 'postalcode'
autoload: bool = True

scrubadub.detectors.en_US.SocialSecurityNumberDetector

class scrubadub.detectors.en_US.SocialSecurityNumberDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to detect a social security number (SSN) in dirty dirty text.

region_regex: Dict[str, Pattern]
filth_cls

alias of scrubadub.filth.en_US.social_security_number.SocialSecurityNumberFilth

name: str = 'social_security_number'
autoload: bool = True

scrubadub.detectors.en_GB.TaxReferenceNumberDetector

class scrubadub.detectors.en_GB.TaxReferenceNumberDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Use regular expressions to detect the UK PAYE temporary reference number (TRN), Simple pattern matching, no checksum solution.

region_regex: Dict[str, Pattern]
name: str = 'tax_reference_number'
autoload: bool = True
filth_cls

alias of scrubadub.filth.en_GB.tax_reference_number.TaxReferenceNumberFilth

scrubadub.detectors.TwitterDetector

class scrubadub.detectors.TwitterDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Use regular expression magic to remove twitter usernames from dirty dirty text.

regex: Optional[Pattern[str]]

Compiled regular expression object.

filth_cls

alias of scrubadub.filth.twitter.TwitterFilth

name: str = 'twitter'
autoload: bool = True

scrubadub.detectors.UrlDetector

class scrubadub.detectors.UrlDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Use regular expressions to remove URLs that begin with http://, https:// or www. from dirty dirty text.

With keep_domain=True, this detector only obfuscates the path on a URL, not its domain. For example, http://twitter.com/someone/status/234978haoin becomes http://twitter.com/{{replacement}}.

regex: Optional[Pattern[str]]

Compiled regular expression object.

filth_cls

alias of scrubadub.filth.url.UrlFilth

name: str = 'url'
autoload: bool = True

scrubadub.detectors.VehicleLicencePlateDetector

class scrubadub.detectors.VehicleLicencePlateDetector(**kwargs)[source]

Bases: scrubadub.detectors.base.RegionLocalisedRegexDetector

Detects standard british licence plates.

region_regex: Dict[str, Pattern]
filth_cls

alias of scrubadub.filth.vehicle_licence_plate.VehicleLicencePlateFilth

name: str = 'vehicle_licence_plate'
autoload: bool = True

Optional detectors

These detectors need to be manually added to a Scrubber, they are not loaded automatically. An example is shown below that demonstrates the various ways that a detector can be added to a Scrubber:

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[scrubadub.detectors.TextBlobNameDetector()])
>>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector)
>>> scrubber.add_detector('skype')
>>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=True)
>>> scrubber.add_detector(detector)

For further information see the usage page.

scrubadub.detectors.DateOfBirthDetector

class scrubadub.detectors.DateOfBirthDetector(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]

Bases: scrubadub.detectors.base.Detector

This detector aims to detect dates of birth in text.

First all possible dates are found, then they are filtered to those that would result in people being between DateOfBirthFilth.min_age_years and DateOfBirthFilth.max_age_years, which default to 18 and 100 respectively.

If require_context is True, we search for one of the possible context_words near the found date. We search up to context_before lines before the date and up to context_after lines after the date. The context that we search for are terms like ‘birth’ or ‘DoB’ to increase the likelihood that the date is indeed a date of birth. The context words can be set using the context_words parameter, which expects a list of strings.

>>> import scrubadub, scrubadub.detectors.date_of_birth
>>> DateOfBirthFilth.min_age_years = 12
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.date_of_birth.DateOfBirthDetector(),
... ])
>>> scrubber.clean("I was born on 10-Nov-2008.")
'I was born {{DATE_OF_BIRTH}}.'
name: str = 'date_of_birth'
filth_cls

alias of scrubadub.filth.date_of_birth.DateOfBirthFilth

autoload: bool = False
context_words_language_map = {'de': ['geburt', 'geboren', 'geb', 'geb.'], 'en': ['birth', 'born', 'dob', 'd.o.b.']}
__init__(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]

Initialise the detector.

Parameters
  • context_before (int) – The number of lines of context to search before the date

  • context_after (int) – The number of lines of context to search after the date

  • require_context (bool) – Set to False if your dates of birth are not near words that provide context (such as “birth” or “DOB”).

  • context_words (bool) – A list of words that provide context related to dates of birth, such as the following: ‘birth’, ‘born’, ‘dob’ or ‘d.o.b.’.

  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Search text for Filth and return a generator of Filth objects.

Parameters
  • text (str) – The dirty text that this Detector should search

  • document_name (Optional[str]) – Name of the document this is being passed to this detector

Returns

The found Filth in the text

Return type

Generator[Filth]

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code eg “en”, “es”.

Returns

True if the locale is supported, otherwise False

Return type

bool

scrubadub.detectors.SkypeDetector

class scrubadub.detectors.SkypeDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Skype usernames tend to be used inline in dirty dirty text quite often but also appear as skype: {{SKYPE}} quite a bit. This method looks at words within word_radius words of “skype” for things that appear to be misspelled or have punctuation in them as a means to identify skype usernames.

Default word_radius is 10, corresponding with the rough scale of half of a sentence before or after the word “skype” is used. Increasing the word_radius will increase the false positive rate and decreasing the word_radius will increase the false negative rate.

filth_cls

alias of scrubadub.filth.skype.SkypeFilth

name: str = 'skype'
autoload: bool = False
word_radius = 10
SKYPE_TOKEN = '[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]+'
SKYPE_USERNAME = re.compile('[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]{5,31}')
iter_filth(text, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.TaggedEvaluationFilthDetector

class scrubadub.detectors.TaggedEvaluationFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]

Bases: scrubadub.detectors.base.Detector

Use this Detector to find tag filth as true Filth. This is useful when you want evaluate the effectiveness of a Detector using Filth that has been selected by a human.

Results from this detector are used as the “truth” against which the other detectos are compared. This is done in scrubadub.comparison.get_filth_classification_report where the detecton accuracies are calculated.

An example of how to use this detector is given below:

>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'),
...     scrubadub.detectors.TaggedEvaluationFilthDetector([
...         {'match': 'Tom', 'filth_type': 'name'},
...         {'match': 'tom@example.com', 'filth_type': 'email'},
...     ]),
... ])
>>> filth_list = list(scrubber.iter_filth("Hello I am Tom"))
>>> print(scrubadub.comparison.get_filth_classification_report(filth_list))
filth    detector         locale      precision    recall  f1-score   support

name     name_detector    en_US            1.00      1.00      1.00         1

                            accuracy                           1.00         1
                           macro avg       1.00      1.00      1.00         1
                        weighted avg       1.00      1.00      1.00         1

This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:

  • match (str) - a string value that will be searched for in the text

  • filth_type (str) - a string value that indicates the type of Filth, should be set to Filth.name. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.

The known filth item dictionary may also optionally contain:

  • match_end (str) - if specified will search for Filth starting with the value of match and ending with the value of match_end

  • limit (int) - an integer describing the maximum number of characters between match and match_end, defaults to 150

  • ignore_case (bool) - Ignore case when searching for the tagged filth

  • ignore_whitespace (bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)

  • ignore_partial_word_matches (bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)

Examples of this:

  • {'match': 'aaa', 'filth_type': 'name'} - will search for an exact match to aaa and return it as a NameFilth

  • {'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'} - will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.

  • {'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True} - will search for an exact match to 012345, ignoring any partial matches and return it as a PhoneFilth

This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a scrubber.add_detector(detector) call or by adding it to the detector_list inialising a Scrubber.

filth_cls

alias of scrubadub.filth.tagged.TaggedEvaluationFilth

name: str = 'tagged'
autoload: bool = False
__init__(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]

Initialise the Detector.

Parameters
  • known_filth_items (list of dicts) – A list of dictionaries that describe items to be searched for in the dirty text. The keys match and filth_type are required, which give the text to be searched for and the type of filth that the match string represents. See the class docstring for further details of available flags in this dictionary.

  • tagged_filth (bool, default True) – Whether the filth has been tagged and should be used as truth when calculating filth finding accuracies.

  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

static dedup_dicts(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem]) List[scrubadub.detectors.tagged.KnownFilthItem][source]
create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) scrubadub.filth.base.Filth[source]
iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

scrubadub.detectors.TextBlobNameDetector

class scrubadub.detectors.TextBlobNameDetector(name: Optional[str] = None, locale: str = 'en_US')[source]

Bases: scrubadub.detectors.base.RegexDetector

Use part of speech tagging from textblob to clean proper nouns out of the dirty dirty text. Disallow particular nouns by adding them to the NameDetector.disallowed_nouns set.

filth_cls

alias of scrubadub.filth.name.NameFilth

name: str = 'text_blob_name'
autoload: bool = False
disallowed_nouns = {'skype'}
iter_filth(text, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

True if the locale is supported, otherwise False

Return type

bool

scrubadub.detectors.UserSuppliedFilthDetector

class scrubadub.detectors.UserSuppliedFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]

Bases: scrubadub.detectors.tagged.TaggedEvaluationFilthDetector

Use this Detector to find some known filth in the text. An example might be if you have a list of employee numbers that you wish to remove from a document, as shown below:

>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[
...     scrubadub.detectors.UserSuppliedFilthDetector([
...         {'match': 'Anika', 'filth_type': 'name'},
...         {'match': 'Larry', 'filth_type': 'name'},
...     ]),
... ])
>>> scrubber.clean("Anika is my favourite employee.")
'{{NAME}} is my favourite employee.'

This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:

  • match (str) - a string value that will be searched for in the text

  • filth_type (str) - a string value that indicates the type of Filth, should be set to Filth.name. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.

The known filth item dictionary may also optionally contain:

  • match_end (str) - if specified will search for Filth starting with the value of match and ending with the value of match_end

  • limit (int) - an integer describing the maximum number of characters between match and match_end, defaults to 150

  • ignore_case (bool) - Ignore case when searching for the tagged filth

  • ignore_whitespace (bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)

  • ignore_partial_word_matches (bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)

Examples of this:

  • {'match': 'aaa', 'filth_type': 'name'} - will search for an exact match to aaa and return it as a NameFilth

  • {'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'} - will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.

  • {'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True} - will search for an exact match to 012345, ignoring any partial matches and return it as a PhoneFilth

This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a scrubber.add_detector(detector) call or by adding it to the detector_list inialising a Scrubber.

name: str = 'user_supplied'
create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) scrubadub.filth.base.Filth[source]

External detectors

These are detectors that are not included in the scrubadub package, usually because they come with large external dependencies that are not always needed. To use them you should first import their package and then add them to the Scrubber, an example of this is shown below:

>>> import scrubadub, scrubadub_address
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_address.detectors.AddressDetector)

scrubadub_address.detectors.AddressDetector

scrubadub_spacy.detectors.SpacyEntityDetector

class scrubadub_spacy.detectors.SpacyEntityDetector(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]

Bases: scrubadub.detectors.base.Detector

Use spaCy’s named entity recognition to identify possible Filth.

This detector is made to work with v3 of spaCy, since the NER model has been significantly improved in this version.

This is particularly useful to remove names from text, but can also be used to remove any entity that is recognised by spaCy. A full list of entities that spacy supports can be found here: https://spacy.io/api/annotation#named-entities.

Additional entities can be added like so:

>>> import scrubadub, scrubadub_spacy
>>> class MoneyFilth(scrubadub.filth.Filth):
...     type = 'money'
>>> scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map['MONEY'] = MoneyFilth
>>> detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector(named_entities=['MONEY'])
>>> scrubber = scrubadub.Scrubber(detector_list=[detector])
>>> scrubber.clean("You owe me 12 dollars man!")
'You owe me {{MONEY}} man!'

The dictonary scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map is used to map between the spaCy named entity label and the type of scrubadub Filth, while the named_entities argument sets which named entities are considered Filth by the SpacyEntityDetector.

filth_cls_map = {'DATE': <class 'scrubadub.filth.date_of_birth.DateOfBirthFilth'>, 'FAC': <class 'scrubadub.filth.location.LocationFilth'>, 'GPE': <class 'scrubadub.filth.location.LocationFilth'>, 'LOC': <class 'scrubadub.filth.location.LocationFilth'>, 'ORG': <class 'scrubadub.filth.organization.OrganizationFilth'>, 'PER': <class 'scrubadub.filth.name.NameFilth'>, 'PERSON': <class 'scrubadub.filth.name.NameFilth'>}
name: str = 'spacy'
language_to_model = {'de': 'de_dep_news_trf', 'en': 'en_core_web_trf', 'es': 'es_dep_news_trf', 'fr': 'fr_dep_news_trf', 'nl': 'nl_core_news_trf', 'zh': 'zh_core_web_trf'}
disallowed_nouns = {'skype'}
__init__(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]

Initialise the Detector.

Parameters
  • named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to {'PERSON', 'PER', 'ORG'}

  • model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).

  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

static check_spacy_version() bool[source]

Ensure that the version od spaCy is v3.

static check_spacy_model(model) bool[source]

Ensure that the spaCy model is installed.

iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in a list of documents.

Parameters
  • document_list (List[str]) – A list of documents to clean.

  • document_names (List[str]) – A list containing the name of each document.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

True if the locale is supported, otherwise False

Return type

bool

scrubadub_spacy.detectors.SpacyNameDetector

class scrubadub_spacy.detectors.SpacyNameDetector(include_spacy: bool = True, **kwargs)[source]

Bases: scrubadub_spacy.detectors.spacy.SpacyEntityDetector

Add an extension to the spacy detector to look for tokens that often occur before or after names of people’s names, a prefix might be Hello as in “Hello Jane”, or Mrs as in “Mrs Jane Smith” and a suffix could be PhD as in “Jane Smith PhD”.

See the SpacyDetector for further info on how to use this detector as it shares many similar options.

Currently only english prefixes and sufixes are supported, but other language titles can be easily added, as in the example below:

>>> import scrubadub, scrubadub_spacy
>>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NOUN_TAGS['de'] = ['NN', 'NE', 'NNE']
>>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NAME_PREFIXES['de'] = ['frau', 'herr']
>>> detector = scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector(locale='de_DE', model='de_core_news_sm',
...     include_spacy=False)
>>> scrubber = scrubadub.Scrubber(detector_list=[detector], locale='de_DE')
>>> scrubber.clean("bleib dort Frau Schmidt")
'bleib dort {{NAME+NAME}}'
name: str = 'spacy_name'
NAME_PREFIXES = {'en': ['mr', 'mr.', 'mister', 'mrs', 'mrs.', 'misses', 'ms', 'ms.', 'miss', 'dr', 'dr.', 'doctor', 'prof', 'prof.', 'professor', 'lord', 'lady', 'rev', 'rev.', 'reverend', 'hon', 'hon.', 'honourable', 'hhj', 'honorable', 'judge', 'sir', 'madam', 'hello', 'dear', 'hi', 'hey', 'regards', 'to:', 'from:', 'sender:']}
NAME_SUFFIXES = {'en': ['phd', 'bsc', 'msci', 'ba', 'md', 'qc', 'ma', 'mba']}
NOUN_TAGS = {'en': ['NNP', 'NN', 'NNPS']}
TOKEN_SEARCH_DISTANCE = 3
MINIMUM_NAME_LENGTH = 1
__init__(include_spacy: bool = True, **kwargs)[source]

Initialise the Detector.

Parameters
  • include_spacy (bool, default, False) – include default spacy library in addition to title detector.

  • named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to {'PERSON', 'PER', 'ORG'}.

  • model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).

  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

static find_names(doc: spacy.tokens.doc.Doc, tokens: Sequence[spacy.tokens.token.Token], noun_tags: List[str]) spacy.tokens.doc.Doc[source]

This function searches for possilbe names in a flagged set of tokens and adds them to the identified entities.

iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None][source]

Yields discovered filth in a list of documents.

Parameters
  • document_list (List[str]) – A list of documents to clean.

  • document_names (List[str]) – A list containing the name of each document.

Returns

A list containing all the spacy doc

Return type

Sequence[Optional[str]]

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

True if the locale is supported, otherwise False

Return type

bool

scrubadub_stanford.detectors.StanfordEntityDetector

class scrubadub_stanford.detectors.StanfordEntityDetector(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]

Bases: scrubadub.detectors.base.Detector

Search for people’s names, organization’s names and locations within text using the stanford 3 class model.

The three classes of this model can be enabled with the three arguments to the inialiser enable_person, enable_organization and enable_location. An example of their usage is given below.

>>> import scrubadub, scrubadub_stanford
>>> detector = scrubadub_stanford.detectors.StanfordEntityDetector(
...     enable_person=False, enable_organization=False, enable_location=True
... )
>>> scrubber = scrubadub.Scrubber(detector_list=[detector])
>>> scrubber.clean('Jane is visiting London.')
'Jane is visiting {{LOCATION}}.'
filth_cls

alias of scrubadub.filth.base.Filth

name: str = 'stanford'
ignored_words = ['tennant']
stanford_version = '4.0.0'
stanford_download_url = 'https://nlp.stanford.edu/software/stanford-ner-{version}.zip'
__init__(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]

Initialise the Detector.

Parameters
  • name (str, optional) – Overrides the default name of the :class:Detector

  • locale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

iter_filth(text, document_name: Optional[str] = None)[source]

Yields discovered filth in the provided text.

Parameters
  • text (str) – The dirty text to clean.

  • document_name (str, optional) – The name of the document to clean.

Returns

An iterator to the discovered Filth

Return type

Iterator[Filth]

classmethod supported_locale(locale: str) bool[source]

Returns true if this Detector supports the given locale.

Parameters

locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.

Returns

True if the locale is supported, otherwise False

Return type

bool

Catalogue functions

These functions register or remove Detectors from the Detector catalogue.

scrubadub.detectors.register_detector

scrubadub.detectors.register_detector(detector: Type[Detector], *, autoload: Optional[bool] = None) Type[Detector][source]

Register a detector for use with the Scrubber class.

You can use register_detector(NewDetector, autoload=True) after your detector definition to automatically register it with the Scrubber class so that it can be used to remove Filth.

The argument autoload``decides whether a new ``Scrubber() instance should load this detector by default.

>>> import scrubadub
>>> class NewDetector(scrubadub.detectors.Detector):
...     pass
>>> scrubadub.detectors.register_detector(NewDetector, autoload=False)
<class 'scrubadub.detectors.catalogue.NewDetector'>
Parameters
  • detector (Detector class) – The Detector to register with the scrubadub detector configuration.

  • autoload (Optional[bool]) – Whether to automatically load this Detector on Scrubber initialisation.

scrubadub.detectors.remove_detector

scrubadub.detectors.remove_detector(detector: Union[Type[Detector], str])[source]

Remove an already registered detector.

>>> import scrubadub
>>> class NewDetector(scrubadub.detectors.Detector):
...     pass
>>> scrubadub.detectors.catalogue.register_detector(NewDetector, autoload=False)
<class 'scrubadub.detectors.catalogue.NewDetector'>
>>> scrubadub.detectors.catalogue.remove_detector(NewDetector)
Parameters
  • detector (Union[Type['PostProcessor'], str]) – The Detector to register with the scrubadub detector configuration.

  • autoload (bool) – Whether to automatically load this Detector on Scrubber initialisation.