scrubadub.detectors¶
scrubadub consists of several Detector’s, which are responsible for
identifying and iterating over the Filth that can be found in a piece of
text.
Base classes¶
Every Detector that inherits from scrubadub.detectors.Detector.
scrubadub.detectors.Detector¶
- class scrubadub.detectors.Detector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
objectThis is the base class for all detectors.
A simple example of how to make a new detector is given below:
>>> import scrubadub >>> class MyFilth(scrubadub.filth.Filth): ... type = 'mine' >>> class MyDetector(scrubadub.detectors.Detector): ... name = 'my_fr_detector' ... def iter_filth(self, text, document_name=None): ... # This detector always returns this same Filth no matter the input. ... # You should implement something better here. ... yield MyFilth(beg=0, end=8, text='My stuff', document_name=document_name, detector_name=self.name) >>> scrubber = scrubadub.Scrubber() >>> scrubber.add_detector(MyDetector) >>> text = "My stuff can be found there." >>> scrubber.clean(text) '{{MINE}} can be found there.'
You can also advertise a
Detectoras supporting a certain locale by defining the`Detector.supported_local()`function.- filth_cls¶
alias of
scrubadub.filth.base.Filth
- autoload: bool = False¶
- __init__(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Initialise the
Detector.- Parameters
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- name: str = 'detector'¶
- static locale_transform(locale: str) str¶
Normalise the locale string, e.g. ‘fr’ -> ‘fr_FR’.
- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
The normalised locale string
- Return type
str
- static locale_split(locale: str) Tuple[Optional[str], Optional[str]]¶
Split the locale string into the language and region.
- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
The two-letter language code and the two-letter region code in a tuple.
- Return type
tuple, (str, str)
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
- iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in a list of documents.
- Parameters
document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
scrubadub.detectors.RegexDetector¶
For convenience, there is also a RegexDetector, which makes it easy to
quickly add new types of Filth that can be identified from regular
expressions:
- class scrubadub.detectors.RegexDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.DetectorBase class to match PII with a regex.
This class requires that the
filth_clsattribute be set to the class of theFilththat should be returned by thisDetector.>>> import re, scrubadub >>> class NewUrlDetector(scrubadub.detectors.RegexDetector): ... name = 'new_url_detector' ... filth_cls = scrubadub.filth.url.UrlFilth ... regex = re.compile(r'https.*$', re.IGNORECASE) >>> scrubber = scrubadub.Scrubber(detector_list=[NewUrlDetector()]) >>> text = u"This url will be found https://example.com" >>> scrubber.clean(text) 'This url will be found {{URL}}'
- regex: Optional[Pattern[str]] = None¶
- filth_cls¶
alias of
scrubadub.filth.base.Filth
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
scrubadub.detectors.RegionLocalisedRegexDetector¶
- class scrubadub.detectors.RegionLocalisedRegexDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorDetector to detect
Filthlocalised using regular expressions localised by the region- region_regex: Dict[str, Pattern] = {}¶
- __init__(**kwargs)[source]¶
Initialise the
Detector.- Parameters
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
Detectors enabled by default¶
These are the detectors that are enabled in the scrubber by default.
scrubadub.detectors.CredentialDetector¶
- class scrubadub.detectors.CredentialDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorRemove username/password combinations from dirty drity
text.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.credential.CredentialFilth
- name: str = 'credential'¶
- autoload: bool = True¶
scrubadub.detectors.CreditCardDetector¶
- class scrubadub.detectors.CreditCardDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorRemove credit-card numbers from dirty dirty
text.Supports Visa, MasterCard, American Express, Diners Club and JCB.
- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- name: str = 'credit_card'¶
- filth_cls¶
alias of
scrubadub.filth.credit_card.CreditCardFilth
- autoload: bool = True¶
scrubadub.detectors.DriversLicenceDetector¶
- class scrubadub.detectors.DriversLicenceDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetectorUse regular expressions to detect UK driving licence numbers, Simple pattern matching, no checksum solution.
- region_regex: Dict[str, Pattern]¶
- name: str = 'drivers_licence'¶
- autoload: bool = True¶
- filth_cls¶
alias of
scrubadub.filth.drivers_licence.DriversLicenceFilth
scrubadub.detectors.EmailDetector¶
- class scrubadub.detectors.EmailDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorUse regular expression magic to remove email addresses from dirty dirty
text. This method also catches email addresses likejohn at gmail.com.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.email.EmailFilth
- name: str = 'email'¶
- autoload: bool = True¶
- at_matcher = re.compile('@|\\sat\\s', re.IGNORECASE)¶
- dot_matcher = re.compile('\\.|\\sdot\\s', re.IGNORECASE)¶
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
scrubadub.detectors.en_GB.NationalInsuranceNumberDetector¶
- class scrubadub.detectors.en_GB.NationalInsuranceNumberDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetectorUse regular expressions to remove the GB National Insurance number (NINO), Simple pattern matching, no checksum solution.
- region_regex: Dict[str, Pattern]¶
- name: str = 'national_insurance_number'¶
- autoload: bool = True¶
- filth_cls¶
alias of
scrubadub.filth.en_GB.national_insurance_number.NationalInsuranceNumberFilth
scrubadub.detectors.PhoneDetector¶
- class scrubadub.detectors.PhoneDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.DetectorRemove phone numbers from dirty dirty
textusing python-phonenumbers, a port of a Google project to correctly format phone numbers in text.Set the locale on the scrubber or detector to set the region used to search for valid phone numbers. If the locale is set to ‘en_CA’ Canadian numbers will be searched for, while setting the local to ‘en_GB’ searches for British numbers.
- filth_cls¶
alias of
scrubadub.filth.phone.PhoneFilth
- name: str = 'phone'¶
- autoload: bool = True¶
- iter_filth(text, document_name: Optional[str] = None)[source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
scrubadub.detectors.PostalCodeDetector¶
- class scrubadub.detectors.PostalCodeDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetectorDetects postal codes, currently only British post codes are supported.
- region_regex: Dict[str, Pattern]¶
- filth_cls¶
alias of
scrubadub.filth.postalcode.PostalCodeFilth
- name: str = 'postalcode'¶
- autoload: bool = True¶
scrubadub.detectors.en_GB.TaxReferenceNumberDetector¶
- class scrubadub.detectors.en_GB.TaxReferenceNumberDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetectorUse regular expressions to detect the UK PAYE temporary reference number (TRN), Simple pattern matching, no checksum solution.
- region_regex: Dict[str, Pattern]¶
- name: str = 'tax_reference_number'¶
- autoload: bool = True¶
- filth_cls¶
alias of
scrubadub.filth.en_GB.tax_reference_number.TaxReferenceNumberFilth
scrubadub.detectors.TwitterDetector¶
- class scrubadub.detectors.TwitterDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorUse regular expression magic to remove twitter usernames from dirty dirty
text.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.twitter.TwitterFilth
- name: str = 'twitter'¶
- autoload: bool = True¶
scrubadub.detectors.UrlDetector¶
- class scrubadub.detectors.UrlDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorUse regular expressions to remove URLs that begin with
http://,https://orwww.from dirty dirtytext.With
keep_domain=True, this detector only obfuscates the path on a URL, not its domain. For example,http://twitter.com/someone/status/234978haoinbecomeshttp://twitter.com/{{replacement}}.- regex: Optional[Pattern[str]]¶
Compiled regular expression object.
- filth_cls¶
alias of
scrubadub.filth.url.UrlFilth
- name: str = 'url'¶
- autoload: bool = True¶
scrubadub.detectors.VehicleLicencePlateDetector¶
- class scrubadub.detectors.VehicleLicencePlateDetector(**kwargs)[source]¶
Bases:
scrubadub.detectors.base.RegionLocalisedRegexDetectorDetects standard british licence plates.
- region_regex: Dict[str, Pattern]¶
- filth_cls¶
alias of
scrubadub.filth.vehicle_licence_plate.VehicleLicencePlateFilth
- name: str = 'vehicle_licence_plate'¶
- autoload: bool = True¶
Optional detectors¶
These detectors need to be manually added to a Scrubber, they are not loaded automatically.
An example is shown below that demonstrates the various ways that a detector can be added to a Scrubber:
>>> import scrubadub
>>> scrubber = scrubadub.Scrubber(detector_list=[scrubadub.detectors.TextBlobNameDetector()])
>>> scrubber.add_detector(scrubadub.detectors.CreditCardDetector)
>>> scrubber.add_detector('skype')
>>> detector = scrubadub.detectors.DateOfBirthDetector(require_context=True)
>>> scrubber.add_detector(detector)
For further information see the usage page.
scrubadub.detectors.DateOfBirthDetector¶
- class scrubadub.detectors.DateOfBirthDetector(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]¶
Bases:
scrubadub.detectors.base.DetectorThis detector aims to detect dates of birth in text.
First all possible dates are found, then they are filtered to those that would result in people being between
DateOfBirthFilth.min_age_yearsandDateOfBirthFilth.max_age_years, which default to 18 and 100 respectively.If
require_contextis True, we search for one of the possiblecontext_wordsnear the found date. We search up tocontext_beforelines before the date and up tocontext_afterlines after the date. The context that we search for are terms like ‘birth’ or ‘DoB’ to increase the likelihood that the date is indeed a date of birth. The context words can be set using thecontext_wordsparameter, which expects a list of strings.>>> import scrubadub, scrubadub.detectors.date_of_birth >>> DateOfBirthFilth.min_age_years = 12 >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.date_of_birth.DateOfBirthDetector(), ... ]) >>> scrubber.clean("I was born on 10-Nov-2008.") 'I was born {{DATE_OF_BIRTH}}.'
- name: str = 'date_of_birth'¶
- filth_cls¶
alias of
scrubadub.filth.date_of_birth.DateOfBirthFilth
- autoload: bool = False¶
- context_words_language_map = {'de': ['geburt', 'geboren', 'geb', 'geb.'], 'en': ['birth', 'born', 'dob', 'd.o.b.']}¶
- __init__(context_before: int = 2, context_after: int = 1, require_context: bool = True, context_words: Optional[List[str]] = None, **kwargs)[source]¶
Initialise the detector.
- Parameters
context_before (int) – The number of lines of context to search before the date
context_after (int) – The number of lines of context to search after the date
require_context (bool) – Set to False if your dates of birth are not near words that provide context (such as “birth” or “DOB”).
context_words (bool) – A list of words that provide context related to dates of birth, such as the following: ‘birth’, ‘born’, ‘dob’ or ‘d.o.b.’.
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Search
textforFilthand return a generator ofFilthobjects.- Parameters
text (str) – The dirty text that this Detector should search
document_name (Optional[str]) – Name of the document this is being passed to this detector
- Returns
The found Filth in the text
- Return type
Generator[Filth]
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code eg “en”, “es”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
scrubadub.detectors.SkypeDetector¶
- class scrubadub.detectors.SkypeDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorSkype usernames tend to be used inline in dirty dirty text quite often but also appear as
skype: {{SKYPE}}quite a bit. This method looks at words withinword_radiuswords of “skype” for things that appear to be misspelled or have punctuation in them as a means to identify skype usernames.Default
word_radiusis 10, corresponding with the rough scale of half of a sentence before or after the word “skype” is used. Increasing theword_radiuswill increase the false positive rate and decreasing theword_radiuswill increase the false negative rate.- filth_cls¶
alias of
scrubadub.filth.skype.SkypeFilth
- name: str = 'skype'¶
- autoload: bool = False¶
- word_radius = 10¶
- SKYPE_TOKEN = '[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]+'¶
- SKYPE_USERNAME = re.compile('[a-zA-Z][a-zA-Z0-9_\\-\\,\\.]{5,31}')¶
- iter_filth(text, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
scrubadub.detectors.TaggedEvaluationFilthDetector¶
- class scrubadub.detectors.TaggedEvaluationFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶
Bases:
scrubadub.detectors.base.DetectorUse this
Detectorto find tag filth as trueFilth. This is useful when you want evaluate the effectiveness of a Detector using Filth that has been selected by a human.Results from this detector are used as the “truth” against which the other detectos are compared. This is done in
scrubadub.comparison.get_filth_classification_reportwhere the detecton accuracies are calculated.An example of how to use this detector is given below:
>>> import scrubadub, scrubadub.comparison, scrubadub.detectors.text_blob >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.text_blob.TextBlobNameDetector(name='name_detector'), ... scrubadub.detectors.TaggedEvaluationFilthDetector([ ... {'match': 'Tom', 'filth_type': 'name'}, ... {'match': 'tom@example.com', 'filth_type': 'email'}, ... ]), ... ]) >>> filth_list = list(scrubber.iter_filth("Hello I am Tom")) >>> print(scrubadub.comparison.get_filth_classification_report(filth_list)) filth detector locale precision recall f1-score support name name_detector en_US 1.00 1.00 1.00 1 accuracy 1.00 1 macro avg 1.00 1.00 1.00 1 weighted avg 1.00 1.00 1.00 1
This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:
match(str) - a string value that will be searched for in the textfilth_type(str) - a string value that indicates the type of Filth, should be set toFilth.name. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.
The known filth item dictionary may also optionally contain:
match_end(str) - if specified will search for Filth starting with the value of match and ending with the value ofmatch_endlimit(int) - an integer describing the maximum number of characters between match and match_end, defaults to 150ignore_case(bool) - Ignore case when searching for the tagged filthignore_whitespace(bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)ignore_partial_word_matches(bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)
Examples of this:
{'match': 'aaa', 'filth_type': 'name'}- will search for an exact match to aaa and return it as aNameFilth{'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'}- will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.{'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True}- will search for an exact match to 012345, ignoring any partial matches and return it as aPhoneFilth
This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a
scrubber.add_detector(detector)call or by adding it to thedetector_listinialising aScrubber.- filth_cls¶
alias of
scrubadub.filth.tagged.TaggedEvaluationFilth
- name: str = 'tagged'¶
- autoload: bool = False¶
- __init__(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶
Initialise the
Detector.- Parameters
known_filth_items (list of dicts) – A list of dictionaries that describe items to be searched for in the dirty text. The keys match and filth_type are required, which give the text to be searched for and the type of filth that the match string represents. See the class docstring for further details of available flags in this dictionary.
tagged_filth (bool, default True) – Whether the filth has been tagged and should be used as truth when calculating filth finding accuracies.
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- static dedup_dicts(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem]) List[scrubadub.detectors.tagged.KnownFilthItem][source]¶
- create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) scrubadub.filth.base.Filth[source]¶
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
scrubadub.detectors.TextBlobNameDetector¶
- class scrubadub.detectors.TextBlobNameDetector(name: Optional[str] = None, locale: str = 'en_US')[source]¶
Bases:
scrubadub.detectors.base.RegexDetectorUse part of speech tagging from textblob to clean proper nouns out of the dirty dirty
text. Disallow particular nouns by adding them to theNameDetector.disallowed_nounsset.- filth_cls¶
alias of
scrubadub.filth.name.NameFilth
- name: str = 'text_blob_name'¶
- autoload: bool = False¶
- disallowed_nouns = {'skype'}¶
- iter_filth(text, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
scrubadub.detectors.UserSuppliedFilthDetector¶
- class scrubadub.detectors.UserSuppliedFilthDetector(known_filth_items: List[scrubadub.detectors.tagged.KnownFilthItem], **kwargs)[source]¶
Bases:
scrubadub.detectors.tagged.TaggedEvaluationFilthDetectorUse this
Detectorto find some known filth in the text. An example might be if you have a list of employee numbers that you wish to remove from a document, as shown below:>>> import scrubadub >>> scrubber = scrubadub.Scrubber(detector_list=[ ... scrubadub.detectors.UserSuppliedFilthDetector([ ... {'match': 'Anika', 'filth_type': 'name'}, ... {'match': 'Larry', 'filth_type': 'name'}, ... ]), ... ]) >>> scrubber.clean("Anika is my favourite employee.") '{{NAME}} is my favourite employee.'
This detector takes a list of dictonaires (reffered to as known filth items). These specify what to look for in the text to label as tagged filth. The dictionary should contain the following keys:
match(str) - a string value that will be searched for in the textfilth_type(str) - a string value that indicates the type of Filth, should be set toFilth.name. An example of these could be ‘name’ or ‘phone’ for name and phone filths respectively.
The known filth item dictionary may also optionally contain:
match_end(str) - if specified will search for Filth starting with the value of match and ending with the value ofmatch_endlimit(int) - an integer describing the maximum number of characters between match and match_end, defaults to 150ignore_case(bool) - Ignore case when searching for the tagged filthignore_whitespace(bool) - Ignore whitespace when matching (“asd qwe” can also match “asd\nqwe”)ignore_partial_word_matches(bool) - Ignore matches that are only partial words (if you’re looking for “Eve”, this flag ensure it wont match “Evening”)
Examples of this:
{'match': 'aaa', 'filth_type': 'name'}- will search for an exact match to aaa and return it as aNameFilth{'match': 'aaa', 'match_end': 'zzz', 'filth_type': 'name'}- will search for aaa followed by up to 150 characters followed by zzz, which would match both aaabbbzzz and aaazzz.{'match': '012345', 'filth_type': 'phone', 'ignore_partial_word_matches': True}- will search for an exact match to 012345, ignoring any partial matches and return it as aPhoneFilth
This detector is not enabled by default (since you need to supply a list of known filths) and so you must always add it to your scrubber with a
scrubber.add_detector(detector)call or by adding it to thedetector_listinialising aScrubber.- name: str = 'user_supplied'¶
- create_filth(start_location: int, end_location: int, text: str, comparison_type: Optional[str], detector_name: str, document_name: Optional[str], locale: str) scrubadub.filth.base.Filth[source]¶
External detectors¶
These are detectors that are not included in the scrubadub package, usually because they come with large
external dependencies that are not always needed.
To use them you should first import their package and then add them to the Scrubber, an example of this is shown
below:
>>> import scrubadub, scrubadub_address
>>> scrubber = scrubadub.Scrubber()
>>> scrubber.add_detector(scrubadub_address.detectors.AddressDetector)
scrubadub_address.detectors.AddressDetector¶
scrubadub_spacy.detectors.SpacyEntityDetector¶
- class scrubadub_spacy.detectors.SpacyEntityDetector(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]¶
Bases:
scrubadub.detectors.base.DetectorUse spaCy’s named entity recognition to identify possible
Filth.This detector is made to work with v3 of spaCy, since the NER model has been significantly improved in this version.
This is particularly useful to remove names from text, but can also be used to remove any entity that is recognised by spaCy. A full list of entities that spacy supports can be found here: https://spacy.io/api/annotation#named-entities.
Additional entities can be added like so:
>>> import scrubadub, scrubadub_spacy >>> class MoneyFilth(scrubadub.filth.Filth): ... type = 'money' >>> scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_map['MONEY'] = MoneyFilth >>> detector = scrubadub_spacy.detectors.spacy.SpacyEntityDetector(named_entities=['MONEY']) >>> scrubber = scrubadub.Scrubber(detector_list=[detector]) >>> scrubber.clean("You owe me 12 dollars man!") 'You owe me {{MONEY}} man!'
The dictonary
scrubadub_spacy.detectors.spacy.SpacyEntityDetector.filth_cls_mapis used to map between the spaCy named entity label and the type of scrubadubFilth, while thenamed_entitiesargument sets which named entities are consideredFilthby theSpacyEntityDetector.- filth_cls_map = {'DATE': <class 'scrubadub.filth.date_of_birth.DateOfBirthFilth'>, 'FAC': <class 'scrubadub.filth.location.LocationFilth'>, 'GPE': <class 'scrubadub.filth.location.LocationFilth'>, 'LOC': <class 'scrubadub.filth.location.LocationFilth'>, 'ORG': <class 'scrubadub.filth.organization.OrganizationFilth'>, 'PER': <class 'scrubadub.filth.name.NameFilth'>, 'PERSON': <class 'scrubadub.filth.name.NameFilth'>}¶
- name: str = 'spacy'¶
- language_to_model = {'de': 'de_dep_news_trf', 'en': 'en_core_web_trf', 'es': 'es_dep_news_trf', 'fr': 'fr_dep_news_trf', 'nl': 'nl_core_news_trf', 'zh': 'zh_core_web_trf'}¶
- disallowed_nouns = {'skype'}¶
- __init__(named_entities: Optional[Iterable[str]] = None, model: Optional[str] = None, **kwargs)[source]¶
Initialise the
Detector.- Parameters
named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to
{'PERSON', 'PER', 'ORG'}model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in a list of documents.
- Parameters
document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
- iter_filth(text: str, document_name: Optional[str] = None) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
scrubadub_spacy.detectors.SpacyNameDetector¶
- class scrubadub_spacy.detectors.SpacyNameDetector(include_spacy: bool = True, **kwargs)[source]¶
Bases:
scrubadub_spacy.detectors.spacy.SpacyEntityDetectorAdd an extension to the spacy detector to look for tokens that often occur before or after names of people’s names, a prefix might be Hello as in “Hello Jane”, or Mrs as in “Mrs Jane Smith” and a suffix could be PhD as in “Jane Smith PhD”.
See the
SpacyDetectorfor further info on how to use this detector as it shares many similar options.Currently only english prefixes and sufixes are supported, but other language titles can be easily added, as in the example below:
>>> import scrubadub, scrubadub_spacy >>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NOUN_TAGS['de'] = ['NN', 'NE', 'NNE'] >>> scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector.NAME_PREFIXES['de'] = ['frau', 'herr'] >>> detector = scrubadub_spacy.detectors.spacy_name_title.SpacyNameDetector(locale='de_DE', model='de_core_news_sm', ... include_spacy=False) >>> scrubber = scrubadub.Scrubber(detector_list=[detector], locale='de_DE') >>> scrubber.clean("bleib dort Frau Schmidt") 'bleib dort {{NAME+NAME}}'
- name: str = 'spacy_name'¶
- NAME_PREFIXES = {'en': ['mr', 'mr.', 'mister', 'mrs', 'mrs.', 'misses', 'ms', 'ms.', 'miss', 'dr', 'dr.', 'doctor', 'prof', 'prof.', 'professor', 'lord', 'lady', 'rev', 'rev.', 'reverend', 'hon', 'hon.', 'honourable', 'hhj', 'honorable', 'judge', 'sir', 'madam', 'hello', 'dear', 'hi', 'hey', 'regards', 'to:', 'from:', 'sender:']}¶
- NAME_SUFFIXES = {'en': ['phd', 'bsc', 'msci', 'ba', 'md', 'qc', 'ma', 'mba']}¶
- NOUN_TAGS = {'en': ['NNP', 'NN', 'NNPS']}¶
- TOKEN_SEARCH_DISTANCE = 3¶
- MINIMUM_NAME_LENGTH = 1¶
- __init__(include_spacy: bool = True, **kwargs)[source]¶
Initialise the
Detector.- Parameters
include_spacy (bool, default, False) – include default spacy library in addition to title detector.
named_entities (Iterable[str], optional) – Limit the named entities to those in this list, defaults to
{'PERSON', 'PER', 'ORG'}.model (str, optional) – The name of the spacy model to use, it must contain a ‘ner’ step in the model pipeline (most do, but not all).
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- static find_names(doc: spacy.tokens.doc.Doc, tokens: Sequence[spacy.tokens.token.Token], noun_tags: List[str]) spacy.tokens.doc.Doc[source]¶
This function searches for possilbe names in a flagged set of tokens and adds them to the identified entities.
- iter_filth_documents(document_list: Sequence[str], document_names: Sequence[Optional[str]]) Generator[scrubadub.filth.base.Filth, None, None][source]¶
Yields discovered filth in a list of documents.
- Parameters
document_list (List[str]) – A list of documents to clean.
document_names (List[str]) – A list containing the name of each document.
- Returns
A list containing all the spacy doc
- Return type
Sequence[Optional[str]]
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
scrubadub_stanford.detectors.StanfordEntityDetector¶
- class scrubadub_stanford.detectors.StanfordEntityDetector(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]¶
Bases:
scrubadub.detectors.base.DetectorSearch for people’s names, organization’s names and locations within text using the stanford 3 class model.
The three classes of this model can be enabled with the three arguments to the inialiser enable_person, enable_organization and enable_location. An example of their usage is given below.
>>> import scrubadub, scrubadub_stanford >>> detector = scrubadub_stanford.detectors.StanfordEntityDetector( ... enable_person=False, enable_organization=False, enable_location=True ... ) >>> scrubber = scrubadub.Scrubber(detector_list=[detector]) >>> scrubber.clean('Jane is visiting London.') 'Jane is visiting {{LOCATION}}.'
- filth_cls¶
alias of
scrubadub.filth.base.Filth
- name: str = 'stanford'¶
- ignored_words = ['tennant']¶
- stanford_version = '4.0.0'¶
- stanford_download_url = 'https://nlp.stanford.edu/software/stanford-ner-{version}.zip'¶
- __init__(enable_person: bool = True, enable_organization: bool = True, enable_location: bool = False, **kwargs)[source]¶
Initialise the
Detector.- Parameters
name (str, optional) – Overrides the default name of the :class:
Detectorlocale (str, optional) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- iter_filth(text, document_name: Optional[str] = None)[source]¶
Yields discovered filth in the provided
text.- Parameters
text (str) – The dirty text to clean.
document_name (str, optional) – The name of the document to clean.
- Returns
An iterator to the discovered
Filth- Return type
Iterator[
Filth]
- classmethod supported_locale(locale: str) bool[source]¶
Returns true if this
Detectorsupports the given locale.- Parameters
locale (str) – The locale of the documents in the format: 2 letter lower-case language code followed by an underscore and the two letter upper-case country code, eg “en_GB” or “de_CH”.
- Returns
Trueif the locale is supported, otherwiseFalse- Return type
bool
Catalogue functions¶
These functions register or remove Detectors from the Detector catalogue.
scrubadub.detectors.register_detector¶
- scrubadub.detectors.register_detector(detector: Type[Detector], *, autoload: Optional[bool] = None) Type[Detector][source]¶
Register a detector for use with the
Scrubberclass.You can use
register_detector(NewDetector, autoload=True)after your detector definition to automatically register it with theScrubberclass so that it can be used to remove Filth.The argument
autoload``decides whether a new ``Scrubber()instance should load thisdetectorby default.>>> import scrubadub >>> class NewDetector(scrubadub.detectors.Detector): ... pass >>> scrubadub.detectors.register_detector(NewDetector, autoload=False) <class 'scrubadub.detectors.catalogue.NewDetector'>
- Parameters
detector (Detector class) – The
Detectorto register with the scrubadub detector configuration.autoload (Optional[bool]) – Whether to automatically load this
DetectoronScrubberinitialisation.
scrubadub.detectors.remove_detector¶
- scrubadub.detectors.remove_detector(detector: Union[Type[Detector], str])[source]¶
Remove an already registered detector.
>>> import scrubadub >>> class NewDetector(scrubadub.detectors.Detector): ... pass >>> scrubadub.detectors.catalogue.register_detector(NewDetector, autoload=False) <class 'scrubadub.detectors.catalogue.NewDetector'> >>> scrubadub.detectors.catalogue.remove_detector(NewDetector)
- Parameters
detector (Union[Type['PostProcessor'], str]) – The
Detectorto register with the scrubadub detector configuration.autoload (bool) – Whether to automatically load this
DetectoronScrubberinitialisation.