Advanced usage

By default, scrubadub aggressively removes content from text that may reveal personal identity, but there are certainly circumstances where such omissions are (i) not necessary and (ii) detrimental to downstream analysis. scrubadub allows users to fine-tune the manner in which content is deidentified using the specific methods in the scrubadub.scrubbers.Scrubber class.

class scrubadub.scrubbers.Scrubber[source]

The Scrubber class is used to clean personal information out of dirty dirty text.

clean_with_placeholders(text)[source]

This is the master method that cleans all of the filth out of the dirty dirty text using the default options for all of the other clean_* methods below.

clean_proper_nouns(text, replacement='{{NAME}}')[source]

Use part of speech tagging to clean proper nouns out of the dirty dirty text.

clean_email_addresses(text, replacement='{{EMAIL}}')[source]

Use regular expression magic to remove email addresses from dirty dirty text. This method also catches email addresses like john at gmail.com.

clean_urls(text, replacement='{{URL}}', keep_domain=False)[source]

Use regular expressions to remove URLs that begin with http://, https:// or www. from dirty dirty text.

With keep_domain=True, this method only obfuscates the path on a URL, not its domain. For example, http://twitter.com/someone/status/234978haoin becomes http://twitter.com/{{replacement}}.

clean_phone_numbers(text, replacement='{{PHONE}}', region='US')[source]

Remove phone numbers from dirty dirty text using python-phonenumbers, a port of a Google project to correctly format phone numbers in text.

region specifies the best guess region to start with (default: "US"). Specify None to only consider numbers with a leading + to be considered.

clean_credentials(text, username_replacement='{{USERNAME}}', password_replacement='{{PASSWORD}}')[source]

Remove username/password combinations from dirty drity text.

clean_skype(text, replacement='{{SKYPE}}', word_radius=10)[source]

Skype usernames tend to be used inline in dirty dirty text quite often but also appear as skype: {{SKYPE}} quite a bit. This method looks at words within word_radius words of “skype” for things that appear to be misspelled or have punctuation in them as a means to identify skype usernames.

Default word_radius is 10, corresponding with the rough scale of half of a sentence before or after the word “skype” is used. Increasing the word_radius will increase the false positive rate and decreasing the word_radius will increase the false negative rate.