scrubadub

Remove personally identifiable information from free text. Sometimes we have additional metadata about the people we wish to anonymize. Other times we don’t. This package makes it easy to seamlessly scrub personal information from free text, without comprimising the privacy of the people we are trying to protect.

scrubadub currently supports removing:

  • Names (proper nouns) via textblob
  • Email addresses
  • URLs
  • Phone numbers via phonenumbers
  • username / password combinations
  • Skype usernames
  • Social security numbers

Quick start

Getting started with scrubadub is as easy as pip install scrubadub and incorporating it into your python scripts like this:

>>> import scrubadub

# John may be a cat, but he doesn't want other people to know it.
>>> text = u"John is a cat"

# Replace names with {{NAME}} placeholder. This is the scrubadub default
# because it maximally omits any information about people.
>>> scrubadub.clean(text)
u"{{NAME}} is a cat"

# Replace names with {{NAME-ID}} anonymous, but consistent IDs.
>>> scrubadub.clean(text, replace_with='identifier')
u"{{NAME-0}} is a cat"
>>> scrubadub.clean("John spoke with Doug.", replace_with='identifier')
u"{{NAME-0}} spoke with {{NAME-1}}."

There are many ways to tailor the behavior of scrubadub using different ``Detector` and Filth classes <under_the_hood>`. These advanced techniques allow users to fine-tune the manner in which scrubadub cleans dirty dirty text.

Indices and tables