Change Log¶
This project uses semantic versioning to track version numbers, where backwards incompatible changes (highlighted in bold) bump the major version of the package.
latest changes in development for next release¶
2.0.0¶
There have been some changes in the scrubadub API, but few breaking changes. The headline changes include:
Several new detectors have been added (spacy, stanford NER, tax reference number, credit card, …).
Splitting of the scrubadub package into smaller parts.
Added ability to easily evaluate a
Detector‘s performance, see Accuracy.Started to localise detectors to function for more than one language/location.
Support for scrubbing multiple documents together.
Introduced the concept of a
PostProcessor. This will allow more complex groupings ofFilths and new types of tokenization.New detector configuration/management system.
Scrubber¶
Detectors andPostProcessors can be added and removed using a string containing their default name, their class or an instance.You can clean multiple documents with one
Scrubber().clean_documents(docs)callA default set of Detectors are loaded instead of all Detectors. This is particularly useful for detectors that are slow or have complex dependencies, as they dont need to be loaded each time. However, this might need an explicit
Scrubber().add_detector(detector)call for the same behaviour as before.Added a
localeparameter to theScrubberinitialiser.A
Scrubberwill only auto-load detectors that support a givenScrubberlocale.The
Scrubberwill ensure that filth are valid with a call toFilth().is_valid()
Detectors¶
The the name of the detector has been separated from the type of filth found. This means multiple instances of the same detector (configured differently) can be in the same
Scrubberinstance and oneDetectorcan return multiple types ofFilth.Detectors now required to define an attribute called name, which should be unique within a
Scrubberinstance.Detectors are now passed a locale argument to the Detector initialiser.
Detectorshave an optionalsupported_locale(locale)function that returns a bool to indicate if a givenDetectorsupports a locale.Regular expressions used by the RegexDetector class have been moved from RegexFilth.regex to RegexDetector.regex.
Renamed SSNDetector to SocialSecurityNumberDetector.
New
AddressDetector, which detects US, CA and GB addresses.New
CreditCardDetector, which detects credit card numbers (based on the Detector in the alphagov scrubadub fork).New
DateOfBirthDetector, which detects dates of birth (thanks to @mirandachong).New
DriversLicenceDetector, which detects GB drivers licence numbers.New
TaggedEvaluationFilthDetector, which is used to tag real filth in text when you’re evaluating the quality of your filth removal.New
UserSuppliedFilthDetector, which is used to find bits of Filth that you know will be in the text.New
PostalCodeDetector, which detects GB post codes.New
SpacyEntityDetector, which detects a range of named entities, including names (thanks to @aCampello).New
StanfordEntityDetector, which also detects slightly different range of named entities, including names.New
NationalInsuranceNumberDetector, which detects GB National Insurance Numbers (NINO) (thanks to @mirandachong).New
TaxReferenceNumberDetector, which detects GB Tax Reference Numbers (TRN) (thanks to @mirandachong).New
VehicleLicencePlateDetector, which detects number plates on GB cars (based on the Detector in the alphagov scrubadub fork).New
RegionLocalisedRegexDetector, which derived from the convenience classRegexDetectorto allow for quickly creating regional regex based detectors.Detectors can now be registered to a catalogue ofDetectors. This allows detectors to be defined in separate packages.
Filth¶
Introduced three parameters in the constructor detector_name, document_name and locale. These keep track of the
Detectorthat found theFilth, the document it came from and the documents locale. This results inFilthobjects being passed additional parameters on initialisation. If you have defined customFilths they will need to be updated so thatFilth.__init__accepts thedetector_name,document_nameandlocalekeywords and call the base class constructor.Added a
generate()function that allows to generate fake examples of thatFilth. This can be used to help evaluate detector performance.Added an
is_valid()function, this can be used to ensure that a piece of detected filth is indeed valid.
PostProcessors¶
- Introduction of simple
PostProcessors: FilthReplacer: Replace the filth with the type of filthexample@example.com -> EMAIL, a configurable hashexample@example.com -> 196aa39e9f8159ecor a monotonically increasing number for each unique piece of filth (optionally including the filth type)example@example.com -> EMAIL-1.PrefixSuffixReplacer: Add a prefix and/or suffix onto the replacementEMAIL-1 -> {{EMAIL-1}}
- Introduction of simple
It is envisioned that other more complex operations can be done here too such as grouping filth (e.g. “John”, “John Doe” and “Mr. Doe” could be grouped together).
1.2.2¶
LeapBeyond are now supporting scrubadub with maintanance and development.
bug fixes:
StopIteration no longer supported in recent python varions (#41 via @roman-y-korolev)
Fix test runner with python 3 (#42 via @roman-y-korolev)
Update documentation to reflect new repository location (#49)
This is the last version that will be explicitly compatible with python 2.7.
1.2.1¶
bug fixes:
bumped
textblobversion (#43 via @roman-y-korolev)fixed documentation (#32 via @ivyleavedtoadflax)
1.2.0¶
added python 3 compatability (#31 via @davidread)
1.1.1¶
1.1.0¶
1.0.3¶
minor change to force
Detector.filth_clsto exist (#13)
1.0.1¶
several bug fixes, including:
installation bug (#12)
1.0.0¶
major update to process Filth in parallel (#11)
0.1.0¶
0.0.1¶
initial release, ported from past projects