Various privacy laws and other data security regulations detail how private, financial, and other confidential records should be handled. These rules cover the usage and storage of data, such as credit card numbers, social security numbers, social insurance information, and health care records. As described in a previous blog article, having this Personally Identifiable information (PII) or Payment Card Industry (PCI) data loose in your system can cause all kinds trouble, ranging from fines to loss of reputation. Thankfully, Discovery Attender for Windows has various methods to assist in finding and removing troublesome data that is found unencrypted in your email or file systems.

The first thing to realize when locating PCI or PII data is that it is rather tricky to gather accurate data without also collecting items, which are not relevant. Since the PCI or PII in question usually matches a pattern of numeric characters, one of the basic methods to locate this information is to use regular expressions. Regular Expressions (aka Regex or GREP) are a defined sequence of characters that match a specified pattern. In Discovery Attender, regular expressions are supported and use the syntax REGEX(“pattern”). For example:

  • Typical Credit Card pattern: RegEx(“(d{4}([D]?d{4}){3}([D]?d{3})?|d{4}[D]?d{6}[D]?d{5}([D]?d{4})?)[^A-Za-z0-9]”)
  • Typical Social Security Number pattern: RegEx(“b(?!000)([0-6]d{2}|7([0-6]d|7[012]))([-]?)(?!00)dd3(?!0000)d{4}b”)

However, when using typical regular expressions, false positives – items that match the regular expression pattern, but that don’t match either credit cards or social security numbers – become a problem. There are a myriad of other numbers which may match the same patterns. For example, the number 5501316362657721007 will match the regular expression pattern for a credit card, but is obviously not one.

The trick is to catch all instances of PCI or PII data while eliminating as many of these false positives from the search as possible. To diminish the problem, it helps to employ a bit of programmatic logic in your scanning mechanism to whittle down false contenders. For credit card numbers, this is made a bit easier as a checksum called the Luhn Algorithm to compare against a prospective match. Social security numbers also have a few rules (no starting with ‘000’ or ‘666’, for example). Sticking to those will help further filter out incorrect matches. Another tactic is to check either side of the matching snippet for additional numeric values. In Discovery Attender, these rules are given a bit of a push with the ‘PATTERN’ keyword:

  • PATTERN(CC) uses a regular expression to match standard credit card patterns. It then programmatically checks that the match is isolated and not part of a larger number. It further tests the match by applying the Luhn algorithm to for validation.
  • PATTERN(SSN) finds Social Security Numbers (US), by matching the standard pattern, validating it for isolation and the expected range (i.e. not having all zeros in a digit group or 666 or 900 to 999 in the first digit group).
  • PATTERN(SIN) finds Social Insurance Numbers (Canada) by matching the pattern, validating for isolation and the basic rules while also matching the Lunn algorithm.

While testing at Sherpa Software, applying a PATTERN removed over 75% of the false positives in our sample sets. However, that does leave a number of them remaining. Some false positives may persist due to identically formatted numbers that are used for other, non-regulated purposes (e.g. account numbers) in your organization. Absent any format constraints, a zip code in zip+4 will match social security number patterns. Additionally, both credit card and social security numbers match data patterns found in log files, random URLs, and spreadsheets. The link example below, for example, will match PATTERN(CC). request

At this point, professional knowledge of the ESI in your organization comes in handy. Sherpa’s tools give users the flexibility to add additional criteria such as proximity and Boolean operators to the PATTERNs or to create custom regular expressions. These should be used by someone familiar with the local ESI to craft additional, appropriate, refinements to the search. For example, the following expression has two parts:

{credit* OR charge*} NEAR(3) {card* OR num*} NEAR(10) PATTERN(CC) OR card* NEAR(3) num* NEAR(10) PATTERN(CC) OR {AMEX OR AX OR “American Express” OR VISA OR VS OR MC OR MASTER* OR DS OR DISCOVER} NEAR(10) PATTERN(CC)

It will only match a result if it contains one of the following in close proximity to the credit card pattern:

  • The words credit cardcharge cardcredit card numbercharge num, etc.
  • Names or abbreviations for common credit card providersThe same could be created for social security numbers:


{“ss#” OR “ss #” OR ssn OR soc*} NEAR(5, BEFORE) PATTERN(SSN)

The above expression will significantly reduce false positives, but will fail to pick up something that doesn’t include the matching keywords. For example, the following would not match the credit card expression, but probably is of interest.


Since the above example will match a PATTERN(CC), users must weigh the plusses and minuses to refining the criteria – the likelihood of missed expressions vs. the time and energy necessary for sifting through false positives. To give you the best chance of success it becomes extremely important to test your criteria with items that should be found as well as those that shouldn’t. The integrated Keyword Tester should always be used to validate your keyword criteria, but it is especially useful when testing out regular expressions, patterns and search expressions.

Now you have your data, what do you want to do with it? In Discovery Attender you can review, report, copy, move or delete (depending on type of data store) the source documents. There are a few other items to be aware of when using Discovery Attender to find PCI and PII data:

  • There are a number of regular expression libraries on the Internet that have useful features for creating and testing regular expressions. However, many of these libraries are geared toward finding data in fields (e.g. validating against a web site payment entry form) versus a document or email body.  Discovery Attender uses the .Net flavor of Regex, so it is essential you use the integrated Keyword Tester to validate any custom regular expressions and not rely on those found on the web.
  • The Preview Pane and Text Search viewer cannot render regular expressions or highlight the found pattern. Two things can help mitigate this issue. First, the keyword snippets and keyword details views can give you a better idea of the context of the hits. Next, there is an option in the Settings to view an entire keyword snippet in the result. This will highlight an entire line of text which can be helpful for locating matches in large documents.
  • Always, always, always do a sample search first!

Sherpa Software offers other solutions for locating PCI or PII data in enterprise or Domino environments. Please contact your sales representative to set up a demonstration of features. Don’t hesitate to contact Technical Support if you have any questions about PCI or PII searching in Discovery Attender.