Just about every day, the national headlines include stories about data breaches that have compromised the security of thousands or millions of confidential electronic records, often containing credit card or social security numbers.  Usually, these occur at organizations found to be in violation of various regulations or industry standards designed to prevent  such incursions.  These notable hacks and thefts have prompted organizations to look to their own policies to avoid becoming fodder for the next news cycle.   At Sherpa Software, we are concerned with managing and searching corporate data. This post will focus on one specific portion of this compliance challenge – finding credit card or social security numbers in your repositories of electronically stored information (ESI).

Two key areas of data compliance revolve around Payment Card Industry (PCI) and Personally Identifiable Information (PII).  PCI data falls under the aegis of the Data Security Standards, currently in version 3.0, promulgated by a council of global payment brands (Visa, American Express, etc.).  PII data includes such things as social security numbers, date of birth, personal health information, and other data that can identify an individual.  Various privacy laws and industry regulations are responsible for detailing how PII and PCI records should be handled. Recommended protections  range from maintaining secure networks to managing the flow of private information.   For data stored at rest (e.g. on file shares or email servers), the details generally boil down to  don’t transmit or store this data in plain text, audit your systems, and employ policies to ensure your organization remains in compliance.

One helpful technique for performing these steps effectively is to scan systems for PCI or PII data that is not encrypted.  If that data is found, you can take steps to remedy the issue via policy or software.  Since the PCI or PII in question usually matches a pattern, one of the basic methods to locate it is to use pattern based searching in the form of regular expressions.  However, false positives – items that match the criteria, but that don’t match either credit cards or social security numbers – become a problem.  There are a myriad of other numbers which may match the same patterns.  Typical social security number may match a nine number pattern, but, absent any constraints, so too will a zip code in zip+4 format.  Additionally, both credit card and social security numbers match data patterns found in log files, random URLs, and spreadsheets.  The trick is to catch all instances of PCI or PII data to achieve compliance while eliminating as many of these false positives from the search as possible. This is to help speed up the review and resolution of any problem areas.

At this point, a simple regular expression search may leave too many outliers.  To diminish the problem of false positives, it helps to employ a bit of programmatic logic in your scanning mechanism to whittle down the contenders.  For credit card numbers, this is made a bit easier as a checksum can be compared against a prospective match.   Social security numbers have a few rules (no starting with ‘000’ or ‘666’, for example), and several basic patterns with specific delineators. Sticking to those will help further filter out false matches.  Another trick is to check for either side of the selected text for numeric values.  While testing at Sherpa Software, we found that applying this logic (programmed into our default PATTERN functionality) removed over 75% of the false positives in our sample sets.

But some false positives may persist due to the identically formatted numbers that are used for other, non-regulated purposes (e.g. account numbers) in your organization.  At this point, professional knowledge of the ESI in your organization comes in handy.  Sherpa’s tools give users the flexibility to add additional criteria such as proximity and Boolean operators to the regular expression functionality. These should be used by someone familiar with the local ESI to craft additional, appropriate, refinements to the search.

The goal of all this searching is a key step in compliance – to pass an audit for unencrypted PCI or PII in locally stored data. If your policy is sound and confidential data flow is well regulated, genuine hits should never be found.   However, if your proactive steps do turn up unwanted PCI or PII data, it gives you an opportunity to resolve the situation – via policy, software and procedure – before your organization is featured, rather disparagingly, on the nightly news.