Dec 15, 2021
5 min read

How to discover PII across your systems

Monitoring PII across your systems has become a necessary evil with privacy compliance creeping up on the agenda every year.

While there are numerous solutions out there which provide visibility into the data you store, there are three main methods of tracking personal data: surveys, scanning, proxy implementation. But which of these techniques is best for your organization?

Surveys

Contrary to some of the marketing materials for the scanning software out there, surveys can be a cheap and effective way of discovering what personal data you hold and in which systems. This method suits smaller companies that don’t use many different surveys and databases.

Carrying out surveys to discover personal data across your systems is as simple as it sounds. You just need to send out an email to all relevant stakeholders to ask them what:

  • SaaS they are using
  • Databases they use

In a small company, chances are SaaS use is more pervasive from a data standpoint than databases and this can make it hard to keep a tight control over personal data.

Moreover, taking stock of what data is there means the data owners need to have a good knowledge of their data practices. However, larger companies will need to have perfect knowledge of their data inventory in order to be compliant with privacy regulations. 

This can be a big downside to surveys if perfect information is needed: data owners are sifting through everything they can take their time and mistakes can be made.

However, surveys offer a primitive way to get a birds’ eye view of the data in the company and are a quick and cheap solution in data discovery.

Scan

While conducting surveys can be a chore for those tasked with doing so and a bit of a do-it-and-forget-about-it exercise, there are also software tools out there that can take care of the manual work for you. 

These tools operate on a kind of set-it-and-forget-it basis. They work similar to an anti virus scanner, whereby you launch the scan, go about your business and then come back to it when the scan is done.

Just like anti-virus scanners, the time it takes to complete a scan of all systems for PII depends on the amount of data you store. Moreover, if you are using cloud-based tools (SaaS), the scanners will need to have integrations to sift through what you are storing off-site.

This can be a big downside if you have a lot of data to go through, or are using SaaS that needs custom integrations to be scanned. Conversely, if you want detailed insight into everything you hold and where, you have the time to do a full scan, and are using popular software tools, this is probably the solution for you.

In 2016 the IDG reported that the average company stored 162.9TB of data. This would take a scanner around 70 days to process if 80% were unstructured data. Imagine how long that would take now.

While machine learning is helping scanners get faster, if you have auditors at your door and need to have an inventory fast, you would have to employ multiple scanners at a large cost, not to mention that more advanced scanners generally cost more.

It’s also worth mentioning the invasive nature of these scanning tools. In order for them to audit all of your data to discover what personal data you hold, you need to authorize access to many (if not all) of the places you store data. 

HTTP proxy

The last method is to discover PII by monitoring it with a HTTP proxy. This method involves using a standard proxy that receives traffic and forwards it to another service that would perform another analysis.

In a nutshell, it works by rerouting traffic which is then forwarded to an analyzer to extract personal information which is then reformed as metadata record. This metadata is then sent to a dashboard.

Think of it as a kind of traffic police looking for personal data that flows via API calls.

Unlike scanning, which runs through your systems processing data at rest, proxy data monitoring processes data in motion, meaning that it isn’t invasive to systems like scanning. However, since it only manages the data in motion, discovery can only happen if personal data is in transit and passes through the proxy. Also, the proxy can add additional latency to data flows as it is effectively an additional hurdle that the data has to jump before reaching its destination.

Proxy systems scan data as it is being used, recognizing (using machine learning) and classifying the parts that are personal data. This can be helpful when trying to uncover risky data practices because you can see where the data has come from and where it is going.

It is worth noting, though: just like the above solutions, the proxy works asynchronously with data flows, but is implemented as fast as possible so as not to harm latency and throughput. Failsafes can be added to “switch off” the proxy if it is causing data traffic jams.

If the service is able to sit on volumes of traffic high and long enough it is able to say how many systems there are on the backend. This means there is no need to scan; just monitor traffic to know which systems are there.

Which is best?

In all honesty, it depends what your needs are, the size of your company, and the amount of data you store. I’ve broken down the pros and cons of each method below.

Survey

Pros

  • Quick-fix
  • Cheap

Cons

  • Mistake prone
  • Time consuming for large orgs
  • Go unfinished without stakeholders onboard
  • Changes aren’t logged

Scan

Pros

  • Set it and forget it
  • Thorough

Cons

  • Time-consuming for large data sets
  • Invasive
  • May need custom integrations
  • Costly
  • Limited insights into data usage

Proxy

Pros

  • Non-invasive
  • Quick setup

Cons

  • Effective only with large amount of traffic
  • Doesn’t check stale data

Looking for a low-budget solution in an environment with relatively little data and few systems: survey or scan.

Looking for an in depth view into everything you have ever collected: scan.

Looking to monitor data as it flows around your systems: HTTP proxy.


Author
Vladimir
Vladimir has almost four decades of professional experience, including 12 years as a researcher for the Space Research Institute of the Russian Academy of Sciences and eight years as executive director — head of CIB Application Architecture for Russia’s national bank (Sberbank).

Receive helpful tips, practical content, and updates

Thank you! You have been successfully subscribed
Oops! Something went wrong while submitting the form.