Tuesday, July 15, 2008

Making fudged PII

As I start dealing more with application development that sends data to external sources I think more about the security of personally identifiable data within that data. When testing web services and file transfers there is no reason to have real, personally identifiable information (PII) in that stream of test data. However, the destination for that data still needs values in those elements. There are a few ways to approach this:

  1. Random data
  2. Pre-generated test data, test cases, etc.
  3. Use existing data but generate PII from a real primary key

The main problem I see with randomly generated data is "re-testing". For example: you send data for ID 1234567 to a service, randomly generating four columns of test data, then that service requests a re-test with the same data.

Pre-generated data would be a set of test cases, known trouble patterns and other data created before testing occurs. This scenario is feasible for new development on new data systems, to test min/max/null value scenarios that would otherwise never appear in live data, or to force a specific set of data. Where this scenario becomes cumbersome is when there is a tightly integrated system with numerous historical pre-cursor processes; triple the complexity if that system is another vendor's package and not your own. For example: create customer, purchase 14 months worth of product, run through aging and re-bill process, reconcile A/R, and then skip a month of purchases. There could be over a hundred tables touched by that set of processes and if one is skipped/missed then the data for another group is no longer valid.

Generating test data from the primary key is feasible when the system has a long history of identified test data handy but simply needs PII altered to protect the identity of the individual or organization. By using a unique, primary key the "fudged" PII data will always match back to the source primary key for instances where "re-testing" is required. For example: ID 1234567 always generates social security number 898-75-5309 (not a real number but will validate in some systems).

The example below was written for Oracle SQL to show how to convert a seven digit primary key identifier into a social security number, date of birth and gender:
SELECT t1.id  --- VARCHAR2(7)
, TO_CHAR(899-FLOOR(TO_NUMBER(t1.id)/989901),'000')
||TO_CHAR(MOD(TO_NUMBER(t1.id),9999)-9999,'S0000') social_security_number
, TO_CHAR(TRUNC(TRUNC(SYSDATE-6574.5,'YEAR')-MOD(FLOOR(TO_NUMBER(t1.id)/10),2191.5)-CASE SUBSTR(t1.id,-1,1)
WHEN '2' THEN 2191.5
WHEN '3' THEN 2191.5
WHEN '4' THEN 4383
WHEN '5' THEN 6574.5
WHEN '6' THEN 8766
WHEN '7' THEN 10957.5
WHEN '8' THEN 13149
WHEN '9' THEN 15340.5
ELSE 0 END),'MM/DD/YYYY') date_of_birth
WHEN SUBSTR(t1.id,-1,1) < '5' THEN 'Male'
ELSE 'Female' END gender
, CASE SUBSTR(t1.id,-1,1)
WHEN '0' THEN 'AI' --- American Indian/Alaskan
WHEN '1' THEN 'AS' --- Asian/Pacific Islander
WHEN '2' THEN 'BL' --- Black/Non-Hispanic
WHEN '3' THEN 'HS' --- Hispanic
WHEN '4' THEN 'NR' --- Non-Resident Alien
ELSE 'WH' END ethnic_code
FROM table_name t1 WHERE [selection criteria];
The social security number is pretty straight forward. If area numbers (first three digits) higher than 772 do not validate, then use area 267 (237-267 have all groups allocated within them).

The date of birth is a little complicated but the attempt was to use the ones digit to generate one of seven date ranges with the lower two in the range getting more hits because they are the primary age group dealt with. Starting with a base age of 18 years of age, subtract one of seven six year blocks, and then subtract zero to six years.

Gender was a simple test of the ones digit to determine male or female.

Ethnic background was a simple translation of the ones digit to a code taking into account there are more numbers in four of the six groups.

This was a very bare example meant only to suggest direction. It would be interesting to build a library (although someone probably already has).

No comments: