- Karina Datascientist's Newsletter
- Posts
- PII in Databases: What You Can't Afford to Get Wrong
PII in Databases: What You Can't Afford to Get Wrong
The other day, someone on TikTok messaged me asking for "an email database of any country."
I asked: "Do you understand that email addresses are personal information?"
His response: "I don't consider email to be personal info, hence it doesn't have the person's personal data."

I nearly dropped my phone.
Today, let's talk about PII (Personally Identifiable Information) - what it is, why it matters, and how to handle it properly in your data work.
Because getting this wrong isn't just embarrassing. It can cost companies millions and end careers.
What Is PII (And Why Should You Care)?
PII stands for Personally Identifiable Information - any data that can identify a specific person.
Why it matters: Laws like GDPR (Europe), CCPA (California), and similar regulations worldwide impose massive fines for mishandling personal data. We're talking millions of dollars in penalties.
But beyond legal issues - mishandling PII destroys trust. One data breach, and customers are gone.
I am sure you’ve heard about personal data leakages. I can think of a telco, airlines, an insurance company in Australia as well as health data from a famous female health app.
What Counts as PII?
This is where people get confused. Let me break it down:
Direct PII (Obviously identifies someone):
Full name
Email address (yes, it IS PII!)
Phone number
National ID numbers (passport, driving licence, social security)
Physical address
Date of birth
Credit card numbers
Bank account details
Biometric data (fingerprints, facial recognition)
Indirect PII (Can identify someone when combined):
IP address
Device ID
Location data
Browser cookies
Employment information
Educational records
Medical records (even without names). This one is usually considered extra sensitive. I am sure none of us would want our medical information floating around the internet
Here's the tricky part: Even "anonymised" data can become PII when combined with other data.
Example: "35-year-old female software engineer in Manchester who visited Hospital X on Date Y" - No name, but you could probably figure out who this is.
It actually reminds me of when companies run ‘anonymous’ surveys but ask you to input your email address as a mandatory field and you think ‘why is it called anonymous then’?!.
The "Email Isn't Personal Info" Myth
Let's address this directly: Email addresses are absolutely PII.
Why?
They directly identify an individual
They can be used to contact someone
They're linked to personal accounts
GDPR explicitly lists email as personal data
You need consent to store and use them
"But everyone shares their email!" - Doesn't matter. It's still regulated personal data.
How to Handle PII in Databases
Rule 1: Use USER_ID instead of PII
Firstly, it would improve the speed of your analysis dramatically. User_id is a numeric column, while FULL NAME/EMAIL/ADDRESS is varchar. It is much faster to join tables (lookup information) if you are using indexes on numeric columns
Secondly, it would save you from unnecessary drama. When I was running a data consultancy, I often asked clients to provide me with data without PII. Usually it doesn’t carry extra analytical importance (unless you are analysing suburbs, or need to analyse email domains etc) and just raises extra security issues.
Rule 2: Encrypt Sensitive Data
PII should be encrypted both:
At rest (in the database)
In transit (when moving between systems)
This means even if someone gains unauthorised access, they can't read the data.
For example, I am working in telco at the moment, so I have access to phone numbers. However, I am not allowed to even export this data. It can be used only for lookups and some types of analysis in very rare cases. In case I need to export something, such data gets hashed, similar to credit cards numbers, when you see on the screen **** **** **** 1234.
Rule 3: Implement Access Controls
Not everyone needs access to PII. Set up role-based access:
Data analysts: Access to anonymised data only
Customer service: Access to specific customer records when needed
Marketing: Access to aggregate data, not individual records
DBAs: Full access but all actions logged
Rule 4: Know Where Your Data Lives
This is HUGE and often overlooked.
GDPR requires:
EU citizen data must be stored in EU or countries with adequate data protection
You must know exactly where your data is physically located
Cloud providers must comply with regional requirements
Real scenario: You're using AWS. Where are your servers?
US East? That's Virginia
EU West? That could be Ireland or Frankfurt
If you're storing EU customer data on US servers without proper safeguards, you're violating GDPR.
I worked for a Mental Health app, which dealt with very sensitive health records. We had to store data in different countries, as we had customers worldwide - US, UK, Australia.
What About "Public" Information?
"But I found these emails on LinkedIn/company websites - they're public!"
Still PII. Still regulated. Just because information is publicly available doesn't mean you can:
Scrape it without consent
Store it indefinitely
Use it for purposes the person didn't agree to
The "Right to be Forgotten"
Under GDPR, people can request:
To see what data you have about them
To delete their data
To correct inaccurate data
To move their data to another service
Your database structure needs to support this. Can you:
Find all records for a specific user?
Delete them completely (including backups)?
Export their data in readable format?
If not, you've got a problem.
What Happens When You Get It Wrong
British Airways (2020): £20 million fine for data breach
Marriott (2020): £18.4 million fine for exposing customer data
Google (2021): €90 million fine for cookie violations
These aren't small companies with bad security. These are major corporations who still got it wrong.
As a data analyst, you're often the first line of defence. You're the one writing queries, exporting data, building dashboards.
One careless export. One unencrypted file. One database with poor access controls.
That's all it takes.
The Bottom Line
PII isn't just a legal concept. It's about respecting people's privacy and protecting their data.
As data analysts, we have access to incredibly sensitive information. With that access comes responsibility.
Getting this right isn't optional. It's not bureaucracy. It's professional responsibility.
And honestly? If someone asks you for "an email database of any country" - run.
Keep pushing 💪
Karina
P.S. If you're building portfolio projects, use synthetic or properly anonymised data. Never scrape or use real personal information without explicit consent. It's not worth the risk, and it shows potential employers you don't understand data ethics.
Python tip
Stop losing track of variables in Jupyter notebooks
Use %who and %whos to see what's in your namespace.Perfect for debugging messy notebooks.

Grab your freebies if you haven’t done already:
Data Playbook (CV template, Books on Data Analytics and Data Science, Examples of portfolio projects)
Need more help?
Just starting with Python? Wondering if programming is for you?
Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.
Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.
Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.
![]() | Data Analyst & Data Scientist |
