This week I made a Linkedin post about portfolio projects, suggesting to use less common datasets.
Someone replied: "But how do I even get data? If you scrape it - you will get sued"
Scraping can get you into legal trouble. Real company data is confidential. And public datasets? Hiring managers have seen the same Iris/AirBnB/NY Bicycles analysis a thousand times.
So what do you do?
You generate your own data.
Synthetic data is artificially created data that looks and behaves like real data — same structure, same patterns — but none of the legal headaches. You can make it fit exactly the business problem you want to solve.
Let’s talk about how to generate your own data. But before that, small advertisement:
Every headline satisfies an opinion. Except ours.
Remember when the news was about what happened, not how to feel about it? 1440's Daily Digest is bringing that back. Every morning, they sift through 100+ sources to deliver a concise, unbiased briefing — no pundits, no paywalls, no politics. Just the facts, all in five minutes. For free.
So how do you generate your data?
Option 1: Python Libraries
There are three libraries worth knowing.
Faker is the most popular one. It generates realistic fake data — names, emails, addresses, dates, phone numbers. Multiple languages and locales too (for example, you can generate Arabic names or UAE addresses if you need to).
Great for building customer databases, transaction logs, user profiles.
# Install first:
# pip install faker pandas
import pandas as pd
import random
from faker import Faker
fake = Faker()
# -------------------------------------------------
# 1. Generate synthetic customers
# -------------------------------------------------
people = []
for i in range(100_000):
age = random.randint(18, 75)
tenure = random.randint(1, 120)
monthly_spend = round(random.uniform(10, 200), 2)
churn = random.choice([0, 1])
people.append({
"customer_id": i + 1,
"name": fake.name(),
"email": fake.email(),
"gender": random.choice(["Male", "Female"]),
"city": fake.city(),
"country": fake.country(),
"company": fake.company(),
"job_title": fake.job(),
"age": age,
"signup_date": fake.date_between(start_date="-6y", end_date="today"),
"tenure_months": tenure,
"monthly_spend": monthly_spend,
"annual_income": random.randint(20_000, 150_000),
"churn": churn
})
# -------------------------------------------------
# 2. Convert to DataFrame
# -------------------------------------------------
df = pd.DataFrame(people)
# -------------------------------------------------
# 3. Save dataset
# -------------------------------------------------
df.to_csv("customers_faker.csv", index=False)Simple. Fast. Does the job.
Mimesis is basically Faker's faster cousin. Same idea, but significantly better performance on large datasets. If you're generating hundreds of thousands of rows, switch to Mimesis.
# Install first:
# pip install mimesis pandas
import pandas as pd
import random
from mimesis import Person, Address, Datetime, Finance
from mimesis.locales import Locale
# -------------------------------------------------
# 1. Create generators
# -------------------------------------------------
person = Person(Locale.EN)
address = Address(Locale.EN)
dt = Datetime()
finance = Finance()
# -------------------------------------------------
# 2. Generate synthetic customers
# -------------------------------------------------
people = []
for i in range(100_000):
age = random.randint(18, 75)
tenure = random.randint(1, 120)
monthly_spend = round(random.uniform(10, 200), 2)
churn = random.choice([0, 1])
people.append({
"customer_id": i + 1,
"name": person.full_name(),
"email": person.email(),
"gender": person.gender(),
"city": address.city(),
"country": address.country(),
"age": age,
"signup_date": dt.date(start=2018, end=2024),
"tenure_months": tenure,
"monthly_spend": monthly_spend,
"annual_income": finance.price(minimum=20_000, maximum=150_000),
"churn": churn
})
# -------------------------------------------------
# 3. Convert to DataFrame
# -------------------------------------------------
df = pd.DataFrame(people)
# -------------------------------------------------
# 4. Save dataset
# -------------------------------------------------
df.to_csv("customers_mimesis.csv", index=False)
SDV (Synthetic Data Vault) is where it gets really interesting. Unlike Faker and Mimesis which just make up random realistic-looking data, SDV actually learns from a real dataset and generates synthetic data that preserves the statistical properties of the original. Same distributions. Same correlations. Same relationships between columns.
This is a game changer if you have access to a small sample of real data but can't share it publicly.
# Install first:
# pip install sdv pandas
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer
# -------------------------------------------------
# 1. Load your real dataset
# -------------------------------------------------
real_data = pd.read_csv("customers.csv")
# -------------------------------------------------
# 2. Detect metadata (SDV figures out column types)
# -------------------------------------------------
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
# If you have an ID column you can specify it (optional)
# metadata.set_primary_key("customer_id")
# -------------------------------------------------
# 3. Train the synthesiser
# -------------------------------------------------
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
# -------------------------------------------------
# 4. Generate synthetic rows
# -------------------------------------------------
synthetic_data = synthesizer.sample(num_rows=5000)
# -------------------------------------------------
# 5. Save to a new file
# -------------------------------------------------
synthetic_data.to_csv("customers_synthetic.csv", index=False)
Option 2: Just Ask Claude or ChatGPT
You don't always need to write code. Sometimes the fastest way to get synthetic data is to just ask an AI to generate it.
The key is being specific in your prompt. Vague prompts give you generic data. Specific prompts give you something actually useful.
Here's the formula I use:
What's the business context?
What columns do you need and what do they represent?
What format? (CSV, JSON)
What patterns should the data follow?
That last one is important. If you're building a churn model, you want churned customers to actually look different from retained ones — otherwise your model has nothing to learn from.
If you are a beginner and don’t know the business rules, I would recommend to start with the prompt like this:
Generate a realistic telecom churn dataset for a data analytics project.
First:
• Suggest typical customer features used in churn analysis
• Briefly explain what usually influences churn
Then:
• Generate a synthetic CSV with ~20000 rows
• Include realistic relationships but also noise
• Ensure churn rate ~8%
• Make the dataset suitable for exploratory analysis and a simple ML model
Return CSV file.
In case you are aware of business rules, your prompt might get more sophisticated
Generate a realistic synthetic telecom churn dataset as a CSV.
Dataset specs:
50 customers
Each row = one customer snapshot
Use random seed 42 for reproducibility
Columns:
Column | Type | Range/Notes |
|---|---|---|
customer_id | integer | Unique, 1001–1050 |
tenure_months | integer | 1–60 |
monthly_spend_usd | float | Log-normal, mean ~65, range 20–150, right-skewed |
data_usage_gb | float | 1–50, correlated with spend at ~0.6 |
num_support_tickets | integer | 0–10, right-skewed (most customers = 0–2) |
churned | integer | 0 or 1 |
Business logic — churn probability increases when:
tenure_months < 12
num_support_tickets >= 4
spend_to_usage_ratio is high (monthly_spend_usd / data_usage_gb) — calculate this internally, do not include in output
Combination of low tenure + high tickets = strong churn signal
Churn suppressors:
tenure_months > 36 → churn probability drops significantly
monthly_spend_usd > 100 AND data_usage_gb > 25 → rarely churn
Noise requirements:
Include ~5–8 exceptions (e.g. long-tenure customer who churns, low-ticket customer who churns anyway)
Correlations should look believable, not deterministic or perfectly separable
Constraints:
Target churn rate: 25–35% (verify this before returning)
Avoid unrealistic combos (e.g. tenure = 1 with extremely high spend and zero tickets and no churn)
All values must be realistic for a consumer telecom segment
Output instructions:
Return raw CSV content only
No markdown, no code fences, no explanation
First line must be the header row
Do not include spend_to_usage_ratio in the output
Try these options and see which one you like the most.
Keep pushing 💪,
Karina
Need more help?
Just starting with Python? Wondering if programming is for you?
Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.
Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.
Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

Data Analyst & Data Scientist

