In partnership with

This week I made a Linkedin post about portfolio projects, suggesting to use less common datasets.

Someone replied: "But how do I even get data? If you scrape it - you will get sued"

Scraping can get you into legal trouble. Real company data is confidential. And public datasets? Hiring managers have seen the same Iris/AirBnB/NY Bicycles analysis a thousand times.

So what do you do?

You generate your own data.

Synthetic data is artificially created data that looks and behaves like real data — same structure, same patterns — but none of the legal headaches. You can make it fit exactly the business problem you want to solve.

Let’s talk about how to generate your own data. But before that, small advertisement:

Every headline satisfies an opinion. Except ours.

Remember when the news was about what happened, not how to feel about it? 1440's Daily Digest is bringing that back. Every morning, they sift through 100+ sources to deliver a concise, unbiased briefing — no pundits, no paywalls, no politics. Just the facts, all in five minutes. For free.

So how do you generate your data?

Option 1: Python Libraries

There are three libraries worth knowing.

Faker is the most popular one. It generates realistic fake data — names, emails, addresses, dates, phone numbers. Multiple languages and locales too (for example, you can generate Arabic names or UAE addresses if you need to).
Great for building customer databases, transaction logs, user profiles.

# Install first:
# pip install faker pandas

import pandas as pd
import random
from faker import Faker

fake = Faker()

# -------------------------------------------------
# 1. Generate synthetic customers
# -------------------------------------------------
people = []

for i in range(100_000):
    age = random.randint(18, 75)
    tenure = random.randint(1, 120)

    monthly_spend = round(random.uniform(10, 200), 2)
    churn = random.choice([0, 1])

    people.append({
        "customer_id": i + 1,
        "name": fake.name(),
        "email": fake.email(),
        "gender": random.choice(["Male", "Female"]),
        "city": fake.city(),
        "country": fake.country(),
        "company": fake.company(),
        "job_title": fake.job(),
        "age": age,
        "signup_date": fake.date_between(start_date="-6y", end_date="today"),
        "tenure_months": tenure,
        "monthly_spend": monthly_spend,
        "annual_income": random.randint(20_000, 150_000),
        "churn": churn
    })

# -------------------------------------------------
# 2. Convert to DataFrame
# -------------------------------------------------
df = pd.DataFrame(people)

# -------------------------------------------------
# 3. Save dataset
# -------------------------------------------------
df.to_csv("customers_faker.csv", index=False)

Simple. Fast. Does the job.

Mimesis is basically Faker's faster cousin. Same idea, but significantly better performance on large datasets. If you're generating hundreds of thousands of rows, switch to Mimesis.

# Install first:
# pip install mimesis pandas

import pandas as pd
import random
from mimesis import Person, Address, Datetime, Finance
from mimesis.locales import Locale

# -------------------------------------------------
# 1. Create generators
# -------------------------------------------------
person = Person(Locale.EN)
address = Address(Locale.EN)
dt = Datetime()
finance = Finance()

# -------------------------------------------------
# 2. Generate synthetic customers
# -------------------------------------------------
people = []

for i in range(100_000):
    age = random.randint(18, 75)
    tenure = random.randint(1, 120)

    monthly_spend = round(random.uniform(10, 200), 2)
    churn = random.choice([0, 1])

    people.append({
        "customer_id": i + 1,
        "name": person.full_name(),
        "email": person.email(),
        "gender": person.gender(),
        "city": address.city(),
        "country": address.country(),
        "age": age,
        "signup_date": dt.date(start=2018, end=2024),
        "tenure_months": tenure,
        "monthly_spend": monthly_spend,
        "annual_income": finance.price(minimum=20_000, maximum=150_000),
        "churn": churn
    })

# -------------------------------------------------
# 3. Convert to DataFrame
# -------------------------------------------------
df = pd.DataFrame(people)

# -------------------------------------------------
# 4. Save dataset
# -------------------------------------------------
df.to_csv("customers_mimesis.csv", index=False)

SDV (Synthetic Data Vault) is where it gets really interesting. Unlike Faker and Mimesis which just make up random realistic-looking data, SDV actually learns from a real dataset and generates synthetic data that preserves the statistical properties of the original. Same distributions. Same correlations. Same relationships between columns.

This is a game changer if you have access to a small sample of real data but can't share it publicly.

# Install first:
# pip install sdv pandas

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# -------------------------------------------------
# 1. Load your real dataset
# -------------------------------------------------
real_data = pd.read_csv("customers.csv")


# -------------------------------------------------
# 2. Detect metadata (SDV figures out column types)
# -------------------------------------------------
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# If you have an ID column you can specify it (optional)
# metadata.set_primary_key("customer_id")

# -------------------------------------------------
# 3. Train the synthesiser
# -------------------------------------------------
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# -------------------------------------------------
# 4. Generate synthetic rows
# -------------------------------------------------
synthetic_data = synthesizer.sample(num_rows=5000)

# -------------------------------------------------
# 5. Save to a new file
# -------------------------------------------------
synthetic_data.to_csv("customers_synthetic.csv", index=False)

Option 2: Just Ask Claude or ChatGPT

You don't always need to write code. Sometimes the fastest way to get synthetic data is to just ask an AI to generate it.

The key is being specific in your prompt. Vague prompts give you generic data. Specific prompts give you something actually useful.

Here's the formula I use:

  • What's the business context?

  • What columns do you need and what do they represent?

  • What format? (CSV, JSON)

  • What patterns should the data follow?

That last one is important. If you're building a churn model, you want churned customers to actually look different from retained ones — otherwise your model has nothing to learn from.

If you are a beginner and don’t know the business rules, I would recommend to start with the prompt like this:

Generate a realistic telecom churn dataset for a data analytics project.

First:

• Suggest typical customer features used in churn analysis
• Briefly explain what usually influences churn

Then:

• Generate a synthetic CSV with ~20000 rows
• Include realistic relationships but also noise
• Ensure churn rate ~8%
• Make the dataset suitable for exploratory analysis and a simple ML model

Return CSV file.

In case you are aware of business rules, your prompt might get more sophisticated

Generate a realistic synthetic telecom churn dataset as a CSV.

Dataset specs:

  • 50 customers

  • Each row = one customer snapshot

  • Use random seed 42 for reproducibility

Columns:

Column

Type

Range/Notes

customer_id

integer

Unique, 1001–1050

tenure_months

integer

1–60

monthly_spend_usd

float

Log-normal, mean ~65, range 20–150, right-skewed

data_usage_gb

float

1–50, correlated with spend at ~0.6

num_support_tickets

integer

0–10, right-skewed (most customers = 0–2)

churned

integer

0 or 1

Business logic — churn probability increases when:

  • tenure_months < 12

  • num_support_tickets >= 4

  • spend_to_usage_ratio is high (monthly_spend_usd / data_usage_gb) — calculate this internally, do not include in output

  • Combination of low tenure + high tickets = strong churn signal

Churn suppressors:

  • tenure_months > 36 → churn probability drops significantly

  • monthly_spend_usd > 100 AND data_usage_gb > 25 → rarely churn

Noise requirements:

  • Include ~5–8 exceptions (e.g. long-tenure customer who churns, low-ticket customer who churns anyway)

  • Correlations should look believable, not deterministic or perfectly separable

Constraints:

  • Target churn rate: 25–35% (verify this before returning)

  • Avoid unrealistic combos (e.g. tenure = 1 with extremely high spend and zero tickets and no churn)

  • All values must be realistic for a consumer telecom segment

Output instructions:

  • Return raw CSV content only

  • No markdown, no code fences, no explanation

  • First line must be the header row

  • Do not include spend_to_usage_ratio in the output

Try these options and see which one you like the most.

Keep pushing 💪,

Karina

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist

Keep Reading