How to Generate Your Own Data for Your Portfolio

In partnership with

This week I made a Linkedin post about portfolio projects, suggesting to use less common datasets.

Someone replied: "But how do I even get data? If you scrape it - you will get sued"

Scraping can get you into legal trouble. Real company data is confidential. And public datasets? Hiring managers have seen the same Iris/AirBnB/NY Bicycles analysis a thousand times.

So what do you do?

You generate your own data.

Synthetic data is artificially created data that looks and behaves like real data — same structure, same patterns — but none of the legal headaches. You can make it fit exactly the business problem you want to solve.

Let’s talk about how to generate your own data. But before that, small advertisement:

Every headline satisfies an opinion. Except ours.

Remember when the news was about what happened, not how to feel about it? 1440's Daily Digest is bringing that back. Every morning, they sift through 100+ sources to deliver a concise, unbiased briefing — no pundits, no paywalls, no politics. Just the facts, all in five minutes. For free.

Read the newsletter trusted by 4.5 million fact-seekers.

So how do you generate your data?

Option 1: Python Libraries

There are three libraries worth knowing.

Faker is the most popular one. It generates realistic fake data — names, emails, addresses, dates, phone numbers. Multiple languages and locales too (for example, you can generate Arabic names or UAE addresses if you need to).
Great for building customer databases, transaction logs, user profiles.

# Install first:
# pip install faker pandas

import pandas as pd
import random
from faker import Faker

fake = Faker()

# -------------------------------------------------
# 1. Generate synthetic customers
# -------------------------------------------------
people = []

for i in range(100_000):
    age = random.randint(18, 75)
    tenure = random.randint(1, 120)

    monthly_spend = round(random.uniform(10, 200), 2)
    churn = random.choice([0, 1])

    people.append({
        "customer_id": i + 1,
        "name": fake.name(),
        "email": fake.email(),
        "gender": random.choice(["Male", "Female"]),
        "city": fake.city(),
        "country": fake.country(),
        "company": fake.company(),
        "job_title": fake.job(),
        "age": age,
        "signup_date": fake.date_between(start_date="-6y", end_date="today"),
        "tenure_months": tenure,
        "monthly_spend": monthly_spend,
        "annual_income": random.randint(20_000, 150_000),
        "churn": churn
    })

# -------------------------------------------------
# 2. Convert to DataFrame
# -------------------------------------------------
df = pd.DataFrame(people)

# -------------------------------------------------
# 3. Save dataset
# -------------------------------------------------
df.to_csv("customers_faker.csv", index=False)

Simple. Fast. Does the job.

Mimesis is basically Faker's faster cousin. Same idea, but significantly better performance on large datasets. If you're generating hundreds of thousands of rows, switch to Mimesis.

# Install first:
# pip install mimesis pandas

import pandas as pd
import random
from mimesis import Person, Address, Datetime, Finance
from mimesis.locales import Locale

# -------------------------------------------------
# 1. Create generators
# -------------------------------------------------
person = Person(Locale.EN)
address = Address(Locale.EN)
dt = Datetime()
finance = Finance()

# -------------------------------------------------
# 2. Generate synthetic customers
# -------------------------------------------------
people = []

for i in range(100_000):
    age = random.randint(18, 75)
    tenure = random.randint(1, 120)

    monthly_spend = round(random.uniform(10, 200), 2)
    churn = random.choice([0, 1])

    people.append({
        "customer_id": i + 1,
        "name": person.full_name(),
        "email": person.email(),
        "gender": person.gender(),
        "city": address.city(),
        "country": address.country(),
        "age": age,
        "signup_date": dt.date(start=2018, end=2024),
        "tenure_months": tenure,
        "monthly_spend": monthly_spend,
        "annual_income": finance.price(minimum=20_000, maximum=150_000),
        "churn": churn
    })

# -------------------------------------------------
# 3. Convert to DataFrame
# -------------------------------------------------
df = pd.DataFrame(people)

# -------------------------------------------------
# 4. Save dataset
# -------------------------------------------------
df.to_csv("customers_mimesis.csv", index=False)

SDV (Synthetic Data Vault) is where it gets really interesting. Unlike Faker and Mimesis which just make up random realistic-looking data, SDV actually learns from a real dataset and generates synthetic data that preserves the statistical properties of the original. Same distributions. Same correlations. Same relationships between columns.

This is a game changer if you have access to a small sample of real data but can't share it publicly.

# Install first:
# pip install sdv pandas

import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# -------------------------------------------------
# 1. Load your real dataset
# -------------------------------------------------
real_data = pd.read_csv("customers.csv")


# -------------------------------------------------
# 2. Detect metadata (SDV figures out column types)
# -------------------------------------------------
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)

# If you have an ID column you can specify it (optional)
# metadata.set_primary_key("customer_id")

# -------------------------------------------------
# 3. Train the synthesiser
# -------------------------------------------------
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# -------------------------------------------------
# 4. Generate synthetic rows
# -------------------------------------------------
synthetic_data = synthesizer.sample(num_rows=5000)

# -------------------------------------------------
# 5. Save to a new file
# -------------------------------------------------
synthetic_data.to_csv("customers_synthetic.csv", index=False)

Option 2: Just Ask Claude or ChatGPT

You don't always need to write code. Sometimes the fastest way to get synthetic data is to just ask an AI to generate it.

The key is being specific in your prompt. Vague prompts give you generic data. Specific prompts give you something actually useful.

Here's the formula I use:

What's the business context?
What columns do you need and what do they represent?
What format? (CSV, JSON)
What patterns should the data follow?

That last one is important. If you're building a churn model, you want churned customers to actually look different from retained ones — otherwise your model has nothing to learn from.

If you are a beginner and don’t know the business rules, I would recommend to start with the prompt like this:

❝

Generate a realistic telecom churn dataset for a data analytics project.

First:

• Suggest typical customer features used in churn analysis
• Briefly explain what usually influences churn

Then:

• Generate a synthetic CSV with ~20000 rows
• Include realistic relationships but also noise
• Ensure churn rate ~8%
• Make the dataset suitable for exploratory analysis and a simple ML model

Return CSV file.

In case you are aware of business rules, your prompt might get more sophisticated

Generate a realistic synthetic telecom churn dataset as a CSV.

Dataset specs:

50 customers
Each row = one customer snapshot
Use random seed 42 for reproducibility

Columns:

Column	Type	Range/Notes
customer_id	integer	Unique, 1001–1050
tenure_months	integer	1–60
monthly_spend_usd	float	Log-normal, mean ~65, range 20–150, right-skewed
data_usage_gb	float	1–50, correlated with spend at ~0.6
num_support_tickets	integer	0–10, right-skewed (most customers = 0–2)
churned	integer	0 or 1

Business logic — churn probability increases when:

tenure_months < 12
num_support_tickets >= 4
spend_to_usage_ratio is high (monthly_spend_usd / data_usage_gb) — calculate this internally, do not include in output
Combination of low tenure + high tickets = strong churn signal

Churn suppressors:

tenure_months > 36 → churn probability drops significantly
monthly_spend_usd > 100 AND data_usage_gb > 25 → rarely churn

Noise requirements:

Include ~5–8 exceptions (e.g. long-tenure customer who churns, low-ticket customer who churns anyway)
Correlations should look believable, not deterministic or perfectly separable

Constraints:

Target churn rate: 25–35% (verify this before returning)
Avoid unrealistic combos (e.g. tenure = 1 with extremely high spend and zero tickets and no churn)
All values must be realistic for a consumer telecom segment

Output instructions:

Return raw CSV content only
No markdown, no code fences, no explanation
First line must be the header row
Do not include spend_to_usage_ratio in the output

Try these options and see which one you like the most.

Keep pushing 💪,

Karina

Need more help?

Just starting with Python? Wondering if programming is for you?

Master key data analysis tasks like cleaning, filtering, pivot and grouping data using Pandas, and learn how to present your insights visually with Matplotlib with ‘Data Analysis with Python’ masterclass.

Building your portfolio?
Grab the Complete EDA Portfolio Project — a full e-commerce analysis (ShopTrend 2024) with Python notebook, realistic dataset, portfolio template, and step-by-step workflow. See exactly how to structure professional portfolio projects.

Grab your Pandas CheatSheet here. Everything you need to know about Pandas - from file operations to visualisations in one place.

Get CV, Portfolio and LinkedIn Review.

More from me: YouTube | TikTok | Instagram | Threads | LinkedIn

Data Analyst & Data Scientist

How to Generate Your Own Data for Your Portfolio

Every headline satisfies an opinion. Except ours.

Keep Reading

Karina Datascientist's Newsletter

Home