Fabricator: Declarative Synthetic Data Generation
Why Use Fabricator?
When developing and testing data pipelines, having realistic test data is crucial. However, manually creating test data is: - Time-consuming and error-prone - Hard to maintain as schemas evolve - Difficult to make reproducible - Often lacks the complexity of real-world data
The Fabricator module solves these challenges by providing a declarative way to generate synthetic data that: - Closely mimics real-world data patterns - Maintains referential integrity across tables - Generates deterministically when seeded - Scales easily as schemas grow - Keeps test data generation code clean and maintainable
What Does It Do?
Fabricator takes a YAML configuration that describes your desired data structure and generates pandas DataFrames accordingly. Key features include:
- Rich Data Generation Options
- Unique IDs with customizable formats
- Realistic fake data (names, addresses, etc.) via Faker
- Statistical distributions via NumPy
- Custom arrays and lists
- Date sequences and ranges
-
Value mapping and transformations
-
Complex Relationships
- Reference columns from other tables
- Cross products for exhaustive combinations
- Row-wise and column-wise operations
- Weighted random sampling
-
Deterministic value mapping
-
Data Quality Controls
- Configurable null value injection
- Type casting and validation
- Size control (up/down sampling)
- Seeding for reproducibility
- Error handling and logging
How to Use It
Basic Example
Let's start with a simple example generating patient data:
patients:
num_rows: 100
columns:
id:
type: generate_unique_id
prefix: PAT_
id_length: 8
name:
type: faker
provider: name
admission_date:
type: generate_dates
start_dt: 2023-01-01
end_dt: 2023-12-31
freq: B # Business days
This generates a DataFrame with: - 100 rows - Patient IDs like 'PAT_00000123' - Realistic names - Random admission dates on business days in 2023
Advanced Features
1. Column References and Dependencies
You can reference columns from the same or other tables:
encounters:
num_rows: 500
columns:
id:
type: generate_unique_id
prefix: ENC_
patient_id:
type: copy_column
source_column: patients.id
sample:
num_rows: "@encounters.num_rows"
visit_date:
type: row_apply
input_columns: [patients.admission_date]
row_func: "lambda x: x + datetime.timedelta(days=random.randint(1, 30))"
2. Complex Data Generation
Generate arrays, weighted choices, and statistical distributions:
diagnoses:
num_rows: 200
columns:
codes:
type: generate_random_arrays
sample_values: ["E11.9", "I10", "J45.909", "F41.1"]
min_length: 1
max_length: 3
delimiter: "|" # Results in strings like "E11.9|I10"
severity:
type: generate_values
sample_values:
mild: 0.5 # 50% chance
moderate: 0.3 # 30% chance
severe: 0.2 # 20% chance
lab_value:
type: numpy_random
distribution: normal
loc: 100 # mean
scale: 15 # standard deviation
3. Data Quality Controls
Control nulls and types:
demographics:
columns:
email:
type: faker
provider: email
inject_nulls:
probability: 0.1
value: "NOEMAIL"
age:
type: numpy_random
distribution: normal
loc: 45
scale: 15
dtype: Int64 # Nullable integer type
Real-World Example
Here's a more complex example showing how Fabricator handles real-world data patterns:
rtx_kg2: # Knowledge Graph Data
nodes:
num_rows: 500
columns:
id:
type: generate_unique_id
prefix: "RTX:"
name:
type: generate_unique_id
prefix: name_
inject_nulls:
probability: 0.2
category:
type: generate_values
sample_values:
- biolink:Drug
- biolink:Disease
- biolink:Gene
all_categories:
type: row_apply
input_columns: ["nodes.category"]
row_func: matrix.pipelines.fabricator.generators.get_ancestors_for_category_delimited
row_func_kwargs:
delimiter: "|"
equivalent_curies:
type: generate_random_arrays
delimiter: "|"
sample_values:
- "CHEMBL:CHEMBL25"
- "DRUGBANK:DB00316"
- "MESH:D009369"
edges:
num_rows: 2000
columns:
subject:
type: copy_column
source_column: "nodes.id"
sample:
num_rows: "@edges.num_rows"
seed: 590590
object:
type: copy_column
source_column: "nodes.id"
sample:
num_rows: "@edges.num_rows"
seed: 49494
predicate:
type: generate_values
sample_values:
- biolink:treats
- biolink:interacts_with
- biolink:affects
Best Practices
- Use Seeds for Reproducibility
- Set a global seed when initializing MockDataGenerator
- Use specific seeds for critical columns
-
Document seed values in comments
-
Structure Your Configuration
- Group related tables under namespaces
- Order tables by dependency (referenced tables first)
-
Use clear, consistent naming
-
Handle Data Quality
- Configure appropriate null rates
- Set explicit dtypes for important columns
-
Use validation where needed
-
Performance Considerations
- Use column_apply for operations needing full column context
- Use row_apply for independent row operations
- Consider chunking for very large datasets
Common Patterns
-
Stable References
# Generate stable mappings using hash_map territory: type: hash_map input_column: postal_code buckets: ["North", "South", "East", "West"] -
Derived Values
# Calculate values based on other columns bmi: type: row_apply input_columns: [weight_kg, height_m] row_func: "lambda w, h: round(w / (h * h), 1)" -
Complex Combinations
# Generate all possible combinations product_regions: type: column_apply input_columns: [products.id, regions.id] column_func: cross_product check_all_inputs_same_length: false
Troubleshooting
Common issues and solutions:
Column Reference Errors:
- Ensure referenced tables/columns are generated first
- Check namespace prefixes if using multiple namespaces
- Verify column names match exactly
Type Mismatches:
- Set explicit dtypes when needed
- Check format of date strings
- Use appropriate nullable types (Int64, string, etc.)
Size Mismatches:
- Use resize=True when needed
- Check num_rows references
- Verify sample configurations
Seeding Issues:
- Set global seed for overall reproducibility
- Use column-specific seeds for fine control
- Document seed values used