Skip to content

Fabricator: Declarative Synthetic Data Generation

Why Use Fabricator?

When developing and testing data pipelines, having realistic test data is crucial. However, manually creating test data is: - Time-consuming and error-prone - Hard to maintain as schemas evolve - Difficult to make reproducible - Often lacks the complexity of real-world data

The Fabricator module solves these challenges by providing a declarative way to generate synthetic data that: - Closely mimics real-world data patterns - Maintains referential integrity across tables - Generates deterministically when seeded - Scales easily as schemas grow - Keeps test data generation code clean and maintainable

What Does It Do?

Fabricator takes a YAML configuration that describes your desired data structure and generates pandas DataFrames accordingly. Key features include:

  1. Rich Data Generation Options
  2. Unique IDs with customizable formats
  3. Realistic fake data (names, addresses, etc.) via Faker
  4. Statistical distributions via NumPy
  5. Custom arrays and lists
  6. Date sequences and ranges
  7. Value mapping and transformations

  8. Complex Relationships

  9. Reference columns from other tables
  10. Cross products for exhaustive combinations
  11. Row-wise and column-wise operations
  12. Weighted random sampling
  13. Deterministic value mapping

  14. Data Quality Controls

  15. Configurable null value injection
  16. Type casting and validation
  17. Size control (up/down sampling)
  18. Seeding for reproducibility
  19. Error handling and logging

How to Use It

Basic Example

Let's start with a simple example generating patient data:

patients:
    num_rows: 100
    columns:
        id:
            type: generate_unique_id
            prefix: PAT_
            id_length: 8
        name:
            type: faker
            provider: name
        admission_date:
            type: generate_dates
            start_dt: 2023-01-01
            end_dt: 2023-12-31
            freq: B  # Business days

This generates a DataFrame with: - 100 rows - Patient IDs like 'PAT_00000123' - Realistic names - Random admission dates on business days in 2023

Advanced Features

1. Column References and Dependencies

You can reference columns from the same or other tables:

encounters:
    num_rows: 500
    columns:
        id:
            type: generate_unique_id
            prefix: ENC_
        patient_id:
            type: copy_column
            source_column: patients.id
            sample:
                num_rows: "@encounters.num_rows"
        visit_date:
            type: row_apply
            input_columns: [patients.admission_date]
            row_func: "lambda x: x + datetime.timedelta(days=random.randint(1, 30))"

2. Complex Data Generation

Generate arrays, weighted choices, and statistical distributions:

diagnoses:
    num_rows: 200
    columns:
        codes:
            type: generate_random_arrays
            sample_values: ["E11.9", "I10", "J45.909", "F41.1"]
            min_length: 1
            max_length: 3
            delimiter: "|"  # Results in strings like "E11.9|I10"
        severity:
            type: generate_values
            sample_values:
                mild: 0.5    # 50% chance
                moderate: 0.3 # 30% chance
                severe: 0.2   # 20% chance
        lab_value:
            type: numpy_random
            distribution: normal
            loc: 100    # mean
            scale: 15   # standard deviation

3. Data Quality Controls

Control nulls and types:

demographics:
    columns:
        email:
            type: faker
            provider: email
            inject_nulls:
                probability: 0.1
                value: "NOEMAIL"
        age:
            type: numpy_random
            distribution: normal
            loc: 45
            scale: 15
            dtype: Int64  # Nullable integer type

Real-World Example

Here's a more complex example showing how Fabricator handles real-world data patterns:

rtx_kg2:  # Knowledge Graph Data
    nodes:
        num_rows: 500
        columns:
            id:
                type: generate_unique_id
                prefix: "RTX:"
            name:
                type: generate_unique_id
                prefix: name_
                inject_nulls:
                    probability: 0.2
            category:
                type: generate_values
                sample_values:
                    - biolink:Drug
                    - biolink:Disease
                    - biolink:Gene
            all_categories:
                type: row_apply
                input_columns: ["nodes.category"]
                row_func: matrix.pipelines.fabricator.generators.get_ancestors_for_category_delimited
                row_func_kwargs:
                    delimiter: "|"
            equivalent_curies:
                type: generate_random_arrays
                delimiter: "|"
                sample_values:
                    - "CHEMBL:CHEMBL25"
                    - "DRUGBANK:DB00316"
                    - "MESH:D009369"

    edges:
        num_rows: 2000
        columns:
            subject:
                type: copy_column
                source_column: "nodes.id"
                sample:
                    num_rows: "@edges.num_rows"
                    seed: 590590
            object:
                type: copy_column
                source_column: "nodes.id"
                sample:
                    num_rows: "@edges.num_rows"
                    seed: 49494
            predicate:
                type: generate_values
                sample_values:
                    - biolink:treats
                    - biolink:interacts_with
                    - biolink:affects

Best Practices

  1. Use Seeds for Reproducibility
  2. Set a global seed when initializing MockDataGenerator
  3. Use specific seeds for critical columns
  4. Document seed values in comments

  5. Structure Your Configuration

  6. Group related tables under namespaces
  7. Order tables by dependency (referenced tables first)
  8. Use clear, consistent naming

  9. Handle Data Quality

  10. Configure appropriate null rates
  11. Set explicit dtypes for important columns
  12. Use validation where needed

  13. Performance Considerations

  14. Use column_apply for operations needing full column context
  15. Use row_apply for independent row operations
  16. Consider chunking for very large datasets

Common Patterns

  1. Stable References

    # Generate stable mappings using hash_map
    territory:
        type: hash_map
        input_column: postal_code
        buckets: ["North", "South", "East", "West"]
    

  2. Derived Values

    # Calculate values based on other columns
    bmi:
        type: row_apply
        input_columns: [weight_kg, height_m]
        row_func: "lambda w, h: round(w / (h * h), 1)"
    

  3. Complex Combinations

    # Generate all possible combinations
    product_regions:
        type: column_apply
        input_columns: [products.id, regions.id]
        column_func: cross_product
        check_all_inputs_same_length: false
    

Troubleshooting

Common issues and solutions:

Column Reference Errors:

  • Ensure referenced tables/columns are generated first
  • Check namespace prefixes if using multiple namespaces
  • Verify column names match exactly

Type Mismatches:

  • Set explicit dtypes when needed
  • Check format of date strings
  • Use appropriate nullable types (Int64, string, etc.)

Size Mismatches:

  • Use resize=True when needed
  • Check num_rows references
  • Verify sample configurations

Seeding Issues:

  • Set global seed for overall reproducibility
  • Use column-specific seeds for fine control
  • Document seed values used

Further Reading