Skip to content

datafiller

NAME

datafiller.py - generate random data from database schema extended with directives

SYNOPSIS

datafiller.py [--help --man ...] [schema.sql ...] > data.sql

DESCRIPTION

This script generates random data from a database schema enriched with simple directives in SQL comments to drive 29 data generators which cover typical data types and their combination. Reasonable defaults are provided, especially based on key and type constraints, so that few directives should be necessary. The minimum setup is to specify the relative size of tables with directive mult so that data generation can be scaled.

See the "TUTORIAL" section below. Also, run with --validate=comics or --validate=library and look at the output for didactic examples.

OPTIONS

  • --csv-null=NULL

    Set NULL encoding for CSV output.

    Default is NULL.

  • --csv-separator=SEP

    Set separator for CSV output.

    Default is ,.

  • --debug or -D

    Set debug mode. Repeat for more.

    Default is no debug.

  • --drop

    Drop tables before recreating them. This implies option --filter, otherwise there would be no table to fill.

    Default is not to.

  • --encoding=enc or -e enc

    Set this encoding for input and output files.

    Default is no explicit encoding.

  • --filter or -f, reverse with --no-filter

    Work as a filter, i.e. send the schema input script to stdout and then the generated data. This is convenient to pipe the result of the script directly for execution to the database command.

    Default is to only ouput generated data.

  • --freeze, reverse with --no-freeze

    Whether to use PostgreSQL COPY FREEZE option.

    Default is to use it, as it allows better performance.

  • --help or -h

    Show basic help.

  • --man or -m

    Show full man page based on POD. Yes, the perl thing:-)

  • --null RATE or -n RATE

    Probability to generate a null value for nullable attributes.

    Default is 0.01, which can be overriden by the null directive at the schema level, or per-attributes provided null rate.

  • --offset OFFSET or -O OFFSET

    Set default offset for integer generators on primary keys. This is useful to extend the already existing content of a database for larger tests.

    Default is 1, which can be overriden by the offset directive at the schema level, or per-attribute provided offset.

  • --pod COMMAND

    Override pod conversion command used by option --man.

    Default is 'pod2usage -verbose 3'.

  • --quiet or -q

    Generate less verbose SQL output.

    Default is to generate one echo when starting to fill a table.

  • --seed SEED or -S SEED

    Seed overall random generation with provided string.

    Default uses OS supplied randomness or current time.

  • --size SIZE

    Set overall scaling. The size is combined with the mult directive value on a table to compute the actual number of tuples to generate in each table.

    Default is 100, which can be overriden with the size directive at the schema level.

  • --target (postgresql|mysql|csv|tsv) or -t ...

    Target database engine. PostgreSQL support is well tested. MySQL support is more experimental. CSV (comma-separated values) and TSV (tabulation-separated values) support is partial.

    Default is to target PostgreSQL.

  • --test='directives...'

    Run tests for any generator with some directives. If the directives start with !, show an histogram. If the directives start with -, show results on one-line. This is convenient for unit testing, or to check what would be the output of a set of directives.

    Examples: --size=100 --test='int size=10 mangle' would show 100 integers between 1 and 10 drawn uniformly. --test='!bool rate=0.3' may show False: 69.32%, True: 30.68%, stating the rate at which True and False were seen during the test.

    Directive type can be used within --test=... to provide the target SQL type when testing.

  • --transaction or -T

    Use a global transaction.

    Default is not to.

  • --tries=NUM

    How hard to try to satisfy a compound unique constraint before giving up on a given tuple.

    Default is 10.

  • --truncate

    Delete table contents before filling.

    Default is not to.

  • --type=CUSTOM

    Add a custom type. The default generator for this type will rely on a macro of the same name, if it exists. This option can be repeated to add more custom types. See also the type directive at the schema level.

  • --validate=(unit|internal|comics|library|pgbench)

    Output validation test cases. All but the unit tests can be processed with psql.

    sh> datafiller.py --validate=internal | psql
    

    This option sets --filter automatically, although it can be forced back with --no-filter.

    Default is to process argument files or standard input.

  • --version, -v or -V

    Show script version.

ARGUMENTS

Files containing SQL schema definitions. Standard input is processed if empty.

TUTORIAL

This tutorial introduces how to use DataFiller to fill a PostgreSQL database for testing functionalities and performances.

GENERATORS AND DIRECTIVES

The starting point of the script to generate test data is the SQL schema of the database taken from a file. It includes important information that will be used to generate data: attribute types, uniqueness, not-null-ness, foreign keys... These informations are used to set up a default generator for each field. The generator parameters can be modified by providing additional directives.

The schema can be augmented with directives in comments so as to provide hints about data generation. A datafiller directive is a special SQL comment recognized by the script, with a df marker at the beginning. A directive must appear after the object about which it is applied, either directly after the object declaration, in which case the object is implicit, or much later, in which case the object must be explicitely referenced:

-- this directive sets the default overall "size" parameter
  -- df: size=10
-- this directive defines a macro named "fn"
  -- df fn: word=/path/to/file-containing-words
-- this directive applies to table "Foo"
CREATE TABLE Foo( -- df: mult=10.0
  -- this directive applies to attribute "fid"
  fid SERIAL -- df: offset=1000
  -- use defined macro, choose "stuff" from the list of words
, stuff TEXT NOT NULL -- df: use=fn
);
-- ... much later
-- this directive applies to attribute "fid" in table "Foo"
-- df T=Foo A=fid: null=0.8

A SIMPLE LIBRARY EXAMPLE

Let us start with a simple example involving a library where readers borrow books. Our schema is defined in file library.sql as follows:

CREATE TABLE Book( 
  bid SERIAL PRIMARY KEY,
  title TEXT NOT NULL, 
  isbn ISBN13 NOT NULL 
);

CREATE TABLE Reader( 
  rid SERIAL PRIMARY KEY,
  firstname TEXT NOT NULL, 
  lastname TEXT NOT NULL, 
  born DATE NOT NULL, 
  gender BOOLEAN NOT NULL, 
  phone TEXT 
);

CREATE TABLE Borrow( 
  borrowed TIMESTAMP NOT NULL, 
  rid INTEGER NOT NULL REFERENCES Reader,
  bid INTEGER NOT NULL REFERENCES Book, 
  PRIMARY KEY(bid) -- a book is borrowed once at a time!
);

The first and only information you really need to provide is the relative or absolute size of relations. For scaling, the best way is to specify a relative size multiplier with the mult directive on each table, which will be multiplied by the size option to compute the actual size of data to generate in each table. Let us say we want 100 books in stock per reader, with 1.5 borrowed books per reader on average:

CREATE TABLE Book( -- df: mult=100.0
...
CREATE TABLE Borrow( --df: mult=1.5

The default multiplier is 1.0, it does not need to be set on Reader. Then you can generate a data set with:

sh> datafiller.py --size=1000 library.sql > library_test_data.sql

Note that explicit constraints are enforced on the generated data, so that foreign keys in *Borrow* reference existing *Books* and *Readers*. However the script cannot guess implicit constraints, thus if an attribute is not declared NOT NULL, then some NULL values will be generated. If an attribute is not unique, then the generated values will probably not be unique.

IMPROVING GENERATED VALUES

In the above generated data, some attributes may not reflect the reality one would expect from a library. Changing the default with per-attribute directives will help improve this first result.

First, book titles are all quite short, looking like title_number, with some collisions. Indeed the default is to generate strings with a common prefix based on the attribute name and a number drawn uniformly from the expected number of tuples. We can change to texts composed of between 1 and 7 English words taken from a dictionary:

title TEXT NOT NULL
-- df English: word=/etc/dictionaries-common/words
-- df: text=English length=4 lenvar=3

Also, we may have undesirable collisions on the ISBN attribute, because the default size of the set is the number of tuples in the table. We can extend the size at the attribute level so as to avoid this issue:

isbn ISBN13 NOT NULL -- df: size=1000000000

If we now look at readers, the result can also be improved. First, we can decide to keep the prefix and number form, but make the statistics more in line with what you can find. Let us draw from 1000 firstnames, most frequent 3%, and 10000 lastnames, most frequent 1%:

firstname TEXT NOT NULL,
   -- df: sub=power prefix=fn size=1000 rate=0.03
lastname TEXT NOT NULL,
   -- df: sub=power prefix=ln size=10000 rate=0.01

The default generated dates are a few days around now, which does not make much sense for our readers' birth dates. Let us set a range of birth dates.

birth DATE NOT NULL, -- df: start=1923-01-01 end=2010-01-01

Most readers from our library are female: we can adjust the rate so that 25% of readers are male, instead of the default 50%.

gender BOOLEAN NOT NULL, -- df: rate=0.25

Phone numbers also have a prefix_number structure, which does not really look like a phone number. Let us draw a string of 10 digits, and adjust the nullable rate so that 1% of phone numbers are not known. We also set the size manually to avoid too many collisions, but we could have chosen to keep them as is, as some readers do share phone numbers.

phone TEXT
  -- these directives could be on a single line
  -- df: chars='0-9' length=10 lenvar=0
  -- df: null=0.01 size=1000000

The last table is about currently borrowed books. The timestamps are around now, we are going to spread them on a period of 50 days, that is 24 * 60 * 50 = 72000 minutes (precision is 60 seconds).

borrowed TIMESTAMP NOT NULL -- df: size=72000 prec=60

Because of the primary key constraint, the borrowed books are the first ones. Let us mangle the result so that referenced book numbers are scattered.

bid INTEGER REFERENCES Book -- df: mangle

Now we can generate improved data for our one thousand readers library, and fill it directly to our library database:

sh> datafiller.py --size=1000 --filter library.sql | psql library

Our test database is ready. If we want more users and books, we only need to adjust the size option. Let us query our test data:

-- show firstname distribution
SELECT firtname, COUNT(*) AS cnt FROM Reader
GROUP BY firstname ORDER BY cnt DESC LIMIT 3;
  -- fn_1_... | 33
  -- fn_2_... | 15
  -- fn_3_... | 12
-- compute gender rate
SELECT AVG(gender::INT) FROM Reader;
  -- 0.246

DISCUSSION

We could go on improving the generated data so that it is more realistic. For instance, we could skew the borrowed timestamp so that there are less old borrowings, or skew the book number so that old books (lower numbers) are less often borrowed, or choose firtnames and lastnames from actual lists.

When to stop improving is not obvious: On the one hand, real data may show particular distributions which impact the application behavior and performance, thus it may be important to reflect that in the test data. On the other hand, if nothing is really done about readers, then maybe the only relevant information is the average length of firstnames and lastnames because of the storage implications, and that's it.

ADVANCED FEATURES

Special generators allow to combine or synchronize other generators.

Let us consider an new email attribute for our Readers, for which we want to generate gmail or yahoo addresses. We can use a pattern generator for this purpose:

email TEXT NOT NULL CHECK(email LIKE '%@%')
  -- df: pattern='[a-z]{3,8}\.[a-z]{3,8}@(gmail|yahoo)\.com'

The pattern sets 3 to 8 lower case characters, followed by a dot, followed by 3 to 8 characters again, followed by either gmail or yahoo domains. With this approach, everything is chosen uniformly: letters in first and last names appear 1/26, their six possible sizes between 3 to 8 are equiprobable, and each domain is drawn on average by half generated email addresses.

In order to get more control about these distributions, we could rely on the chars or alt generators which can be skewed or weighted, as illustrated in the next example.

Let us now consider an ip address for the library network. We want that 20% comes from librarian subnet '10.1.0.0/16' and 80% from reader subnet '10.2.0.0/16'. This can be achieved with directive alt which tells to make a weighted choice between macro-defined generators:

-- define two macros
-- df librarian: inet='10.1.0.0/16'
-- df reader: inet='10.2.0.0/16'
ip INET NOT NULL
-- df: alt=reader:8,librarian:2
-- This would do as well: --df: alt=reader:4,librarian

Let us finally consider a log table for data coming from a proxy, which stores correlated ethernet and ip addresses, that is ethernet address always get the same ip addess, and we have about 1000 distinct hosts:

   -- df distinct: int size=1000
,  mac MACADDR NOT NULL -- df share=distinct
,  ip INET NOT NULL -- df share=distinct inet='10.0.0.0/8'

Because of the share directive, the same mac and ip will be generated together in a pool of 1000 pairs of addresses.

TUTORIAL CONCLUSION

There are many more directives to drive data generation, from simple type oriented generators to advanced combinators. See the documentation and examples below.

For very application-specific constraints that would not fit any generator, it is also possible to apply updates to modify the generated data afterwards.

DIRECTIVES AND DATA GENERATORS

Directives drive the data sizes and the underlying data generators. They must appear in SQL comments after the object on which they apply, although possibly on the same line, introduced by '--Β df:Β '.

CREATE TABLE Stuff(         -- df: mult=2.0
  id SERIAL PRIMARY KEY,    -- df: step=19
  data TEXT UNIQUE NOT NULL -- df: prefix=st length=30 lenvar=3
);

In the above example, with option --size=1000, 2000 tuples will be generated (2.0*1000) with id 1+(i*19)%2000 and unique text data of length about 30+-3 prefixed with st. The sequence for id will be restarted at 2001.

The default size is the number of tuples of the containing table. This implies many collisions for a uniform generator.

DATA GENERATORS

There are 29 data generators which are selected by the attribute type or possibly directives. Most generators are subject to the null directive which drives the probability of a NULL value. There is also a special shared generator.

Generators are triggered by using directives of their names. If none is specified, a default is chosen based on the attribute type.

  • alt generator

    This generator aggregates other generators by choosing one. The list of sub-generators must be specified as a list of comma-separated weighted datafiller macros provided by directive alt, see below. These generator definition macros must contain an explicit directive to select the underlying generator.

  • array generator

    This generator generates an SQL array from another generator. The sub-generator is specified as a macro name with the array directive. It takes into account the length, lenvar, lenmin and lenmax directives.

  • bit generator

    This generator handles BIT and VARBIT data to store sequences of bits. It takes into account the length, lenvar, lenmin and lenmax directives.

  • blob generator

    This is for blob types, such as PostgreSQL's BYTEA. It uses an int generator internally to drive its extent. It takes into account the length, lenvar, lenmin and lenmax directives. This generator does not support UNIQUE, but uniqueness is very likely if the blob length is significant and the size is large.

  • bool generator

    This generator is used for the boolean type. It is subject to the rate directive.

  • cat generator

    This generator aggregates other generators by concatenating their textual output. The list of sub-generators must be specified as a list of comma-separated datafiller macros provided by a cat directive, see below.

  • chars generator

    This alternate generator for text types generates random string of characters. It is triggered by the chars directive. In addition to the underlying int generator which allows to select values, another int generator is used to build words from the provided list of characters, The cgen directives is the name of a macro which specifies the int generator parameters for the random character selection. It also takes into account the length, lenvar, lenmin and lenmax directives. This generator does not support UNIQUE.

  • const generator

    This generator provides a constant text value. It is driven by the const directive.

  • count generator

    This generator provides a simple counter. It takes into account directives start, step and format.

  • date generator

    This generator is used for the date type. It uses an int generator internally to drive its extent. Its internal working is subject to directives start, end and prec.

  • ean generator

    This is for International Article Number (EAN!) generation. It uses an int generator internally, so the number of distinct numbers can be adjusted with directive size. It takes into account the length and prefix directives. Default is to generate EAN-13 numbers. This generator does not support UNIQUE, but uniqueness is very likely.

  • file generator

    Inline file contents. The mandatory list of files is specified with directive files. See also directive mode.

  • float generator

    This generator is used for floating point types. The directive allows to specify the sub-generator to use, see the float directive below. Its configuration also relies on directives alpha and beta. It does not support UNIQUE, but uniqueness is very likely.

  • inet generator

    This is for internet ip types, such as PostgreSQL's INET and CIDR. It uses an int generator internally to drive its extent. It takes into account the network directive to specify the target network. Handling IPv6 networks requires module netaddr.

  • int generator

    This generator is used directly for integer types, and indirectly by other generators such as text, word and date. Its internal working is subject to directives: sub, size (or mult), offset, shift, step, xor and mangle.

  • interval generator

    This generator is used for the time interval type. It uses the int generator internally to drive its extent. See also the unit directive.

  • isnull generator

    Generates the special NULL value.

  • luhn generator

    This is for numbers which use a Luhn's algorithm checksum, such as bank card numbers. It uses an int generator internally, so the number of distinct numbers can be adjusted with directive size. It takes into account the length and prefix directives. Default length is 16, default prefix is empty. This generator does not support UNIQUE, but uniqueness is very likely.

  • mac generator

    This is for MAC addresses, such as PostgreSQL's MACADDR. It uses an int generator internally, so the number of generated addresses can be adjusted with directive size.

  • pattern generator

    This alternative generator for text types generates text based on a regular expression provided with the pattern directive. It uses internally the alt, cat, repeat, const and chars generators.

  • reduce generator

    This generator applies the reduction operation specified by directive op to generators specified with reduce as a comma-separated list of macros.

  • repeat generator

    This generator aggregates the repetition of another generator. The repeated generator is specified in a macro with a repeat directive, and the number of repetitions relies on the extent directive. It uses an int generator internally to drive the number of repetitions, so it can be skewed by specifying a subtype with the sub directive.

  • string generator

    This generator is used by default for text types. This is a good generator for filling stuff without much ado. It takes into account prefix, and the length can be specified with length, lenvar lenmin and lenmax directives. The generated text is of length length +- lenvar, or between lenmin and lenmax. For CHAR(n) and VARCHAR(n) text types, automatic defaults are set.

  • text generator

    This aggregate generator generates aggregates of words drawn from any other generator specified by a macro in directive text. It takes into account directives separator for separator (default ), prefix and suffix (default empty). It also takes into account the length, lenvar, lenmin and lenmax directives which handle the number of words to generate. This generator does not support UNIQUE, but uniqueness is very likely for a text with a significant length drawn from a dictionary.

  • timestamp generator

    This generator is used for the timestamp type. It is similar to the date generator but at a finer granularity. The tz directive allows to specify the target timezone.

  • tuple generator

    This aggregate generator generates composite types. The list of sub-generators must be specified with tuple as a list of comma-seperated macros.

  • uuid generator

    This generator is used for the UUID type. It is really a pattern generator with a predefined pattern.

  • value generator

    This generator uses per-tuple values from another generator specified as a macro name in the value directive. If the same value is specified more than once in a tuple, the exact same value is generated.

  • word generator

    This alternate generator for text types is triggered by the word directive. It uses int generator to select words from a list or a file. This generator handles UNIQUE if enough words are provided.

GLOBAL DIRECTIVES

A directive macro can be defined and then used later by inserting its name between the introductory df and the :. The specified directives are stored in the macro and can be reused later. For instance, macros words, mangle cfr and cen can be defined as:

--df words: word=/etc/dictionaries-common/words sub=power alpha=1.7
--df mix: offset=10000 step=17 shift=3
--df cfr: sub=scale alpha=6.7
--df cen: sub=scale alpha=5.9

Then they can be used in any datafiller directive with use=...:

--df: use=words use=mix
--df: use=mix

Or possibly for chars generators with cgen=...:

--df: cgen=cfr chars='esaitnru...'

There are four predefined macros: cfr and cen define skewed integer generators with the above parameters. french, english define chars generators which tries to mimic the character frequency of these languages.

The size, offset, null, seed and directives can be defined at the schema level to override from the SQL script the default size multiplier, primary key offset, null rate or seed. However, they are ignored if the corresponding options are set.

The type directive at the schema level allows to add custom types, similarly the --type option above.

TABLE DIRECTIVES

  • mult=float

    Size multiplier for scaling, that is computing the number of tuples to generate. This directive is exclusive from size.

  • nogen

    Do not generate data for this table.

  • null

    Set defaut null rate for this table.

  • size=int

    Use this size, so there is no scaling with the --size option and mult directive. This directive is exclusive from mult.

  • skip=float

    Skip (that is generate but do not insert) some tuples with this probability. Useful to create some holes in data. Tables with a non-zero skip cannot be referenced.

ATTRIBUTE DIRECTIVES

A specific generator can be specified by using its name in the directives, otherwise a default is provided based on the attribute SQL type. Possible generators are: alt array bit blob bool cat chars const count date ean file float inet int interval isnull luhn mac pattern reduce repeat string text timestamp tuple uuid value word. See also the sub directive to select a sub-generator to control skewness.

  • alt=some:2,weighted,macros:2

    List of macros, possibly weighted (default weight is 1) defining the generators to be used by an alt generator. These macros must include an explicit directive to select a generator.

  • array=macro

    Name of the macro defining an array built upon this generator for the array generator. The macro must include an explicit directive to select a generator.

  • cat=list,of,macros

    List of macros defining the generators to be used by a cat generator. These macros must include an explicit directive to select a generator.

  • chars='0123456789A-Z\\n' cgen=macro

    The chars directive triggers the chars generator described above. Directive chars provides a list of characters which are used to build words, possibly including character intervals with '-'. A leading '-' in the list means the dash character as is. Characters can be escaped in octal (e.g. \\041 for !) or in hexadecimal (e.g. \\x3D for =). Unicode escapes are also supported: eg \\u20ac for the Euro symbol and \\U0001D11E for the G-clef. Also special escaped characters are: null \\0 (ASCII 0), bell \\a (7), backspace \\b (8), formfeed \\f (12), newline \\n (10), carriage return \\r (13), tab \\t (9) and vertical tab \\v (11). The macro name specified in directive cgen is used to setup the character selection random generator.

    For exemple:

    ...
    -- df skewed: sub=power rate=0.3
    , stuff TEXT -- df: chars='a-f' sub=uniform size=23 cgen=skewed
    

    The text is chosen uniformly in a list of 23 words, each word being built from characters 'abcdef' with the skewed generator described in the corresponding macro definition on the line above.

  • const=str

    Specify the constant text to generate for the const generator. The constant text can contains the escaped characters described with the chars directive above.

  • extent=int or extent=int-int

    Specify the extent of the repetition for the repeat generator. Default is 1, that is not to repeat.

  • files=str

    Path-separated patterns for the list for files used by the file generator. For instance to specify image files in the ./img UN*X subdirectory:

    files='./img/*.png:./img/*.jpg:./img/*.gif'
    
  • float=str

    The random sub-generators for floats are those provided by Python's random:

    • beta

      Beta distribution, alpha and beta must be >0.

    • exp

      Exponential distribution with mean 1.0 / alpha

    • gamma

      Gamma distribution, alpha and beta must be >0.

    • gauss

      Gaussian distribution with mean alpha and stdev beta.

    • log

      Log normal distribution, see normal.

    • norm

      Normal distribution with mean alpha and stdev beta.

    • pareto

      Pareto distribution with shape alpha.

    • uniform

      Uniform distribution between alpha and beta. This is the default distribution.

    • vonmises

      Circular data distribution, with mean angle alpha in radians and concentration beta.

    • weibull

      Weibull distribution with scale alpha and shape beta.

  • format=str

    Format output for the count generator. Default is d. For instance, setting 08X displays the counter as 0-padded 8-digits uppercase hexadecimal.

  • inet=str

    Use to specify in which IPv4 or IPv6 network to generate addresses. For instance, inet=10.2.14.0/24 chooses ip addresses between 10.2.14.1 and 10.2.14.254, that is network and broadcast addresses are not generated. Similarily, inet=fe80::/112 chooses addresses between fe80::1 and fe80::ffff. The default subnet limit is 24 for IPv4 and 64 for IPv6. A leading , adds the network address, a leading . adds the broadcast address, and a leading ; adds both, thus inet=';10.2.14.0/24' chooses ip addresses between 10.2.14.0 and 10.2.14.255.

  • length=int lenvar=int lenmin=int lenmax=int

    Specify length either as length and variation or length bounds, for generated characters of string data, number of words of text data or blob.

  • mangle

    Whether to automatically choose random shift, step and xor for an int generator.

  • mode=str

    Mode for handling files for the file generator. The value is either blob for binaries or text for text file in the current encoding. Default is to use the binary format, as it is safer to do so.

  • mult=float

    Use this multiplier to compute the generator size. This directive is exclusive from size.

  • nogen

    Do not generate data for this attribute, so it will get its default value.

  • null=float

    Probability of generating a null value for this attribute. This applies to all generators.

  • offset=int shift=int step=int

    Various parameters for generated integers. The generated integer is offset+(shift+step*i)%size. step must not be a divider of size, it is ignored and replaced with 1 if so.

    Defaults: offset is 1, shift is 0, step is 1.

  • op=(+|*|min|max|cat)

    Reduction operation for reduce generator.

  • pattern=str

    Provide the regular expression for the pattern generator.

    They can involve character sequences like calvin, character escape sequences (octal, hexadecimal, unicode, special) as in directive chars above, character classes like [a-z] and [^a-z] (exclusion), character classes shortcuts like . \\d \\h \\H \\s and \\w which stand for [ -~] [0-9] [0-9a-f] [0-9A-F] [ \\f\\n\\r\\t\\v] and [0-9a-zA-Z_] respectively, as well as POSIX character classes within [:...:], for instance [:alnum:] for [0-9A-Za-z] or [:upper:] for [A-Z].

    Alternations are specified with |, for instance (hello|world). Repetitions can be specified after an object with {3,8}, which stands for repeat between 3 and 8 times. The question mark ? is a shortcut for {0,1}, the star sign * for {0,8} and the plus sign + for {1,8}.

    For instance:

    stuff TEXT NOT NULL -- df: pattern='[a-z]{3,5} ?(!!|\\.{3}).*'
    

    means 3 to 5 lower case letters, maybe followed by a space, followed by either 2 bangs or 3 dots, and ending with any non special character.

    The special [:GEN ...:] syntax allow to embedded a generator within the generated pattern, possibly including specific directives. For instance the following would generate unique email addresses because of the embedded counter:

    email TEXT UNIQUE NOT NULL
      -- df: pattern='\w+\.[:count format=X:]@somewhere\.org'
    
  • prefix=str

    Prefix for string ean luhn and text generators.

  • rate=float

    For the bool generator, rate of generating True vs False. Must be in [0, 1]. Default is 0.5.

    For the int generator, rate of generating value 0 for generators power and scale.

  • repeat=macro

    Macro which defines the generator to repeat for the repeat generator. See also the extent directive.

  • seed=str

    Set default global seed from the schema level. This can be overriden by option --seed. Default is to used the default random generator seed, usually relying on OS supplied randomness or the current time.

  • separator=str

    Word separator for text generator. Default is (space).

  • share=macro

    Specify the name of a macro which defines an int generator used for synchronizing other generators. If several generators share the same macro, their values within a tuple are correlated between tuples.

  • size=int

    Number of underlying values to generate or draw from, depending on the generator. For keys (primary, foreign, unique) , this is necessarily the corresponding number of tuples. This directive is exclusive from mult.

  • start=date/time , end=date/time, prec=int

    For the date and timestamp generators, issue from start up to end at precision prec. Precision is in days for dates and seconds for timestamp. Default is to set end to current date/time and prec to 1 day for dates et 60 seconds for timestamps. If both start and end are specified, the underlying size is adjusted.

    For example, to draw from about 100 years of dates ending on January 19, 2038:

    -- df: end=2038-01-19 size=36525
    
  • sub=SUGENERATOR

    For integer generators, use this underlying sub-type generator.

    The integer sub-types also applies to all generators which inherit from the int generator, namely blob date ean file inet interval luhn mac string text timestamp repeat and word.

    The sub-generators for integers are:

    • serial

      This is really a counter which generates distinct integers, depending on offset, shift, step and xor.

    • uniform

      Generates uniform random number integers between offset and offset+size-1. This is the default.

    • serand

      Generate integers based on serial up to size, then use uniform. Useful to fill foreign keys.

    • power with parameter alpha or rate

      Use probability to this alpha power. When rate is specified, compute alpha so that value 0 is drawn at the specified rate. Uniform is similar to power with alpha=1.0, or rate=1.0/size The higher alpha, the more skewed towards 0.

      Example distribution with --test='!int sub=power rate=0.3 size=10':

      value     0   1   2   3   4   5   6   7   8   9
      percent  30  13  10   9   8   7   6   6   5   5
      
    • scale with parameter alpha or rate

      Another form of skewing. The probability of increasing values drawn is less steep at the beginning compared to power, thus the probability of values at the end is lower.

      Example distribution with --test='!int sub=scale rate=0.3 size=10':

      value     0   1   2   3   4   5   6   7   8   9
      percent  30  19  12   9   7   6   5   4   3   2
      
  • suffix=str

    Suffix for text generator. Default is empty.

  • text=macro

    Macro which defines the word provided generator for the text generator.

  • type=str

    At the schema level, add a custom type which will be recognized as such by the schema parser. At the attribute level, use the generator for this type.

  • unit=str

    The unit directive specifies the unit of the generated intervals. Possible values include s m h d mon y. Default is s, i.e. seconds.

  • word=file or word=:list,of,words

    The word directive triggers the word generator described above, or is used as a source for words by the text generator. Use provided word list or lines of file to generate data. The default size is the size of the word list.

    If the file contents is ordered by word frequency, and the int generator is skewed (see sub), the first words can be made to occur more frequently.

  • xor=int

    The xor directive adds a non-linear xor stage for the int generator.

EXAMPLES

The first example is taken from pgbench. The second example is a didactic schema to illustrate directives. See also the library example in the tutorial above. As these schemas are embedded into this script, they can be invoked directly with the --test option:

sh> datafiller.py --test=pgbench -T --size=10 | psql pgbench
sh> datafiller.py --test=comics -T --size=10 | psql comics
sh> datafiller.py --test=library -T --size=1000 | psql library

PGBENCH SCHEMA

This schema is taken from the TCP-B benchmark. Each Branch has Tellers and Accounts attached to it. The History records operations performed when the benchmark is run.

-- TPC-B example adapted from pgbench

-- df regress: int sub=power alpha=1.5
-- df: size=1

CREATE TABLE pgbench_branches( -- df: mult=1.0
  bid SERIAL PRIMARY KEY,
  bbalance INTEGER NOT NULL,   -- df: size=100000000 use=regress
  filler CHAR(88) NOT NULL
);

CREATE TABLE pgbench_tellers(  -- df: mult=10.0
  tid SERIAL PRIMARY KEY,
  bid INTEGER NOT NULL REFERENCES pgbench_branches,
  tbalance INTEGER NOT NULL,   -- df: size=100000 use=regress
  filler CHAR(84) NOT NULL
);

CREATE TABLE pgbench_accounts( -- df: mult=100000.0
  aid BIGSERIAL PRIMARY KEY,
  bid INTEGER NOT NULL REFERENCES pgbench_branches,
  abalance INTEGER NOT NULL,   -- df: offset=-1000 size=100000 use=regress
  filler CHAR(84) NOT NULL
);

CREATE TABLE pgbench_history(  -- df: nogen
  tid INTEGER NOT NULL REFERENCES pgbench_tellers,
  bid INTEGER NOT NULL REFERENCES pgbench_branches,
  aid BIGINT NOT NULL REFERENCES pgbench_accounts,
  delta INTEGER NOT NULL,
  mtime TIMESTAMP NOT NULL,
  filler CHAR(22)
  -- UNIQUE (tid, bid, aid, mtime)
);

The integer *balance figures are generated with a skewed generator defined in macro regress. The negative offset setting on abalance will help generate negative values, and the regress skewed generator will make small values more likely.

If this is put in a tpc-b.sql file, then working test data can be generated and loaded with:

sh> datafiller.py -f -T --size=10 tpc-b.sql | psql bench

COMICS SCHEMA

This schema models Comics books written in a Language and published by a Publisher. Each book can have several Authors through Written. The Inventory tells on which shelf books are stored. Some of the books may be missing and a few may be available twice or more.

-- Comics didactic example.

-- Set default scale to 10 tuples for one unit (1.0).
-- This can be overwritten with the size option.
-- df: size=10

-- This relation is not re-generated.
-- However the size will be used when generating foreign keys to this table.
CREATE TABLE Language( --df: nogen size=2
  lid SERIAL PRIMARY KEY,
  lang TEXT UNIQUE NOT NULL
);

INSERT INTO Language(lid, lang) VALUES
  (1, 'French'),
  (2, 'English')
;

-- Define a char generator for names:
-- df chfr: int sub=scale rate=0.17
-- df name: chars='esaitnrulodcpmvqfbghjxyzwk' cgen=chfr length=8 lenvar=3

CREATE TABLE Author( --df: mult=1.0
  aid SERIAL PRIMARY KEY,
  -- There are 200 firstnames do draw from, most frequent 2%.
  -- In the 19th century John & Mary were given to >5% of the US population,
  -- and rates reached 20% in England in 1800. Currently firstnames are much
  -- more diverse, most frequent at about 1%.
  firstname TEXT NOT NULL,
    -- df: use=name size=200 sub=scale rate=0.02
  -- There are 10000 lastnames to draw from, most frequent 8/1000 (eg Smith)
  lastname TEXT NOT NULL,
    -- df: use=name size=10000 sub=scale rate=0.008
  -- Choose dates in the 20th century
  birth DATE NOT NULL,     -- df: start=1901-01-01 end=2000-12-31
  -- We assume that no two authors of the same name are born on the same day
  UNIQUE(firstname, lastname, birth)
);

-- On average, about 10 authors per publisher (1.0/0.1)
CREATE TABLE Publisher( -- df: mult=0.1
  pid SERIAL PRIMARY KEY,
  pname TEXT UNIQUE NOT NULL -- df: prefix=pub length=12 lenvar=4
);

-- On average, about 15.1 books per author (15.1/1.0)
CREATE TABLE Comics( -- df: mult=15.1
  cid SERIAL PRIMARY KEY,
  title TEXT NOT NULL,     -- df: use=name length=20 lenvar=12
  published DATE NOT NULL, -- df: start=1945-01-01 end=2013-06-25
  -- The biased scale generator is set for 95% 1 and 5% for others.
  -- There are 2 language values because of size=2 on table Language.
  lid INTEGER NOT NULL REFERENCES Language, -- df: int sub=scale rate=0.95
  pid INTEGER NOT NULL REFERENCES Publisher,
  -- A publisher does not publish a title twice in a day
  UNIQUE(title, published, pid)
);

-- Most books are stored once in the inventory.
-- About 1% of the books are not in the inventory (skip).
-- Some may be there twice or more (15.2 > 15.1 on Comics).
CREATE TABLE Inventory( -- df: mult=15.2 skip=0.01
  iid SERIAL PRIMARY KEY,
  cid INTEGER NOT NULL REFERENCES Comics, -- df: sub=serand
  shelf INTEGER NOT NULL -- df: size=20
);

-- on average, about 2.2 authors per comics (33.2/15.1)
CREATE TABLE Written( -- df: mult=33.2
  -- serand => at least one per author and one per comics, then random
  cid INTEGER NOT NULL REFERENCES Comics, -- df: sub=serand mangle
  aid INTEGER NOT NULL REFERENCES Author, -- df: sub=serand mangle
  PRIMARY KEY(cid, aid)
);

WARNINGS, BUGS, FEATURES AND GRUMBLING

All software has bug, this is a software, hence it has bugs.

If you find one, please sent a report, or even better a patch that fixes it!

BEWARE, you may loose your hairs or your friends by using this script. Do not use this script on a production database and without dumping your data somewhere safe beforehand.

There is no SQL parser, table and attributes are analysed with optimistic regular expressions.

Foreign keys cannot reference compound keys. That is a good thing, because you should not have compound foreign keys.

Handling of quoted identifiers is partial and may not work at all.

Beware that unique constraint checks for big data generation may require a lot of memory.

The script works more or less with Python versions 2.6, 2.7, 3.2, 3.3 and 3.4. However Python compatibility between major or even minor versions is not the language's forte. The script tries to remain agnostic with respect to this issue at the price of some hacks. Sometimes it fails. Complain to Python's developers who do not care about script stability, and in some way do not respect other people work. See discussion in PEP387, which in my opinion is not always implemented.

Python is quite paradoxical, as it can achieve forward compatibility (from __future__ import ...) but often fails miserably at backward compatibility. Guido, this is the other way around! The useful feature is to make newer versions handle older scripts.

LICENSE

GNU GPLv3

Copyright 2013-2022 Fabien Coelho <fabien at coelho dot net>.

This is free software, both inexpensive and with sources.

The GNU General Public License v3 applies, see http://www.gnu.org/copyleft/gpl.html for details.

The summary is: you get as much as you paid for, and I am not responsible for anything.

If you are happy with this software, feel free to send me a postcard saying so! See my web page for current address http://www.coelho.net/.

If you have ideas or patches to improve this script, send them to me!

DOWNLOAD

This is datafiller.py version 2.0.1-dev (r856 on 2022-11-29).

Latest version and online documentation should be available from http://www.coelho.net/datafiller.html.

Download script: https://www.cri.ensmp.fr/people/coelho/datafiller.

VERSIONS

History of versions:

  • version 2.0.1-dev (r856 on 2022-11-29)

    In development.

    Make --drop option to cascade with PostgreSQL. Add --freeze and --no-freeze options to trigger PostgreSQL COPY FREEZE option. Add license information in generated output. Add support for generating CSV and TSV (with inputs from Alex Reece). Add a message if word shifts its offset. Improved MySQL support by Alex Reece and Robert Lee (datetime, bigint...). Improved documentation and comments. Some code cleanup.

  • version 2.0.0 (r792 on 2014-03-23)

    • New generators

      Add regular expression pattern generator. Add luhn generator for data which use Luhn's algorithm checksum, such as bank card numbers. Add ean generator for supporting EAN13, ISBN13, ISSN13, ISMN13, UPC, ISBN, ISSN and ISMN types. Add file generator to inline file contents. Add uuid generator for Universally Unique IDentifiers. Add bit generator for BIT and VARBIT types. Add aggregate generators alt, array, cat, reduce, repeat and tuple. Add simple isnull, const and count generators. Add special share generator synchronizer, which allow to generate correlated values within a tuple. Add special value generator which allow to generate the exact same value within a tuple.

    • Changes

      Simplify and homogenize per-attribute generator selection, and possibly its subtype for int and float. Remove nomangle directive. Remove mangle directive from table and schema levels. Remove --mangle option. Improve chars directive to support character intervals with '-' and various escape characters (octal, hexadecimal, unicode...). Use --test=... for unit testing and --validate=... for the validation test cases. These changes are incompatible with prior versions and may require modifying some directives in existing schemas or scripts.

    • Enhancements

      Add a non-linear xor stage to the int generator. Integer mangling now relies on more and larger primes. Check that directives size and mult are exclusive. Add the type directive at the schema level. Improve inet generator to support IPv6 and not to generate by default network and broadcast addresses in a network; adding leading characters ,.; to the network allows to change this behavior. Add lenmin and lenmax directives to specify a length. Be more consistent about seeding to have deterministic results for some tests.

    • Options

      Add --quiet, --encoding and --type options. Make --test=... work for all actual data generators. Make --validate=... handle all validation cases. Add internal self-test capabilities.

    • Bug fixes

      Check directives consistency. Do ignore commented out directives, as they should be. Generate escaped strings where appropriate for PostgreSQL. Handle size better in generators derived from int generator. Make UTF-8 and other encodings work with both Python 2 and 3. Make it work with Python 2.6 and 3.2.

    • Documentation

      Improved documentation and examples, including a new "ADVANCED FEATURES" Section in the "TUTORIAL". - version 1.1.5 (r360 on 2013-12-03)

    Improved user and code documentations. Improved validation. Add --no-filter option for some tests. Add some support for INET, CIDR and MACADDR types.

  • version 1.1.4 (r326 on 2013-11-30)

    Improved documentation, in particular add a "TUTORIAL" section and reorder some other sections. Add some pydoc in the code. Memoize enum regular expression. Raise an error if no generator can be created for an attribute, instead of silently ignoring it. Fix a bug which could generate primary key violations when handling some unique constraints.

  • version 1.1.3 (r288 on 2013-11-29)

    Improved documentation, including a new "KEYWORDS" section. Check consistency of parameters in float generator. Add some partial support for ALTER TABLE. Multiple --debug increases verbosity, especially in the parser stage. Support settings directives out of declarations. Better support of PostgreSQL integer types. Add separate "VERSIONS" section. One fix for python 3.3...

  • version 1.1.2 (r268 on 2013-06-30)

    Improved and simplified code, better comments and validation. Various hacks for python 2 & 3 compatibility. Make validations stop on errors. Check that lenvar is less than length. Fixes for length and lenvar overriding in string generator.

  • version 1.1.1 (r250 on 2013-06-29)

    Minor fix to the documentation.

  • version 1.1.0 (r248 on 2013-06-29)

    Improved documentation, code and comments. Add --test option for demonstration and checks, including an embedded validation. Add --validate option as a shortcut for script validation. Add seed, skip, mangle and nomangle directives. Add null directive on tables. Add nogen and type directives on attributes. Accept size 0. Change alpha behavior under float sub=scale so that the higher alpha, the more skewed towards 0. Add alternative simpler rate specification for scale and power integer generators. Deduce size when both start and end are specified for the date and timestamp generators. Add tz directive for the timestamp generator. Add float, interval and blob generators. Add some support for user-defined enum types. Add --tries option to control the effort to satisfy UNIQUE constraints. Add support for non integer foreign keys. Remove square and cube integer generators. Change macro definition syntax so as to be more intuitive. Add -V option for short version. Some bug fixes.

  • version 1.0.0 (r128 on 2013-06-16)

    Initial distribution.

KEYWORDS

Automatically populate, fill PostgreSQL database with test data. Generate test data with a data generation tool (data generator) for PostgreSQL. Import test data into PostgreSQL.

SEE ALSO

Relational data generation tools are often GUI or even Web applications, possibly commercial. I did not find a simple filter-oriented tool driven by directives, and I wanted to do something useful to play with python.