Script to analyze all entries in a table and suggest optimal column data types.

I’ve been working on a bash script that does this. I thought it was going to be really short and simple, but it is turning out to be quite complex. Before I continue with it, I want to know if such a thing already exists.

Basically I do a lot of queries to test each column like:

“select ‘false.’, `$field` from $TABLE where lpad(cast(0+`$field` as binary), length(`$field`), ‘0’) != `$field` limit 1;”

Which tests to see if the data is an INT ZEROFILL.

Is there already a solution for this? If not, would anyone else be interested in helping to develop this? I intended to create a source forge project for it once I got a working version. But, I’m not finding the time to complete it.

Okay, since no one is interested in helping me with it… Is anyone interested in using it when it is done?

Here is the basic outline. For each column in the table, analyze the values:
[] Decide if the column is currently BINARY.
] Get the length of the longest value.
[] If the values are numeric, then…[LIST=1]
] Decide whether they require zerofill.
[] Decide if they are unsigned.
] If they are all integers, decide what size is the best fit. Else…
[] If they are [fixed point] DECIMAL, decide their significant digits and number of decimal places. Else…
] Decide if DOUBLE or single precision FLOAT is required.
[] (optionally) Decide the significant digits and number of decimal places that best fit.
] If the values are temporal[LIST=1]
[] Decide which is the best fit. (YEAR, DATE, TIME, TIMESTAMP, or DATETIME in that order of precedence)
] If an ENUM would be appropriate, suggest it.
[] If a CHAR would be most efficient, suggest it (BINARY is needed).
] If the length you came up with first is < 255 (or 65535 post-MySQL 5.0.3) use VARCHAR (VARBINARY is needed). Else… Find the TEXT/BLOB size that best fits.

What’s missing:
[] BIT type (I don’t use it)
] SET type (I don’t want to code it) Spatial Types (I hope I never need it)

I hope that draws some interest. I’ll take your continued silence to mean that I’m an idiot and the only person in the community who would like a tool for doing this. :smiley:

The path I would take here is parsing the table schema, then contrasting the data type definitions with the actual data in the table.

This is a good idea. It would be worth adding.

My original purpose for the script was to serve my own need. I’m currently working for a company whose idea of data storage was delimited text files or excel files. I’ve loaded dozens of tables into MySQL from comma/tab delimited files. In the interest of speed, I generally create the tables with all varchar fields. These tables are currently being used by individuals in the company. When it is decided that an application is going to be built against data in one of these tables, I either optimize the table or I pull the needed columns out of many tables and build one streamlined table.

I’d like to be able to take all these nasty tables that I have (and continue to create) and clean them up. So, as it stands, i don’t much care about the current schema. I know what it is, and I know it is bad.

I think, You probably can start with use of this query:


Unfortunately that almost always just suggests that I make everything an ENUM. I think that should be a REALLY powerful tool, but turns out to be basically useless. (