A solution to a common data problem

In this post, I will explain what the database does that I am currently developing for the sea turtle researchers at Osa Conservation. In short, their problem is that have have multiple tables with sea turtle data, which are structured somewhat differently and which they want to bring together in one table. (See my previous post that describes this in detail.) Since, I could not find an existing software tool, I decided to build one that helps stitching their tables together.

Spot the differences… (fake data)

How does the tool help?

The table-stitching database – I haven’t found a good name for it so suggestions are welcome – helps solving this problem with the following strategy.

1. Changes are added to your data and thus become traceable, reviewable, and adaptable.

Instead of altering the problematic data, the table stitcher will generate and export a changed version of the data. Basically, you first import all your problematic data and then specify and export a revision with a uniform structure and formatting. All the changes are not made to the data itself, but administered as instructions next to the data. This way you can always see how the final result was generated.
An added benefit is that if the end result is not good enough, you can simply change the instructions and generate a new version of the final result (nothing is really final, not in life, not in databases).

Here a date is specified by connecting two columns with ‘-‘ and adding a year.

2. Integrated data-cleaning and formatting

The database allows you not only to make corrections to (that is, added to) individual cells in the import tables, it also gives you the possibility to change the formatting or regular mistakes in the data. For example, a column may contain dates in different formats, or someone often miss-spells a particular name. Advanced programming of specific transformations is also possible. With a simple test facility you can quickly see the result of the changed formatting and manual corrections.

This is the definition of step 3 of the calculation. Notice the data cleaning and substitution.

3. Generating multiple new tables

Depending on your data and your wishes, it could be that the tables that you want to stitch together contain data for multiple export tables. Perhaps some tables have extra columns that do not occur with others, so you could make a separate export of those columns only.

4. Documentation, documentation, documentation

While you are working with the database, you have ample opportunity to add meta-data, including descriptions of the imported tables and all their columns, and the export table(s) and their columns. It is not necessary, but advisable because it will help you later on when you need to explain everything to your reviewers, your successors, or yourself a couple of years later. The database also has plenty of possibilities to leave notes about all the import and export tables, their columns and the instructions of how to generate an export table.

Interested?

As I explained at the beginning, I am developing this database to help solve a problem for the sea turtle researchers, but I see potential for more use. It is in fact a perfect jump-board to start working on my ‘more databases‘ plan. However, I could do with some help.

If you have a problem like this, then please do contact me. We could help each other in different ways.

If you have this problem and would like to try out the database, then please do let me know. I am interested in having the database beta tested. You will need to either have a Mac computer, or a Windows computer with FileMaker Pro installed.

If you do have this problem, but don’t have time for testing the software, I would appreciate receiving sample data of your files. Just the first, say 100 rows would be enough. Please, also mention the magnitude of how many rows all tables have together (thousands, tens of thousands, … )

Finally, you could let me know how you would like to work with a database like this? Do you prefer to install software on your computer, or go on-line and work with the database in the cloud? Would you perhaps prefer not to do the work and make use of a service, i.e. I would be doing the work for you?

PS Technical restrictions

The database does have some restrictions.

It can only deal with data that is expressed in text and will only export data as text. Data expressed as text includes numbers, times and dates because they are written in text. But it can not deal with audio, video and images.

It exports as tab-text, coma separated values (.csv), or .xlsx, but even in .xlsx all data will be formatted as text. So in your spreadsheet software, you will have to manually set cell-formatting to the desired format, or try its import function to automatically determine the format.

At present, the maximum number of columns for import and export is 104. In spreadsheet terms, the last column that is imported is CZ.

A page-length version of this post and its predecessor can also be found here on my website for future reference.

.


2 thoughts on “A solution to a common data problem

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.