gemini://sotiris.papatheodorou.xyz/gemlog/20221111_creating_a_pleasant_to_use

As a PhD student in robotics I've had the need to use several datasets, some more pleasant to use than others. This is a list of general guidelines, tips and often overlooked aspects than can make a dataset more pleasant to use. By pleasant to use I mean spending as little time as possible in downloading, pre-processing and using a dataset in order to maximize the time spent on actual research. Although the guidelines are mostly focused on robotics and computer vision datasets, most of them should be applicable to other domains as well.

Do as much data pre-processing as possible at dataset creation time. Since the dataset is only created once but will be downloaded and used multiple times, any time spent on pre-processing during creation is saved many times over from the users of the dataset who won't have to do this work every time.

Prefer simple and fast loading by programs over easy inspection by humans. A dataset will be loaded by programs much more often than it will be inspected by humans. Optimizing the dataset for programs makes the implementation of dataset readers easier and reduces processing time.

Distributing uncompressed data results in slower downloads for users and increased storage space requirements for hosting the dataset. The ZIP format is a good candidate since it performs compression by default and can be easily decompressed on virtually any operating system.

If your dataset is larger than 4 GB it's a good idea to split it into multiple ZIP archives for compatibility with older filesystems and beacause smaller downloads are less likely to be interrupted. If you do split the dataset prefer splitting it into independent ZIP files instead of a so called split or spanned ZIP archive. A split ZIP archive requires all of the individual ZIP files in order to be decompressed forcing users to download the whole dataset even if they're only interested in a small part of it.

On a related point, if the dataset consists of a series of files that must be processed in order and you split it into multiple ZIP archives then make sure the files are stored in order in the ZIP archive. That means that e.g. archive 0 contains files 0 to 999, archive 1 contains files 1000 to 1999 etc. This way users interested only in the beginning of the dataset can download only the first few ZIP archives.

Select the appropriate file formats for your data. Prefer well known and established file formats as this ensures that there are libraries or code samples to read them in a large number of programming languages. Avoid creating custom file formats if possible as users will likely have to write more code to read them. In most cases it's probably a bad idea to use a format that performs lossy compression, like JPEG, as this will add artifacts to the data.

PNG is a widely used and well supported format for image data with lossless compression. It can store a wide range of data since it supports color and grayscale images with or without an alpha channel. PNG images can be used to store any value that can be scaled to fit inside 1 to 4 integers each 8 or 16-bits long:

Depth images (typically scaled to millimeters and stored in a 16-bit grayscale image)

Normal maps (with x, y and z coordinates scaled and mapped to the red, green and blue channels)

TSV (Tab Separated Values) is a good choice for tabular data. Unlike CSV (Comma Separated Values) which requires quoting and a specialized reader and writer, TSV is very simple to read and write.

Reducing the size of the data obviously reduces the download time and space requirements to store the data which is already a worthwhile goal. What's less obvious is that it can also reduce the time required to process the dataset since less data has to be moved around and processed. Some ways of reducing data size:

Remove all unnecessary metadata from files. While the gains for an individual file might be minuscule, datasets often consist of tens of thousands of files or more at which point the size reduction becomes appreciable. Image metadata can be modified with exiftool while audio and video metadata can be modified with ffmpeg.

Apply lossless compression to files. This mainly applies to image files. For JPEG images jpegoptim performs lossless optimization while for PNG images there are pngcrush and optipng. I've gotten some datasets down to 70-75% of their original size just by running optipng.

Remove unnecessary data. For example don't include an alpha channel in images where it doesn't make sense.

Use units that will be familiar to users of your dataset. This of course depends on the domain of the dataset but other datasets in the same domain are a good indication of what units users expect. If in doubt use SI units. Make sure to also document all units, don't make users guess.

For example in robotics position coordinates are typically expressed in meters. Even if the sensor or simulator you used to generate the dataset provides the position coordinates in some other unit, convert it to meters in the dataset.

This applies mainly to robotics and computer vision. The most common convention for the camera frame is z-forward x-right while for the robot body it's x-forward z-up. Try to stick to those conventions in your dataset even if your original data uses a different one.

Often datasets consist of a sequence of files than must be processed in order. If the file number is zero-padded then the lexicographical order of the filenames is the same as the order they are intended to be processed in. Lexicographical ordering is the default string sorting method for most programming languages while numeric ordering is much more difficult to achieve.

For example in lexicographical order image_10.png will sort before image_2.png but with zero-padding image_10.png will sort after image_02.png as intended.

It's often needed to include accurate timestamps in the dataset. They are typically stored as an integer number of seconds and an integer number of nanoseconds. When formatting the timestamp as a decimal number make sure to pad the nanoseconds with the appropriate amount of leading zeros. Similarly if instead of nanoseconds you've got microseconds or milliseconds.

For example a timestamp of 4 seconds and 7662458 nanoseconds should be formatted as 4.007662458, not as 4.7662458.

It's quite likely that some errors will make it to the initial release. Including a version number in the dataset allows indicating to users that a change has been made to the dataset after they last downloaded it. Results can then become reproducible by citing the version of the dataset used. The version number can be as simple as a v1, v2 etc. in the name of the distributed file and a changelog on the dataset website containing the changes made in each version.

Allow users to download the dataset in a non-interactive manner, for example by passing several URLs to wget. This is especially important for large datasets which may consist of hundreds or thousands of individual files. Even better provide a text file with the URLs of all files in the dataset so users don't have to scrape them off a webpage.

Whether a dataset can be downloaded in a non-interactive manner mainly depends on where it's hosted. Cloud hosting services often don't allow this at all or generate URLs that are some opaque hash of the file they point to making it difficult to automatically generate a list of URLs. The safest bet is serving static files from a webserver with predictable URLs like http://example.com/cool-dataset/sequence_002.zip.

If it makes sense for your dataset provide helper scripts or sample code for common operations. This makes it easier for users to get started with the dataset but also acts as a manner of implicit documentation. Sample code is even more important if the dataset contains any custom file formats so users won't have to write readers from scratch.

Creating a pleasant to use dataset

Pre-process data at creation time

Create datasets for computers, not humans

Compress the data for distribution

Select appropriate file formats

Reduce the size of the data

Use predictable units

Use predictable coordinate frames

Zero-pad numbers in filenames

Write timestamps correctly

Add a version number

Allow non-interactive downloads

Provide helper scripts and sample code