The Essentials of Dataset Creation
Computer vision based AI projects are heavily reliant on high quality image and video data. In order to train the next generation of AI models accurately annotated image and video datasets must be assembled at the scales that developers require. But before the annotation process can begin companies need to ensure that they have access to large enough quantities of useful, raw data. This is where dataset creation comes in.
This blog will cover the characteristics of a good dataset, before going on to detail the ways in which third party image and video annotation tool, like Keylabs, can ease the burden of data collection and creation.
Quality Datasets: The Basics
In order to reach the threshold of functionality for machine learning, training images and video must possess the following characteristics: quality, quantity, density, and variance. To show what these characteristics mean in practice, let's look at how they would be applied to a dataset for customer monitoring computer vision system for the retail sector:
- Quality: Image quality can be determined by a number of factors. High-definition images and video, in this case footage from a supermarket, are much easier to annotate. Clearly distinguishable objects, with defined boundaries, result in high-quality labeled images. Some raw image data can also feature artefacts and oddities as a result of human error. Poor framing of objects could result in unusable training data outputs.
- Quantity: For machine learning models too much data is rarely a problem. However, a dearth of quality images and video often is. The challenge for most companies is ensuring that flow of training data is sufficient so as to avoid bottlenecks in development.
- Density: Images must feature a large enough number of objects so as to reflect more chaotic real-world situations. One customer strolling down the aisle is not a true representation of a busy supermarket. Under populated training images may result in poorly performing systems.
- Variance: This metric also correlates with the real world. Dataset images and video need to contain a varied collection of objects that might be of relevance to a computer vision model. Different types of customers, different types of products, shopping carts, wheelchairs, the diverse range of ontologies that make up the retail experience.
Assembling and Annotating Datasets with the Help of Third Party Providers
There are a number of avenues that companies can go down to assemble datasets that meet the requirements detailed above. Companies can go to open source datasets, such as Google’s Dataset Search. These resources are often free, and are usually pre-labeled. However, because the data is designed for other use-cases it may not fit the specifics of a given company's AI project.
This could lead to lost time in quality control and validation. Companies can also create their own data in-house. This custom data will be better suited to their model’s needs, however, the time and resources devoted to the management of annotation can be prohibitive in this circumstance.
Increasingly companies are looking to partner with third party data annotation service providers to create high-quality datasets. Companies, such as annotation specialists Keymakr, can take on the burden of raw data collection and creation. Images can be scraped from the web, or collected through negotiation with image vendors.
If the data required by a project does not exist then professional services can create it, utilising production studios and teams of photographers. This collaborative approach can be more costly, however, it removes the burden of management and data assembly from the AI company and ensures that the annotated training data will be in line with the needs of the model.
The final step in successful dataset creation is ensuring that the right annotation tools are being brought bear on the raw data. Keylabs is a state-of-the-art annotation tool that provides a user-friendly experience alongside a full suite of labeling features and management options.
Contact a team member to book your personalized demo today.