Data Creation for AI Training
Data creation defined
Finding the right image and/or video data for computer vision based machine learning projects can be challenging. The vast majority of data for AI training is collected from open sources such Imagenet and Google Open Images. Datasets assembled from online repositories are then annotated using annotation platforms, according to the specific needs of the project. Image and video annotation is an important stage in the development of AI models as it allows machine learning engineers to focus on labeling best practices, as well as how to address obscure or unclear objects in images.
Whilst data collection resources are sufficient for many AI projects, there are times when it can be hard to find images and video that are suitable for very specific training needs. When this is the case it is necessary to use cameras and production facilities to produce new images and videos for AI training datasets. This is called data creation. This piece will examine when it is necessary to create data, how data creation works in practice, and how best to access data creation services.
What makes a quality dataset
Data creation is one way to achieve high quality datasets that are fit for purpose. In order to reach the threshold of functionality for machine learning, training images and videos must possess the following characteristics: quality, quantity, density, and variance. Bespoke image and video creation can play a part in securing all of these important qualities for training data:
- Quality: Image quality can be determined by a number of factors. High-definition images and video are much easier to annotate. Clearly distinguishable objects, with defined boundaries, result in high-quality labeled images. Some raw image data can also feature artefacts and oddities as a result of human error. Poor framing of objects could result in unusable training data outputs.
- Quantity: For machine learning models too much data is rarely a problem. However, a dearth of quality images and video often is. The challenge for most companies is ensuring that flow of training data is sufficient so as to avoid bottlenecks in development.
- Density: Images must feature a large enough number of objects so as to reflect more chaotic real-world situations. Under populated training images may result in poorly performing systems.
- Variance: This metric also correlates with the real world. Dataset images and video need to contain a varied collection of objects that might be of relevance to a computer vision model. Training data must reflect the diverse range of ontologies that make up real environments.
Overcoming bias with data creation
Data creation is essential when necessary images and videos cannot be found through traditional sources. This can be because the computer vision project being undertaken is particularly novel. However, data creation is also a means of overcoming bias. Datasets that do not reflect complex aspects of real world environments can lead to models that do not function optimally in all conditions. Creating images and videos from scratch is a way of ensuring varied training data that reflects a changing, and diverse world:
- Light conditions: Most training data for computer vision models is taken during the day, in good light conditions. Most AI systems are predominantly deployed in daytime, with sufficient light to operate. However, for many use cases (autonomous vehicles, drones) it is essential that models function effectively in low light and even at night. This means that training data must reflect differing light levels. Sometimes the only way to achieve this variation is via data creation.
- Weather: Data creation may also be necessary to reflect varied weather conditions. Weather patterns and climate are, of course, regionally specific. Training data created exclusively in predominantly dry climates may not reflect the low visibility of rain, or fog which is present in other locations. Images and video for safety conscious AI systems must represent a wide range of weather types.
- Ethnic and gender diversity: For AI applications that are tasked with identifying humans and analyzing their behaviour it is essential that AI training data reflects the diversity present in all human populations. Datasets that do not represent diverse ethnic and gender identities may lead to models that do not function well or act in an illegal or discriminatory manner.
- Cultural contexts: Different cultural norms may require data creation. For example, models trained to recognize western cutlery and eating practices may be confused by chopsticks. For AI models to function across the globe it is important that they are trained with images and videos that represent the particularities of each culture. Data creation can help capture these differences as they emerge during development.
- Road conditions: Automated vehicles are designed to operate on roads in any context. Data creation can help to capture road conditions that vary from the typical North American context. Driving on the left, differing road markings, different coloured road material, and varied street furniture; the complexity and diversity of roads across the world can be challenging to incorporate in training data.
- Signage: Training images must capture the specific road signage used in every country in which AI models operate. Data creation services may be the only means of capturing this complexity.
Data annotation in practice
Data annotation services are often best placed to create the image and video data that today’s AI innovators need. Keymakr is an annotation provider with data creation expertise. By combining annotation experience and production facilities Keymakr can support AI projects with practical, affordable data creation:
- Data for autonomous vehicles: Data annotation providers, like Keymakr, have access to production facilities in different countries. This makes accessing diverse image and video data more straightforward for autonomous vehicle AI developers.
- In-cabin AI: In-cabin AI allows monitoring systems to watch, and interpret human behaviour. This means that AI systems can tell if somebody is falling asleep or is impaired in some other way. In-cabin AI can also keep track of objects in the car, and warn drivers if they leave something behind. Creating training data for these models means filming individuals in cars in a variety of contexts. By creating data featuring people of different ethnicities, driving at night and during the day, Keymakr can broaden the scope, and improve the quality of training datasets.
- Workplace settings: AI technology is increasingly being brought into the workplace as a way of helping employees and improving productivity. However, finding high-quality images and video from inside the average office can be a challenge. In-house production capacity enables data annotation providers, like Keymakr, to create images and video that reflect a variety of office or workplace contexts. Working with AI developers means that data of this kind can be tailored to the precise specifications of any model.
- Medical contexts: By committing to data security and privacy regulations annotation providers can produce bespoke images and video for medical AI training. Expert verification is also crucial for this kind of data creation.
- Retail AI: Data from real world retail locations can be of low quality, particular video data. High-quality, effective retail AI training data can be created by using high definition cameras and even film sets.
- Fitness applications: Fitness applications are often required to identify and analyze specific body movements. Finding these particular exercises in open source data pools can be difficult. Data creation facilities, like those at Keymakr, allow fitness AI companies to define exactly the movements and images that they need to be captured and annotated.
Creating bespoke datasets
Data creation allows AI developers to fine tune training datasets. This leads to streamlined research and higher functioning end models. Keymakr offers unique data creation facilities and expertise to computer vision innovators.