guides

Data Labeling: The Authoritative Guide

The quality of your data and labels will determine how well your ML models perform. This is the manual you need to make sure you obtain the best labels available.

Data Labeling for Machine Learning What is Data Labeling?Why is Data Annotation Important?How to Annotate Data High-Quality Data Annotations Data Labeling for Computer Vision NLP Data Labeling Conclusion

Data Labeling for Machine Learning

Our method for resolving issues with computer vision and natural language processing has been completely transformed by machine learning. Machine learning algorithms, which are driven by massive volumes of data, are exceptionally skilled at learning from data, identifying patterns, and producing insightful predictions—all without the need for explicit programming.

Computer models that have been trained on vast volumes of visual data are highly accurate in predicting objects. They don't need human programmers to expressly define how to identify faces, cars, or fruit; they can do it on their own.

In a similar vein, chatbots and voice assistants of today operate on natural language processing models. These models, which have been trained on vast quantities of text and audio data, are able to translate between languages, detect speech, and comprehend the context of written information.

Machine learning engineers program these models using a vast amount of clean, relevant data, as opposed to trying to hand-code these skills into software. Labeling data is necessary for models to produce these useful forecasts. Machine learning relies heavily on data labeling, which is a crucial but often neglected task.

The purpose of this guide is to disseminate useful best practices gleaned from Refonte and to offer a thorough reference for data labeling. Refonte.Ai's vast experience in solving the biggest issues with data labeling.

What is Data Labeling?

The process of giving context or meaning to data so that machine learning algorithms may use the labels to learn and produce the intended outcome is known as data labeling.

We will go over the many forms of machine learning and the various kinds of data that need to be labeled in order to gain a better understanding of data labeling. There are three main types of machine learning: reinforcement learning, unsupervised learning, and supervised learning. Each kind of machine learning will be covered in more detail in What Makes Data Annotation Vital?

Large volumes of labeled data are used by supervised machine learning algorithms to "train" neural networks or models to find patterns in the data that are pertinent to a particular application. Machine learning engineers enter the data into a machine learning system, while data labelers define ground truth annotations to the data. For instance, data labelers for autonomous vehicle object recognition models will label every car in a scene. Next, the labeled dataset will be analyzed by the machine learning model, which will learn to recognize patterns. Subsequently, these models forecast previously unseen data.

Types of Data

Structured vs. Unstructured Data

Highly ordered information can be found in spreadsheets and relational databases (RDBMS). Structured data includes information about customers, phone numbers, social security numbers, revenue, serial numbers, and product descriptions.

Unstructured data, which includes things like photos, videos, LiDAR, radar, some text data, and audio data, is data that has not been organized according to predetermined schemas.

Images

Camera sensors first provide data in raw format, which is subsequently transformed to.png or, better yet,.jpg files. The latter are compressed and require less storage than.png files, which is important to take into account when working with the massive volumes of data required to train machine learning models. Additionally, image data is gathered by third-party services or scraped from the internet. Numerous applications, such as facial recognition, industrial flaw detection, and diagnostic imaging, rely on image data.

Videos

Video data is also obtained in raw format from camera sensors and is composed of a sequence of frames that are saved in the.mp4,.mov, or other video file formats. Because of its reduced file size, MP4 is a standard in machine learning applications, much like.jpg for picture data. Fitness apps and driverless cars are made possible by video data.

3D Data (LiDAR, Radar)

Machine learning models can better comprehend a scene thanks to the use of 3D data, which helps models overcome the lack of depth information from 2D data sources like conventional RGB camera sensors.

LiDAR (Light Detection and Ranging) is a remote sensing technique that creates accurate three-dimensional photographs of scenes by utilizing light. LiDAR data is typically translated to JSON file format before being processed by machine learning algorithms. It is saved as point clouds in both raw and .las file formats.

(Radio Detection and Ranging) Using radio waves, radar is a remote sensing technique that measures an object's distance, angle, and radial velocity with respect to the radar source.

Text

Text data is information represented by characters and is typically kept in files with the extensions .txt, .docx, or .html. Natural Language Processing (NLP) applications powered by text include automatic translation, text-to-speech, speech-to-text, text-to-speech, document information extraction, and virtual assistants that respond to your inquiries.

Why is Data Annotation Important?

Massively high-quality data makes innovative applications conceivable, and machine learning powers them. It is essential to comprehend the distinctions between supervised, unsupervised, and reinforcement learning in order to appreciate the significance of data labeling.

Reinforcement Learning uses algorithms to perform in a way that maximizes a reward in a given environment. For example, Deepmind's AlphaGo utilized reinforcement learning to compete with itself in games to become the greatest player in history in the game of GO. Reinforcement learning optimizes a reward function rather than labeled data in order to accomplish a goal.

Supervised Learning vs. Unsupervised Learning

The most popular and potent machine learning applications, such as spam detection and the ability for self-driving cars to recognize people, cars, and other impediments, are powered by supervised learning. A significant amount of labeled data is used in supervised learning to train a model that can correctly classify data or predict outcomes.

Unsupervised learning powers systems like recommendation engines by assisting in the analysis and clustering of unlabeled data. These models do not require labeled data to "teach" the algorithm the predicted outputs; instead, they learn from the properties of the dataset itself. K-means clustering is a popular method that seeks to divide n observations into k clusters and place each observation next to the closest mean.

While unsupervised learning has many wonderful uses, supervised learning has led to the majority of high-impact applications because of its superior accuracy and predictive power.

Data-centric artificial intelligence is a new paradigm that machine learning practitioners have created as a result of their focus shifting from improving models to enhancing data. ML code makes up a very small portion of real-world machine learning systems. To power greater AI, more precise data labeling and high-quality data are needed. Understanding every step of a well-defined data pipeline—from data collection techniques to data labeling and curation—is crucial as approaches to building more accurate machine learning models become more data-centric.

To help you get the most out of your data and, consequently, the most out of your models, this article focuses on the most popular categories of data labels and high quality best practices.

How to Annotate Data

You need a lot of data with high-quality labels in order to build supervised learning models of the highest caliber. How therefore should data be labeled? You must first decide who will annotate your data. Building labeling teams can be done in a variety of ways, and each has advantages, disadvantages, and factors to take into account. Firstly, let us discuss if it is better to use both human and automatic data labeling, or to mix the two methods for labeling.

1. Choose Between Humans vs. Machines

Automated Data Labeling

Automating or semi-automating data labeling is feasible for big datasets including recognized items. The dataset will automatically be labeled using custom machine learning models that have been trained to identify particular sorts of data.

Only after creating early, high-quality ground-truth datasets can you make use of automatic data labeling. It can be difficult to take into consideration all edge circumstances and to fully trust automated data labeling to produce the best quality labels, even given high-quality ground truth.

Human Only Labeling

For many modalities we care about for machine learning applications, like vision and natural language processing, humans are extraordinarily excellent at certain tasks. In many domains, human data labeling produces labels of superior quality than automated data labeling.

However, it might be difficult to train people to consistently categorize the same data because human experiences can differ in their subjectivity. Moreover, the cost of human labeling for a given work can be higher than that of automated labeling due to their relative slowerness.

Human in the Loop (HITL) Labeling

Human-in-the-loop labeling makes use of people's extremely specialized skills to support automated data labeling. HITL data labeling can originate from actively tooled data that enhances the quality and efficiency of labeling, or from automatically labeled data audited by humans. The accuracy and efficiency of automated labeling with a human in the loop almost usually surpass those of either approach alone.

2. Assemble Your Labeling Workforce

You will need to figure out how to source your labeling workforce if you decide to include humans in your data labeling workforce, which is something we strongly advise. Will you grow to a third-party labeling company, recruit a team internally, or persuade friends and relatives to label your data for free? Below, we offer a framework to assist you in making this choice.

In-House Teams

Small businesses might not have the funding to dedicate a large amount of resources to data labeling; as a result, all team members—including the CEO—may have to label data themselves. This method might work for a small prototype, but it is not scalable.

Large, well-funded companies may decide to maintain internal labeling teams in order to maintain control over the whole data flow. Although this method offers a great deal of control and flexibility, it is costly and requires a lot of management.

Businesses that handle sensitive data or have privacy concerns may decide to deploy internal labeling teams. Although a totally reasonable strategy, scaling this can be challenging.

Pros: Strict control over data pipelines and subject matter knowledge

Cons: High costs, associated with supervising and training labelers

Crowdsourcing

Crowdsourcing systems offer a rapid and simple method for a huge pool of people to swiftly execute a variety of activities. These platforms work well for labeling data that doesn't raise privacy issues, including publicly available datasets with straightforward instructions and annotations. However, the unskilled resource pool from crowdsource sites is a bad alternative if more complicated labels are required or if sensitive data is involved. Crowdsourcing platforms often contain resources that are poorly trained and lack domain expertise, which results in labeling that is of low quality.

Pros: Availability of a bigger pool of labelers

Cons: Questionable quality; high training and administrative costs for labelers

3rd Party Data Labeling Partners

Third-party data labeling organizations are proficient in producing high-quality data labels and frequently possess extensive knowledge of machine learning. These businesses can serve as your technical partners, offering guidance on the best ways to gather, organize, and classify your data as well as best practices for the full machine learning lifecycle. These businesses provide high-quality labels at competitive prices thanks to their highly skilled resource pools, cutting-edge automated data labeling operations, and sophisticated toolkits.

A huge workforce (1,000+ data labelers on any particular project) is needed to achieve exceptionally high quality (99%+) on a large dataset. It is challenging to scale to this volume at this quality using crowdsourcing platforms and internal teams. However, these companies can also be expensive and, if they are not acting as a trusted advisor, can convince you to label more data than you may need for a given application.

Pros: Excellent quality, low cost, and technical know-how; The best data labeling businesses has certifications like SOC2 and HIPAA, which are pertinent to their industry.

Cons: Give up authority over the labeling procedure; Need a reliable partner to handle sensitive data that possesses the necessary credentials

3. Select Your Data Labeling Platform

You must locate a data labeling platform after deciding who will label your data. Here, one can choose from a variety of approaches, such as constructing internally, utilizing open-source tools, or utilizing paid labeling services.

Open Source Tools

With minor restrictions on commercial use, these tools are available for free to anybody. These resources are excellent for testing early commercial AI applications, working on personal projects, and learning about and developing machine learning and AI. These tools are free, but the cost is that they lack some of the sophistication and scalability of other paid systems. These free tools might not support all of the label kinds covered in this article.

Many excellent open source alternatives might not be included in this list, which is intended to be representative but not exhaustive.

CVAT: CVAT is a free, open-source web-based data labeling tool that was first created by Intel. Numerous common label shapes are supported by CVAT, such as cuboids, polygons, and rectangles. CVAT is a collaboration tool that works well for smaller or more beginner projects. Nevertheless, the web version's collaboration features are less appealing because it only allows 500 MB of data and 10 projects per person. CVAT can be accessed locally to circumvent these limitations on data.
LabelMe: LabelMe is an open-source, free data-labeling tool developed by CSAIL that facilitates community participation on computer vision research datasets. After installing the tool, you can label own data and contribute to other projects by labeling public datasets. Labelme is quite limited compared to CVAT, and the web version no longer accepts new accounts.
Stanford Core NLP: Stanford's CoreNLP is a feature-rich open-source platform for natural language processing and NLP labeling that includes text processing, Named Entity Recognition (NER), linking, and other features.

In-house Tools

Some major firms choose to build their own tools in-house to have more control over their ML pipelines. Which functionality to develop to serve your desired use cases and overcome your particular issues is entirely up to you. Nevertheless, this strategy is expensive, and in order to stay current with emerging technologies, these tools will need to be updated and maintained.

Commercial Platforms

Commercial platforms help you expand by providing expert labeling workforces, specialized support, and top-notch tools. They can also offer advice on machine learning and labeling best practices. Providing support to a large number of clients raises the standard of the platforms for all users, giving you access to cutting-edge features that open-source or in-house labeling systems would not offer.

The greatest labeling infrastructure to speed up your team, labeling tools to support any use case, and orchestration to maximize worker performance are all provided by Refonte.Ai Studio, the top commercial platform in the market. Analyze, track, and enhance the quality of your data with ease.

High-Quality Data Annotations

To get the most out of your machine learning applications, you must maximize the quality of your data labels, regardless of the annotation platform you choose.

Since data is the main input for machine learning, the old computer science principle "Garbage in Garbage Out" is very relevant. You will get subpar results if the data or labels are of poor quality. Our goal is to give you the most important quality metrics and to talk about best practices so that you may make the most out of your labeling quality.

Different Ways to Measure Quality

We cover some of the most critical quality metrics and then discuss best practices to ensure quality in your labeling processes.

Label Accuracy

It's critical to evaluate how closely labels match your expectations and guidelines. Consider a scenario in which you have assigned data labelers the task of marking pedestrians. You have directed that labels not be placed on objects that are pushed or pulled, but only on those that are carried, like a phone or backpack. Do you notice that pushing and dragging strollers and baggage is covered in the annotations as you go over the example duties, or are the instructions not being followed?

In what way do labelers perform accurately on benchmark tasks? These are tests used to gauge your level of confidence in the quality of other labeled data and gauge the overall correctness of the labelers. Is labeling consistent across labelers or types of data? If label accuracy is inconsistent across different labelers, this may indicate that your instructions are unclear or that you need to provide more training to your labelers.

Model Performance Improvement

How precise is your model when it comes to the task at hand? While labeling quality is an important consideration, the output measure is not exclusively reliant on it. Both the quantity and quality of data are important factors.

Now let's go over some of the most important metrics for model performance.

Precision

Precision defines what proportion of positive identifications were correct and is calculated as follows:

Precision = True Positives / True Positives + False Positives

A model that produces no false positives has a precision of 1.0

Recall

Recall defines what proportion of actual positives were identified correctly by the model, and is calculated as follows:

Recall = True Positives / True Positives + False Negatives

A model that produces no false negatives has a recall of 1.0

Precision Recall Curve

When compared to the ground truth labels, a model with high recall but low precision produces a large number of outcomes, many of which have wrong projected labels. Conversely, a model with high precision but low recall does the exact reverse, producing very few results yet, when compared to the ground truth labels, most of the projected labels match. A perfect model will yield a large number of accurately categorized outcomes and have excellent precision and recall.

Compared to a single statistic, precision recall curves offer a more comprehensive picture of model performance. There is a trade-off between precision and recall, and the precise numbers you want will depend on your model and its use.

Higher recall, for example, is preferred for a diagnostic imaging application that detects cancerous tumors since it is more accurate to anticipate that a non-cancerous tumor is cancerous than to classify a dangerous tumor as non-cancerous.

Alternatively, even though this may allow more real SPAM to reach our inbox, programs like SPAM filters need to be extremely precise in order to ensure that crucial emails are not mistakenly tagged as SPAM.

Intersection Over Union (IoU)

An assessment metric known as Intersection over Union (IoU) provides an indirect way to verify label quality in computer vision applications.

By calculating the ratio of the predicted label's area of overlap to the area of the union of the predicted label and the ground truth label, IoU evaluates how accurate a predicted label is in comparison to the ground truth label.

The better trained the model is, the closer this ratio is to 1.

High-quality data and labels are crucial for model training, as we previously covered. IoU is therefore a proxy for the caliber of data labels. Depending on the type of object, you can be focused on a different quality threshold. For example, if you are developing an augmented reality application that emphasizes human contact, you might require an IoU of 0.95 to detect faces of people, but just 0.70 for dog identification.

Once more, it is crucial to keep in mind that IoU might also be impacted by other data-related variables, such as bias in the dataset or inadequate data.

Confusion Matrices

Confusion matrices are a straightforward yet effective method for improving your understanding of class confusion in your model. A grid of classes that compares actual (ground truth) and expected classifications is called a confusion matrix. You can rapidly comprehend misclassifications by looking at the confusion matrix, such as when your model predicts a traffic sign while the actual object is a train.

When these confusions are combined with confidence ratings, it might be simple to rank the object confusion by focusing on the situations in which the model has a high degree of confidence in its inaccurate prediction. Class confusions are probably caused by labels that are missing or wrong, or by insufficient data that includes the train classes and traffic signs.

Best Practices for Achieving High-Quality Labels:

Gather the finest information you can: High-quality, consistently collected data is essential, but biases that could lessen the utility of your model should be avoided. To maximize productivity and reduce turnaround times, it is ideal for your data collecting pipeline to be coupled with your labeling pipeline.
Get the best labelers for the job by hiring: Make sure the people labeling for you have the appropriate language skills, are native to the area, or have experience in a given field. Additionally, make sure your labelers have the right incentives to produce labels of the highest caliber.
Bring together humans and machines: for the highest precision labels, employ ML-powered labeling tooling with humans in the loop (HITL).
Provide clear and comprehensive instructions: This will help to ensure that different labelers will label data consistently.
Curate Your data: You should curate your data as you work to enhance the performance of your model. Examine your data with a data curation tool like Refonte.AI Nucleus to find any data that is incorrectly or entirely missing its labels. To gain a better understanding of poor model performance, review the IoU, ROC curve, and confusion matrix for your dataset. In addition, the most effective data curation solutions enable you to directly interact with these charts, examining data associated with a particular confusion and even forwarding it to your labeling team for correction. It's also possible that you'll find that some of the data is missing, in which case you'll need to gather extra information to categorize.
Benchmark tasks and screening: To gauge the caliber of your labelers, gather answers with high confidence to a portion of the labeling duties. Include these benchmark tasks in other assignments. To find out if a particular labeler is capable of producing the quality you require and comprehends your instructions, use their performance on benchmark projects. Labelers who fail your benchmark jobs can be filtered so you can retrain them or remove them from the project.
Inspect common answers for specific data: Finding patterns in labeling errors might be aided by examining typical responses to labels. If every data labeler is mislabeling a piece of data or classifying a certain object wrongly, then perhaps there is another problem that needs to be addressed instead of the labeler. Review your guidelines, training procedures, and ground truth to make sure that your expectations have been understood. When common errors are found, incorporate them into your guidelines to help prevent them in the future.
As you come across edge cases, update your guidelines and golden datasets.
Create calibration batches to ensure that your instructions are clear and that quality is high on a small sample of your data before scaling up your labeling tasks.
Establish a consensus pipeline: Implement a consensus pipeline for classification or text-based tasks with more subjectivity. Use a majority vote or a hierarchical approach based on the experience or proven quality of an individual or group of data labelers.
Establish layers of review: Establish a hierarchical review structure for computer vision tasks to ensure that the labels are as accurate as possible.
Randomly sample labeled data for manual auditing: Randomly sample your labeled data and audit it yourself to confirm the quality of the sample. This approach will not guarantee that the entire dataset is labeled accurately but can give you a sense of general performance on labeling tasks.
Retrain or remove poor annotators: If an annotator'`s performance does not improve over time and with retraining, then you may need to remove them from your project.
Measure your model performance: You can usually tell whether your model performs better or worse by looking at the caliber of your data labels. Use tools for validating models, like Refonte.Precision, recall, intersection over union, and any other metrics vital to your model's performance can be rigorously assessed via AI validation.

Data Labeling for Computer Vision

The goal of the artificial intelligence discipline of computer vision is to comprehend data from 2D photos, videos, or 3D inputs and generate predictions or suggestions based on that information. The human vision system is particularly advanced, and humans are very good at computer vision tasks.

We examine the most pertinent data labeling categories for computer vision in this chapter and offer guidelines for properly labeling each category.

1. Bounding Box

Bounding boxes, the most often used and basic type of data label, are rectangular boxes that show where an object is in an image or video.

A rectangular box is drawn over an object of interest, like a street sign or car, by data labelers. The X and Y coordinates of the object are defined by this box.

In order to save computer resources and improve the accuracy of object detection, machine learning models can extract certain object properties from a more precise feature set by "bounding" an item with this kind of label.

The practice of classifying items in an image based on where they are located is called object detection. Then, these X and Y coordinates can be exported in a format that is machine-readable, such JSON.

Typical Bounding Box Applications:

Autonomous driving and robotics to detect objects such as cars, people, or houses
Identifying damage or defects in manufactured objects
Household object detection for augmented reality applications
Anomaly detection in medical diagnostic imaging

Best Practices:

Hug the border as tightly as possible. Accurate labels will capture the entire object and match the edges as closely as possible to the object's edges to reduce confusion for your model.
Avoid item overlap. Due to IoU, bounding boxes work best when there is minimal overlap between objects. If objects overlap significantly, using polygon or segmentation annotations may be better.
Object size: Smaller objects are better suited for bounding boxes, while larger objects are better suited for instance segmentation. However, annotating tiny objects may require more advanced techniques.
Avoid Diagonal Lines: Bounding boxes perform poorly with diagonal lines such as walkways, bridges, or train tracks as boxes cannot tightly hug the borders. Polygons and instance segmentation are better approaches in these cases.

2. Classification

Object classification means applying a label to an entire image based on predefined categories, known as classes. Labeling images as containing a particular class such as "Dog," "Dress," or "Car" helps train an ML model to accurately predict objects of the same class when run on new data.

Typical Classification Applications:

Activity Classification
Product Categorization
Image Sentiment Analysis
Hot Dog vs. Not Hot Dog

Best Practices:

Create clearly defined, easily understandable categories that are relevant to the dataset.
Provide sufficient examples and training to your data labelers so that the requirements are clear and ambiguity between classes is minimized.
Create benchmark tests to ensure label quality.

3. Cuboids

Cuboids are 3-dimensional labels that identify the width, height, and depth of an object, as well as the object's location.

Data labelers draw a cuboid over the object of interest such as a building, car, or household object, which defines the object's X, Y, and Z coordinates. These coordinates are then output in a machine-readable format such as JSON.

Cuboids are crucial for applications like indoor robotics, autonomous driving, and 3D room planners because they allow models to accurately comprehend an object's position in 3D space. It is also easier to comprehend a scene as a whole when these objects are reduced to geometric primitives.

Typical Cuboid Applications:

Develop prediction and planning models for autonomous vehicles using cuboids on pedestrians and cars to determine predicted behavior and intent.
Indoor objects such as furniture for room planners
Picking, safety, or defect detection applications in manufacturing facilities

Best Practices:

Capture the corners and edges accurately. Like bounding boxes, ensure that you capture the entire object in the cuboid while keeping the label as tight to the object as possible.
Avoid Overlapping labels where possible. Clean, non-overlapping cuboid data annotations will help your model improve object predictions and localizations in 3D space.
Axis alignment is critical. Ensure that the alignment of your bounding boxes is on the same axis for objects of the same class.
Remember the intrinsics of your camera. When objects are not in the same place as the camera in the future, applying cuboids without knowing the location of the camera will result in inaccurate predictions. A "true" cuboid's front face probably won't be a perfect 90-degree rectangle, especially if it isn't directly facing the camera. In addition, the top and bottom edges of the right side of the above annotation are parallel, although the edges of a cuboid parallel to the ground should converge to the horizon.
Pair 2D Data with 3D Depth Data such as LiDAR.2D images inherently lack depth information, so pairing your 2D data with 3D depth data such as LiDAR will yield the best results for applications dependent on depth accuracy. See the section below on 3D Sensor fusion for more information on this topic.

4. 3D Sensor Fusion

The process of merging data from several sensors to account for the shortcomings of each sensor is known as three-dimensional sensor fusion. Present machine learning models cannot understand complete scenes from 2D photos alone. It is difficult to estimate depth from a 2D image, and depending solely on 2D images is problematic due to occlusion and a narrow field of view. Some methods of autonomous driving only use cameras, while a more reliable method uses 3D systems to enhance 2D systems with sensors like radar and LiDAR to overcome the limits of 2D.

LiDAR (Light Detection and Ranging) is a technique that uses a laser to measure an object's distance in order to calculate an object's depth and produce three-dimensional (3D) images of the scene.

Radio waves are used by radar, also known as radio detection and range, to measure an object's radial velocity, angle, and distance.

An interactive 3D sensor fused scene is shown in this demo, and a high-level overview of a scenario similar to this one is shown in the video below.

Typical 3D Sensor Fusion Applications

Autonomous Vehicles
Geospatial and mapping applications
Robotics and automation

Best Practices:

Ensure that your data labeling platform is calibrated to your sensor intrinsics (or better yet, ensure that your tooling is sensor agnostic) and supports different lens and sensor types, for example, fisheye and panoramic cameras.
Look for a data labeling platform that can support large scenes, ideally with support for infinitely long scenes.
Ensure that object tracking is consistent throughout a scene, even when an object leaves and returns to the scene.
Include attribute support for understanding correlations between objects, such as truck cabs and trailers.
Leverage linked instance IDs describing the same object across the 2D and 3D modalities.

5. Ellipses

Oval data labels called ellipses are used to indicate where items are located in an image. When data labelers come upon an interesting thing, such wheels, faces, eyes, or fruit, they write an ellipse label on it. The object's location in two dimensions is specified by this annotation. To properly specify the location of the ellipse, the X and Y coordinates of its four extremal vertices can then be exported in a machine-readable format, like JSON.

Applications

Face Detection
Medical Imaging Diagnosis
Wheel Detection

Best Practices:

The data that needs to be labeled should be circular or oval; in other words, labeling rectangular boxes with ellipses when a bounding box will work better is not beneficial.
When there is a large degree of overlap for bounding boxes or when items are closely spaced or obscured, like in fruit bunches, use ellipses. Ellipses can give your model a more focused geometry by closely hugging the edges of these items.

6. Lines

Roadway markers and other linear items are identified by lines. The vertices of a line are defined by the lines that data labelers draw over areas of interest. Adding lines to photos can help you teach your model to recognize boundaries more precisely. The line vertices' X and Y coordinates can then be exported in JSON.

Typical Lines Applications

Label roadway markers with straight or curved lines for autonomous vehicles
Horizon lines for AR/VR applications
Define boundaries for sporting fields

Best Practices:

Label only the lines that matter most to your application.
Match the lines to the shape of the lines in the image as closely as possible.
Depending on the use case, it could be important for lines not to intersect.
Center the line annotation within the line in the image to improve model performance.

7. Points

The geographic placements of points in an image are used to highlight significant characteristics of an object. Each location of interest is marked with a point by data labelers, which also indicates the location's X and Y coordinates. These ideas might be connected to one another, as in the case of labeling the wrist, elbow, and shoulder of a human being in order to indicate the major moving components of an arm. These labels help machine learning models more accurately determine pose estimations or detect essential features of an object.

Typical Points Applications

Pose estimation for fitness or health applications or activity recognition
Facial feature points for face detection

Best Practices:

Label only the points that are most critical to your application. For instance, if you are building a face detection application, focus on labeling salient points on the eyes, nose, mouth, eyebrows, and the outline of the face.
Group points into structures (hand, face, and skeletal keypoints), and the labeling interface should make it efficient for taskers to visualize the interconnections between points in these structures.

8. Polygons

Although data labelers can quickly and easily create bounding boxes, they can leave significant gaps surrounding objects and are not accurate in mapping to irregular forms. When employing bounding boxes and polygons, accuracy and efficiency are trade-offs. Bounding boxes can easily create a machine learning model with enough accuracy for many applications. On the other hand, certain applications need higher polygon accuracy at the risk of an expensive and ineffective annotation.

By clicking on pertinent locations on an object of interest, data labelers create a polygon shape over the object to complete a totally connected annotation. The polygon's vertices are defined by these points. Next, a JSON output of these vertices' X and Y coordinates is produced.

Typical Polygons Applications

Irregular objects such as buildings, vehicles, or trees for autonomous vehicles
Satellite imagery of houses, pools, industrial facilities, planes, or landmarks
Fruit detection for agricultural applications

Best Practices:

Special attention must be given to objects that have holes in them or that split into several polygons as a result of occlusion (a automobile behind a tree, for example). Take each hole's area and subtract it from the object.
Keep adjacent polygons from slightly overlapping one another.
To make sure that you set dots close to the edges of each object, zoom in close to each one.
Keep an eye out for curved edges and be sure to add additional vertices to 'smooth' them as much as possible.
Make effective use of the Auto Annotate Polygon tool to label objects. Quickly and automatically create high-precision polygon annotations by using an initial, approximative bounding box to highlight particular items of interest.

Follow these steps to achieve success with the Auto Annotate Polygon tool:

Include all parts of the object of interest.
Exclude overlapping object instances and other objects as much as possible.
Keep the bounding box tight to the borders of the object.
Use click to include/exclude to refine the automatically-generated polygon by instantly performing local edits - include and exclude specific areas of interest.
Further, refine the polygon by increasing or decreasing vertex density to smooth curved edges.

9. Segmentation

There are three primary forms of segmentation labels: panoptic, instance, and semantic. Segmentation labels correspond to pixel-wise labeling on an image.

Semantic Segmentation

Give each pixel in an image a class corresponding to the object it represents, for as a person, a car, or some vegetation. This procedure, known as "dense prediction," is laborious and time-consuming.

You cannot discriminate between distinct objects of the same class using semantic segmentation (for more on this, see instance segmentation).

Instance segmentation

Label every pixel of each distinct object of an image. Unlike semantic segmentation, instance segmentation distinguishes between separate objects of the same class (i.e., identifying car 1 as separate from car 2).

Panoptic Segmentation

Semantic and instance segmentation are used to provide panoptic segmentation. Both an instance label (instance segmentation) and a class label (semantic segmentation) are assigned to each point in an image. Every instance can stand in for other things, including vehicles, people, or spaces, like the road or the sky. Deeper scene knowledge can be achieved with panoptic segmentation since it is more detailed and offers more context than semantic segmentation.

Typical Segmentation Applications

Autonomous Vehicles and Robotics: Identify pedestrians, cars, trees
Medical Diagnostic imaging: tumors, abscesses in diagnostic imaging
Clothing: Fashion retail

Best Practices

Carefully trace the outlines of each shape to ensure that all pixels of each object are labeled.
Use ML-assisted tooling like the boundary tool to quickly segment borders and objects of interest.
After segmenting borders, use the flood fill tool to fill in and complete segmentation masks quickly.
Use active tools like Autosegment to increase the efficiency and accuracy of your labelers.

Explore Coco-Stuff on Nucleus for a large collection of data with segmentation labels!

10. Special considerations for Video Labeling

You can apply many of the same labels to images and videos, but there are some special considerations for video labeling.

Temporal linking of labels

Video annotations add the dimension of time to the data fed to your models, so your model needs a way to understand related objects and labels between frames.

Manually tracking objects and adding labels through many video frames is time and resource intensive. You can leverage several techniques to increase the efficiency and accuracy of temporally linked labels.

First, you can leverage video Interpolation to interpolate between frames of your video to smooth out the images and labels to make it easier to track labels through the video.
You should also look for tools that automatically duplicate annotations between frames, minimizing human intervention needed to correct for misaligned labels.
If you are working with videos, ensure that the tools you are using can handle the storage capacity of the video file and can stitch together hour-long videos so that you retain the same context no matter how long the video.

In videos, objects may leave the camera view and may return at a later time. Leverage tools that enable you to track these objects automatically or make sure to remember to annotate these objects with the same unique ids.

Multimodal

Multimodal machine learning attempts to understand the world from multiple, related modalities such as 2D images, audio, and text.

Combining multiple labeling types such as human keypoints, bounding boxes, and transcribed audio with entity recognition all connected in rich scenes.

Typical Multimodal Applications

AR/VR full scene understanding Video/GIF/Image (Object Detection + Human Keypoints + Audio Transcription + Entity Recognition)
Sentiment analysis by combining video gestures and voice data

Best Practices

Incorporate temporal linking to ensure that models fully understand the entire breadth of each scene.
Identify which modalities are best suited for your application. For instance, if you are working on sentiment analysis for AR/VR applications, you will want to consider not only 2D video object or human keypoint labels but also audio transcription and entity recognition in addition to sentiment classification so that you have a rich understanding of the entire scene and how the individuals in the scene contribute towards a particular sentiment (i.e., if the person is yelling and gesturing wildly you can determine the sentiment is "upset").
Include Human in the loop to ensure consistency across modalities. Assign complex scenes to only the most experienced taskers.

Synthetic Data

Digitally created data that imitates real-world data is called synthetic data. Often, artists use graphics computer tools to generate this data, or models such as Neural Radiance Fields (NERFs), Variational Autoencoders (VAEs), and Generative Adversarial Networks (GANs) are used programmatically to generate this data. Perfect ground truth labels are automatically included in synthetic data; no extra human labeling is needed.

Typical Synthetic Data Applications

Digital Humans for autonomous vehicles and robotics, particularly in long-tail edge cases such as pedestrians walking on shoulders
Digital humans for fitness and health applications. Getting enough real-world human pose data is difficult and expensive, whereas synthetic data is relatively easy to generate and cheaper.
Manufacturing defect detection

Best Practices:

If using a mix of Synthetic and real-world data, ensure that the labels of your real data are as accurate as possible. Synthetic data generates perfectly accurate labels, so any label inaccuracies in your real data will degrade your model's predictive capabilities.
Leverage synthetic data for data that is difficult to collect due to privacy concerns, rare edge cases, to avoid bias, or prohibitively expensive data collection and labeling methods.
Integrate synthetic data with your existing data pipelines to maximize your ROI
Curate your data using best-in-class tools to ensure that you are surfacing the edge cases for which you need more data. Ideally, your dataset curation tool will also integrate into your labeling pipelines.

NLP Data Labeling

Labeling text enables natural language processing algorithms to understand, interact with, and generate text for various applications ranging from chatbots to machine translation to product review sentiment analysis.

Like computer vision, there is a wide variety of text label types, and we will cover the most common labels in this guide.

1. Part of Speech Tagging (POS)

Categorizing words in a text (corpus) with a particular part of speech depending on the word's definition and context. This basic tagging enables machine learning models to understand natural language better.

Labeling parts of speech enables chatbots and virtual assistants to have more relevant conversations and better understand the content with which it interacts.

quote source: Refonte.AI Zeitgeist, Eric Schmidt, Schmidt Ventures

2. Named Entity Recognition (NER)

Named Entity recognition is similar to speech tagging but focuses on classifying text into predefined categories such as person names, organizations, locations, and time expressions. Linking entities to establish contextual relationships (i.e., Ada Lovelace is the child of Lord and Lady Byron) adds another layer of depth to your model's understanding of the text.

quote source: Refonte.AI Zeitgeist, Eric Schmidt, Schmidt Ventures

Applications

Improve search terms
Ad serving models
Identify terms in customer interactions (i.e., support threads, chatbots, social media posts) to map to specific topics, brands, etc.

3. Classification

Classify text into predefined categories, such as customer sentiment or chatbots to accurately monitor brand equity, trends, etc.

Applications

Customer sentiment
Intent on social media or chatbots
Active monitoring of brand equity

4. Audio

Transcribe audio data into text for natural language models to make sense of the data. In this case, the text data becomes the label for the audio data. Add further depth to text data with named entity recognition or classification.

5. Best Practices for labeling Text

Use native speakers, ideally those with a cultural understanding that mirrors the source of the text.
Provide clear instructions on the parts of speech to be labeled and train your labelers on the task.
Set up benchmark tasks and build a consensus pipeline to ensure quality and avoid bias.
Leverage rule-based tagging/heuristics to automatically label known named entities (i.e., "Eiffel Tower") and combine this with humans in the loop to improve efficiency and avoid subtle errors for critical cases.
Deduplicate data to reduce labeling overhead.
Leverage native speakers and labelers with relevant cultural experience to your use case to avoid confusion around subtle ambiguities in language. For example, Greeks will associate the color "purple" with sadness. At the same time, those from China and Germany will consider purple emotionally ambivalent, and those from the UK may think purple to be positive.

Conclusion

It is hoped that this guide will be useful to you as you continue to learn about machine learning and that you will be able to use these useful insights to enhance your data labeling procedures. We wanted to disseminate the best practices we discovered while working with some of the biggest corporations globally, delivering billions of annotations.

High-quality data and labeling are more important than ever as the machine learning business expands. We advise you to approach machine learning holistically, starting with identifying the main business difficulties you face and moving on to data collection and categorization.

The purpose of this book is to provide you with the necessary information to establish high-caliber data annotation processes. Consider Refonte.AI Studio, which provides a best-in-class annotation infrastructure created by knowledgeable annotators, if you are employing your own workforce for labeling. Conversely, Refonte.If you need labeling resources, AI Rapid offers a simple approach to ramp up to production levels and offload your data labeling. Please get in touch with us if you have any more queries; we will be pleased to assist you.

The future of your
industry starts here.

Book a Demo Build AI

guides

Data Labeling: The Authoritative Guide

contents

Data Labeling for Machine Learning

What is Data Labeling?

Types of Data

Why is Data Annotation Important?

How to Annotate Data

1. Choose Between Humans vs. Machines

Automated Data Labeling

Human Only Labeling

Human in the Loop (HITL) Labeling

2. Assemble Your Labeling Workforce

In-House Teams

Crowdsourcing

3rd Party Data Labeling Partners

3. Select Your Data Labeling Platform

Open Source Tools

In-house Tools

Commercial Platforms

High-Quality Data Annotations

Different Ways to Measure Quality

Data Labeling for Computer Vision

1. Bounding Box

Typical Bounding Box Applications:

Best Practices:

2. Classification

Typical Classification Applications:

Best Practices:

3. Cuboids

Typical Cuboid Applications:

Best Practices:

4. 3D Sensor Fusion

Typical 3D Sensor Fusion Applications

Best Practices:

5. Ellipses

Applications

Best Practices:

6. Lines

Typical Lines Applications

Best Practices:

7. Points

Typical Points Applications

Best Practices:

8. Polygons

Typical Polygons Applications

Best Practices:

Follow these steps to achieve success with the Auto Annotate Polygon tool:

9. Segmentation

Semantic Segmentation

Instance segmentation

Panoptic Segmentation

Typical Segmentation Applications

Best Practices

10. Special considerations for Video Labeling

Temporal linking of labels

Multimodal

Typical Multimodal Applications

Best Practices

Synthetic Data

Typical Synthetic Data Applications

Best Practices:

NLP Data Labeling

1. Part of Speech Tagging (POS)

2. Named Entity Recognition (NER)

Applications

3. Classification

Applications

4. Audio

5. Best Practices for labeling Text

Conclusion

The future of your industry starts here.

The future of your
industry starts here.