guides
Guide to Computer Vision Applications
Understand what computer vision is, how it works, and deep dive into some of the top applications for computer vision by industry.
contents
Introduction
Machine learning (ML) has transformed how we tackle computer vision and natural language processing problems, as we cover in our Authoritative Guide to Data Labeling.
This guide tries to give a general overview of computer vision (CV) applications in machine learning, including definitions, methods, subfields, and industry-by-industry breakdowns of computer vision use cases.
What is computer vision?
People have been dreaming of creating machines that are intelligent like humans for decades. Giving computers the capacity to "see" and comprehend their surroundings is a crucial first step in developing this artificial intelligence.
The goal of the artificial intelligence field of computer vision is to create machines that can process, interpret, and make sense of visual data—such as photos, movies, and other sensor data—in a manner that is comparable to that of humans. From an engineering standpoint, computer vision systems try to automate operations that the human visual system can accomplish in addition to trying to understand the environment they are in.
How does computer vision work?
Computer vision draws inspiration from the functioning of human visual systems and brains. Pattern recognition is the foundation of the computer vision techniques we use today, which train models on vast volumes of visual data. Let's say we train a model using a million photos of flowers, for instance. After examining the million photos, the algorithm will find trends that pertain to all flowers and eventually be able to recognize a flower in a fresh photograph.
Computer vision systems rely on a sort of deep learning technique called a convolutional neural network (CNN). An input layer, hidden layers, and output layer make up a CNN. These layers are used to look for the previously mentioned patterns.
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision1.webp&w=2048&q=75)
Images, videos, and other sensor data, such as light detection and ranging (LiDAR) and radio detection and range (RADAR) data, can all be used to train computer vision applications. Every sort of data has advantages and disadvantages.
Images
Pros:
- Large-scale open-source datasets are available for image data (ImageNet, MS COCO, etc.).
- Cameras are inexpensive if you need to collect data from scratch.
- Images are easier to annotate compared to other data types.
Cons:
- Even the most popular large-scale datasets have known quality issues and gaps that can limit the performance of your models.
- If your use case requires depth perception (e.g. autonomous vehicles or robotics), images alone may not provide the accuracy you need.
- Static images alone are not sufficient to develop object-tracking models.
Videos
Pros:
- Again, cameras are inexpensive if you need to collect data from scratch.
- Enables the development of object tracking or event detection models.
Cons:
- More challenging to annotate compared to images, especially if pixel-level accuracy is required.
LiDAR
![object tracking](/_next/image?url=%2Fassets%2Fguides%2Fvision2.jpg&w=3840&q=75)
What is LiDAR?
LiDAR uses laser light pulses to scan its environment. When the laser pulse reaches an object, the pulse is reflected and returned to the receiver. The time of flight (TOF) is used to generate a three-dimensional distance map of objects in the scene.
Pros:
- LiDAR sensors are more accurate and provide finer resolution data than RADAR.
- Allows for better depth perception when developing computer vision systems.
- LiDAR can also be used to determine the velocity of a moving object in a scene.
Cons:
- Advancements in LiDAR technology have brought down costs in the last few years, but it is still a more costly method of data collection than images or videos.
- Performance degrades in adverse weather conditions such as rain, fog, or snow.
- Calibrating multiple sensors for data collection is a challenge.
- Visualizing and annotating LiDAR data is technically challenging, requires more expertise, and can be expensive.
RADAR
What is RADAR?
RADAR sensors work much like LiDAR sensors but use radio waves to determine the distance, angle, and radial velocity of objects relative to the site instead of a laser.
Pros:
- Radio waves have less absorption compared to the light waves used by LiDAR. Thus, they can work over a relatively long distance, making it ideal for applications like aircraft or ship detection.
- RADAR performs relatively well in adverse weather conditions such as rain, fog, or snow.
- RADAR sensors are generally less expensive than LiDAR sensors.
Cons:
- Less angularly accurate than LiDAR and can lose sight of target objects on a curve.
- Less crisp/accurate images compared to LiDAR.
Notable research in computer vision
Advancements in the field of computer vision are driven by robust academic research. In this chapter, we will highlight some of the seminal research papers in the field in chronological order.
ImageNet: A large-scale hierarchical image database
J. Deng, W. Dong, R. Socher, L. -J. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848.
Why it's important: This paper introduced the Imagenet dataset, which has been the standard in the field of computer vision since 2009.
AlexNet:
A. Krizhevsky, I. Sutskever, and Geoffrey Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems (NIPS), 2012.
Why it’s important: This paper put convolutional neural networks (CNNs) on the map as a solution to solve complicated vision classification tasks.
ResNet:
K. He, X. Zhang, S. Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” arXiv, 2015.
Why it's important: This paper introduced key ideas to help train significantly deeper CNNs. Deeper CNNs are crucial to improving the performance of computer vision models.
MoCO:
K. He, H. Fan, Y. Wu, S. Xie, and Ross Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” 2020 IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Why it's important: This was the first self-supervised learning paper that was competitive with supervised learning (and sparked the field of contrastive learning).
Vision Transformers:
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Refonte.AI,” 2021 International Conference on Learning Representations, 2020.
Why it's important: This paper showed how transformers, which were already dominant in natural language models, could be applied for vision.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis:
B. Mildenhall, P. Srinivasan, M. Tancik, J. Barron, R. Ramamoorthi, and Ren Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” 2020 European Conference on Compter Vision, 2020.
Why it's important: This was a foundational paper for hundreds of papers in the last few years showing how to generate novel views of a 3d scene using a small number of captured images, representing the entire scene implicitly (as opposed to using classical computer graphics representations such as meshes and textures).
Masked Autoencoders:
K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and Ross Girshick, “Masked Autoencoders are Scalable Vision Learners,” arXiv, 2021.
Why it's important: This paper introduced a new self-supervised learning technique that uses masking ideas that were successful in language. Advantages include that it isn't based on contrastive learning and is efficient.
What are some subfields of computer vision?
Computer vision has numerous distinct subfields and subdomains. Object categorization, object detection, object recognition, object tracking, event detection, and pose estimation are a few of the more popular ones. We will give a quick summary and an example of various subfields in this chapter.
Object Classification
![object classification](/_next/image?url=%2Fassets%2Fguides%2Fvision3.webp&w=2048&q=75)
With object classification, models are trained to identify the class of a singular object in an image. For example, given an image of an animal, the model would return what animal is identified in the image (e.g. an image of a cat should come back as “cat”).
Object Detection
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision4.webp&w=2048&q=75)
With object detection, models are trained to identify occurrences of specific objects in a given image or video. For example, given an image of a street scene and an object class of “pedestrian”, the model would return the locations of all pedestrians in the image.
Object Recognition
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision5.webp&w=2048&q=75)
Models are trained to recognize every pertinent thing in an image or video using object recognition. An object identification model, for instance, might report the locations of every object it has been trained to identify in a picture of a street scene (e.g. pedestrians, automobiles, buildings, street signs, etc.)
Object Tracking
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision6.jpg&w=3840&q=75)
The process of taking a starting set of object detections, assigning a unique ID to each detection, and then monitoring each object as it moves throughout a video is known as object tracking. An object tracking model, for instance, would recognize a product in a video of a fulfillment center, tag it, and then monitor it over time as it moved around the space.
Event Detection
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision7.jpg&w=3840&q=75)
With event detection, models are trained to determine when a particular event has occurred. For example, given a video of a retail store, an event detection model would flag when a customer has picked up or bagged a product to enable autonomous checkout systems.
Pose Estimation
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision8.jpg&w=3840&q=75)
Models are trained to identify and forecast the location and orientation of a person, object, or important features in a scene via pose estimation. For instance, a pose estimate model will identify and forecast the location and orientation of the first person's hands to unlock and open the door given an egocentric video of that person doing so. door.
Depth Estimation
With depth estimation, models are trained to measure the distance of an object relative to a camera or sensor. Depth can be measured either from monocular (single) or stereo (multiple views) images. Depth estimation is critical to enable applications such as autonomous driving, augmented reality, robotics, and more.
Generation
Generative models or diffusion models are a class of machine learning models that can generate new data based on training data. For more on diffusion models, take a look at our Practical Guide to Diffusion Models.
What are the top computer vision use cases by industry?
Various industries are developing computer vision applications, from automotive, software & internet, healthcare, retail & eCommerce, robotics, government & public sector, and more. In this chapter, we provide a non-exhaustive list of top computer vision use cases for each industry.
Automotive
![automotive](/_next/image?url=%2Fassets%2Fguides%2Fvision9.jpg&w=3840&q=75)
The most common use case for computer vision in the automotive sector is autonomous driving. However, autonomous driving is not a one-or-zero concept. Numerous automakers have been gradually integrating safety and autonomous technologies, with varying levels of autonomy and human control, into their cars. The degrees of autonomy in the automotive industry are standardized on a range from 0 to 5.
Vehicles equipped with Advanced Driver-Assistance Systems (ADAS) have technical characteristics that make driving safer. Among the five degrees of autonomous driving, ADAS functions typically come under level one or level two autonomy. Examples of ADAS capabilities include adaptive cruise control, emergency braking assistance, lane-keeping, and parking assistance.
Autonomous vehicles, or self-driving cars, generally fall under level three, four, or five autonomy. AVs are capable of sensing the environment around them and navigating safely with little to no human input.
Software & Internet
The software and internet industry is pioneering computer vision applications in augmented and virtual reality (AR/VR), content understanding, and more.
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision10.webp&w=1920&q=75)
Augmented Reality (AR) is a technology that combines computer-generated text, pictures, music, and other virtual additions with real-world items. Virtual try-ons, Snapchat filters, Pokémon Go, interior design apps, 3D diagnostic imaging exploration, equipment/robotics repair, and more are a few instances of augmented reality. Conversely, virtual reality (VR) completely submerges the user in a virtual environment while masking the outside world.
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision11.webp&w=1200&q=75)
Social media postings, videos, photos, and other types of material are analyzed, enhanced, and categorized using content understanding. Enhancing model suggestion ranks through metadata addition, or content data enrichment, is a fundamental use of content comprehension. Through gaining a deeper comprehension of your content, teams may identify opportunities for improvement in terms of personalization quickly. Through automatic identification of user-generated content that deviates from a platform's norms, trust and safety are addressed in a second core application. Content understanding enhances user safety, recommendation systems, and personalization.
Healthcare
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision12.jpg&w=1920&q=75)
In order to improve patient outcomes and empower healthcare workers to make more informed decisions, the healthcare industry is utilizing computer vision technologies. Use cases like this have been made possible by the growing use of wearable technology and the standardization of medical imaging under the Digital Imaging and Communications in Medicine (DICOM) framework, has led to use cases such as:
- Diagnostics
- Patient monitoring
- Research and development
Retail & eCommerce
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision13.webp&w=1200&q=75)
Applications for computer vision can help shops and customers alike. CV technology may improve consumer engagement, click-through rates, cost savings, and operational efficiency for retailers, all while improving product discovery and providing a more seamless shopping experience for customers. Use cases consist of:
- Autonomous Checkout
- Product Matching/Deduplication
- AI-generated product imagery
For more on AI for eCommerce, take a look at our Guide to AI for eCommerce.
Robotics
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision14.jpg&w=3840&q=75)
Agriculture, warehousing and manufacturing, and any other industry that uses robotics has begun leveraging computer vision technology to improve their operations and enhance safety. Uses cases include:
- Inventory sorting and handling
- Defect detection
- Automated harvesting
- Plant disease detection
Government & Public Sector
![convolutional neural network](/_next/image?url=%2Fassets%2Fguides%2Fvision15.jpg&w=3840&q=75)
The US government has troves of geospatial data across a number of sensor types including electro-optical (EO), synthetic aperture radar (SAR), and full motion video (FMV). The amount of geospatial data produced by the US government and commercial satellite vendors is increasing while the amount of analysts power is staying the same. The use of computer vision to process, exploit, and disseminate (PED) this data is essential for the government to utilize the full potential of all available data and derive increasingly meaningful insights about the way our world operates.
Use cases include:
- Assessment of natural disasters and war on infrastructure. Take a look at TIME magazine's Detecting Destruction best AI invention.
- Intelligence, surveillance, and reconnaissance (ISR)
- Perimeter security
- Environmental monitoring from space
Conclusion
Computer vision applications are being developed by many different sectors. However, the caliber of the training data that computer vision models are trained on has a significant impact on their performance. Successful data pipeline setup includes data generation, curation, and annotation.
You can create your own datasets, use pre-existing open-source datasets, or artificially create or enhance datasets in order to provide data for your models. After generating data, you must filter it to find unusual edge situations and failure points in order to maximize model performance. Your data can then be annotated from there. Check out our Authoritative Guide to Data Annotation for a more thorough explanation on how to annotate data for your machine learning initiatives.
We hope you found this guide helpful as you think about developing your own CV applications. If you already have data and are ready to get started, check out Refonte.AI Generative AI Platform, our platform that automatically trains state-of-the-art machine learning models.