guides

Training and Building Machine Learning Models

The Foundational Guide

Training Models for Machine Learning What Are Machine Learning (ML) Models?What types of data can ML models be trained on?What are some common classes or types of ML models?What are some commonly used models?What are some common layer types in ML models?ML Modeling Frameworks/Libraries Choosing Model Metrics Best Practices and Considerations for Training Models Conclusion

Training Models for Machine Learning

Machine learning (ML) has changed state-of-the-art research as well as enterprises' ability to handle difficult or unsolvable problems in computer vision and natural language processing, as we previously explained in our Authoritative Guide to Data Labeling. Thanks to massive data sets, predictive models can now learn and identify patterns with high accuracy without the need for special programming.

In a broader sense, machine learning (ML) models may produce useful and helpful, salient and logical writing without the need for human intervention. They can also identify cars or retail merchandise, anticipate numerical outcomes like temperature or a mechanical breakdown, and plan better ways to grasp objects. Want to get started training and building models for your business use case? You've come to the right place to learn how model training works, and how you too can start building your own ML models!

What Are Machine Learning (ML) Models?

ML models usually produce a classification, a prediction, or some other indicator as an output after receiving "high-dimensional" sets of data artifacts as inputs. Text prompts, numerical data streams, pictures, videos, music, and even three-dimensional point cloud data can all be used as these inputs. The term "inference," which was borrowed from cognitive science, is usually used to describe the computing process that generates the model output. The "prediction" that the model is generating is based on past trends.

What sets an ML model apart from hard-coded feature detectors (yes, face recognition once relied on identifying a certain arrangement of circles and lines) or basic heuristics, which are frequently conditional statements? is a set of floating point numbers used as "weights," which are arranged into "layers" and connected by functions. The system is trained through trial and error, adjusting weights to minimize error (a metric typically referred to as “loss” in the ML world) over time. In nearly all ML models, there are too many of these weights to adjust them manually or selectively; they must be “trained” iteratively and automatically, in order to produce a useful and capable model. Ideally, this model has “learned” on the training examples, and can generalize to new examples it hasn't seen before in the real world.

The ML engineer in charge of system design may typically only conjecture or make assumptions about how each weight contributes to the final model because these weights are iteratively trained. Rather, she needs to adjust and fine-tune the model architecture, hyperparameters, and dataset. Instead of micromanaging every last piece of the model, the ML engineer sort of "steers the ship." The objective is to trigger the process to minimize model error or loss (as we described before) closer and closer to zero after numerous training and evaluation cycles (referred to as "epochs"). A model "converges" when the loss typically drops to a global minimum, where it frequently stabilizes. At this point, the model is deemed “as good as it's going to get,” in the sense that further training is unlikely to yield any performance improvements.

When a model's performance metrics settle, it is occasionally possible to recognize this and apply the "early stopping" strategy. Investing more effort and resources in further training that doesn't significantly enhance the model is pointless. You can assess your model at this point to see whether it's ready for production or not. To find out if you're "ready" to release the product that embodies your model or if you still need to make adjustments, add more data, and retrain, real-world user testing is frequently helpful. Externalities will typically result in model drift or failures, necessitating ongoing model maintenance and improvement.

Divvying up your data

It's helpful to train the model on most of 50-90% of the available data, leaving 5-20% out in a "validation" set to adjust hyperparameters, and then saving 5-20% to really evaluate model performance in order to create a model that can appropriately "generalize" to data it has never seen before. Avoid "tainting" or "contaminating" the training set with data that will be used for testing the model later on. This is because the model may "memorize" the result of the test by overfitting on that particular example, which would impair its ability to generalize. This is generally a crucial component of almost all successful machine learning models. Some researchers refer to the test set (that the model has never seen before) as the “hold-out” or “held out” set of data. You can think of this as the “final exam” for the model, which the model shouldn't have seen verbatim before exam day, even if it has seen similar examples in the metaphorical problem sets (to which it has checked against the answer key) during prior training.

What types of data can ML models be trained on?

Tabular data

Go straight to the Computer Vision section if all you're interested in is computer vision or some of the more advanced and modern data types that ML models can handle. Nevertheless, for more complex data types, working with tabular data helps clarify how deep learning and convolutional neural networks were developed. Starting with this more basic type of data.

Typically, tabular data is made up of rows and columns. Different data kinds, such as a date, person, transaction, or other granular information, are frequently represented in columns that correspond to each row entry. When combined, these columns can act as "features" that aid in the model's ability to accurately forecast a result. As a data scientist, you can also decide to train the model on combinations of various columns that you multiply, subtract, or otherwise mix. To predict a label, a score, or any other (often synthetic) metric based on the inputs, a wide range of models can be applied to tabular data. Often, it's helpful to eliminate columns that are “co-linear,” although some models are designed to deprioritize columns that are effectively redundant in terms of determining a predictive outcome.

Tabular data maintains the practice of keeping training and test data apart, preventing the model from overfitting—regurgitate examples it has seen, but react incorrectly to ones it hasn't—by "memorizing" the training data. You can even use it to dynamically move the sections of the table that you'll utilize (randomizing the split, or randomizing the table first, is great practice) such that the evaluation, train, and test sets are all in windows that you can swap or swivel throughout your dataset. Cross-fold validation, sometimes referred to as n-folds validation, is the process of dividing your training and test sets into distinct sections of the table by "folding" the table n times.

Source: Wikipedia

A final point about tabular data that we'll revisit in the computer vision section is that data often has to be scaled to a usable range. This might mean scaling a range of values from 1 to 1000000 into a floating point number such that the range is between 0 and 1.0 or -1.0 and 1.0. Machine Learning engineers often need to experiment with different types of scaling (perhaps logarithmic is also useful for some datasets) in order for the model to reach its most robust state—generating the most accurate predictions possible.

Text

No discussion of machine learning would be complete without discussing text. Large language models have stolen the show in recent years, and they generally function to serve two roles:

Translate languages from one to another (even a language with only minimal examples on the internet)
Predict the next section of text—this might be a synthesized verse of Rumi or Shakespeare, a typical response to a common IT support request, or even an impressively cogent response to a specific question

Various approaches can be applied to text, frequently in conjunction with big language models, in addition to deep and massive models. These include unsupervised techniques such as clustering, principal components analysis (PCA), and latent dirichlet allocation (LDA). Feel free to investigate these further on your own as they aren't strictly speaking "supervised learning" or "deep learning" methodologies! When used with a model that has been trained on labeled text data, they might be helpful.

An additional crucial consideration for any textual modeling approach is "tokenization." This entails breaking up the training text's words into manageable chunks, which could be single words, entire phrases, stems, or even syllables. The Python-based Natural Language Toolkit (NLTK), albeit it is now rather old, has tokenizers called Treebank and Snowball that have been industry standard. SpaCy and Gensim also include more modern tokenizers, and even PyTorch, a cutting-edge, actively developed Python ML library includes its own tokenizers.

Returning to huge language models, it is usually beneficial to train them on extremely large corpuses. Text data generally requires much less storage than high-resolution imagery, so you can either write your own corpora (or train models on massive text datasets, also known as "corpuses"), such as the entire corpus of Shakespeare, all of the Wikipedia articles ever written, all of the public-domain books available on Project Gutenberg, all of the public code on GitHub, or, if you'd rather write a scraping tool, as much of the textual internet as you can save on a storage device that will be quickly accessible by the system on which you plan to train your model.

You can "re-train" or "fine-tune" large language models (LLMs) using text data that is unique to your use case. These may be standard questions with excellent answers paired with each one, or they could just be a sizable body of writing from the same author or company, from which the model can infer the next n words in the document. (In Natural Language, each piece of writing is referred to as a "document"!) Similar to this, translation models can be adjusted to support the input-output use case that is necessary for translating between languages starting with a general LLM that has already been trained.

Images

Images were among the first kinds of data that scientists used to train truly "deep" neural networks in the 1990s and the previous decade. Uncompressed images require a lot more storage space than tabular data and, in certain situations, even audio. In addition to width and height (measured in pixels), image size also varies with color depth. (For instance, do you want to save brightness values alone or color information as well? How many bits are there? Additionally, how many channels?) One of the easier images to recognize is handwritten digits since it simply needs to compare binary pixel values at a low resolution. In the 1990s, most neural networks were laboriously trained on large sequential computers to adjust the model weights so that handwriting recognition (in the form of Pitney Bowes' postal code detector in conjunction with Yann LeCun's LeNet network), and MNIST's DIGITS dataset is still considered a useful baseline computer vision dataset for proving a minimal baseline of viability for a new model architecture.

In a broader sense, images comprise digital photos taken with smartphones or digital cameras, as most people are familiar with and enjoy. Images typically have 3 channels, one each for red, blue, and green. Typically, 8- or 10-bit sequences are used to encode brightness for each color channel. While some ML model layers may learn their "features" from particular color channels while disregarding patterns in other channels, others may merely look at brightness (in the form of a grayscale image).

A lot of models are trained on Microsoft's COCO dataset (extremely large, includes per-pixel labels), ImageNet (larger, 1000 label classes), or CIFAR (smaller, 10 or 24 classes of labels). These models, once reaching a reasonable level of accuracy, can be re-trained or fine-tuned on data specific to a use case: for example, breeds of dogs and cats, or more practically, vehicle types.

Video

All that video is is a series of visuals accompanied by sounds. Training data can be derived from individual frames or from the "delta," or difference between two frames. (Occasionally, this is represented as a sequence of motion vectors that overlay an image with a lesser resolution.) Individual video frames can generally be processed using a model in the same way as individual image frames. The primary distinction is that adjacent frames can take use of the possibility of overlap between an object's identified location in one frame and its (sometimes) neighboring location in another. Per-frame computer vision can benefit from contextual cues such as sound identification from the linked audio track or voice recognition.

Audio

Digitalized sound waves in binary representation can be used for sentiment analysis and voice recognition in addition to playback and mixing. Audio files are frequently compressed; they may use the OGG, AAC, or MP3 codecs, but they all usually decompress to amplitude values of 8, 16, or 24 bits, with sample rates ranging from 8 kHz to 192 kHz (usually in multiples of 2). Voice recordings typically require less bit-depth or quality to capture, even for (relatively) convincing synthesis and correct recognition. While Hidden Markov Models (HMMs) were the mainstay of speech-to-text services in the past, long short-term memory networks (LSTMs) have recently taken center stage in speech recognition technology. Usually, they run voice-activated digital assistants that you may use or are acquainted with, such Alexa, Google Assistant, Siri, and Cortana. Although efforts are being made to lower entry barriers for these applications, training computer vision and text-to-speech models usually requires substantially less computing power than training text-to-speech models. Transformers have shown useful in improving the accuracy and noise-robustness of speech-to-text systems, as they have in numerous other use cases. While Siri, Alexa, and Google Assistant serve as examples of this advancement, OpenAI's Whisper also shows how resilient these algorithms are against noise interference, mumbling, and other challenges. Whisper is distinct in that it is not utilized in a final product aimed at consumers, but may be accessed through an API.

3D Point Clouds

Point clouds are used to store point positions in three-dimensional Cartesian (or maybe polar, depending on the sensor output) space. Usually, the user can use their mouse to "spin" an arbitrary axis what appears to be a random arrangement of dots on screen, revealing that the points are actually arranged in three dimensions. Usually, radially spinning laser rangefinders installed on a moving car's roof gather this data. For interior settings, an infrared camera that uses "structured light" or a camera that uses "time-of-flight" may be able to obtain comparable data—usually at a higher point density. The Xbox Kinect camera and the Wiimote, the remote controller for the Wii game console, may be known to you.

Multimodal Data

It turns out that training on a single type of data at a time is not required by the unreasonably high efficacy of deep learning. To generate text, your model can really be trained on both audio and image inputs. It can even be trained on many cameras at once, combining all of the photos that each camera takes of an object at one moment into a single frame. Certain deep neural networks function well on mixed input types, frequently handled as one huge (or "long") input vector, even if this could be perplexing to a person as training data. In a similar vein, mismatched data type pairs, like text prompts and image outputs, can be used to train the model.

What are some common classes or types of ML models?

Support Vector Machines (SVMs)

SVMs are among the simpler types of machine learning. Usually employed as classifiers (yes, it's acceptable to consider "hot dog" versus "not hot dog," or better yet, cat versus dog, as labels in this context), they are still helpful in handling "non-linear" forms of classification, or classification that cannot be handled by a straightforward linear or logistic regression. Although they are no longer the state of the art in computer vision. (Consider the closest neighbor, calculating the slope on a point chart, etc.) You can learn more about SVMs on their documentation page at the scikit-learn website. We'll continue to use scikit-learn as a reference for non-state-of-the-art models, because their documentation and examples are arguably the most robust available. And scikit-learn (discussed in greater depth below) is a great tool for managing your dataset in Python, and then proving that simpler, cheaper-to-train, or computationally less complex models aren't suitable for your use case. (You can also use these simpler models as baselines against which you can compare your deep neural nets trained on Refonte.AI infrastructure!)

Random Forest Classifiers

Whether you require a solution that isn't computationally complex or you're trying to model tabular data with plenty of collinearities, Random Forest Classifiers have an uncanny knack of discovering the perfect answer. How several data buckets affect the final result determines the "forest" of trees. In a cluster map, Random Forest Classifiers are able to identify non-linear borders between neighboring classes, albeit they usually don't follow the boundary precisely. You can learn more about Random Forest Classifiers, again, over at scikit-learn's documentation site.

Gradient-Boosted Trees

Another method or alteration for decision trees is gradient-boosting. In addition to handling co-linearities, these models provide several hyperparameters that can be used to prevent overfitting. (Remembering the training set to ensure high accuracy throughout training; this memory does not apply to the test set that is being held out.) The framework that really took off on Kaggle is called XGBoost, but LightGBM and Catboost are also quite popular for models in this class. As the number of hyperparameters rises, the model's complexity finally starts to diverge from some of the scikit-learn models that are derived from simpler regressions. (In essence, this means that your model's complexity can be increased by tuning it in more ways.) You can read all about how XGBoost works here. While there are some techniques to attribute model outputs to specific columns or “features” on the input side, with Shapley values perhaps, XGBoost models certainly demonstrate that not every ML model is truly “explainable.” As you might guess, explainability is challenged by model complexity, so as we dive deeper into complex neural networks, you can begin to think of them more as “black boxes.” Not every step of your model will necessarily be explainable, nor will it necessarily be useful to hypothesize why the values of each layer in your model ended up the way they did, after training has completed.

Feedforward Neural Networks

These neural networks are the most basic type since they take an object with a defined input length and produce a classification as an output. The feedforward aspect means that each node in the graph is connected in a forward fashion from input to output. If the model extends beyond an input layer and an output layer (each with its own set of weights), it may have intermediate layers that are termed "hidden."

Recurrent Neural Networks

Neural networks underwent a further evolution when they were made "recurrent." In other words, nodes nearer the input in the inference or classification pipeline might conditionally link back to nodes in earlier levels. Because of this back-linking, certain neural networks can be "unrolled" into more straightforward feedforward networks, but other connections prevent this from happening. These models may do inference on inputs of different lengths, but the varied complexity arising from the classification taking loops or not implies that inference time might vary from one classification run to the next. (Therefore negating the need for the "fitting" outlined in the section on tabular data.)

Convolutional Neural Networks

Long Short-Term Memory Networks (LSTMs)

If you've been following along thus far, you might notice that most models you've encountered up until this point have no notion of “memory” from one classification to the next. Every “inference” performed at runtime is entirely dependent on the model, with no influence from any other inference that came before it. That's where “Long Short Term Networks” or LSTMs come in. These networks have a “gate” at each node in the network that allows the weights to remain unchanged. This can occasionally lessen the "vanishing gradient" issue with RNNs, wherein all layer weights may "vanish" to 0 if a weight change is required for each training run's epoch (a single step-wise update of all model weights depending on the output loss function). Weights are comparable to physiological "short term memory," which is stored in the synaptic connections of the human brain and can endure for many epochs. Over time, certain connections get stronger, weaker, or remain unchanged. Now let's return to theoretical applications: some prior well-known LSTMs gained notoriety for their capacity to identify cat faces in enormous datasets collected from YouTube, such as YouTube-8M. Eventually the model could operate in reverse, recalling the rough outline of a cat face, given the “cat” label as an input.

Q-Learning

Q-Learning is a "model-free" learning methodology that directs updates to the network's layers through reinforcement using a "objective function." This procedure was implemented by DeepMind during their well-known Go match against world champion Lee Sedol. Since then, Q-Learning has proven to be remarkably effective at picking up RPG strategy games like StarCraft and WarCraft as well as other historically notable Atari games.

—OpenAI and MuJoCo, from OpenAI Gym

Word2Vec

It's time to start using Word2Vec as we haven't spent much time or effort on text models thus far. A set of models called Word2Vec matches word pairings based on their cosine similarity scores, enabling useful vector space mapping. Word2Vec can generate a distributed word representation, a continuous "skip-gram," or a continuous "bag-of-words." While "skip-gram" appears to handle infrequent words better, CBOW is faster. You can learn more about word2vec in the documentation for the industry-standard gensim Python package. If you're looking to synthesize biological sequences like DNA, RNA, or even proteins, word2vec also proves useful in these scenarios: it handles sequences of biological and molecular data in the same way it does words.

Transformers

After LSTMs and RNNs reigned as the state of the art for natural language processing for several years, in 2017, a group of researchers at Google Brain formulated a set of multi-head attention layers that began to perform unreasonably well on translation workloads. These “attention units” typically consist of a refonted dot product. The only drawback to this architecture was that training on large datasets and verifying performance on long input strings during training was both computationally intensive and time-consuming. Attention(Q,K,V), or attention as a function of the matrices Q, K, and V, is equivalent to the softmax(QKT/sqrt(dk))*V. Modifications to this approach are typically focused on reducing the computational complexity from O(N2) to O(N ln N) with Reformers, or to O(N) with ETC or BigBird, where N is the input sequence length. The larger, in the case of BERT, “teacher” model is typically a product of self-supervised learning, starting with unsupervised pre-training on large Internet corpuses, followed by supervised fine-tuning. Common tasks that transformers can reliably perform include:

Paraphrasing
Question Answering
Reading Comprehension
Sentiment Analysis
Next Sentence Prediction/Synthesis

As the authors of this model class named their 2017 research paper, “Attention Is All You Need.” This title was prescient, as transformers are now a lower-case category, and have influenced vision systems as well (known as Vision Transformers, or ViT for short), CLIP, DALL-E, GPT, BLOOM, and other highly influential models. Next, we'll jump into a series of specific and canonically influential models—you'll find the transformer-based models at the end of the list that follows.

What are some commonly used models?

AlexNet (historical)

AlexNet was the model that demonstrated that compute power and convolutional neural nets could scale to classify as many as 1000 different classes in Stanford's canonical ImageNet dataset (and corresponding image classification challenge). The model consisted of a series of activation layers, hidden layers, ReLUs, and some “pooling” layers, all of which we'll describe in a later section. It was the first widely reproduced model to be trained on graphics processors (GPUs), two NVIDIA GTX 580s, to be specific. Nearly every successor model past this point was also trained on GPUs. AlexNet won the 2012 N(eur)IPS ImageNet challenge, and it would become the inspiration for many successor state-of-the-art networks like ResNet, RetinaNet, and EfficientDet. Whereas predecessor neural networks such as Yann LeCun's LeNet could perform fairly reliable classification into 10 classes (mapping to the 10 decimal digits), AlexNet could classify images into any of 1000 different classes, complete with confidence scores.

Original paper here and Papers with Code.

ResNet

Residual Networks—ResNet for short—encompass a series of image classifiers of varying depth. (Depth, here, roughly scaling with classification accuracy and also compute time.) Often, while training networks like AlexNet, the model won't converge reliably. This “exploding/vanishing” (“vanishing” was explained above, while “exploding” means the floating point value rapidly increases to the maximum range of the data type) gradient problem becomes more challenging as the ML engineer adds more layers or blocks to the model. Compared to previous “VGG” nets of comparable accuracy, designating certain layers as residual functions drastically reduces complexity, enabling models up to 152 layers deep that still have reasonable performance in terms of inference time and memory usage. In 2015, ResNet set the standard by winning 1st place in ImageNet and COCO competitions, for detection, localization, and segmentation.

Original paper here on arXiv and Papers with Code.

Single Shot MultiBox Detector (SSD)

In 2015, Wei Li from Facebook AI Research published Single Shot MultiBox Detector (SSD) at NeurIPS. It was written in Caffe and offered neural networks an effective means of determining bounding boxes for objects as well as dynamically updating the bounding boxes for object detection. While SSD cleared the way for increasingly effective kinds of object recognition, even at high frame rates, as can be found on a webcam (60 frames per second, or "FPS"), autonomous car, or security camera (typically lower, perhaps 24 FPS), AlexNet demonstrated the utility of picture categorization to the market.

Original paper here on arXiv and Papers with Code.

Faster R-CNN and Mask R-CNN

Bounding boxes, or rectangles, are useful for identifying things in images, but sometimes more specific information is beneficial as well. Fortunately, datasets containing images and matching per-pixel labels can be used to train models. Notable and iterative models that produce strong pixel-wise predictions in a reasonable amount of time are Mask R-CNN and Faster R-CNN (released by Matterport, a residential 3D scanning firm). For robotics applications like taking goods out of boxes or rearranging objects in a setting, these models can be especially helpful. Facebook AI Research released Mask R-CNN on GitHub under the moniker "Detectron." Below, we'll discuss Detectron2, its successor.

Original papers here and here on arXiv and here and here on Papers with Code.

You Only Look Once (YOLO) and YOLOv3

Similar to the previously discussed SSD, Joseph Redmon at the University of Washington chose to manually code a different "single shot" object detector rather than using the then-emerging TensorFlow framework. The objective was to make the detector run incredibly quickly using only C and CUDA (a GPU programming language) code. His model design is still in use today by Ultralytics, a company centered around providing clients with YOLOv5 models, which are currently available in PyTorch. YOLO is an architecture that pushes the boundaries of high-quality and fast object recognition and has withstood the test of time. It is still relatively relevant today.

Original papers here and here on arXiv and here and here on Papers with Code.

Inception v3 (2015), RetinaNet (2017) and EfficientDet (2020)

A new model would emerge as the winner of the yearly ImageNet and MSCOCO challenges every few years in the ten years since Alex Krizhevsky of the University of Toronto unveiled AlexNet. Although it would appear that high-speed, high-accuracy object recognition is a "solved" problem today, there is always potential to find more compact, simpler models that offer superior performance in terms of speed, quality, or other metrics. The state-of-the-art for objective detection based on "neural architecture search," or employing a model to choose various configurations, sizes, and kinds of layers, has advanced somewhat. The greatest models available today, however, take inspiration from previous trials and no longer use an independent ML model to deliberately look for improved model configurations.

Original papers for Inception v3, RetinaNet, and EfficientDet are available on arXiv. You can find Inception v3, RetinaNet, and EfficientDet on Papers with Code as well.

Mingxing Tan et al., 2020, Comparative Performance of Several Models

U-Net

As a result of the successful application of deep neural networks in the diagnosis of diabetic retinopathy (eye scans for diabetes), several studies have been conducted to investigate semantic segmentation models for other radiological applications, such as medical scans like Computerized Tomography (CT) scans. A relatively unusual mirror-image architecture with high dimensional inputs and outputs corresponding to the per-pixel labels was employed by U-Net, which was also initially implemented in Caffe, for both cell detection and anomaly labeling. Because it influenced more recent models for semantic segmentation and image formation, such as Detectron2 and even Stable Diffusion, U-Net is still relevant today.

Original paper here on Springer and Papers with Code.

Detectron2

With the release of the Panoptic-DeepLab article in 2019, Facebook Research raised the bar even farther for both object identification and semantic segmentation. It is freely accessible in the Detectron2 repository on GitHub. As the cutting edge of low latency, high accuracy semantic segmentation and object detection, Detectron2 is still in use today. Every few years or so, it seems like the state of the art is firmly surpassed, often at the cost of processing power, but these models are eventually condensed or made simpler to facilitate training and deployment. Maybe in a few more years, memory and processing power needs will drop significantly, or a benchmark for unquestionably better quality will be set.

Original paper here on arXiv and Papers with Code.

GPT-3

By moving from word-by-word translations through dictionary lookups to using transformer models to predict the next word of the translated output, using the entire source phrase as an input, the industry concentrated much on translation as it shifted to Natural Language use cases. Well-known services like Google Translate developed noticeably better performance as a result. The next major turning point in natural language computing, apart from Google, was the release of OpenAI's GPT. The final trained model, which has 175 billion parameters and was trained using Common Crawl, Wikipedia, Books, and web text-based datasets, was a notably huge model for its time and is still impressive as of this writing. Just as BERT (a predecessor transformer-based language model) inspired more efficient, smaller derivatives before it, since GPT-3's release in 2020, the smaller InstructGPT model, released in 2022, has shown a greater ability to respond to directive prompts instructing the model to do something specific, all while including fewer weights. One important ingredient in this mix is RLHF, or Reinforcement Learning from Human Feedback, which seems to enable enhanced performance despite a smaller model size. A modified version of GPT-3 is trained on public Python code available in GitHub, known as Codex, and is deployed as “GitHub Copilot” in Microsoft's open source (VS) Code integrated development environment (IDE).

Original paper here on arXiv and GitHub.

BLOOM

BLOOM is a novel model designed to be used as a huge language model for text production, similar to GPT-3. However, unlike GPT-3, which is only given by OpenAI and Microsoft as a commercial API, the complete model and training dataset are freely accessible and open-source. Since the trained BLOOM model (full size) has 176 billion parameters, training it from scratch would have required a large amount of compute power. As a result, the model was trained as part of a BigScience workshop in collaboration with French government and academic research organizations on the IDRIS supercomputer, which is located close to Paris. BLOOM's training data and model code are all totally open source, which may make it easier to fine-tune the model for particular applications.

Code and context available on the HuggingFace blog.

The following section includes references to paper covered in our guide to Diffusion Models.

DALL·E 2

When DALL·E 2 was released in 2022, it stunned the industry with its remarkable capacity to create believable, visually beautiful images based only on text cues. DALL·E 2 is trained on a large dataset of internet-scraped pictures and description text, similar to GPT-3 and LSTM classification inference in reverse. In order to reduce the possibility that the model would provide harmful outputs, faces are eliminated and explicit or dangerous content is deleted using a combination of human feedback and models. The Designer and Image Creator programs from Microsoft utilize DALL·E 2, which is accessible as an API.

The DALL·E 2 paper is available on the OpenAI website. Unfortunately no model code is available at this time.

Stable Diffusion

Several business founders, notably Emad Mostaque of Stability.ai and the members of RunwayML, choose to expose a functionally identical model and dataset to OpenAI's API as opposed to OpenAI's approach with DALL·E 2. Known as Stable Diffusion, it is an extension of a method created a year prior at LMU in Munich. It creates graphics from word prompts or a source image by applying a sequence of noising and denoising stages. Another option is to employ a method known as "outpainting," which involves instructing the model to enlarge the canvas in a particular direction and repeatedly running the model to create synthesized tiles that enlarge the final image. The model architecture runs on consumer graphics cards and incorporates U-net as a core component. (More than 10GB VRAM is preferred.) Unlike DALL·E 2, since Stable Diffusion can be run locally on the user's workstation, harmful text prompts are not filtered. Thus it is mainly the responsibility of the user to only use the model for productive rather than harmful purposes.

Code and context available on the HuggingFace model card page.

What are some common layer types in ML models?

Input Image (or other input matrix p x q x r)

Images are usually padded or scaled to a predetermined input size, like 256 x 256 pixels. Models can concentrate on learning features from color data in addition to brightness since modern networks can learn from color values as well. To distinguish one species of bird from another or one kind of manufacturing problem from another, color cues may be crucial. A sliding classification box that strobes the picture from left to right and top to bottom can be used to iteratively scan an image, one row at a time, if the input image to a model at inference time is too big to scale down to 256 x 256 without losing much information. The distance between one scanning window's location and the next one is known as the model's “stride.” Meanwhile, on the training side, much research has been devoted to identifying the best ways to crop, scale, and pad images with “blank” or “zero” pixels such that the model remains robust at inference time.

ResNet

Original paper here on arXiv and Papers with Code.

Convolutional Layer

Convolution is a mathematical function and signal processing technique that accepts two signals as inputs and produces the integral of their product. To compute the output signal over time, the second input function is moved along the function's baseline and flipped across the y-axis. Convolutions are frequently used to convert a signal from one frequency domain to the other, or vice versa.

Sigmoid

A neural network can only learn a linear border between one class and the other when it uses a linear activation function. Since many real-world issues have non-linear class borders, non-linear activation functions are required.

ReLU (Rectification Linear Unit)

Because a ReLU layer just scales positive inputs linearly and sets all negative number inputs to zero, it is computationally far simpler than a sigmoid. ReLU turns out to be a quick and practical substitute for the far more costly Sigmoid function when it comes to layer connections.

Pooling layer

By downsampling features from different areas of the image, this layer renders the output "invariant" to small translations of objects or features. Although a convolution's "stride," or the separation between the centers of successive convolution "windows," can be changed, using a pooling layer to "summarize" the features found in each area of an image is more typical.

Mingxing Tan et al., 2020, Comparative Performance of Several Models

Softmax

This is usually a neural network's final activation function, which is used to normalize the output to a probability distribution over the classes of expected outputs. It essentially converts the neural network's intermediate values back into the desired output labels or values. If the training and test data are subjected to any input-side transformations (such as downsampling to 256 x 256 resolution or converting an integer to a floating point number between 0 and 1), the softmax function is essentially the opposite of such transformations.

ML Modeling Frameworks/Libraries

Scikit-learn

Scikit-learn is a Swiss Army knife of frameworks in that it supports everything from linear regression to convolutional neural networks. ML engineers typically don't use it for bleeding edge research or production systems, except for maybe its data loaders. That said, it is incredibly useful for establishing utilities for splitting training from evaluation and test data, and loading a number of base datasets. It also can be used for baseline (simpler) models, such as Support Vector Machines (SVMs) and Histogram of Gradients (HOGs) to use as sanity checks, comparing algorithmically and computationally simpler models to their more sophisticated, modern

XGBoost

XGBoost is a framework that allows the simple training of gradient-boosted trees, which has also proven its value across a wide range of tabular use cases on Kaggle, an online data science competition platform. In many scenarios, before training an LSTM to make a prediction, it's often helpful to try an Xgboost model first, to disprove the hypothesis that a simpler model might possibly do the job equally well, if not better.

Caffe (historical)

Yangqing Jia wrote the Caffe framework while he was a PhD student at UC Berkeley with the intent of giving developers access to machine learning primitives that he wrote in C++ and CUDA (GPU computation language), with a Python interface. The framework was primarily useful for image classification and image segmentation with CNNs, RCNNs, LSTMS, and fully connected networks. After hiring Jia, Facebook announced Caffe2, supporting RNNs for the first time, and then soon merged the codebase into PyTorch (covered two sections later).

TensorFlow

Originally launched in 2015, TensorFlow quickly became the standard machine learning framework for training and deploying complex ML systems, including with GPU support for both training and inference. While TensorFlow is still in use in many production systems, it has seen decreased popularity in the research community. In 2019, Google launched TensorFlow 2.0, merging Keras' high-level capabilities into the framework. TensorBoard, a training and inference stats and visualization tool also saw preliminary support for PyTorch, a rival, more research-oriented framework.

PyTorch

Roughly a year after TensorFlow's release, Facebook (now Meta) released PyTorch, a port of NYU's Torch library from the relatively obscure Lua programming language over to Python, hoping for mainstream adoption. Since its release, PyTorch has seen consistent growth in the research community. As a baseline, it supports scientific computing in a manner similar to NumPy, but accelerated on GPUs. Generally speaking, it offers more flexibility to update a model's graph during training, or even inference. For some researchers the ability to perform a graph update is invaluable. The PyTorch team (in conjunction with FAIR, the Facebook AI Research group) also launched ONNX with Microsoft, an open source model definition standard, so that models could be ported easily from one framework to another. In September 2022, Meta (formerly Facebook) announced that PyTorch development would be orchestrated by its own newly created PyTorch Foundation, a subsidiary of the Linux Foundation.

Keras (non-TensorFlow support is now only historical)

In 2015, around the same time as TensorFlows release, ML researcher François Chollet released Keras, in order to provide an “opinionated” library that helps novices get started with just a few lines of code, and experts deploy best-practices prefab model elements without having to concern themselves with the details. Initially, Keras supported multiple back-ends in the form of TensorFlow, Theano, and Microsoft's Cognitive Toolkit (CNTK). After Chollet joined Google a few years later, Keras was integrated into the version 2.0 release of TensorFlow, although its primitives can still be called via its standalone Python library. Chollet still maintains the library with the goal of automating as many “standard” parts of ML training and serving as possible.

Chainer (historical)

Initially released in 2015 by the team at Japan's Preferred Networks, Chainer was a Python-native framework especially useful in robotics. While TensorFlow had previously set the industry standard, using the “define then run” modality, Chainer pioneered the “define-by-run” approach, which meant that every run of a model could redefine its architecture, rather than relying on a separate static model definition. This approach is similar to that of PyTorch, and ultimately Chainer was re-written in PyTorch to adapt to a framework that was independently growing in popularity and matched some of Chainer's founding design principles.

YOLOv5

Although we have already addressed YOLO above as a model class, there have been significant updates to this class, with YOLOv5 serving as Ultralytics' PyTorch rewrite of the classic model. Ultimately, the rewrite uses 25% of the memory of the original and is designed for production applications. YOLO's continued support and development is a testament to the efficiency of its original design, and successive iterations of YOLO continue to provide competitive, high-performance semantic segmentation and object detection, even though the original model consciously chose to avoid using industry-standard frameworks like TensorFlow and PyTorch.

HuggingFace Transformers

The world of machine learning frameworks and libraries is no stranger to higher-level abstractions, with Keras serving as the original example of this approach. A wider ranging, and more recent attempt at building a higher level abstraction library for dataset ingest, training, and inference, Transformers provides easy few-line access in Python to numerous APIs in Natural Language, Computer Vision, audio, and multimodal models. Transformers also offers cross-compatibility with PyTorch, TensorFlow, and Jax. It is accessible through a number of sample Colab notebooks as well as on HuggingFace's own Spaces compute platform.

Choosing Model Metrics

It's crucial to ensure that your metrics are business-aligned even though you may start with accuracy or loss when selecting measures for your model's training. This could indicate that some classification or prediction situations are more important to you than others. Essentially, you want to align your model with your company aims, not undermine or divert from them.

Minimizing loss

Whenever you train a model and apply a function to update the weights in different layers, the objective is always to "minimize loss"—that is, error. Consequently, your model is convergent toward a condition where it may produce insightful classifications or predictions if loss is consistently declining. The metric that the model iteratively updates the various weights to optimize is called loss; however, the user of a model may be more concerned in other, derivative metrics, like as accuracy, precision, and recall, which we'll talk about next.

Maximizing accuracy

You'll assess the accuracy of the model every few "epochs" or so after changing the weights in the layers of your neural network to make sure it's becoming better rather than worse. Accuracy is usually assessed on both training and test sets to ensure that performance isn't maximized on one but minimized on the other (a phenomenon called "overfitting" occurs when a model performs well on training data but is unable to generalize to test data that it has never seen before).

Precision and recall

Precision defines what proportion of positive identifications were correct and is calculated as follows:

Precision = True Positives / True Positives + False Positives

A model that produces no false positives has a recall of 1.0

Recall defines what proportion of actual positives were identified correctly by the model, and is calculated as follows:

Recall = True Positives / True Positives + False Negatives:

A model that produces no false negatives has a recall of 1.0

F-1 or F-n Score

F-1 is a statistic that takes into account both scores as inputs since it is the harmonic mean of Precision and Recall. A score of 1.0 is obtained from perfect precision and memory, but F-1 is 0 when either precision or recall is 0. Depending on your objectives, you can bias the harmonic mean to favor recall or precision by choosing a different number for n. For instance, which is more important to you: fewer false positives or fewer false negatives? Often, safety-critical situations will require you to accept false positives or negatives but not the other way around.

IoU (Intersection-over-Union)

The percentage of overlap between the model's prediction and the ground truth (data annotations) is this statistic. This holds true for both 2D and 3D object detection issues and is especially pertinent to autonomous cars. Consider two rectangles that cross, with the intersection serving as your success target. A larger overlap indicates a better model. When you divide the overlap over the total area of the prediction and ground truth bounding rectangles, the quotient, or the outcome of division, is known as IoU. This idea can be extrapolated into three dimensions, where the overlap of two rectilinear prisms is the aim, and the IoU (ground truth and prediction again) is the quotient of this overlapping prism over the total of the two intersecting prisms.

Best Practices and Considerations for Training Models

Time required

While more complicated model architectures, larger datasets, and higher input resolutions are generally slower to train on—they take longer to develop a performant model—some simpler model designs and smaller datasets can be trained more quickly. Even though training on increasingly larger datasets can help increase model accuracy (datasets with 1,000 or even 10,000 samples are a good place to start), more data samples won't always improve model performance beyond a certain point if they don't contain any new information, especially in rare or "edge" cases.

Model size

An additional limitation may be the total sum of the model's weights. The model uses more (GPU) memory during training the more weights it has. Training a model over several memory banks—whether they be located in different systems or on different GPUs within the same system—requires more work.

Dataset size

Usually, one needs at least 1000 pictures or 1000 rows of tabular data in order to construct a relevant model. Although training on fewer instances might be feasible, you will probably run into issues with your model, such as overfitting, when working with tiny datasets. If the biggest order-of-magnitude dataset size isn't accessible for your use case, you can still explore other options like synthetic augmentation or simulation.

Compute requirements

Training models usually requires a lot of computation and time. Distributed training can be appropriate for certain models, especially if the graph can be stored and replicated on several machines. Larger models might take hours or even days to train, but logistic regressions and SVMs are very simple and rapid to perform. Basic methods such as K-means clustering can need a lot of computing power, and using GPUs or specialized AI hardware like TPUs for training and inference can result in significant speedups.

Spectrum: easy to hard

If there's a simpler tool for the job, it makes sense to start there. Simpler models include logistic regression and support vector machines (SVMs).
That said, the boundaries between one class and another won't always be smooth or “differentiable” or “continuous.” So sometimes more sophisticated tools are necessary.
Random Forest Classifiers and Gradient Boosted Machines can typically handle easier vision classification challenges like the Iris dataset.
Larger datasets and larger models typically require longer training times. So when more sophisticated models are required to achieve high accuracy, often it will take more time, more fine tuning, and a larger compute/storage budget to train the next model.

Conclusion

With the array of tools at our disposal, training machine learning models shouldn't be a difficult task. Beginning has never been simpler.Having said that, avoid using machine learning as a band-aid solution. Before settling on a CNN, DNN, or LSTM as the best option, demonstrate that less complicated methods aren't any more efficient. Although it can seem like an art, training models is best approached as a science. Although your dataset's peculiarities and hyperparameters may affect the results, if you look more closely, you can finally confirm or refute theories about what's driving better or worse results. It's crucial to follow a thorough and systematic approach; maintaining an engineer's notebook or record is recommended. Tracking hyperparameters in GitHub is largely insufficient; you should ruthlessly test your models for regressions, track aggregate and specific performance metrics, and use tooling where possible to take the tediousness out of writing scripts.

The future of your
industry starts here.

Book a Demo Build AI

guides

Training and Building Machine Learning Models

contents

Training Models for Machine Learning

What Are Machine Learning (ML) Models?

Divvying up your data

What types of data can ML models be trained on?

Tabular data

Text

Images

Video

Audio

3D Point Clouds

Multimodal Data

What are some common classes or types of ML models?

Support Vector Machines (SVMs)

Random Forest Classifiers

Gradient-Boosted Trees

Feedforward Neural Networks

Recurrent Neural Networks

Convolutional Neural Networks

Long Short-Term Memory Networks (LSTMs)

Q-Learning

Word2Vec

Transformers

What are some commonly used models?

AlexNet (historical)

ResNet

Single Shot MultiBox Detector (SSD)

Faster R-CNN and Mask R-CNN

You Only Look Once (YOLO) and YOLOv3

Inception v3 (2015), RetinaNet (2017) and EfficientDet (2020)

U-Net

Detectron2

GPT-3

BLOOM

DALL·E 2

Stable Diffusion

What are some common layer types in ML models?

Input Image (or other input matrix p x q x r)

ResNet

Convolutional Layer

Sigmoid

ReLU (Rectification Linear Unit)

Pooling layer

Softmax

ML Modeling Frameworks/Libraries

Scikit-learn

XGBoost

Caffe (historical)

TensorFlow

PyTorch

Keras (non-TensorFlow support is now only historical)

Chainer (historical)

YOLOv5

HuggingFace Transformers

Choosing Model Metrics

Minimizing loss

Maximizing accuracy

Precision and recall

F-1 or F-n Score

IoU (Intersection-over-Union)

Best Practices and Considerations for Training Models

Time required

Model size

Dataset size

Compute requirements

Spectrum: easy to hard

Conclusion

The future of your industry starts here.

The future of your
industry starts here.