Machine Learning Infrastructure: Dedicated Server Guide

Q: Can dedicated servers run machine learning workloads without a GPU?

Yes, for a broad range of practical ML workloads. Gradient boosting models (XGBoost, LightGBM), classical machine learning algorithms, data preprocessing pipelines, and batch inference jobs all run efficiently on multi-core CPU hardware. Many production ML deployments rely entirely on CPU for inference, since GPU instances cost significantly more and CPU inference latency is acceptable for most use cases.

Q: How much RAM does a machine learning server need?

RAM requirements depend on model size and dataset characteristics. For classical ML and gradient boosting, 64GB is sufficient for most workloads. For deep learning inference, a 7-billion-parameter model in 16-bit precision requires approximately 14GB for parameters alone. For training, larger RAM allows larger batch sizes. The general principle is to size RAM so the active model and dataset fit entirely in memory, avoiding the severe performance penalty of swap usage.

Q: Why use a dedicated server for AI instead of cloud?

Cloud infrastructure suits short-duration variable AI workloads. For sustained workloads — a production inference API serving predictions continuously, or a team running experiments regularly — dedicated infrastructure at a fixed monthly cost is significantly more economical. Cloud egress costs for large dataset movement add up significantly. A dedicated server in a European data centre has no egress charges and provides full data residency certainty for GDPR-regulated training data.

Q: What storage configuration is best for machine learning workloads?

Fast SSD storage is critical for data-intensive ML workloads. Dataset loading at the start of each training epoch, checkpoint saving, and preprocessing reads all depend on storage throughput and latency. For production inference servers, dual-drive RAID 1 configurations provide fault tolerance. For training and experimentation, a single large SSD maximises available capacity without requiring networked storage.

Q: Is a dedicated server suitable for running large language models?

It depends on model size and performance requirements. Smaller open-source models — 7B parameter models like Mistral or Llama variants — can run on CPU for inference with quantisation. On a dedicated server with 128GB or 256GB RAM and a high-core-count processor, these models can serve predictions at acceptable latency for many production use cases. Larger models — 70B parameters and above — typically require GPU acceleration for practical inference latency.

Machine learning has moved from research laboratories into production infrastructure. Recommendation engines, fraud detection systems, natural language processing pipelines, computer vision models, and real-time inference APIs all run as live workloads serving real users, and all of them place demands on infrastructure that shared hosting environments were never designed to meet.

Furthermore, the infrastructure requirements of machine learning are specific and unforgiving. Training a model requires sustained, high-intensity CPU or GPU compute over hours or days. Inference, meanwhile, requires low-latency responses under concurrent load. Additionally, data preprocessing requires fast storage I/O for reading large datasets. All of these demand exclusive, predictable hardware, which is exactly what a dedicated server provides.

This guide explains what machine learning workloads actually need from infrastructure, where shared and cloud environments fall short, and how dedicated servers address each requirement specifically.

📖 New to dedicated server infrastructure?

Before exploring AI and ML workloads specifically, it helps to understand what dedicated servers are and how they differ from shared and virtualised environments. Read What Is a Dedicated Server?, a complete introduction to bare-metal infrastructure.

What Machine Learning Workloads Actually Require

Machine learning is not a single workload type, it is a family of distinct computational patterns, each with different infrastructure requirements. Understanding the difference between training, inference, and data processing is essential for choosing the right hardware configuration.

Model Training

Training a machine learning model involves iterating over a large dataset many times, adjusting model parameters on each pass to minimise prediction error. This process is computationally intensive, runs continuously for hours or days, and generates significant heat and power consumption.

CPU-based training, which covers a broad range of models including gradient boosting, random forests, classical neural networks, and many natural language processing architectures, requires sustained multi-core compute at high utilisation for extended periods. A training job that runs for 12 hours at 95% CPU utilisation on a shared environment will compete with other tenants throughout that window, producing inconsistent results and potentially being throttled or terminated.

On a dedicated server, the full CPU capacity is available for the duration of the training run. No other tenant’s workload interrupts it, throttles it, or competes for the same cores.

Inference and Serving

Once a model is trained, it serves predictions in response to requests, this is inference. A recommendation engine responding to a user click, a fraud detection model evaluating a transaction, or a sentiment analysis API processing a text input all involve inference workloads.

Inference requires low and consistent latency. Each prediction request executes a forward pass through the model, which involves matrix multiplications across the model’s parameters. The speed of this computation depends on CPU clock speed, available memory bandwidth, and whether the model fits entirely in RAM or requires reading from storage.

For high-concurrency inference, many simultaneous prediction requests, core count determines how many requests execute in parallel. For low-latency single-request inference, clock speed matters more. The right specification depends on the specific inference pattern your application generates.

Data Preprocessing and Feature Engineering

Before training begins, raw data must be cleaned, transformed, normalised, and formatted, a process called preprocessing or feature engineering. This stage involves reading large files from storage, applying transformations in memory, and writing processed outputs back to disk.

Storage I/O performance is the primary bottleneck here. Reading a 500GB dataset of raw training data from a slow storage device takes significantly longer than reading it from NVMe, and this preprocessing step runs before every training iteration in many pipelines.

Why Shared Hosting and Standard VPS Fall Short for AI Workloads

Machine learning workloads expose the structural limitations of shared infrastructure more quickly than most other workload types.

Resource Contention Under Sustained Load

Training workloads run at high CPU utilisation continuously for hours. On shared hosting or a VPS where CPU is divided among multiple tenants, this sustained demand triggers throttling, the provider limits CPU consumption to prevent one tenant from monopolising shared resources.

Throttling during training does not just slow the job down. It introduces variability that affects model convergence — the training process takes longer, costs more in compute time, and may produce different results than it would on dedicated hardware where compute is consistent throughout.

Insufficient RAM for Large Models

Modern machine learning models are large. A mid-size language model may require 8 to 32GB of RAM just to load its parameters for inference. A training job that processes large batches of data simultaneously may require 64GB or more of working memory. Many VPS environments cap RAM at levels that simply cannot accommodate these requirements.

When a model does not fit in RAM, the system begins reading model parameters from disk during inference, a process that increases inference latency by orders of magnitude and makes real-time serving impractical.

Storage I/O Limitations

Shared storage environments, particularly network-attached storage used by many VPS providers, introduce latency to every disk read. For data preprocessing pipelines that read and write large datasets continuously, this latency compounds into training pipelines that are significantly slower than they need to be.

NVMe-based dedicated storage delivers random read latency around 20 microseconds. Network-attached storage in a shared VPS environment may deliver 1 to 10 milliseconds per operation, a difference of two orders of magnitude that becomes very significant at dataset scale.

📖 How does NVMe storage accelerate data-intensive workloads?

Data preprocessing and dataset loading are storage-bound operations. Read How NVMe Storage Boosts Dedicated Server Performance, a complete breakdown of how NVMe latency and IOPS translate into faster data pipelines.

How Dedicated Servers Support Machine Learning Infrastructure

Exclusive CPU for Sustained Compute

A dedicated server’s CPU is exclusively yours for the duration of any workload. As a result, a training job that requires 20 cores at sustained high utilisation for 18 hours gets exactly that, no throttling, no competing tenant workloads, no variability introduced by shared infrastructure activity.

Moreover, this exclusivity has a direct impact on reproducibility. Machine learning experiments depend on consistent compute environments to produce comparable results across runs. Shared infrastructure, by contrast, introduces noise into the compute environment that dedicated hardware eliminates.

Ample RAM for Large Model Workloads

Dedicated servers in the Swify range offer configurations from 64GB to 256GB of DDR4 RAM. For machine learning workloads, this matters in several specific ways.

Large language models and deep neural networks loaded for inference require their parameters to reside entirely in RAM for fast prediction. A model with 7 billion parameters in 16-bit precision requires approximately 14GB of RAM, and that is before accounting for activation memory during inference. Larger models require proportionally more.

Training workloads benefit from large RAM in a different way: bigger batch sizes. Processing more training examples simultaneously per GPU or CPU pass is more computationally efficient than processing small batches, and larger RAM allocations make larger batches possible without hitting memory limits.

High-Performance NVMe Storage for Data Pipelines

Swify dedicated servers use SSD storage, providing the fast sequential read throughput and low random access latency that data-intensive ML pipelines require. Loading a 100GB training dataset from NVMe storage takes a fraction of the time it would take from spinning HDD or network-attached storage, and this loading time repeats at the start of every training epoch.

For teams running multiple training experiments simultaneously, a common pattern in hyperparameter tuning and model comparison, fast storage ensures that dataset loading is not the limiting step in the experimental cycle.

Full Root Access for ML Environment Configuration

Machine learning workloads require specific software environments: Python runtime versions, CUDA drivers for GPU computing, specific versions of PyTorch, TensorFlow, scikit-learn, or XGBoost, and dependency management through conda or virtualenv. As a result, shared environments rarely permit the level of system configuration that production ML infrastructure requires.

A dedicated server with full root access, however, allows installing exactly the software stack the workload needs, specific Python versions, library combinations, system-level optimisations for numerical computing, without the restrictions that managed environments impose.

CPU-Based Machine Learning: Where Dedicated Servers Excel

While GPU-accelerated deep learning receives most of the attention in AI infrastructure discussions, a large proportion of practical machine learning work runs efficiently on CPU:

Gradient boosting models – XGBoost, LightGBM, and CatBoost are among the most widely deployed ML models in production, used extensively in fintech fraud detection, e-commerce recommendation, and healthcare risk scoring. They train efficiently on multi-core CPU hardware and benefit directly from high core counts.

Classical machine learning – random forests, support vector machines, logistic regression, and other algorithms that underpin much of production ML in regulated industries run entirely on CPU and scale well with core count.

Natural language processing inference – transformer-based NLP models like BERT derivatives, when quantised and optimised for CPU inference, serve predictions in production at acceptable latency on well-specified CPU hardware.

Data preprocessing pipelines – pandas, NumPy, and scikit-learn preprocessing steps run on CPU. For teams spending significant time on feature engineering and data preparation, a high-core-count dedicated server dramatically reduces pipeline execution time.

Batch inference jobs – running predictions on large datasets overnight or periodically, rather than in real-time, is a common production pattern that does not require GPU acceleration and runs well on multi-core CPU dedicated hardware.

For these workloads, a well-specified dedicated server, particularly dual-socket configurations with 20 or 40 physical cores, provides meaningful performance advantages over shared environments without requiring GPU hardware.

📖 How do you choose the right CPU for compute-intensive workloads?

ML training and inference place specific demands on processor architecture. Read How to Choose the Right CPU for Your Dedicated Server, covering clock speed vs core count trade-offs and how to match processor specification to workload type.

Data Privacy and Compliance for AI Workloads

Machine learning models trained on personal data carry specific data protection obligations. For European businesses subject to GDPR, training a model on data that includes personal information of EU residents requires that the training process, and the data itself, remains within the EEA.

Cloud providers, however, complicate this. Data processed on cloud infrastructure may transit or reside in data centres outside the EEA depending on region configuration, replication policies, and provider terms. Furthermore, demonstrating data residency on cloud infrastructure requires careful configuration and ongoing audit.

A dedicated server in a European data centre, by contrast, provides unambiguous data residency. The training data stays on hardware physically located in the Netherlands, no cross-border transit, no replication to non-EEA regions, no configuration complexity to maintain EEA compliance.

For AI applications in regulated sectors: healthcare diagnostic models trained on patient data, financial risk models trained on transaction records, HR systems trained on employee data, this compliance certainty is not optional. Indeed, it is a requirement that dedicated European infrastructure satisfies by design.

Recommended Swify Configurations for ML Workloads

Different machine learning use cases map to different server specifications. The following guidance covers the most common patterns.

Data science and experimentation environments

For individual data scientists or small teams running experiments, training smaller models, and processing moderate-sized datasets, a single-socket configuration provides ample compute without the cost of dual-socket hardware.

Recommended: Dedicated 5 (Intel Xeon Gold 6138, 128GB RAM, 2TB SSD) at €175/month, specifically, 20 cores for parallel experimentation, 128GB for loading large datasets and models, and NVMe-class SSD storage for fast data pipeline execution.

Production inference serving

For serving trained models in production, a prediction API handling concurrent requests from a live application, the priority is low-latency responses under concurrent load.

Recommended: Dedicated 2 (Intel Xeon Gold 5215, 128GB RAM, 2x 1TB SSD) at €150/month for moderate concurrency, or Dedicated 5 (Xeon Gold 6138, 128GB RAM) at €175/month for higher concurrency requirements. RAID 1 configuration on dual-drive plans provides storage redundancy for production deployments.

Large-scale training and batch processing

For training larger models, running hyperparameter optimisation across many parallel experiments, or executing large batch inference jobs, dual-socket configurations provide the core count and memory bandwidth that these workloads benefit from.

Recommended: Dedicated 7 (2x Intel Xeon Gold 6138, 128GB RAM, 2TB SSD) at €260/month: 40 physical cores for maximum training parallelism, or Dedicated 8 (2x Xeon Gold 6138, 256GB RAM, 2x 2TB SSD) at €320/month for workloads requiring both maximum compute and large memory.

Bare-metal infrastructure for your AI and ML workloads

Swify dedicated servers provide exclusive CPU, high-capacity RAM, and fast SSD storage in European data centres, the infrastructure foundation that machine learning workloads require, without shared environment limitations or cloud egress costs.

→ Explore Swify Dedicated Servers

Frequently Asked Questions

Can dedicated servers run machine learning workloads without a GPU?

Yes, for a broad range of practical ML workloads. GPU acceleration provides the largest benefit for deep learning training, particularly large neural network architectures trained on image or text data, where the parallel matrix operations map efficiently to GPU hardware.

However, gradient boosting models (XGBoost, LightGBM), classical machine learning algorithms, data preprocessing pipelines, and batch inference jobs all run efficiently on multi-core CPU hardware. Many production ML deployments rely entirely on CPU for inference, since GPU instances cost significantly more and CPU inference latency is acceptable for most use cases. A high-core-count dedicated server handles these workloads well. Read more about matching CPU specification to workload type in How to Choose the Right CPU for Your Dedicated Server.

How much RAM does a machine learning server need?

RAM requirements depend on the model size and dataset characteristics. For classical ML and gradient boosting, 64GB is sufficient for most workloads, enough to load large datasets into memory for fast processing without hitting swap. For deep learning inference, the model’s parameter count determines minimum RAM: a 7-billion-parameter model in 16-bit precision requires approximately 14GB for parameters alone, plus activation memory during inference.

About training workloads, larger RAM allows larger batch sizes, which improves training efficiency. For teams running multiple experiments simultaneously, each experiment occupies RAM independently: 128GB or 256GB configurations allow parallel experimentation without memory contention. The general principle is to size RAM so the active model and dataset fit entirely in memory, avoiding the severe performance penalty of swap usage. Read more in Understanding RAM Usage in Web Hosting Environments.

Why use a dedicated server for AI instead of cloud?

Cloud infrastructure is well-suited for short-duration, variable AI workloads, spinning up a large instance for a one-off training job, then terminating it. For sustained workloads, a production inference API serving predictions continuously, a data pipeline running daily, or a team running experiments regularly, dedicated infrastructure at a fixed monthly cost is significantly more economical.

Additionally, cloud egress costs, charges for data leaving the cloud provider’s network, add up significantly for AI workloads that move large datasets in and out. A dedicated server in a European data centre has no egress charges and provides full data residency certainty for GDPR-regulated training data. Read the full cost comparison in How Dedicated Servers Reduce Long-Term Infrastructure Costs.

How does server location affect machine learning inference performance?

For real-time inference APIs serving user-facing applications, server location directly affects the latency users experience. Each prediction request travels from the application server or user device to the inference server and back, and physical distance imposes a minimum round-trip time that no hardware or software optimisation can reduce.

For European applications serving European users, hosting the inference server in a European data centre, such as the Netherlands, minimises this geographic latency component. A fraud detection model serving a European payment platform, or a recommendation engine for a European e-commerce site, benefits from a server location close to both the application layer and the end users. Read more in How Server Location Affects Website Speed.

What storage configuration is best for machine learning workloads?

Fast storage is critical for data-intensive ML workloads. Dataset loading at the start of each training epoch, checkpoint saving during training, and feature store reads during preprocessing all depend on storage throughput and latency. SSD storage, as provided on all Swify dedicated servers, delivers the random access performance and sequential throughput that these operations require.

For production inference servers where storage continuity matters, dual-drive configurations with RAID 1 provide fault tolerance, a single drive failure does not take the inference API offline. For training and experimentation environments where the primary concern is speed rather than redundancy, a single large SSD maximises available storage capacity. The 2TB and 4TB SSD options in Swify’s custom configuration accommodate large dataset storage without requiring external networked storage. Read more about storage architecture in How NVMe Storage Boosts Dedicated Server Performance.

Is a dedicated server suitable for running large language models?

It depends on the model size and the performance requirements. Smaller open-source language models: 7B parameter models like Mistral or Llama variants, can run on CPU for inference with quantisation techniques that reduce memory requirements. On a dedicated server with 128GB or 256GB RAM and a high-core-count processor, these models can serve predictions at acceptable latency for many production use cases.

Larger models: 70B parameters and above, typically require GPU acceleration for practical inference latency, which is outside the scope of CPU-only dedicated servers. For teams working with these larger architectures, CPU-only dedicated servers remain useful for data preprocessing, fine-tuning smaller models, running retrieval-augmented generation (RAG) pipelines, and serving embedding models that support vector search, all components of modern LLM applications that do not require GPU for the full inference path.

Dedicated Server for AI and Machine Learning Workloads