De.KCD - Data Management Platform on the Cloud, step by step
Process and documentation on adapting a docker-based Data Management platform, Seek4Science, to Kubernetes, and using it in the cloud.
Goals and scope
This documentation is intended as a quickstart for setting up and managing one or several Data Management platforms, while considering the benefits and costs: how much effort, how complex, the security risks, what is gain. It does not replace a detailed explanation on each topics, but intend on saving time in (1) deciding which solution to adopt, (2) understand enough of each solution to navigate efficiently through their documentation, (3) give working and explained solutions that can be used as it or as base for your own solution.
It goes from a simple local setup to a full Kubernetes based set-up with distributed data, with some side documentations on Authentication and Authorization, setting up a Central Identity Service, monitoring and logging and considerations for connecting applications in the cloud.
We try to give a clear view on the cost and benefit of each solution so it is easy to have a rough idea what solution is the best. It is quickly summarized and emphasized at each section.
We will also compile advices and tips, and we welcome all contributions (and corrections).
Finally it is targeted mostly at Research Projects and/or Institutions so focus on the particular aspects in this cases. By Data Management Platforms we mean an online application with a data repository (database or other), such as those listed in our Data Management Platforms registry
How to use
Each section assumes that you know the precedents. If you already know one section, feel free to jump to the next one.
When to…
When to go full cloud, or with a containerised solution for your datamanagement platforms. We provide below a very simplified answer and we are working on a detailed decision tree
Quick introduction to Linux/Unix
Online documentations and tutorials are enough for the basic. A book is strongly recommended for advanced topics.
Learning difficulty from easy (the basic) to very hard (how it all works). For dealing with a cloud installation it is probably intermediate as you should have some knowledge about security concerns.
- Check external introduction
- Check Galaxy/NFDI training for deeper knowledge
For most set-up, Linux is the operating system of choice. Due to security concern, a minimal knowledge of it is probably a must in all cases, from bare-metal setup to cloud-based installation, though the useful set will change.
We list below the commands you should know to survive, and their eventual options (note that some of these command now run on Windows using the Terminal shell), following by the important folder, user and group access on Unix, private and public keys and ssl/ssh. This part is more a check list of things you should know for the following topics, and we recommend that you learn those before continuing. Wikipedia is a good starting point and there are many easy to find good tutorials online. Books are needed only for a deep understanding, but are also stronly recommended if you use Linux a lot and deal with advanced topics. Note that Linux is not the only Unix, OpenBSD and FreeBSD being good alternative, but all Unix are very similar for their core usage and structure. The package managers will differ, but also differ on Linux between Debian-based Linux (dpkg), RedHat-based Linux (rpm, yum, now dnf) and Ubuntu-based Linux (apt/apt-get), apk (Alpine Linux), as well as sandboxed or universal package managers like Flatpak and Snap, to list the main ones.
Commands that should be known
ls → list the content of a directory ls -al → include all files (including hidden files starting with
.) and show a detailed listcd → go to a directory
cd ..→ parent directorycd /→ root directorycd ~→ user home directory
ps → list running processes ps -edf or ps -aux → list running processes and their owner
top → show running processes and their memory/CPU usage (CTRL-C to exit)
cp / mv / mkdir / rm → copy, move/rename, create directories, remove files or directories
rm -r→ recursive removalrm -rf→ dangerous, use with extreme caution, the force means that it will delete everything (force remove).
more / less / cat → display file content less is usually preferred (scrolling, searching)
tail, tail -f → show the end of a file, -f to follow it live (very useful for logs)
vi / vim / emacs / nano → most of the time we work on servers/containers with a text terminal. Being able to edit a file is often needed. Vi, nano or Emacs are powerful text-based editor that are present on most linux distribution, though the lightest linux distribution for container images might have the minimum (such as vi only, instead or the extended vim or emacs).
Even if vi might seem very hard to use and strange at first glance, it is quick to learn the basics and very convenient.
Emacs has a very different approach but is much more powerful, which can be useful for more complex work (if needed).
If you never used a terminal editor, search for a short “vi basics” or “nano basics” introduction.
man → show the manual for a command. man should be an automatic reflex for any less known usage. It is absolutely normal to forget how such option or command work if they are not used very often. man is there for such cases. If you forget a command name, it is a good idea to keep a linux cheat sheet around. A quick internet search should find many good one page pdfs.
ln → create links (that points to a file); understand the difference between:
- Hard link: a hard link: direct link to the file, will be removed if the file is removed, get its permissions from the target file
- Symbolic link (symlink): exist independently of the linked file and will stay behind of the linked file is removed or moved, has its own permissions
For setting-up an application, we generally use symbolic links, mostly as a way to change the permissions independently of the target file.
chown / chmod → change the owner of a file or folder, change the permissions of a file or a folder. Be familiar with:
symbolic permissions:
rwxrwxrwx,r-x------numeric modes:
755,644examples:
chmod o+r filenamechmod -R 755 directorychown -R user:group directory
Ideally the pipe (|) should be well understood, as well as the I/O indirections (>, >>, <). If this is unclear, look for “Unix pipes and redirection” — it is a core concept.
grep → search for the presence of a string in files, often used combined with another command using a pipe (|)
find → search for files by name, type, size, or date; often combined with grep or xargs
mount / umount → attach or detach filesystems Removable media must be mounted before access. For instance, a USB key must be mounted before use. Temporary mount points are often under
/mntor/media. Modern systems often automount removable media.pwd / whoami →
- pwd → print current directory
- whoami → show current user
kill → kill does not only kill a process. It actually send a signal, which is most of the time to terminate a process: to the given process for most signals, ot to the kernel for SIGKILL (kill -KILL processId) and SIGSTOP (kill -STOP processId), to respectivelly forcefully kill or stop a process. The default signal (i.e. no argument) for kill is SIGTERM (kill -TERM processId), which should be dealt by the process to terminate properly. If the signal is not processed by the corresponding application, by not being coded in or for being in a state where it cannot deal with it (for instance an infinite loop), nothing will happen. In that case, only kill -KILL processId will kill the application.
The SIGKILL termination is considered unsafe and should only be used on hanged process, after using SIGTERM.
The signals have an integer value, with some defined in the POSIX standard, such as kill -9 for SIGKILL, or kill -15 for SIGTERM.
- eval → is a powerful and dangerous command that execute its arguments as a shell command. So it is possible to build a command in a shell script and execute it using eval, making possible very advanced operations.
Knowing eval is important for security reason. The presence of eval in a suspicious script calls for caution.
xargs → build and execute a command from standard input
Very useful when a command produces a list of items (files, process IDs, etc.) that must be passed as arguments to another command.Typical usage combines find, grep, or pipes:
find . -name "*.log" | xargs rmgrep -l "ERROR" *.log | xargs less
This is needed because many commands do not accept input directly from stdin, but only as command-line arguments.
Be careful: xargs will execute the command on all received items.
Prefer using xargs -n 1 (one argument at a time) or test first with echo.
As opposite to eval, xargs does not transform a string into a command (the command needs to be given and will receive parameters from xargs). As such it is not dangerous like eval
If unfamiliar, search for “xargs explained” — understanding it greatly improves command-line efficiency.
curl / wget → interact with HTTP/HTTPS services from the command line Useful for testing APIs, services, or downloading files
free / df / du → check system resources
- free → memory usage
- df → disk usage per filesystem
- du → disk usage per directory
command & / CTRL-Z + bg / fg → run a command in the background, suspend it, or bring it back to foreground
ssh / scp → connect to remote servers and copy files securely Essential for working with remote machines If unfamiliar, search for “SSH basics”.
systemctl graceful → systemctl is part of systemd, where the ‘d’ stands for ‘daemons’.Daemons are Unix’s background process. What is important to know if that a web server, such as Apache or Nginx, will now be run as daemon and managed by systemctl. systemctl graceful ask for a clean restart of a service -> finish the existing “transaction” before starting. For a web-server it means that it will keep the current opened http session before restarting, thus minimizing the impact of the restart for the end use. Keep in mind that it does not know about a CMS session serverd by the web-server (as generally the web-server is only a proxy to the web application), so CMS session might be forcefully closed anyway (depending on how the session is managed). Common commands:
systemctl status servicesystemctl start|stop|restart servicesystemctl reload service(graceful reload when supported)
apachectl configtest / nginx -t → allow to test the current configuration for Apache or Nginx. Configuring a web-server can quickly become complex, if there is a need for a reverse proxy on several virtual hosts (for instance for several web-application served via a docker instance for each). So these command will minimize the risk of a wrong configuration. But they still do not ensure that the configuration is doing what is expected.
rsync → efficient synchronization of files and directories Often used for backups or deployments. Combined with database dumps, it can form a simple backup strategy. In that case it is important to have a monitoring of the process to ensure it is working.
Folders you should know
- / → filesystem root
- /bin → essential system binaries
- /usr → user-space applications and libraries
- /usr/bin → most user commands
- /etc → system-wide configuration files (very important)
- /var → variable data
- /var/log → system and application logs (first place to check on errors)
- /home → user home directories
- /root → root user’s home directory
- /tmp → temporary files (often auto-cleaned)
- /mnt → temporary mount points
- /opt → optional or third-party software
Online resources
The Unix command line is well explained here. It is probably useful to know about the main principle of Unix, and a good course if available here and Wikipedia has a good overview of the Unix filesystem and its layout.
DeNBI Unix course -> TBD, adapt using a light Linux image (or public online VM)
What makes the containers possible: cgroups
One more advanced element which is important to understand in the container context is cgroups. Cgroups (short for control groups) are a Linux kernel feature to isolate processes: limits the resource usage, track their usage, give them a “sub” filesystem isolated from the host file system.
It is important to be aware of these for understanding the difference between container and virtual machine, and be aware that the security risk is higher with a container: bypassing the isolation would allow a direct access to the host filesystem. And it is not only if an exploit is found. As the container are a part of the host, they can use a folder out of their isolated space (host volume). If not properly set-up, it could open a sensitive part of the host to an intruder. On the other hand, virtual machine save their data within the virtual disk, part of the virtual machine. There is no possible access to the host system aside a proper exploit. So for a quick set-up of an exposed application, a virtual machine might be a better solution.
A long term set-up should take a proper care of security in all case, and in this case there should be not benefit in using a virtual machine, aside of cybersecurity needs (like a honeypot).
Bare-Metal setup and Virtual Machines
Vocabulary
“Bare-Metal” is a bit of a misnomer, as it means originally using a computer without operating system, directly on the hardware (so using a low-level language such as ASM or C). But a bare-metal setup means a direct installation on a machine with the operating system.
Note that an actual bare-metal setup could still be a viable option for specialised tasks: if it needs to be embedded (for a cheaper/more efficient solution on an specialised piece of hardware), if it needs to be highly performant, but in the later case it might be a better option to move the processing to the GPU (which is also a kind of bare-metal implementation). To choose to do so if out of scope for this document, but it is important to do so knowlingly as it could have a high cost (i.e. difficulty of maintenance)
A Virtual Machine is a software solution that provide the functionality of a physical computer. It is possible (and generally needed) to install an operating system on it - most Virtual Machines will provide an easy way to do so, and using it is akin to using an actual computer.
It can be an emulation, “imitating” a computer enough to run its software, generally used for specific usages such as running deprecated operating systems. Qemu is one of the main ones. An emulation provides a full encapsulation so is in theory more secure - and exploit are fixable by software update.
Or a virtualization, where the hosted machine is running directly or semi-directly on the physical hardware though an hypervisor that manages how the ressources are used: * full virtualization (simulate enough of the computer hardware to run a guest OS), and will need significant more resource than an actual physical computer, while providing a full encapsulation so in theory more security - and exploit are fixable by software update. * Hardware-assisted virtualization, where the hardware helps with the virtualization. Exploits using the hardware support might be difficult or impossible to fix, though exceptionals. * OS-level virtualization, where the operating system is sharing the resource for the guest OS. Docker and other container engines are using such virtualisation. The reason containers are called containers and not virtual machine is from the usage: instead of creating a virtual machine and setting it up with an operating system and subsequent pieces of software, they rely on fixed images that have an bare-minimal operating system (ideally) and layers of software to support the desired container. Once running, they are actually a virtual machine.
A main protection against threat is being up-to-date: a recent or LTS - Long-Term-Support: that is supported for a long time - Operating System on a physical computer, Virtual Machine software, and virtualization software (which might have a direct support in the CPU).
Containers - Overview of Docker usage
The official Online documentations is good and complete. Online tutorials should be enough as a complement.
Learning difficulty is easy, assuming you have some knowledge of Linux.
A clear presentation of Docker is available here and the official documentation is good. The section you should consult is “Open Source”, so about Docker Engine, Docker Build and Docker Compose. If running Docker on Windows, you might want to use Docker Desktop, which is also proposing a single node Kubernetes and is the simples way to use and test Docker and Kubernetes on Windows.
If Docker is the de facto standard, there are other containers engines, such as PodMan, containerd, or cri-o and a standard, Open Container Initiative (OCI) that PodMan, Containerd and Cri-o follow and Docker almost follows (new image should be OCI compliant, old ones might not), so Docker images might need some adaptations to run with PodMan and others. Docker offer an OCI exporter for the image builts.
Vocabulary
- An image is a template used to create containers. It contains:
- an operating system base (often minimal Linux),
- the application,
- its dependencies,
- default configuration.
A container is a running instance of an image. It:
- runs processes,
- listens on ports,
- can be started, stopped, or restarted.
Note that the desired application needs to be started (entrypoint), or only the operating system will run.
Containers are ephemeral:
- if a container is deleted, everything inside it is lost
- unless data is stored in a volume
A volume is a persistent storage area outside the container. It is used to store all data that must survive container restarts or upgrades.
Volumes are:
- mounted inside containers
- independent of the container lifecycle
Docker Compose is a tool to define and run multiple containers together using a single configuration file (
docker-compose.yml).It is used to describe:
- which images to run,
- how containers are connected,
- ports,
- volumes,
- environment variables.
With Docker Compose, you typically manage an entire application stack:
- web server
- application
- database
- cache
All started with:
docker-compose up
Minimal mental model
- Image → what to run
- Container → the running application
- Volume → persistent data
- Docker Compose → a complete setup of multiple images and volumes.
If you keep this model in mind, most Docker-related documentation will start to make sense.
Useful things to know
A docker image will be started (once used in a container) using an Entrypoint and/or a cmd. These can be listed by using docker inspect <image id>, but also individually by using docker inspect -f '{{.Config.Entrypoint}}' <image id> and docker inspect -f '{{.Config.Cmd}}' <image id>. This is useful in case something goes wrong inside a container to know how the container is supposed to start. To know more about cmd and Entrypoint, the official Docker documentation offer a great overview.
Changing something that is part of the image is possible when running the container, but it will never be persistent and is generally a bad idea. One good case is when debugging an application, and if the debugged code is changed (in the code base) as soon as the issue is found.
How to build a Dockerfile
The official Online documentations is good and complete. Online tutorials or books are recommended for an easier approach.
Learning difficulty is medium, assuming you have some knowledge of Linux. There is nothing really difficult but there are a lot of aspects to take into account.
Moving to Docker compose
The official Online documentations is good and complete. Online tutorials should be enough as a complement.
Learning difficulty is easy/medium, assuming you have some knowledge of Linux. The difficulty is more about the different components of the setup (for instance an image with a database, linked with an Authentication system).
Advantages
Things to take into account
Adding parameters
The different types of volumes
Namespaces and namespaces collisions
Namespacing is by using the name of the containing folder of the docker-compose file. Using the same name, even in different parent folder, will use the same namespace. So running 2 different docker-compose setup using a parent folder of the same name will use the same docker network. If the applications use the same images, it could mean that one container communicate with one other container from the other docker-compose setup. For instance, you want 2 instances of the same application, testApp, that uses a MySQL database. The database is defined in the docker-compose and used by the testApp. The set-up is the following:
application1/testApp/docker-compose.yml
application2/testApp/docker-compose.yml
It is a clean set-up, with a distinct parent folder before the docker-compose folder (which would make a lot of sense if you want to store some specific files outside of the docker-compose folder). But as the direct parent folder of the docker-compose has the same name, only one MySQL instance will be used by both (or there will be a binding error and one application will fail, which is probably the best outcome)
From Docker compose to Kubernetes
Kubernetes is composed of many elements and has its own vocabulary. Each elements and the way the work together are not complex, but to grasp a minimal working set of Kubernetes will take some effort. We recommend to start with an online tutorial, interactive or not (links to separate resource page), and ideally a book, such as Kubernetes in Action, Production Kubernetes, Cloud Native DevOps with Kubernetes or Kubernetes: Up and Running
Learning difficulty is medium/hard, assuming you have some knowledge of containers. Using Kubernetes is still much easier than to set-up a production cluster. As software developers, we recommend that you set-up a cluster only as a testbed and rely on sysadmins for a production cluster.
Kubernetes is not more complex than docker compose, but it is much more than docker compose, with many elements. It is an orchestration engine: where docker compose can ask for these containers to be restarted when stopped, on the same machine, Kubernetes can choose on which machine the containers will run, can duplicate them, can kill them if they seem unhealthy to create some healthy ones, can create services out of these containers so they can be used by other clients, which could be other containers within Kubernets, can create network access of these containers, without knowing where they run.
For simplest application it could be a full overkill, then the more complex your application is (in term of size and elements), the more to gain from Kubernetes. If you need an application that needs to run 24/7, with transparent updates and that can scale from hundred of clients to several thousands, Kubernetes will make things much easier, and in a well thought way.
But this possibilities comes with difficulties to grasp it, especially if you are not a full sys-admin, and this documentation is made by and aimed at non-sys-admin persons.
Quick overview
-> Kubernetes section
Other solutions
Proxmox, HashiCorp Nomad, Docker Swarm (new), Apache solutions, Amazon ECS, Serverless Container Platforms, Humanitec, Porter,
Scalability
Databases
Security
Going to an assembly
Automatize the setup
Ansible & Terraform -> short introduction, links to documentation/
CI/CD/GitOps (Flux CD & others) -> also short introduction and links. Maybe a short tutorial for GitHub
This is a Quarto website.
To learn more about Quarto websites visit https://quarto.org/docs/websites.