De.KCD - Data Management Platform on the Cloud, step by step

Process and documentation on adapting a docker-based Data Management platform, Seek4Science, to Kubernetes, and using it in the cloud.

Goals and scope

This documentation is intended as a quickstart for setting up and managing one or several Data Management platforms, while considering the benefits and costs: how much effort, how complex, the security risks, what is gain. It does not replace a detailed explanation on each topics, but intend on saving time in (1) deciding which solution to adopt, (2) understand enough of each solution to navigate efficiently through their documentation, (3) give working and explained solutions that can be used as it or as base for your own solution.

It goes from a simple local setup to a full Kubernetes based set-up with distributed data, with some side documentations on Authentication and Authorization, setting up a Central Identity Service, monitoring and logging and considerations for connecting applications in the cloud.

We try to give a clear view on the cost and benefit of each solution so it is easy to have a rough idea what solution is the best. It is quickly summarized and emphasized at each section.

We will also compile advices and tips, and we welcome all contributions (and corrections).

Finally it is targeted mostly at Research Projects and/or Institutions so focus on the particular aspects in this cases. By Data Management Platforms we mean an online application with a data repository (database or other), such as those listed in our Data Management Platforms registry

How to use

Each section assumes that you know the precedents. If you already know one section, feel free to jump to the next one.

When to…

When to go full cloud, or with a containerised solution for your datamanagement platforms. We provide below a very simplified answer and we are working on a detailed decision tree

Quick introduction to Linux/Unix

Recommended learning path

Online documentations and tutorials are enough for the basic. A book is strongly recommended for advanced topics.

Learning difficulty from easy (the basic) to very hard (how it all works). For dealing with a cloud installation it is probably intermediate as you should have some knowledge about security concerns.

Check external introduction
Check Galaxy/NFDI training for deeper knowledge

For most set-up, Linux is the operating system of choice. Due to security concern, a minimal knowledge of it is probably a must in all cases, from bare-metal setup to cloud-based installation, though the useful set will change.

We list below the commands you should know to survive, and their eventual options (note that some of these command now run on Windows using the Terminal shell), following by the important folder, user and group access on Unix, private and public keys and ssl/ssh. This part is more a check list of things you should know for the following topics, and we recommend that you learn those before continuing. Wikipedia is a good starting point and there are many easy to find good tutorials online. Books are needed only for a deep understanding, but are also stronly recommended if you use Linux a lot and deal with advanced topics. Note that Linux is not the only Unix, OpenBSD and FreeBSD being good alternative, but all Unix are very similar for their core usage and structure. The package managers will differ, but also differ on Linux between Debian-based Linux (dpkg), RedHat-based Linux (rpm, yum, now dnf) and Ubuntu-based Linux (apt/apt-get), apk (Alpine Linux), as well as sandboxed or universal package managers like Flatpak and Snap, to list the main ones.

Viewing a web page in a Linux shell

One very useful capability while setting-up a web application is to access it directly. You might have access to a graphical terminal (see the next topic) but it’s also possible to access most web applications using a command line browser. Some fancy web pages won’t be displayed in any useful way, but a Data Management Platform is rarely in that category.

Tip

A text-only browser is also a good way to check if a web-site is accessible - mostly for visually impaired persons. All important content should be in plain text and the navigation should still be possible.

lynx, the best friend of old server-side developers, the oldest web-browser still maintained, with an easy to remember name, is a quick way to check a web-site.
- Open a page:
```
lynx http://localhost:8080
```
- Navigation:
  - Arrow keys: move between links
  - Enter: follow link
  - G: open a new URL
- Quit:
  - Q, then Y to confirm
w3m Slightly more modern than lynx, with optional image support in some terminals.
- Open a page:
```
w3m http://localhost:8080
```
- Navigation:
  - Arrow keys: move
  - Enter: follow link
  - U: open a new URL
- Quit:
  - Q
links / links2 Similar to lynx, with better table rendering and optional graphics support.
- Open a page:
```
links http://localhost:8080
```
- Menu:
  - Esc: open menu
- Quit:
  - Q

Firewall, SSH and VPN

Export display with X11

The X Window System has been conceived as a remote system. So the application using X11 for its display does not have to be displayed locally. It is handy mostly for applications that have a Graphical User Interface only or those that are easier to manage using their GUI. The communication will be ensured by SSH tunneling, thus encrypted.

Warning

The remote host must allow the local host, which might create a security risk (xhost+ instead of .XAuthority).

It is seldom needed to do so but very useful to know it is possible for some very specific need.

Linux with Wayland: Waypipe

If the same functionality is needed for a distribution using the newer Wayland display server, it is possible using Waypipe.

Remote desktop

A more common way to have a remote graphical display is via remote desktops applications. They are purposely written for such usage and thus should allow a secure and simple solution.

Note

Due to security issue, remote desktop (as well as remote X11) might not be usable. Generally, a properly configured firewall will block almost all ports to-from external access.

If a SSH access is allowed, a remote desktop relying on SSH tunneling will be possible (such as X11).

Otherwise it is possible that a VPN connection is needed first.

It is always better to have an extra constraint (using the VPN in this case) than potentially add a vulnerability (see surface of attack in the Backups and Security page)

Common tools:

RDP
- Server: xrdp
- Client: Remmina, Windows Remote Desktop
- Works well over networks
VNC
- Servers: tigervnc, tightvnc
- Clients: vncviewer, Remmina
- Simple and widely supported
Wayland-friendly options
- GNOME Remote Desktop (RDP)
- KDE Remote Desktop
- weston (Wayland reference compositor)

::: Which option should I use?

I just want to check that a web service is running -> Use a text-based browser (lynx, w3m, or links) Fast, simple, works everywhere.

I need a feature that exists only in a graphical interface -> Try export display with X11 (if available). Useful for configuration tools or admin interfaces.

If too difficult to set-up or not available (on newer Linux or other OS) -> Use remote desktop.

I am not sure which option to choose -> Use remote desktop. It is the most reliable and behaves like a normal desktop. :::

Commands that should be known

ls → list the content of a directory ls -al → include all files (including hidden files starting with .) and show a detailed list
cd → go to a directory
- cd .. → parent directory
- cd / → root directory
- cd ~ → user home directory
ps → list running processes ps -edf or ps -aux → list running processes and their owner
top → show running processes and their memory/CPU usage (CTRL-C to exit)
cp / mv / mkdir / rm → copy, move/rename, create directories, remove files or directories
- rm -r → recursive removal
- rm -rf → dangerous, use with extreme caution, the force means that it will delete everything (force remove).
more / less / cat → display file content less is usually preferred (scrolling, searching)
tail, tail -f → show the end of a file, -f to follow it live (very useful for logs)
vi / vim / emacs / nano → most of the time we work on servers/containers with a text terminal. Being able to edit a file is often needed. Vi, nano or Emacs are powerful text-based editor that are present on most linux distribution, though the lightest linux distribution for container images might have the minimum (such as vi only, instead or the extended vim or emacs).

Even if vi might seem very hard to use and strange at first glance, it is quick to learn the basics and very convenient.

Emacs has a very different approach but is much more powerful, which can be useful for more complex work (if needed).

If you never used a terminal editor, search for a short “vi basics” or “nano basics” introduction.

The absolute minimum knowledge you need of vi/vim is that you need to press “i” to edit the text, then escape to return to “command mode”. In the “command mode”, wq saves and quits (write and quit), q quits if there is no changes, q! saves without saving the change.

These editor are also used for editing crontab (crontab -e), with the editor chosen at the first use, but it can be changed later. Crontab is determining how cronjobs are running (regular jobs) and, if not needed for running an application (and so not part of this list), can be really useful.

man → show the manual for a command. man should be an automatic reflex for any less known usage. It is absolutely normal to forget how such option or command work if they are not used very often. man is there for such cases. If you forget a command name, it is a good idea to keep a linux cheat sheet around. A quick internet search should find many good one page pdfs.
tldr -> a simplified man, that needs to be installed first but can also be consulted online or in a general PDF
ln → create links (that points to a file); understand the difference between:
- Hard link: a hard link: direct link to the file, will be removed if the file is removed, get its permissions from the target file
- Symbolic link (symlink): exist independently of the linked file and will stay behind of the linked file is removed or moved, has its own permissions

For setting-up an application, we generally use symbolic links, mostly as a way to change the permissions independently of the target file.

chown / chmod → change the owner of a file or folder, change the permissions of a file or a folder. Be familiar with:
- symbolic permissions: rwxrwxrwx, r-x------
- numeric modes: 755, 644
- examples:
  - chmod o+r filename
  - chmod -R 755 directory
  - chown -R user:group directory
Ideally the pipe (|) should be well understood, as well as the I/O indirections (>, >>, <). If this is unclear, look for “Unix pipes and redirection” — it is a core concept.
grep → search for the presence of a string in files, often used combined with another command using a pipe (|)
find → search for files by name, type, size, or date; often combined with grep or xargs
mount / umount → attach or detach filesystems Removable media must be mounted before access. For instance, a USB key must be mounted before use. Temporary mount points are often under /mnt or /media. Modern systems often automount removable media.
pwd / whoami →
- pwd → print current directory
- whoami → show current user
kill → kill does not only kill a process. It actually send a signal, which is most of the time to terminate a process: to the given process for most signals, ot to the kernel for SIGKILL (kill -KILL processId) and SIGSTOP (kill -STOP processId), to respectivelly forcefully kill or stop a process. The default signal (i.e. no argument) for kill is SIGTERM (kill -TERM processId), which should be dealt by the process to terminate properly. If the signal is not processed by the corresponding application, by not being coded in or for being in a state where it cannot deal with it (for instance an infinite loop), nothing will happen. In that case, only kill -KILL processId will kill the application.

The SIGKILL termination is considered unsafe and should only be used on hanged process, after using SIGTERM.

The signals have an integer value, with some defined in the POSIX standard, such as kill -9 for SIGKILL, or kill -15 for SIGTERM.

eval → is a powerful and dangerous command that execute its arguments as a shell command. So it is possible to build a command in a shell script and execute it using eval, making possible very advanced operations.

Note

Knowing eval is important for security reason. The presence of eval in a suspicious script calls for caution.

xargs → build and execute a command from standard input
Very useful when a command produces a list of items (files, process IDs, etc.) that must be passed as arguments to another command.

Typical usage combines find, grep, or pipes:
- find . -name "*.log" | xargs rm
- grep -l "ERROR" *.log | xargs less
This is needed because many commands do not accept input directly from stdin, but only as command-line arguments.

Note

Be careful: xargs will execute the command on all received items.
Prefer using xargs -n 1 (one argument at a time) or test first with echo.

Note

As opposite to eval, xargs does not transform a string into a command (the command needs to be given and will receive parameters from xargs). As such it is not dangerous like eval

If unfamiliar, search for “xargs explained” — understanding it greatly improves command-line efficiency.

curl / wget → interact with HTTP/HTTPS services from the command line Useful for testing APIs, services, or downloading files
free / df / du → check system resources
- free → memory usage
- df → disk usage per filesystem
- du → disk usage per directory
command & / CTRL-Z + bg / fg → run a command in the background, suspend it, or bring it back to foreground
ssh / scp → connect to remote servers and copy files securely Essential for working with remote machines If unfamiliar, search for “SSH basics”.
systemctl graceful → systemctl is part of systemd, where the ‘d’ stands for ‘daemons’.Daemons are Unix’s background process. What is important to know if that a web server, such as Apache or Nginx, will now be run as daemon and managed by systemctl. systemctl graceful ask for a clean restart of a service -> finish the existing “transaction” before starting. For a web-server it means that it will keep the current opened http session before restarting, thus minimizing the impact of the restart for the end use. Keep in mind that it does not know about a CMS session serverd by the web-server (as generally the web-server is only a proxy to the web application), so CMS session might be forcefully closed anyway (depending on how the session is managed). Common commands:
- systemctl status service
- systemctl start|stop|restart service
- systemctl reload service (graceful reload when supported)
apachectl configtest / nginx -t → allow to test the current configuration for Apache or Nginx. Configuring a web-server can quickly become complex, if there is a need for a reverse proxy on several virtual hosts (for instance for several web-application served via a docker instance for each). So these command will minimize the risk of a wrong configuration. But they still do not ensure that the configuration is doing what is expected.
rsync → efficient synchronization of files and directories Often used for backups or deployments. Combined with database dumps, it can form a simple backup strategy. In that case it is important to have a monitoring of the process to ensure it is working.

Folders you should know

/ → filesystem root
/bin → essential system binaries
/usr → user-space applications and libraries
/usr/bin → most user commands
/etc → system-wide configuration files (very important)
/var → variable data
/var/log → system and application logs (first place to check on errors)
/home → user home directories
/root → root user’s home directory
/tmp → temporary files (often auto-cleaned)
/mnt → mount points
/media -> temporary mount points for external storage (optional)
/opt → optional or third-party software

Online resources

The Unix command line is well explained here. It is probably useful to know about the main principle of Unix, and a good course if available here and Wikipedia has a good overview of the Unix filesystem and its layout.

DeNBI Unix course -> TBD, adapt using a light Linux image (or public online VM)

What makes the containers possible: cgroups

One more advanced element which is important to understand in the container context is cgroups. Cgroups (short for control groups) are a Linux kernel feature to isolate processes: limits the resource usage, track their usage, give them a “sub” filesystem isolated from the host file system.

It is important to be aware of these for understanding the difference between container and virtual machine, and be aware that the security risk is higher with a container: bypassing the isolation would allow a direct access to the host filesystem. And it is not only if an exploit is found. As the container are a part of the host, they can use a folder out of their isolated space (host volume). If not properly set-up, it could open a sensitive part of the host to an intruder. On the other hand, virtual machine save their data within the virtual disk, part of the virtual machine. There is no possible access to the host system aside a proper exploit. So for a quick set-up of an exposed application, a virtual machine might be a better solution.

A long term set-up should take a proper care of security in all case, and in this case there should be not benefit in using a virtual machine, aside of cybersecurity needs (like a honeypot).

Bare-Metal setup and Virtual Machines

Vocabulary

“Bare-Metal” is a bit of a misnomer, as it means originally using a computer without operating system, directly on the hardware (so using a low-level language such as ASM or C). But a bare-metal setup means a direct installation on a machine with the operating system.

Note

Note that an actual bare-metal setup could still be a viable option for specialised tasks: if it needs to be embedded (for a cheaper/more efficient solution on an specialised piece of hardware), if it needs to be highly performant, but in the later case it might be a better option to move the processing to the GPU (which is also a kind of bare-metal implementation). To choose to do so if out of scope for this document, but it is important to do so knowlingly as it could have a high cost (i.e. difficulty of maintenance)

A Virtual Machine is a software solution that provide the functionality of a physical computer. It is possible (and generally needed) to install an operating system on it - most Virtual Machines will provide an easy way to do so, and using it is akin to using an actual computer.

It can be an emulation, “imitating” a computer enough to run its software, generally used for specific usages such as running deprecated operating systems. Qemu is one of the main ones. An emulation provides a full encapsulation so is in theory more secure - and exploit are fixable by software update.

Or a virtualization, where the hosted machine is running directly or semi-directly on the physical hardware though an hypervisor that manages how the ressources are used: * full virtualization (simulate enough of the computer hardware to run a guest OS), and will need significant more resource than an actual physical computer, while providing a full encapsulation so in theory more security - and exploit are fixable by software update. * Hardware-assisted virtualization, where the hardware helps with the virtualization. Exploits using the hardware support might be difficult or impossible to fix, though exceptionals. * OS-level virtualization, where the operating system is sharing the resource for the guest OS. Docker and other container engines are using such virtualisation. The reason containers are called containers and not virtual machine is from the usage: instead of creating a virtual machine and setting it up with an operating system and subsequent pieces of software, they rely on fixed images that have an bare-minimal operating system (ideally) and layers of software to support the desired container. Once running, they are actually a virtual machine.

Note

A main protection against threat is being up-to-date: a recent or LTS - Long-Term-Support: that is supported for a long time - Operating System on a physical computer, Virtual Machine software, and virtualization software (which might have a direct support in the CPU).

Containers - Overview of Docker usage

Recommended learning path

The official Online documentations is good and complete. Online tutorials should be enough as a complement.

Learning difficulty is easy, assuming you have some knowledge of Linux.

A clear presentation of Docker is available here and the official documentation is good. The section you should consult is “Open Source”, so about Docker Engine, Docker Build and Docker Compose. If running Docker on Windows, you might want to use Docker Desktop, which is also proposing a single node Kubernetes and is the simples way to use and test Docker and Kubernetes on Windows.

If Docker is the de facto standard, there are other containers engines, such as PodMan, containerd, or cri-o and a standard, Open Container Initiative (OCI) that PodMan, Containerd and Cri-o follow and Docker almost follows (new image should be OCI compliant, old ones might not), so Docker images might need some adaptations to run with PodMan and others. Docker offer an OCI exporter for the image builts.

Vocabulary

An image is a template used to create containers. It contains:
- an operating system base (often minimal Linux),
- the application,
- its dependencies,
- default configuration.
An image is immutable: you do not modify it while it is running. Images are usually downloaded from a registry (e.g. Docker Hub).

A container is a running instance of an image. It:
- runs processes,
- listens on ports,
- can be started, stopped, or restarted.
Note that the desired application needs to be started (entrypoint), or only the operating system will run.

Containers are ephemeral:
- if a container is deleted, everything inside it is lost
- unless data is stored in a volume
A volume is a persistent storage area outside the container. It is used to store all data that must survive container restarts or upgrades.

Volumes are:
- mounted inside containers
- independent of the container lifecycle
Docker Compose is a tool to define and run multiple containers together using a single configuration file (docker-compose.yml).

It is used to describe:
- which images to run,
- how containers are connected,
- ports,
- volumes,
- environment variables.
With Docker Compose, you typically manage an entire application stack:
- web server
- application
- database
- cache
All started with:
```
docker-compose up
```

Minimal mental model

Image → what to run
Container → the running application
Volume → persistent data
Docker Compose → a complete setup of multiple images and volumes.

If you keep this model in mind, most Docker-related documentation will start to make sense.

Useful things to know

A docker image will be started (once used in a container) using an Entrypoint and/or a cmd. These can be listed by using docker inspect <image id>, but also individually by using docker inspect -f '{{.Config.Entrypoint}}' <image id> and docker inspect -f '{{.Config.Cmd}}' <image id>. This is useful in case something goes wrong inside a container to know how the container is supposed to start. To know more about cmd and Entrypoint, the official Docker documentation offer a great overview.

Changing something that is part of the image is possible when running the container, but it will never be persistent and is generally a bad idea. One good case is when debugging an application, and if the debugged code is changed (in the code base) as soon as the issue is found.

How to build a Dockerfile

Recommended learning path

The official Online documentations is good and complete. Online tutorials or books are recommended for an easier approach.

Learning difficulty is medium, assuming you have some knowledge of Linux. There is nothing really difficult but there are a lot of aspects to take into account.

Moving to Docker compose

Recommended learning path

The official Online documentations is good and complete. Online tutorials should be enough as a complement.

Learning difficulty is easy/medium, assuming you have some knowledge of Linux. The difficulty is more about the different components of the setup (for instance an image with a database, linked with an Authentication system).

Advantages

Things to take into account

Adding parameters

The different types of volumes

Namespaces and namespaces collisions

Namespacing is by using the name of the containing folder of the docker-compose file. Using the same name, even in different parent folder, will use the same namespace. So running 2 different docker-compose setup using a parent folder of the same name will use the same docker network. If the applications use the same images, it could mean that one container communicate with one other container from the other docker-compose setup. For instance, you want 2 instances of the same application, testApp, that uses a MySQL database. The database is defined in the docker-compose and used by the testApp. The set-up is the following:

application1/testApp/docker-compose.yml
application2/testApp/docker-compose.yml

It is a clean set-up, with a distinct parent folder before the docker-compose folder (which would make a lot of sense if you want to store some specific files outside of the docker-compose folder). But as the direct parent folder of the docker-compose has the same name, only one MySQL instance will be used by both (or there will be a binding error and one application will fail, which is probably the best outcome)

From Docker compose to Kubernetes

Recommended learning path

Kubernetes is composed of many elements and has its own vocabulary. Each elements and the way the work together are not complex, but to grasp a minimal working set of Kubernetes will take some effort. We recommend to start with an online tutorial, interactive or not (links to separate resource page), and ideally a book, such as Kubernetes in Action, Production Kubernetes, Cloud Native DevOps with Kubernetes or Kubernetes: Up and Running

Learning difficulty is medium/hard, assuming you have some knowledge of containers. Using Kubernetes is still much easier than to set-up a production cluster. As software developers, we recommend that you set-up a cluster only as a testbed and rely on sysadmins for a production cluster.

Kubernetes is not more complex than docker compose, but it is much more than docker compose, with many elements. It is an orchestration engine: where docker compose can ask for these containers to be restarted when stopped, on the same machine, Kubernetes can choose on which machine the containers will run, can duplicate them, can kill them if they seem unhealthy to create some healthy ones, can create services out of these containers so they can be used by other clients, which could be other containers within Kubernets, can create network access of these containers, without knowing where they run.

For simplest application it could be a full overkill, then the more complex your application is (in term of size and elements), the more to gain from Kubernetes. If you need an application that needs to run 24/7, with transparent updates and that can scale from hundred of clients to several thousands, Kubernetes will make things much easier, and in a well thought way.

But this possibilities comes with difficulties to grasp it, especially if you are not a full sys-admin, and this documentation is made by and aimed at non-sys-admin persons.

Quick overview

-> Kubernetes section

Scalability

Databases

Security

Going to an assembly

Automatize the setup

Ansible & Terraform -> short introduction, links to documentation/

CI/CD/GitOps (Flux CD & others) -> also short introduction and links. Maybe a short tutorial for GitHub

This is a Quarto website.

To learn more about Quarto websites visit https://quarto.org/docs/websites.

Goals and scope

How to use

When to…

Quick introduction to Linux/Unix

Viewing a web page in a Linux shell

Firewall, SSH and VPN

Export display with X11

Linux with Wayland: Waypipe

Remote desktop

Commands that should be known

Folders you should know

Online resources

What makes the containers possible: cgroups

Bare-Metal setup and Virtual Machines

Vocabulary

Containers - Overview of Docker usage

Vocabulary

Minimal mental model

Useful things to know

How to build a Dockerfile

Moving to Docker compose

Advantages

Things to take into account

Namespaces and namespaces collisions

From Docker compose to Kubernetes

Other solutions

Scalability

Databases

Security

Going to an assembly

Automatize the setup