297 lines
21 KiB
TeX
297 lines
21 KiB
TeX
\documentclass{article}
|
|
|
|
\usepackage[style=authoryear]{biblatex}
|
|
\usepackage{graphicx}
|
|
|
|
\addbibresource{EOM_Software_Infrastructure_v0.1.0.bib}
|
|
|
|
\title{EOM Software Infrastructure}
|
|
\subtitle{v0.1.0}
|
|
\author{Eric O'Neill Meehan}
|
|
\date{\today}
|
|
|
|
\begin{abstract}
|
|
This paper recounts the development of \textit{eom-software-infrastructure} v0.1.0 and
|
|
\textit{ansible-role-eom-services} v0.1.0: an Ansible playbook and role used to deploy the network infrastructure and
|
|
application suite for \textit{eom.dev}. This paper describes the hardware and software stacks used in the EOM network
|
|
as well as the challenges faced in their deployment. Possible improvements are discussed throughout, and a more
|
|
concrete roadmap for future iterations is expounded upon in conclusion. The purpose of this paper is to provide
|
|
context to these release versions by documenting design decisions, highlighting problematic areas, and enumerating
|
|
potential improvements for future releases.
|
|
\end{abstract}
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
\section{Introduction}
|
|
This paper recounts the development of v0.1.0 of \textit{eom-software-infrastructure} and its components - a software
|
|
repository used to deploy the self-hosted content publication platform, \textit{eom.dev}. As the first minor release version,
|
|
this iteration constitutes the minimum-viable infrastructure for the network. The purpose of this paper is to provide context
|
|
to this release by documenting development decisions, highlighting problematic areas, and proposing a roadmap for future
|
|
releases; further, this paper will outline the initial content publication pipeline that has been deployed to the network.
|
|
|
|
\section{Hardware Infrastructure}
|
|
Three machines from Dell Technologies were used in the making of this network: a PowerEdge R350, a PowerEdge T640, and a
|
|
Latitude 7230. The hardware specifications for each can be found in figures 1, 2, and 3 respectively. As shown in figure 4,
|
|
the R350 and T640 were networked over LAN to a router, and the 7230 was connected internally via WiFi and externally via
|
|
AT&T mobile broadband. The T640 was also equipped with a 32 TiB RAID 10 array and a Nvidia RTX A6000 GPU.
|
|
|
|
\section{Software Infrastructure}
|
|
Configuration of the above hardware was source-controlled in the \textit{eom-software-infrastructure} repository. Debian 12
|
|
was selected as the operating system for all three nodes and, after an initial manual installation using the live installer,
|
|
preseed files were generated for each machine by dumping the debconf database as a mechanism for automating subsequent
|
|
installations. These files were sanitized for sensitive data and then added to the aforementioned repository. Further
|
|
configuration was managed through Ansible playbooks and roles within the same repository. A bespoke Debian configuration and
|
|
user environment was deployed to each node using \textit{ansible-role-debian} and \textit{ansible-role-ericomeehan}, which were
|
|
added as submodules to \textit{eom-software-infrastructure}. Ansible roles written by Jeff Geerling and distributed by Ansible
|
|
Galaxy were also added to the repository to configure the three machines into a Kubernetes cluster. Containerd and Kubernetes
|
|
were installed on the R350 and T640 to configure a Kubernetes control plane and worker respectively \cite{geerling_containerd}
|
|
\cite{geerling_kubernetes}. Further, Docker was installed on the 7230 for application development \cite{geerling_docker}.
|
|
Drivers for the T640's Nvidia GPU were installed using the role created by Nvidia and also distributed through Ansible Galaxy;
|
|
however, as will be detailed in the following section, successful execution of the role failed to enable the GPU in this
|
|
release version.
|
|
\subsection{Nvidia Drivers}
|
|
Significant difficulties were encountered while attempting to install Drivers for the RTX A6000. For the Bookworm release,
|
|
Debian provides Nvidia Driver version 535.183.01 and Tesla 470 Driver; however, Nvidia recommends using the Data Center Driver
|
|
version 550.95.07 with the RTX A6000 \cite{debian_nvidia} \cite{nvidia_search}. Adding to the confusion, Nvidia offers an
|
|
official Ansible role for driver installation that is incompatible with the Debian operating system
|
|
\cite{github_nvidia_driver}. An issue exists on the repository offering a workaround involving the modification of variables,
|
|
which was used as the basis for a fork of and pull request to the repository \cite{github_nvidia_issue}
|
|
\cite{eom_nvidia_driver} \cite{github_nvidia_pr}. Through attempting to install each driver version with multiple methods,
|
|
the same error persisted:
|
|
\begin{verbatim}
|
|
RmInitAdapter failed!
|
|
\end{verbatim}
|
|
Given this same GPU had been functioning properly in the same server running Arch Linux, this was assumed not to be a hardware
|
|
malfunction. For the sake of progress, it was decided that solving this issue would be left for a future iteration of the
|
|
project; however, this unit will be essential for data science and AI tasks, and represents a significant financial investment
|
|
for the network. Resolving this issue will, therefore, be a top priority.
|
|
\textit{ansible-role-nvidia-driver}
|
|
|
|
\section{Network Services}
|
|
A suite of services was deployed to the cluster using \textit{ansible-role-eom}, which was added to
|
|
\textit{eom-software-infrastructure} as a submodule. This repository defines five services for the platform: OpenLDAP and
|
|
four instances of Apache HTTP Server. A basic LDAP database was configured with a single administrative user to be
|
|
authenticated over HTTP basic access authentication from the Apache servers. Each instance of HTTPD was deployed with a unique
|
|
configuration to support a different service: Server Side Includes (SSI), WebDAV, git-http-backend with Gitweb, and a reverse
|
|
proxy. With the exception of git-http-backend and Gitweb, these services are available through the base functionality of
|
|
Apache HTTP Server, and need only be enabled by configuration.
|
|
|
|
TLS certificates in Kubernetes are typically requested from and provided by dedicated resources provisioned within the
|
|
cluster \cite{k8s_tls}. In a production environment, one would typically use a third-party service such as cert-manager to
|
|
request certificates from a trusted authority \cite{cert_manager_docs}. While the use of Helm charts makes the deployment of
|
|
cert-manager trivial, and its employment offers benefits such as automated certific rotation, it was found to be incompatible
|
|
with this iteration of the network. In order to request certificates from a trusted authority such as Let's Encrypt using
|
|
cert-manager, an \textit{issuer} of type ACME must be provisioned to solve HTTP01 or DNS01 challenges
|
|
\cite{cert_manager_issuer}. Given that the domain name registrar for \textit{eom.dev} does not offer API endpoints for
|
|
updating DNS records, an HTTP01 ACME issuer would need to be used. Such an issuer relies on Kubernetes \textit{ingress} rules
|
|
for proper routing of HTTP traffic, for which a production-ready \textit{ingress controller}, such as \textit{ingress-nginx},
|
|
is required \cite{k8s_ingress_controllers}. The nginx-ingress controller presumes that the Kubernetes cluster will exist in a
|
|
cloud environment, requiring either running the controller as a \textit{NodePort} service or installing and configuring
|
|
\textit{MetalLB} to accommodate \cite{ingress_nginx_bare_metal}. The NodePort method was chosen here for simplicity. After
|
|
adding a considerable number of resources to the cluster, it was found that the ACME challenge would timeout the HTTP01
|
|
challenge request, resulting in the Kubernetes resources \textit{challenge}, \textit{request}, and \textit{certificate} stuck
|
|
with \textit{pending} status and the following error message:
|
|
\begin{verbatim}
|
|
Waiting for HTTP-01 challenge propagation: failed to perform self check GET request ...
|
|
\end{verbatim}
|
|
At this point, it was determined that properly configuring issuers for the cluster was beyond the scope of this release
|
|
version, so an alternative solution was devised: quite simply, an instance of Apache httpd was deployed as a NodePort service
|
|
to function as a reverse-proxy to HTTP applications running as ClusterIP services. A generic TLS certificate was acquired
|
|
manually using \textit{certbot}, and was uploaded to the proxy server using an Ansible secret.
|
|
|
|
\section{Content Publication}
|
|
The services described above work in tandem to produce a basic content publication platform. Software repositories are
|
|
distributed through \textit{git.eom.dev}, HTML content is published at \textit{www.eom.dev}, and supplemental media is served
|
|
at \textit{media.eom.dev}. Authentication policies defined in Apache httpd and powered by OpenLDAP control permissions for
|
|
access and actions, allowing for secure administration. The platform described here is quite generic, and can be further
|
|
tailored for a wide variety of use cases. For this initial release, the live network is being used to host its own source
|
|
code as well as this article. While serving git repositories was inherent to the Gitweb server, article publication required
|
|
a bespoke pipeline. Taking inspiration from arXiv's initiative to convert scholarly articles written in TeX and LaTeX to the
|
|
more accessible HTML, this paper was composed in TeX and compiled to HTML using the same LaTeXML software as Cornell
|
|
University's service \cite{arxiv_html} \cite{github_latexml}. The source code for this paper was added to the
|
|
\textit{eom-software-infrastructure} repository, and the compiled PDF and HTML documents were uploaded to
|
|
\textit{media.eom.dev} to be added to \textit{www.eom.dev} using SSI.
|
|
\textit{www}
|
|
|
|
\section{Conclusion}
|
|
|
|
\end{document}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
|
|
\maketitle
|
|
|
|
\section{Introduction}
|
|
Release v0.1.0 of \textit{eom-software-infrastructure} and \textit{ansible-role-eom-services} constitutes the minimum-viable
|
|
software and hardware infrastructure for \textit{eom.dev}. The former repository contains an Ansible playbook used to deploy a
|
|
bare-metal Kubernetes cluster. The latter, an Ansible role to deploy a suite of web services to that cluster. This paper
|
|
documents the design, development, and deploymet of \textit{eom.dev}. Difficulties encountered are discussed, and desired
|
|
features are detailed. The purpose of this paper is to provide context to this release version, as well as a potential roadmap
|
|
for future iterations of the network.
|
|
|
|
\section{Hardware}
|
|
Three machines from Dell Technologies were used in the making of this network: a PowerEdge R350, a PowerEdge T640, and a
|
|
Latitude 7230 fulfilling the roles of Kubernetes control plane, worker, and administrator nodes respectively. The following
|
|
sections describe the hardware specifications of each machine.
|
|
\subsection{Dell PowerEdge R350}
|
|
The PowerEdge R350 has (specs)
|
|
\subsection{Dell PowerEdge T640}
|
|
The PowerEdge T640 is a more powerful machine, with (specs)
|
|
\subsubsection{RAID 10 Array}
|
|
A 32 TiB RAID 10 array consisting of four 16 TiB hard drives serves as the primary data storage container for the network.
|
|
Superior performance via hardware RAID is offered by the T640 through the PERCH350 system. Presented to the system as a
|
|
single drive, it was encrypted, partitioned with LVM, and then mounted to the filesystem at \textit{/data/store-0}.
|
|
\subsubsection{Nvidia RTX A6000}
|
|
For data-science and AI workloads, the T640 was equipped with an RTX A6000 GPU from Nvidia. (specs). For use in a Kubernetes
|
|
cluster, installation of GPU drivers and device plugins is required. As will be discussed later, difficulties with Nvidia
|
|
drivers prevented the use of the device in this release version.
|
|
\subsection{Dell Latitude 7230}
|
|
The Latitude 7230 Rugged Extreme is an IP-65 rated tablet computer designed by Dell \cite{dell_latitude}. The model used in
|
|
this project was configured with a 12th Gen Intel® Core™ i7-1260U, 32 GiB of memory, 1 TiB of storage, and 4G LTE mobile
|
|
broadband. With the detachable keyboard and active stylus accessories, this device is a capable software development
|
|
workstation and cluster administration node that is reliable in terms of performance, connectivity, and durability.
|
|
|
|
\section{EOM Software Infrastructure}
|
|
Configuration of the above hardware was automated through the \textit{eom-software-infrastructure} repository.
|
|
\subsection{Debian}
|
|
Debian was installed on each node using customized preseed files.
|
|
\subsection{Ansible}
|
|
Ansible was used as the automation framework for \textit{eom-software-infrastructure}.
|
|
\subsection{Containerd and Kubernetes}
|
|
Containerd and Kubernetes were installed using the Ansible roles created by Jeff Geerling.
|
|
\subsection{Nvidia}
|
|
Nvidia drivers were installed using the official Ansible role; however, significant difficulties were encountered during this
|
|
process.
|
|
|
|
|
|
|
|
Ansible provides the primary automation framework for \textit{eom-software-infrastructure}. With the exception of installing
|
|
the application itself and the configuration of SSH keys, the deployment of \textit{eom.dev} is
|
|
automated completely through playbooks and roles. Third-party roles from the community-driven Ansible Galaxy were used to
|
|
quickly deploy a significant degree of infrastructure, including Containerd, Kubernetes, and Docker, and custom roles were
|
|
created to configure more bespoke elements of the network. Idempotent playbooks further allowed for incremental feature
|
|
development with the ability to easily restore previous versions of the system.
|
|
\subsection{Debian}
|
|
The installation of Debian 12 on each node was automated using preseed files. After an initial manual installation,
|
|
\textit{ansible-role-debian} dumps the debconf database for each node to a file that was sanitized and copied to the
|
|
\textit{eom-software-infrastructure} repository. The role additionally performs a preliminary configuration of the base
|
|
installation, which includes enabling the firewall, configuring open ports, and setting the MOTD.
|
|
\subsection{Nvidia}
|
|
\subsection{Kubernetes}
|
|
As could be inferred from the roles defined for each PowerEdge server, Kubernetes will be installed on bare metal.
|
|
Alternatively, the entire cluster could have been hosted on one server using virtual machines to fulfill the roles of
|
|
control-plane and workers. In fact, doing so offers several advatages, and would potentially mitigate issues encountered
|
|
when attempting to deploy services to the cluster; however, creating and running virtual machines adds overhead to both the
|
|
project and software stack. For these reasons, this release version foregoes the use of virtual machines, though this may
|
|
change in future iterations. This decision made, the installation of Kubernetes and Containerd was managed by Ansible roles
|
|
written by Jeff Geerling \cite{geerling_containerd} \cite{geerling_kubernetes}.
|
|
|
|
\section{EOM Services}
|
|
A suite of web services was deployed to the cluster described above using \textit{ansible-role-eom-services}, which was added
|
|
to the \textit{eom-software-infrastructure} repository as a submodule. Though v0.1.0 of both repositories are being released
|
|
in tandem, synchronicity between release versions is not intended to be maintained into the future. The following sections
|
|
enumerate the services defined in \textit{ansible-role-eom-services}.
|
|
\subsection{OpenLDAP}
|
|
OpenLDAP provides the basis for single sign-on to the network. Here, the \textit{osixia/openldap} container was used; however,
|
|
the \textit{bitnami/openldap} may be used in the future as it offers support for extra schemas through environment variables
|
|
and is maintained by a verified publisher \cite{dockerhub_bitnami_openldap}. Regardless, a single user account was applied to
|
|
the database, but no other deviations from the default deployment were made. Although the container used here supports mapping
|
|
additional \textit{ldiff} configurations to be applied when the database is first created, and doing so would allow for the
|
|
applied configuration to be managed by Ansible, OpenLDAP salts passwords prior to hashing, making the storage of a properly
|
|
encrypted \textit{slappasswd} in an Ansible secret difficult if not impossible. To accommodate, the \textit{ldiff} file was
|
|
applied manually and the LDAP database was stored in a Kubernetes \textit{PersistentVolume}.
|
|
\subsection{Proxy}
|
|
\subsection{Git}
|
|
There are many existing options for self-hosted git repositories, including GitLab, Gitea, and CGit. The git software package
|
|
itself, however, offers the \textit{git-http-backend} script, which is a simple CGI script to serve repositories over HTTP
|
|
\cite{git_http_backend}. With official documentation supporting the configuration of Gitweb as a front-end user interface, a
|
|
Docker container was created in order to deploy this stack on the network. While the generic container functioned as
|
|
expected, several issues were encountered when it was deployed to the cluster. For this environment, the httpd ScriptAlias
|
|
for git-http-backend was modified from \textit{"^/git/"} to smply \textit{"^/"} in order to stylize the url
|
|
\textit{https://git.eom.dev/}; however, this caused httpd to pass requests for static resources (such as CSS and JavaScript
|
|
files) to the CGI scripts. In order to preserve the stylized URL, these static resources were uploaded to the media server
|
|
and Gitweb was reconfigured to utilize the external resources. It is worth noting that the configuration file for this
|
|
application resides directly in textit{/etc}, which necessitates a sub-path to be defined for the volume mount of the
|
|
Kubernetes deployment in order to prevent overwriting the entire directory.
|
|
\begin{verbatim}
|
|
spec:
|
|
containers:
|
|
- name: gitweb
|
|
image: ericomeehan/gitweb
|
|
volumeMounts:
|
|
- name: gitweb-config
|
|
mountPath: /etc/gitweb.conf
|
|
subPath: gitweb.conf
|
|
volumes:
|
|
- name: gitweb-config
|
|
configMap:
|
|
name: git-gitweb
|
|
\end{verbatim}
|
|
\subsection{Media}
|
|
Somewhat unusually, filesharing is achieved using WebDAV over Apache httpd. While HTTP is a cumbersome protocol for bulk file
|
|
transfers, it offers advantages that made it an appropriate choice for this phase of the project: configuration required only
|
|
slight modification of files used for the deployment of other services, the service was intended primarily for publishing files
|
|
over HTTP rather than uploading and downloading efficiently, and the WebDAV protocol inherently supports the creation of CalDAV
|
|
files that will be used to organize the next phase of this project. The remote filesystem was mounted locally on the Latitude
|
|
7230 using \textit{davfs2}; however, performance was poor, so it may be necessary to achieve this functionality by creating a
|
|
\textit{ReadWriteMany PersistentVolume} to be hosted both over WebDAV and a more efficient protocol such as FTP or SMB.
|
|
\subsection{WWW}
|
|
The last service deployed was also the simplest: an HTML website hosted, again, on Apache httpd. The only modifications made
|
|
to the base configuration for this service were enabling server-side includes and the common authorization configuration.
|
|
Simplicity in form matches simplicity in function, as this service is intended to provide basic information about the network
|
|
and link to external resources either by hyperlink or SSI.
|
|
\subsubsection{Blog}
|
|
The blog page of www was designed specifically to organize articles stored on the media server. Taking inspiration from
|
|
arXiv's initiative to convert scholarly papers written in the TeX and LaTeX formats to the more accessible HTML, a similar
|
|
publication pipeline would be employed here \cite{arxiv_html}. This article was composed in the TeX format and stored under
|
|
source-control in https://git.eom.dev/software-infrastructure. The same \textit{LaTeXML} software used by arXiv was used here
|
|
to generate XML, HTML, and PDF varients of the document, eah of which was uploaded to
|
|
https://media.eom.dev/EOM_Infrastructure_v0.1.0/ \cite{github_latexml}. The HTML file was then embeded into the blog page
|
|
using SSI. Each of these steps was executed manually for this release version; however, an automated pipeline may be used to
|
|
compile, publish, and embed documents in the future.
|
|
|
|
\section{Conclusion}
|
|
In this version of the EOM network infrastructure, the automated deployment of a Kubernetes control-plane, worker, and
|
|
administrator nodes through Ansible has been achieved. Additionally, a suite of services providing file sharing, git
|
|
repository hosting, and content publishing has been developed for deployment on top of the aforementioned cluster. The result
|
|
is a durable and secure, though currently minimalistic, platform for software development and research publication. As the
|
|
release version implies, many features remain to be added both in the next minor version and in anticipation of the first
|
|
major version of the project. The next phase of this project will focus on fixing Nvidia drivers and deploying services on
|
|
the network rather than modifying the cluster itself. Development will, as a result, take place more in the
|
|
\textit{ansible-role-nvidia-driver} and \textit{ansible-role-eom-services} repositories. A database and API would be particularly
|
|
useful additions to the latter. Once the digital platform is more mature, consideration may turn towards adapting the cluster
|
|
to run on virtual machines, which would aid in addressing the problems with TLS issuers.
|
|
|
|
\printbibliography
|
|
|
|
\end{document}
|