This commit is contained in:
Eric Meehan 2024-08-21 20:03:04 -04:00
parent ea336b352b
commit e7124040c9
7 changed files with 372 additions and 15 deletions

296
docs/draft_1.tex Normal file
View File

@ -0,0 +1,296 @@
\documentclass{article}
\usepackage[style=authoryear]{biblatex}
\usepackage{graphicx}
\addbibresource{EOM_Software_Infrastructure_v0.1.0.bib}
\title{EOM Software Infrastructure}
\subtitle{v0.1.0}
\author{Eric O'Neill Meehan}
\date{\today}
\begin{abstract}
This paper recounts the development of \textit{eom-software-infrastructure} v0.1.0 and
\textit{ansible-role-eom-services} v0.1.0: an Ansible playbook and role used to deploy the network infrastructure and
application suite for \textit{eom.dev}. This paper describes the hardware and software stacks used in the EOM network
as well as the challenges faced in their deployment. Possible improvements are discussed throughout, and a more
concrete roadmap for future iterations is expounded upon in conclusion. The purpose of this paper is to provide
context to these release versions by documenting design decisions, highlighting problematic areas, and enumerating
potential improvements for future releases.
\end{abstract}
\begin{document}
\maketitle
\section{Introduction}
This paper recounts the development of v0.1.0 of \textit{eom-software-infrastructure} and its components - a software
repository used to deploy the self-hosted content publication platform, \textit{eom.dev}. As the first minor release version,
this iteration constitutes the minimum-viable infrastructure for the network. The purpose of this paper is to provide context
to this release by documenting development decisions, highlighting problematic areas, and proposing a roadmap for future
releases; further, this paper will outline the initial content publication pipeline that has been deployed to the network.
\section{Hardware Infrastructure}
Three machines from Dell Technologies were used in the making of this network: a PowerEdge R350, a PowerEdge T640, and a
Latitude 7230. The hardware specifications for each can be found in figures 1, 2, and 3 respectively. As shown in figure 4,
the R350 and T640 were networked over LAN to a router, and the 7230 was connected internally via WiFi and externally via
AT&T mobile broadband. The T640 was also equipped with a 32 TiB RAID 10 array and a Nvidia RTX A6000 GPU.
\section{Software Infrastructure}
Configuration of the above hardware was source-controlled in the \textit{eom-software-infrastructure} repository. Debian 12
was selected as the operating system for all three nodes and, after an initial manual installation using the live installer,
preseed files were generated for each machine by dumping the debconf database as a mechanism for automating subsequent
installations. These files were sanitized for sensitive data and then added to the aforementioned repository. Further
configuration was managed through Ansible playbooks and roles within the same repository. A bespoke Debian configuration and
user environment was deployed to each node using \textit{ansible-role-debian} and \textit{ansible-role-ericomeehan}, which were
added as submodules to \textit{eom-software-infrastructure}. Ansible roles written by Jeff Geerling and distributed by Ansible
Galaxy were also added to the repository to configure the three machines into a Kubernetes cluster. Containerd and Kubernetes
were installed on the R350 and T640 to configure a Kubernetes control plane and worker respectively \cite{geerling_containerd}
\cite{geerling_kubernetes}. Further, Docker was installed on the 7230 for application development \cite{geerling_docker}.
Drivers for the T640's Nvidia GPU were installed using the role created by Nvidia and also distributed through Ansible Galaxy;
however, as will be detailed in the following section, successful execution of the role failed to enable the GPU in this
release version.
\subsection{Nvidia Drivers}
Significant difficulties were encountered while attempting to install Drivers for the RTX A6000. For the Bookworm release,
Debian provides Nvidia Driver version 535.183.01 and Tesla 470 Driver; however, Nvidia recommends using the Data Center Driver
version 550.95.07 with the RTX A6000 \cite{debian_nvidia} \cite{nvidia_search}. Adding to the confusion, Nvidia offers an
official Ansible role for driver installation that is incompatible with the Debian operating system
\cite{github_nvidia_driver}. An issue exists on the repository offering a workaround involving the modification of variables,
which was used as the basis for a fork of and pull request to the repository \cite{github_nvidia_issue}
\cite{eom_nvidia_driver} \cite{github_nvidia_pr}. Through attempting to install each driver version with multiple methods,
the same error persisted:
\begin{verbatim}
RmInitAdapter failed!
\end{verbatim}
Given this same GPU had been functioning properly in the same server running Arch Linux, this was assumed not to be a hardware
malfunction. For the sake of progress, it was decided that solving this issue would be left for a future iteration of the
project; however, this unit will be essential for data science and AI tasks, and represents a significant financial investment
for the network. Resolving this issue will, therefore, be a top priority.
\textit{ansible-role-nvidia-driver}
\section{Network Services}
A suite of services was deployed to the cluster using \textit{ansible-role-eom}, which was added to
\textit{eom-software-infrastructure} as a submodule. This repository defines five services for the platform: OpenLDAP and
four instances of Apache HTTP Server. A basic LDAP database was configured with a single administrative user to be
authenticated over HTTP basic access authentication from the Apache servers. Each instance of HTTPD was deployed with a unique
configuration to support a different service: Server Side Includes (SSI), WebDAV, git-http-backend with Gitweb, and a reverse
proxy. With the exception of git-http-backend and Gitweb, these services are available through the base functionality of
Apache HTTP Server, and need only be enabled by configuration.
TLS certificates in Kubernetes are typically requested from and provided by dedicated resources provisioned within the
cluster \cite{k8s_tls}. In a production environment, one would typically use a third-party service such as cert-manager to
request certificates from a trusted authority \cite{cert_manager_docs}. While the use of Helm charts makes the deployment of
cert-manager trivial, and its employment offers benefits such as automated certific rotation, it was found to be incompatible
with this iteration of the network. In order to request certificates from a trusted authority such as Let's Encrypt using
cert-manager, an \textit{issuer} of type ACME must be provisioned to solve HTTP01 or DNS01 challenges
\cite{cert_manager_issuer}. Given that the domain name registrar for \textit{eom.dev} does not offer API endpoints for
updating DNS records, an HTTP01 ACME issuer would need to be used. Such an issuer relies on Kubernetes \textit{ingress} rules
for proper routing of HTTP traffic, for which a production-ready \textit{ingress controller}, such as \textit{ingress-nginx},
is required \cite{k8s_ingress_controllers}. The nginx-ingress controller presumes that the Kubernetes cluster will exist in a
cloud environment, requiring either running the controller as a \textit{NodePort} service or installing and configuring
\textit{MetalLB} to accommodate \cite{ingress_nginx_bare_metal}. The NodePort method was chosen here for simplicity. After
adding a considerable number of resources to the cluster, it was found that the ACME challenge would timeout the HTTP01
challenge request, resulting in the Kubernetes resources \textit{challenge}, \textit{request}, and \textit{certificate} stuck
with \textit{pending} status and the following error message:
\begin{verbatim}
Waiting for HTTP-01 challenge propagation: failed to perform self check GET request ...
\end{verbatim}
At this point, it was determined that properly configuring issuers for the cluster was beyond the scope of this release
version, so an alternative solution was devised: quite simply, an instance of Apache httpd was deployed as a NodePort service
to function as a reverse-proxy to HTTP applications running as ClusterIP services. A generic TLS certificate was acquired
manually using \textit{certbot}, and was uploaded to the proxy server using an Ansible secret.
\section{Content Publication}
The services described above work in tandem to produce a basic content publication platform. Software repositories are
distributed through \textit{git.eom.dev}, HTML content is published at \textit{www.eom.dev}, and supplemental media is served
at \textit{media.eom.dev}. Authentication policies defined in Apache httpd and powered by OpenLDAP control permissions for
access and actions, allowing for secure administration. The platform described here is quite generic, and can be further
tailored for a wide variety of use cases. For this initial release, the live network is being used to host its own source
code as well as this article. While serving git repositories was inherent to the Gitweb server, article publication required
a bespoke pipeline. Taking inspiration from arXiv's initiative to convert scholarly articles written in TeX and LaTeX to the
more accessible HTML, this paper was composed in TeX and compiled to HTML using the same LaTeXML software as Cornell
University's service \cite{arxiv_html} \cite{github_latexml}. The source code for this paper was added to the
\textit{eom-software-infrastructure} repository, and the compiled PDF and HTML documents were uploaded to
\textit{media.eom.dev} to be added to \textit{www.eom.dev} using SSI.
\textit{www}
\section{Conclusion}
\end{document}
\begin{document}
\maketitle
\section{Introduction}
Release v0.1.0 of \textit{eom-software-infrastructure} and \textit{ansible-role-eom-services} constitutes the minimum-viable
software and hardware infrastructure for \textit{eom.dev}. The former repository contains an Ansible playbook used to deploy a
bare-metal Kubernetes cluster. The latter, an Ansible role to deploy a suite of web services to that cluster. This paper
documents the design, development, and deploymet of \textit{eom.dev}. Difficulties encountered are discussed, and desired
features are detailed. The purpose of this paper is to provide context to this release version, as well as a potential roadmap
for future iterations of the network.
\section{Hardware}
Three machines from Dell Technologies were used in the making of this network: a PowerEdge R350, a PowerEdge T640, and a
Latitude 7230 fulfilling the roles of Kubernetes control plane, worker, and administrator nodes respectively. The following
sections describe the hardware specifications of each machine.
\subsection{Dell PowerEdge R350}
The PowerEdge R350 has (specs)
\subsection{Dell PowerEdge T640}
The PowerEdge T640 is a more powerful machine, with (specs)
\subsubsection{RAID 10 Array}
A 32 TiB RAID 10 array consisting of four 16 TiB hard drives serves as the primary data storage container for the network.
Superior performance via hardware RAID is offered by the T640 through the PERCH350 system. Presented to the system as a
single drive, it was encrypted, partitioned with LVM, and then mounted to the filesystem at \textit{/data/store-0}.
\subsubsection{Nvidia RTX A6000}
For data-science and AI workloads, the T640 was equipped with an RTX A6000 GPU from Nvidia. (specs). For use in a Kubernetes
cluster, installation of GPU drivers and device plugins is required. As will be discussed later, difficulties with Nvidia
drivers prevented the use of the device in this release version.
\subsection{Dell Latitude 7230}
The Latitude 7230 Rugged Extreme is an IP-65 rated tablet computer designed by Dell \cite{dell_latitude}. The model used in
this project was configured with a 12th Gen Intel® Core™ i7-1260U, 32 GiB of memory, 1 TiB of storage, and 4G LTE mobile
broadband. With the detachable keyboard and active stylus accessories, this device is a capable software development
workstation and cluster administration node that is reliable in terms of performance, connectivity, and durability.
\section{EOM Software Infrastructure}
Configuration of the above hardware was automated through the \textit{eom-software-infrastructure} repository.
\subsection{Debian}
Debian was installed on each node using customized preseed files.
\subsection{Ansible}
Ansible was used as the automation framework for \textit{eom-software-infrastructure}.
\subsection{Containerd and Kubernetes}
Containerd and Kubernetes were installed using the Ansible roles created by Jeff Geerling.
\subsection{Nvidia}
Nvidia drivers were installed using the official Ansible role; however, significant difficulties were encountered during this
process.
Ansible provides the primary automation framework for \textit{eom-software-infrastructure}. With the exception of installing
the application itself and the configuration of SSH keys, the deployment of \textit{eom.dev} is
automated completely through playbooks and roles. Third-party roles from the community-driven Ansible Galaxy were used to
quickly deploy a significant degree of infrastructure, including Containerd, Kubernetes, and Docker, and custom roles were
created to configure more bespoke elements of the network. Idempotent playbooks further allowed for incremental feature
development with the ability to easily restore previous versions of the system.
\subsection{Debian}
The installation of Debian 12 on each node was automated using preseed files. After an initial manual installation,
\textit{ansible-role-debian} dumps the debconf database for each node to a file that was sanitized and copied to the
\textit{eom-software-infrastructure} repository. The role additionally performs a preliminary configuration of the base
installation, which includes enabling the firewall, configuring open ports, and setting the MOTD.
\subsection{Nvidia}
\subsection{Kubernetes}
As could be inferred from the roles defined for each PowerEdge server, Kubernetes will be installed on bare metal.
Alternatively, the entire cluster could have been hosted on one server using virtual machines to fulfill the roles of
control-plane and workers. In fact, doing so offers several advatages, and would potentially mitigate issues encountered
when attempting to deploy services to the cluster; however, creating and running virtual machines adds overhead to both the
project and software stack. For these reasons, this release version foregoes the use of virtual machines, though this may
change in future iterations. This decision made, the installation of Kubernetes and Containerd was managed by Ansible roles
written by Jeff Geerling \cite{geerling_containerd} \cite{geerling_kubernetes}.
\section{EOM Services}
A suite of web services was deployed to the cluster described above using \textit{ansible-role-eom-services}, which was added
to the \textit{eom-software-infrastructure} repository as a submodule. Though v0.1.0 of both repositories are being released
in tandem, synchronicity between release versions is not intended to be maintained into the future. The following sections
enumerate the services defined in \textit{ansible-role-eom-services}.
\subsection{OpenLDAP}
OpenLDAP provides the basis for single sign-on to the network. Here, the \textit{osixia/openldap} container was used; however,
the \textit{bitnami/openldap} may be used in the future as it offers support for extra schemas through environment variables
and is maintained by a verified publisher \cite{dockerhub_bitnami_openldap}. Regardless, a single user account was applied to
the database, but no other deviations from the default deployment were made. Although the container used here supports mapping
additional \textit{ldiff} configurations to be applied when the database is first created, and doing so would allow for the
applied configuration to be managed by Ansible, OpenLDAP salts passwords prior to hashing, making the storage of a properly
encrypted \textit{slappasswd} in an Ansible secret difficult if not impossible. To accommodate, the \textit{ldiff} file was
applied manually and the LDAP database was stored in a Kubernetes \textit{PersistentVolume}.
\subsection{Proxy}
\subsection{Git}
There are many existing options for self-hosted git repositories, including GitLab, Gitea, and CGit. The git software package
itself, however, offers the \textit{git-http-backend} script, which is a simple CGI script to serve repositories over HTTP
\cite{git_http_backend}. With official documentation supporting the configuration of Gitweb as a front-end user interface, a
Docker container was created in order to deploy this stack on the network. While the generic container functioned as
expected, several issues were encountered when it was deployed to the cluster. For this environment, the httpd ScriptAlias
for git-http-backend was modified from \textit{"^/git/"} to smply \textit{"^/"} in order to stylize the url
\textit{https://git.eom.dev/}; however, this caused httpd to pass requests for static resources (such as CSS and JavaScript
files) to the CGI scripts. In order to preserve the stylized URL, these static resources were uploaded to the media server
and Gitweb was reconfigured to utilize the external resources. It is worth noting that the configuration file for this
application resides directly in textit{/etc}, which necessitates a sub-path to be defined for the volume mount of the
Kubernetes deployment in order to prevent overwriting the entire directory.
\begin{verbatim}
spec:
containers:
- name: gitweb
image: ericomeehan/gitweb
volumeMounts:
- name: gitweb-config
mountPath: /etc/gitweb.conf
subPath: gitweb.conf
volumes:
- name: gitweb-config
configMap:
name: git-gitweb
\end{verbatim}
\subsection{Media}
Somewhat unusually, filesharing is achieved using WebDAV over Apache httpd. While HTTP is a cumbersome protocol for bulk file
transfers, it offers advantages that made it an appropriate choice for this phase of the project: configuration required only
slight modification of files used for the deployment of other services, the service was intended primarily for publishing files
over HTTP rather than uploading and downloading efficiently, and the WebDAV protocol inherently supports the creation of CalDAV
files that will be used to organize the next phase of this project. The remote filesystem was mounted locally on the Latitude
7230 using \textit{davfs2}; however, performance was poor, so it may be necessary to achieve this functionality by creating a
\textit{ReadWriteMany PersistentVolume} to be hosted both over WebDAV and a more efficient protocol such as FTP or SMB.
\subsection{WWW}
The last service deployed was also the simplest: an HTML website hosted, again, on Apache httpd. The only modifications made
to the base configuration for this service were enabling server-side includes and the common authorization configuration.
Simplicity in form matches simplicity in function, as this service is intended to provide basic information about the network
and link to external resources either by hyperlink or SSI.
\subsubsection{Blog}
The blog page of www was designed specifically to organize articles stored on the media server. Taking inspiration from
arXiv's initiative to convert scholarly papers written in the TeX and LaTeX formats to the more accessible HTML, a similar
publication pipeline would be employed here \cite{arxiv_html}. This article was composed in the TeX format and stored under
source-control in https://git.eom.dev/software-infrastructure. The same \textit{LaTeXML} software used by arXiv was used here
to generate XML, HTML, and PDF varients of the document, eah of which was uploaded to
https://media.eom.dev/EOM_Infrastructure_v0.1.0/ \cite{github_latexml}. The HTML file was then embeded into the blog page
using SSI. Each of these steps was executed manually for this release version; however, an automated pipeline may be used to
compile, publish, and embed documents in the future.
\section{Conclusion}
In this version of the EOM network infrastructure, the automated deployment of a Kubernetes control-plane, worker, and
administrator nodes through Ansible has been achieved. Additionally, a suite of services providing file sharing, git
repository hosting, and content publishing has been developed for deployment on top of the aforementioned cluster. The result
is a durable and secure, though currently minimalistic, platform for software development and research publication. As the
release version implies, many features remain to be added both in the next minor version and in anticipation of the first
major version of the project. The next phase of this project will focus on fixing Nvidia drivers and deploying services on
the network rather than modifying the cluster itself. Development will, as a result, take place more in the
\textit{ansible-role-nvidia-driver} and \textit{ansible-role-eom-services} repositories. A database and API would be particularly
useful additions to the latter. Once the digital platform is more mature, consideration may turn towards adapting the cluster
to run on virtual machines, which would aid in addressing the problems with TLS issuers.
\printbibliography
\end{document}

35
docs/draft_2.tex Normal file
View File

@ -0,0 +1,35 @@
\documentclass{article}
\title{EOM Software Infrastructure}
\subtitle{v0.1.0}
\author{Eric O'Neill Meehan}
\date{\today}
\begin{abstract}
\end{abstract}
\begin{document}
\maketitle
\section{Introduction}
Release v0.1.0 of \textit{eom-software-infrastructure} and its components constitutes the minum-viable hardware and software
infrastructure for \textit{eom.dev}, a self-hosted media publication platform. For this release, three machines were
configured into a cluster hosting a suite of web services. The purpose of this paper is to add context to the release by
documenting design decisions, highlighting problematic areas, and enumerating potential improvements for future iterations;
further, this paper will describe the publication pipeline that has been deployed to the network.
\section{Hardware Infrastructure}
\section{Software Infrastructure}
\subsection{Nvidia Drivers}
\section{Network Services}
\subsection{Content Publication}
\section{Conclusion}
Release v0.1.0 of \textit{eom-software-infrastructure} and its components were used to configure two PowerEdge servers and a
Latitude tablet into a Kubernetes cluster, development environment, and publication platform used to host its own source code
and this supplementary paper.
\end{document}

View File

@ -0,0 +1,16 @@
# Host vars for inspiron-3670
nvidia_driver_needed: true
packages:
- curl
- davfs2
- gimp
- git
- gphoto2
- latexml
- neovim
- passwordsafe
- texlive-full
- thunderbird
- tmux
- torbrowser-launcher
- w3m

View File

@ -0,0 +1,3 @@
# Host vars for latitude-7230
ansible_connection: local

7
inspiron.yaml Normal file
View File

@ -0,0 +1,7 @@
---
# Playbook for workstations
- name: Initialize system
hosts: inspiron-3670
become: true
roles:
- role: ericomeehan.nvidia_driver

View File

@ -3,17 +3,15 @@ all:
children:
workstations:
hosts:
mobile-command:
latitude-7230:
ansible-host: 192.168.1.123
clusters:
inspiron-3670:
ansible-host: 192.168.1.210
imac:
ansible-host: 192.168.1.139
servers:
children:
alpha:
children:
control_plane:
hosts:
alpha-control-plane:
ansible-host: 192.168.1.137
workers:
hosts:
alpha-worker-0:
ansible-host: 192.168.1.138
poweredge-r350:
ansible-host: 192.168.1.137
poweredge-t640:
ansible-host: 192.168.1.138

View File

@ -1,8 +1,7 @@
---
# Playbook for mobile-command
# Playbook for workstations
- name: Initialize system
hosts: mobile-command
connection: local
hosts: workstations
become: true
roles:
- role: ericomeehan.debian
@ -28,8 +27,11 @@
- git
- gphoto2
- gpsd
- latexml
- neovim
- passwordsafe
- python3-venv
- texlive-full
- thunderbird
- tmux
- torbrowser-launcher