diff --git a/docs/draft_1.tex b/docs/draft_1.tex new file mode 100644 index 0000000..e9f37bf --- /dev/null +++ b/docs/draft_1.tex @@ -0,0 +1,296 @@ +\documentclass{article} + +\usepackage[style=authoryear]{biblatex} +\usepackage{graphicx} + +\addbibresource{EOM_Software_Infrastructure_v0.1.0.bib} + +\title{EOM Software Infrastructure} +\subtitle{v0.1.0} +\author{Eric O'Neill Meehan} +\date{\today} + +\begin{abstract} + This paper recounts the development of \textit{eom-software-infrastructure} v0.1.0 and + \textit{ansible-role-eom-services} v0.1.0: an Ansible playbook and role used to deploy the network infrastructure and + application suite for \textit{eom.dev}. This paper describes the hardware and software stacks used in the EOM network + as well as the challenges faced in their deployment. Possible improvements are discussed throughout, and a more + concrete roadmap for future iterations is expounded upon in conclusion. The purpose of this paper is to provide + context to these release versions by documenting design decisions, highlighting problematic areas, and enumerating + potential improvements for future releases. +\end{abstract} + +\begin{document} + +\maketitle + +\section{Introduction} +This paper recounts the development of v0.1.0 of \textit{eom-software-infrastructure} and its components - a software +repository used to deploy the self-hosted content publication platform, \textit{eom.dev}. As the first minor release version, +this iteration constitutes the minimum-viable infrastructure for the network. The purpose of this paper is to provide context +to this release by documenting development decisions, highlighting problematic areas, and proposing a roadmap for future +releases; further, this paper will outline the initial content publication pipeline that has been deployed to the network. + +\section{Hardware Infrastructure} +Three machines from Dell Technologies were used in the making of this network: a PowerEdge R350, a PowerEdge T640, and a +Latitude 7230. The hardware specifications for each can be found in figures 1, 2, and 3 respectively. As shown in figure 4, +the R350 and T640 were networked over LAN to a router, and the 7230 was connected internally via WiFi and externally via +AT&T mobile broadband. The T640 was also equipped with a 32 TiB RAID 10 array and a Nvidia RTX A6000 GPU. + +\section{Software Infrastructure} +Configuration of the above hardware was source-controlled in the \textit{eom-software-infrastructure} repository. Debian 12 +was selected as the operating system for all three nodes and, after an initial manual installation using the live installer, +preseed files were generated for each machine by dumping the debconf database as a mechanism for automating subsequent +installations. These files were sanitized for sensitive data and then added to the aforementioned repository. Further +configuration was managed through Ansible playbooks and roles within the same repository. A bespoke Debian configuration and +user environment was deployed to each node using \textit{ansible-role-debian} and \textit{ansible-role-ericomeehan}, which were +added as submodules to \textit{eom-software-infrastructure}. Ansible roles written by Jeff Geerling and distributed by Ansible +Galaxy were also added to the repository to configure the three machines into a Kubernetes cluster. Containerd and Kubernetes +were installed on the R350 and T640 to configure a Kubernetes control plane and worker respectively \cite{geerling_containerd} +\cite{geerling_kubernetes}. Further, Docker was installed on the 7230 for application development \cite{geerling_docker}. +Drivers for the T640's Nvidia GPU were installed using the role created by Nvidia and also distributed through Ansible Galaxy; +however, as will be detailed in the following section, successful execution of the role failed to enable the GPU in this +release version. +\subsection{Nvidia Drivers} +Significant difficulties were encountered while attempting to install Drivers for the RTX A6000. For the Bookworm release, +Debian provides Nvidia Driver version 535.183.01 and Tesla 470 Driver; however, Nvidia recommends using the Data Center Driver +version 550.95.07 with the RTX A6000 \cite{debian_nvidia} \cite{nvidia_search}. Adding to the confusion, Nvidia offers an +official Ansible role for driver installation that is incompatible with the Debian operating system +\cite{github_nvidia_driver}. An issue exists on the repository offering a workaround involving the modification of variables, +which was used as the basis for a fork of and pull request to the repository \cite{github_nvidia_issue} +\cite{eom_nvidia_driver} \cite{github_nvidia_pr}. Through attempting to install each driver version with multiple methods, +the same error persisted: +\begin{verbatim} +RmInitAdapter failed! +\end{verbatim} +Given this same GPU had been functioning properly in the same server running Arch Linux, this was assumed not to be a hardware +malfunction. For the sake of progress, it was decided that solving this issue would be left for a future iteration of the +project; however, this unit will be essential for data science and AI tasks, and represents a significant financial investment +for the network. Resolving this issue will, therefore, be a top priority. +\textit{ansible-role-nvidia-driver} + +\section{Network Services} +A suite of services was deployed to the cluster using \textit{ansible-role-eom}, which was added to +\textit{eom-software-infrastructure} as a submodule. This repository defines five services for the platform: OpenLDAP and +four instances of Apache HTTP Server. A basic LDAP database was configured with a single administrative user to be +authenticated over HTTP basic access authentication from the Apache servers. Each instance of HTTPD was deployed with a unique +configuration to support a different service: Server Side Includes (SSI), WebDAV, git-http-backend with Gitweb, and a reverse +proxy. With the exception of git-http-backend and Gitweb, these services are available through the base functionality of +Apache HTTP Server, and need only be enabled by configuration. + +TLS certificates in Kubernetes are typically requested from and provided by dedicated resources provisioned within the +cluster \cite{k8s_tls}. In a production environment, one would typically use a third-party service such as cert-manager to +request certificates from a trusted authority \cite{cert_manager_docs}. While the use of Helm charts makes the deployment of +cert-manager trivial, and its employment offers benefits such as automated certific rotation, it was found to be incompatible +with this iteration of the network. In order to request certificates from a trusted authority such as Let's Encrypt using +cert-manager, an \textit{issuer} of type ACME must be provisioned to solve HTTP01 or DNS01 challenges +\cite{cert_manager_issuer}. Given that the domain name registrar for \textit{eom.dev} does not offer API endpoints for +updating DNS records, an HTTP01 ACME issuer would need to be used. Such an issuer relies on Kubernetes \textit{ingress} rules +for proper routing of HTTP traffic, for which a production-ready \textit{ingress controller}, such as \textit{ingress-nginx}, +is required \cite{k8s_ingress_controllers}. The nginx-ingress controller presumes that the Kubernetes cluster will exist in a +cloud environment, requiring either running the controller as a \textit{NodePort} service or installing and configuring +\textit{MetalLB} to accommodate \cite{ingress_nginx_bare_metal}. The NodePort method was chosen here for simplicity. After +adding a considerable number of resources to the cluster, it was found that the ACME challenge would timeout the HTTP01 +challenge request, resulting in the Kubernetes resources \textit{challenge}, \textit{request}, and \textit{certificate} stuck +with \textit{pending} status and the following error message: +\begin{verbatim} +Waiting for HTTP-01 challenge propagation: failed to perform self check GET request ... +\end{verbatim} +At this point, it was determined that properly configuring issuers for the cluster was beyond the scope of this release +version, so an alternative solution was devised: quite simply, an instance of Apache httpd was deployed as a NodePort service +to function as a reverse-proxy to HTTP applications running as ClusterIP services. A generic TLS certificate was acquired +manually using \textit{certbot}, and was uploaded to the proxy server using an Ansible secret. + +\section{Content Publication} +The services described above work in tandem to produce a basic content publication platform. Software repositories are +distributed through \textit{git.eom.dev}, HTML content is published at \textit{www.eom.dev}, and supplemental media is served +at \textit{media.eom.dev}. Authentication policies defined in Apache httpd and powered by OpenLDAP control permissions for +access and actions, allowing for secure administration. The platform described here is quite generic, and can be further +tailored for a wide variety of use cases. For this initial release, the live network is being used to host its own source +code as well as this article. While serving git repositories was inherent to the Gitweb server, article publication required +a bespoke pipeline. Taking inspiration from arXiv's initiative to convert scholarly articles written in TeX and LaTeX to the +more accessible HTML, this paper was composed in TeX and compiled to HTML using the same LaTeXML software as Cornell +University's service \cite{arxiv_html} \cite{github_latexml}. The source code for this paper was added to the +\textit{eom-software-infrastructure} repository, and the compiled PDF and HTML documents were uploaded to +\textit{media.eom.dev} to be added to \textit{www.eom.dev} using SSI. +\textit{www} + +\section{Conclusion} + +\end{document} + + + + + + + + + + + + + + + + + + + + + + + + + + + +\begin{document} + + +\maketitle + +\section{Introduction} +Release v0.1.0 of \textit{eom-software-infrastructure} and \textit{ansible-role-eom-services} constitutes the minimum-viable +software and hardware infrastructure for \textit{eom.dev}. The former repository contains an Ansible playbook used to deploy a +bare-metal Kubernetes cluster. The latter, an Ansible role to deploy a suite of web services to that cluster. This paper +documents the design, development, and deploymet of \textit{eom.dev}. Difficulties encountered are discussed, and desired +features are detailed. The purpose of this paper is to provide context to this release version, as well as a potential roadmap +for future iterations of the network. + +\section{Hardware} +Three machines from Dell Technologies were used in the making of this network: a PowerEdge R350, a PowerEdge T640, and a +Latitude 7230 fulfilling the roles of Kubernetes control plane, worker, and administrator nodes respectively. The following +sections describe the hardware specifications of each machine. +\subsection{Dell PowerEdge R350} +The PowerEdge R350 has (specs) +\subsection{Dell PowerEdge T640} +The PowerEdge T640 is a more powerful machine, with (specs) +\subsubsection{RAID 10 Array} +A 32 TiB RAID 10 array consisting of four 16 TiB hard drives serves as the primary data storage container for the network. +Superior performance via hardware RAID is offered by the T640 through the PERCH350 system. Presented to the system as a +single drive, it was encrypted, partitioned with LVM, and then mounted to the filesystem at \textit{/data/store-0}. +\subsubsection{Nvidia RTX A6000} +For data-science and AI workloads, the T640 was equipped with an RTX A6000 GPU from Nvidia. (specs). For use in a Kubernetes +cluster, installation of GPU drivers and device plugins is required. As will be discussed later, difficulties with Nvidia +drivers prevented the use of the device in this release version. +\subsection{Dell Latitude 7230} +The Latitude 7230 Rugged Extreme is an IP-65 rated tablet computer designed by Dell \cite{dell_latitude}. The model used in +this project was configured with a 12th Gen Intel® Core™ i7-1260U, 32 GiB of memory, 1 TiB of storage, and 4G LTE mobile +broadband. With the detachable keyboard and active stylus accessories, this device is a capable software development +workstation and cluster administration node that is reliable in terms of performance, connectivity, and durability. + +\section{EOM Software Infrastructure} +Configuration of the above hardware was automated through the \textit{eom-software-infrastructure} repository. +\subsection{Debian} +Debian was installed on each node using customized preseed files. +\subsection{Ansible} +Ansible was used as the automation framework for \textit{eom-software-infrastructure}. +\subsection{Containerd and Kubernetes} +Containerd and Kubernetes were installed using the Ansible roles created by Jeff Geerling. +\subsection{Nvidia} +Nvidia drivers were installed using the official Ansible role; however, significant difficulties were encountered during this +process. + + + +Ansible provides the primary automation framework for \textit{eom-software-infrastructure}. With the exception of installing +the application itself and the configuration of SSH keys, the deployment of \textit{eom.dev} is +automated completely through playbooks and roles. Third-party roles from the community-driven Ansible Galaxy were used to +quickly deploy a significant degree of infrastructure, including Containerd, Kubernetes, and Docker, and custom roles were +created to configure more bespoke elements of the network. Idempotent playbooks further allowed for incremental feature +development with the ability to easily restore previous versions of the system. +\subsection{Debian} +The installation of Debian 12 on each node was automated using preseed files. After an initial manual installation, +\textit{ansible-role-debian} dumps the debconf database for each node to a file that was sanitized and copied to the +\textit{eom-software-infrastructure} repository. The role additionally performs a preliminary configuration of the base +installation, which includes enabling the firewall, configuring open ports, and setting the MOTD. +\subsection{Nvidia} +\subsection{Kubernetes} +As could be inferred from the roles defined for each PowerEdge server, Kubernetes will be installed on bare metal. +Alternatively, the entire cluster could have been hosted on one server using virtual machines to fulfill the roles of +control-plane and workers. In fact, doing so offers several advatages, and would potentially mitigate issues encountered +when attempting to deploy services to the cluster; however, creating and running virtual machines adds overhead to both the +project and software stack. For these reasons, this release version foregoes the use of virtual machines, though this may +change in future iterations. This decision made, the installation of Kubernetes and Containerd was managed by Ansible roles +written by Jeff Geerling \cite{geerling_containerd} \cite{geerling_kubernetes}. + +\section{EOM Services} +A suite of web services was deployed to the cluster described above using \textit{ansible-role-eom-services}, which was added +to the \textit{eom-software-infrastructure} repository as a submodule. Though v0.1.0 of both repositories are being released +in tandem, synchronicity between release versions is not intended to be maintained into the future. The following sections +enumerate the services defined in \textit{ansible-role-eom-services}. +\subsection{OpenLDAP} +OpenLDAP provides the basis for single sign-on to the network. Here, the \textit{osixia/openldap} container was used; however, +the \textit{bitnami/openldap} may be used in the future as it offers support for extra schemas through environment variables +and is maintained by a verified publisher \cite{dockerhub_bitnami_openldap}. Regardless, a single user account was applied to +the database, but no other deviations from the default deployment were made. Although the container used here supports mapping +additional \textit{ldiff} configurations to be applied when the database is first created, and doing so would allow for the +applied configuration to be managed by Ansible, OpenLDAP salts passwords prior to hashing, making the storage of a properly +encrypted \textit{slappasswd} in an Ansible secret difficult if not impossible. To accommodate, the \textit{ldiff} file was +applied manually and the LDAP database was stored in a Kubernetes \textit{PersistentVolume}. +\subsection{Proxy} +\subsection{Git} +There are many existing options for self-hosted git repositories, including GitLab, Gitea, and CGit. The git software package +itself, however, offers the \textit{git-http-backend} script, which is a simple CGI script to serve repositories over HTTP +\cite{git_http_backend}. With official documentation supporting the configuration of Gitweb as a front-end user interface, a +Docker container was created in order to deploy this stack on the network. While the generic container functioned as +expected, several issues were encountered when it was deployed to the cluster. For this environment, the httpd ScriptAlias +for git-http-backend was modified from \textit{"^/git/"} to smply \textit{"^/"} in order to stylize the url +\textit{https://git.eom.dev/}; however, this caused httpd to pass requests for static resources (such as CSS and JavaScript +files) to the CGI scripts. In order to preserve the stylized URL, these static resources were uploaded to the media server +and Gitweb was reconfigured to utilize the external resources. It is worth noting that the configuration file for this +application resides directly in textit{/etc}, which necessitates a sub-path to be defined for the volume mount of the +Kubernetes deployment in order to prevent overwriting the entire directory. +\begin{verbatim} +spec: + containers: + - name: gitweb + image: ericomeehan/gitweb + volumeMounts: + - name: gitweb-config + mountPath: /etc/gitweb.conf + subPath: gitweb.conf + volumes: + - name: gitweb-config + configMap: + name: git-gitweb +\end{verbatim} +\subsection{Media} +Somewhat unusually, filesharing is achieved using WebDAV over Apache httpd. While HTTP is a cumbersome protocol for bulk file +transfers, it offers advantages that made it an appropriate choice for this phase of the project: configuration required only +slight modification of files used for the deployment of other services, the service was intended primarily for publishing files +over HTTP rather than uploading and downloading efficiently, and the WebDAV protocol inherently supports the creation of CalDAV +files that will be used to organize the next phase of this project. The remote filesystem was mounted locally on the Latitude +7230 using \textit{davfs2}; however, performance was poor, so it may be necessary to achieve this functionality by creating a +\textit{ReadWriteMany PersistentVolume} to be hosted both over WebDAV and a more efficient protocol such as FTP or SMB. +\subsection{WWW} +The last service deployed was also the simplest: an HTML website hosted, again, on Apache httpd. The only modifications made +to the base configuration for this service were enabling server-side includes and the common authorization configuration. +Simplicity in form matches simplicity in function, as this service is intended to provide basic information about the network +and link to external resources either by hyperlink or SSI. +\subsubsection{Blog} +The blog page of www was designed specifically to organize articles stored on the media server. Taking inspiration from +arXiv's initiative to convert scholarly papers written in the TeX and LaTeX formats to the more accessible HTML, a similar +publication pipeline would be employed here \cite{arxiv_html}. This article was composed in the TeX format and stored under +source-control in https://git.eom.dev/software-infrastructure. The same \textit{LaTeXML} software used by arXiv was used here +to generate XML, HTML, and PDF varients of the document, eah of which was uploaded to +https://media.eom.dev/EOM_Infrastructure_v0.1.0/ \cite{github_latexml}. The HTML file was then embeded into the blog page +using SSI. Each of these steps was executed manually for this release version; however, an automated pipeline may be used to +compile, publish, and embed documents in the future. + +\section{Conclusion} +In this version of the EOM network infrastructure, the automated deployment of a Kubernetes control-plane, worker, and +administrator nodes through Ansible has been achieved. Additionally, a suite of services providing file sharing, git +repository hosting, and content publishing has been developed for deployment on top of the aforementioned cluster. The result +is a durable and secure, though currently minimalistic, platform for software development and research publication. As the +release version implies, many features remain to be added both in the next minor version and in anticipation of the first +major version of the project. The next phase of this project will focus on fixing Nvidia drivers and deploying services on +the network rather than modifying the cluster itself. Development will, as a result, take place more in the +\textit{ansible-role-nvidia-driver} and \textit{ansible-role-eom-services} repositories. A database and API would be particularly +useful additions to the latter. Once the digital platform is more mature, consideration may turn towards adapting the cluster +to run on virtual machines, which would aid in addressing the problems with TLS issuers. + +\printbibliography + +\end{document} diff --git a/docs/draft_2.tex b/docs/draft_2.tex new file mode 100644 index 0000000..9ed867d --- /dev/null +++ b/docs/draft_2.tex @@ -0,0 +1,35 @@ +\documentclass{article} + +\title{EOM Software Infrastructure} +\subtitle{v0.1.0} +\author{Eric O'Neill Meehan} +\date{\today} + +\begin{abstract} +\end{abstract} + +\begin{document} + +\maketitle + +\section{Introduction} +Release v0.1.0 of \textit{eom-software-infrastructure} and its components constitutes the minum-viable hardware and software +infrastructure for \textit{eom.dev}, a self-hosted media publication platform. For this release, three machines were +configured into a cluster hosting a suite of web services. The purpose of this paper is to add context to the release by +documenting design decisions, highlighting problematic areas, and enumerating potential improvements for future iterations; +further, this paper will describe the publication pipeline that has been deployed to the network. + +\section{Hardware Infrastructure} + +\section{Software Infrastructure} +\subsection{Nvidia Drivers} + +\section{Network Services} +\subsection{Content Publication} + +\section{Conclusion} +Release v0.1.0 of \textit{eom-software-infrastructure} and its components were used to configure two PowerEdge servers and a +Latitude tablet into a Kubernetes cluster, development environment, and publication platform used to host its own source code +and this supplementary paper. + +\end{document} diff --git a/host_vars/inspiron-3670.yaml b/host_vars/inspiron-3670.yaml new file mode 100644 index 0000000..16455de --- /dev/null +++ b/host_vars/inspiron-3670.yaml @@ -0,0 +1,16 @@ +# Host vars for inspiron-3670 +nvidia_driver_needed: true +packages: + - curl + - davfs2 + - gimp + - git + - gphoto2 + - latexml + - neovim + - passwordsafe + - texlive-full + - thunderbird + - tmux + - torbrowser-launcher + - w3m diff --git a/host_vars/latitude-7230.yaml b/host_vars/latitude-7230.yaml new file mode 100644 index 0000000..f5de822 --- /dev/null +++ b/host_vars/latitude-7230.yaml @@ -0,0 +1,3 @@ +# Host vars for latitude-7230 + +ansible_connection: local diff --git a/inspiron.yaml b/inspiron.yaml new file mode 100644 index 0000000..0647da4 --- /dev/null +++ b/inspiron.yaml @@ -0,0 +1,7 @@ +--- +# Playbook for workstations +- name: Initialize system + hosts: inspiron-3670 + become: true + roles: + - role: ericomeehan.nvidia_driver diff --git a/inventories/attlocal.yml b/inventories/attlocal.yml index 84f1337..5acd805 100644 --- a/inventories/attlocal.yml +++ b/inventories/attlocal.yml @@ -3,17 +3,15 @@ all: children: workstations: hosts: - mobile-command: + latitude-7230: ansible-host: 192.168.1.123 - clusters: + inspiron-3670: + ansible-host: 192.168.1.210 + imac: + ansible-host: 192.168.1.139 + servers: children: - alpha: - children: - control_plane: - hosts: - alpha-control-plane: - ansible-host: 192.168.1.137 - workers: - hosts: - alpha-worker-0: - ansible-host: 192.168.1.138 + poweredge-r350: + ansible-host: 192.168.1.137 + poweredge-t640: + ansible-host: 192.168.1.138 diff --git a/mobile-command.yaml b/workstations.yaml similarity index 85% rename from mobile-command.yaml rename to workstations.yaml index d75bde9..1398128 100644 --- a/mobile-command.yaml +++ b/workstations.yaml @@ -1,8 +1,7 @@ --- -# Playbook for mobile-command +# Playbook for workstations - name: Initialize system - hosts: mobile-command - connection: local + hosts: workstations become: true roles: - role: ericomeehan.debian @@ -28,8 +27,11 @@ - git - gphoto2 - gpsd + - latexml - neovim - passwordsafe + - python3-venv + - texlive-full - thunderbird - tmux - torbrowser-launcher