Apply on Employer Site

Inside Higher Ed · 1 day ago

InfraOps Reliability Administrator

Tallahassee, FL

Full-time

Hybrid

Entry, Mid Level

2+ years exp

Inside Higher Ed is seeking an InfraOps Reliability Administrator to join FSU’s Department of Information Technology Services. The role involves designing, building, and managing infrastructure and servers to support IT teams and users, with a strong focus on automation, reliability, and security in a hybrid cloud environment.

Digital MediaEducationHigher EducationJournalismRecruiting

Responsibilities

Design, build, automate, and optimize infrastructure using modern tools and site reliability engineering practices

Manage primarily Windows servers in a hybrid cloud environment, with a focus on reliability, observability, security, and continuous improvement

Collaborate across teams and leverage automation, scripting, data-informed decision-making, and self-directed professional development to deliver secure, scalable, and customer-focused solutions

Use tools such as Terraform, Azure DevOps, Visual Studio Code, and scripting languages like PowerShell and Bash to manage infrastructure as code (IaC) and configuration as code (CaC), ensuring consistency, repeatability, and auditability of systems

Use observability solutions, such as Elastic, to monitor deployments and support data-informed decisions and rapid experiments, that drive continuous improvement

Work with CI/CD pipelines to automate deployment, validation, and testing processes, ensuring systems are secure by design, mitigate vulnerabilities, and are compliant with security policies and standards

Follow secure coding practices, adhere to coding standards, and leverage version control, automated testing, and test-driven development to produce high-quality, secure, and maintainable code

Use AI-assisted tools to accelerate development, validation, and troubleshooting

Participate in pair programming sessions as appropriate to write code and resolve deployment issues

Deploy and manage Windows and Linux servers across a hybrid environment that includes Microsoft Azure and over a dozen geographically dispersed on-premises locations

Ensure that all systems are secure by design, follow zero trust principles, and are scalable, observable, and aligned with business needs

Provision infrastructure with reliability, maintainability, and consistency in mind, and implement observability prior to production to support proactive monitoring and data-informed decisions

Collaborate with cross-functional teams and stakeholders throughout the infrastructure lifecycle to ensure solutions align with customer needs; prioritize high-value work, assess feasibility, and conduct security reviews of new systems and applications; deliver exceptional customer service and maintain clear communication to support successful outcomes

Design and implement solutions that make work easier, reduce manual effort, improve system reliability, and streamline operations across provisioning, configuration, monitoring, and remediation

Use AI, scripting, workflow automation, or robotic process automation (RPA) tools to reduce operational overhead and accelerate delivery

Use observability tools to monitor automation performance, ensure reliability, and identify data-informed opportunities for continuous improvement

Collaborate with peers and stakeholders to prioritize high-value automation opportunities and ensure that solutions are effective, secure, and aligned with business needs

Manage and troubleshoot enterprise-grade network infrastructure, including wireless access points, switches, routers, load balancers, and next-generation firewalls

Diagnose and resolve network issues using packet captures, OS command outputs, diagnostic consoles, logs, or other tools

Leverage network observability tools to make data-informed decisions and identify opportunities for improvement

Implement and maintain security measures to protect data, systems, and network availability

Collaborate with network and security teams to validate new systems and configurations, expand observability, reduce exploitable vulnerabilities, implement security controls, and enhance system resilience and usability for customers

Create and maintain clear, concise documentation for knowledge sharing, process repeatability, and operational continuity

Develop system diagrams, deployment guides, and standard operating procedures (SOPs) that support usability, compliance, and reliability

Continuously refine documentation and processes as systems evolve, incorporating feedback and lessons learned

Ensure all procedures align with FSU ITS Security Policies and Standards

Participate in peer reviews to validate documentation for accuracy, clarity, and usability

Respond to system alerts, outages, and support requests in accordance with established incident management procedures, collaborating with peers and stakeholders to ensure rapid resolution

Use observability tools to support rapid diagnosis and resolution, and create new monitoring as needed to improve visibility

Participate in post-incident reviews, highlighting key data points and observability insights to identify root causes and opportunities for system or process improvements

Implement improvements to prevent the recurrence of issues and to enhance system reliability

Participate in an on-call rotation, typically one week per month, which includes after-hours support for deployments, changes, or incidents, including on holidays and weekends

Actively work to reduce the need for after-hours assistance by leveraging automated deployment solutions, improving system reliability, and lowering the risk and complexity of changes

Assist with IT security investigations as needed

Ensure incident response processes align with the expectations of IT management, technical teams, and customers

Complete both assigned and self-directed professional development to stay current with evolving technologies, tools, and practices

Explore technical subjects that interest you, even beyond current projects

Use provided learning platforms, such as LinkedIn Learning

Participate in the ITS Professional Development Bonus Plan by completing manager-approved certifications

Pursue relevant training, certifications, and conferences aligned with team goals, subject to approval

Research and validate emerging tools, including AI, automation, observability, and other innovations, to assess their value for our organization

Apply a mindset of rapid experimentation using data to guide decisions, improvements, and the next experiment

Participation in knowledge-sharing sessions, communities of practice, and collaborative learning opportunities is encouraged

Qualification

Infrastructure as CodeScripting AutomationCI/CD PipelinesWindows Server ManagementNetwork AdministrationTerraformAzure DevOpsObservability ToolsCloud EnvironmentsContainer OrchestrationTechnical DocumentationCommunication SkillsContinuous LearningProblem Solving

Required

Bachelor's degree in Computer Science, MIS, or other appropriate degree and two years experience or a high school diploma or equivalent and six years of experience. (Note: or a combination of appropriate post high school education and experience equal to six years.)

Preferred

Proven ability to learn new tools and technologies quickly, with a track record of self-directed learning and adaptability in fast-paced environments

Demonstrated commitment to continuous learning and professional development

Proficient in scripting for infrastructure automation using PowerShell, with the ability to write, debug, and maintain scripts independently or with tools like GitHub Copilot; familiarity with Python or Bash is a plus

Experience using infrastructure and configuration as code tools such as Terraform, Ansible, PowerShell, or similar, with version control practices using Git, and integrated development environments like Visual Studio Code

Experience creating and troubleshooting CI/CD pipelines using tools such as Azure DevOps, GitHub Actions, or GitLab to automate infrastructure deployment and configuration

Experience provisioning and managing infrastructure in cloud environments such as Azure, AWS, or Google Cloud, with an understanding of repeatable deployment processes, and troubleshooting network connectivity with next-generation firewalls

Experience deploying containers and familiarity with container orchestration technologies such as Kubernetes

Proficient using observability tools such as Elastic, Dynatrace, Prometheus, Grafana, Splunk, Datadog, or others, to ingest new types of data, build dashboards and alerts, and derive insights for performance tuning and incident response

Experience improving infrastructure design, automation, or troubleshooting by testing ideas, learning from results, and making thoughtful adjustments over time

Experience supporting Windows and Linux systems in an Active Directory domain, including deployment, configuration, and troubleshooting, as well as managing virtual infrastructure using platforms such as Hyper-V or VMware

Experience leveraging AI tools to accelerate task completion and improve operational efficiency

Demonstrated ability to write and troubleshoot firewall rules and quickly diagnose issues across firewalls, switches, and wireless access points from vendors such as Palo Alto, Juniper, Aruba, Arista, Fortinet, Extreme, Brocade, Cisco, or others, with a focus on identifying root causes across network, OS, and application layers

Strong understanding of secure-by-design and zero trust principles, with experience applying secure configurations and patching strategies in operational environments

Demonstrated experience in infrastructure projects by planning and executing technical tasks such as system deployments, launching new remote locations, or automating business processes. This includes prioritizing high-value work, ensuring long-term maintainability through documentation and repeatable processes, leveraging automation where appropriate, and working closely with cross-functional teams to drive project success

Strong written and verbal communication skills, including the ability to document processes, contribute in team discussions, and explain technical concepts to various audiences

Proficient in creating technical diagrams to communicate infrastructure design or operational workflows

Benefits

FSU offers a robust Total Rewards package.

Visit our website to learn more about our Compensation, Benefits, Wellness, Recognition, and Employee Development programs.

Use our interactive tool to calculate Total Compensation options based on potential salary, benefits and retirement contributions, earned leave, and other employment-related perks.

Approved training resources will be paid for by the organization.