element: Walmart’s Machine Learning Platform

Bagavath Subramaniam
6 min readOct 4, 2022

Authors: Hema Rajesh, Bagavath Subramaniam

element Logo
element Logo

We have witnessed a significant growth in the usage of AI/ML over the last few years ranging from sourcing to delivery and everything in between. To truly realize the potential of AI/ML at scale, there is a need for a machine learning platform that standardizes the overall model training and deployment process.

Vision

Our vision has been to “provide competitive advantage by developing transformational ML platform capabilities that simplify the adoption of AI/ML at scale.”

In line with the vision, ‘element’ is developed as an end-to-end platform that provides capabilities for data science teams to work across their entire Data Science project flow as shown below.

Figure 1: Data Science Lifecycle

Strategy

The platform capabilities evolved following a top-down strategic alignment with our vision, business and architecture objectives as shown below.

element Strategy Pyramid

Execution

Best-of-Breed

Our strategy is to reuse existing enterprise services, leverage open-source, encourage inner-source and build strong industry partnerships to enable productivity, speed, and innovation at scale. element is built ground up addressing requirements at several levels and building capabilities where there were no existing solutions and by making use of existing services/frameworks/products wherever possible as highlighted in the architecture diagram below.

element architecture

Some of the key architecture components which were custom built for our environment and use cases are listed below.

  • Resource Management: Platform takes care of resource provisioning and optimization across multiple clouds abstracting the underlying cloud complexities from the data scientists.
  • Cost Attribution: Since the workloads are executed on a shared multi tenant cluster for higher resource utilization a custom cost attribution framework has been developed.
  • User Management & Collaboration: Platform provides AD controlled shared workspaces for teams to collaborate on model development and share the ML artifacts.
  • Notebooks: Customization of open source notebooks to improve security and developer productivity
  • Realtime deployment: Platforms offers a custom built MLOps framework based on walmart standard CI/CD tools that lets users deploy models and monitor them across multiple clouds and regions seamlessly

The platform has been developed on a strong foundation of many open-source tools and frameworks. We also enhanced some of these open source frameworks for our specific needs by adding custom authentication and authorization layer for Jupyter notebooks, implementing multi-tenancy support for MLFlow, adding custom Jupyter/Theia plugins for runtime resource usage intimation on the notebooks and automated environment setup from a definition file, developing custom Airflow operators for notebook batch executions etc., The platform has been opened up for contributions from other internal teams (inner-source) to extend its capabilities in a cost efficient manner.

element open source stack

As we embraced cloud technologies and experimented with multiple cloud platforms, the platform had to evolve fast with the changing landscape. Implementing the company’s “multi-cloud” strategy, ‘element’ focussed on providing platform capabilities that could seamlessly integrate and run cloud-agnostic ML workloads across multiple clouds & regions.

AI/ML is a space with break-through innovations emerging frequently. To keep up with the growing pace, we have also partnered with industry leaders like Nvidia, Microsoft, Google, IBM to augment the platform with transformational AI capabilities for data scientists. We adopted cloud native solutions, libraries, and hardware capabilities as and where they were found appropriate.

This is an ongoing journey and will continue to move with advancements in this space across public cloud platforms, open source, and 3rd party software providers. The platform will strategise adoption of the best-of-breed considering long term cost, competitive advantage and operational workoad implications.

Speed and Scale

The target personas for the platform includes Data Engineers, Data Scientists and Machine Learning Engineers, all with varying degrees of Data Engineering, ML and DevOps skills. ‘element’ provides easy and collaborative development for the Data Engineers/Data Scientists and seamless deployment framework for the Machine Learning Engineers while integrating well with other shared services within our ecosystem.

The platform provides 20+ data connectors that can seamlessly pull data from disparate source systems, bring them onto a collaborative (cloud) IDE attached to a shared workspace for teams to work together. The platform also offers ready-to-use data science environments with pre-packaged libraries & dependencies for custom development. The platform provides pre-built no-code AI Services that enables data engineers, software developers and citizen data scientists to leverage the power of AI/ML without data science skills or expertise.

With greater focus on increasing developer productivity, element MLOps automates the different steps in the model training and model deployment processes enabling data scientists to accelerate the pace of ML adoption in the organization. It has helped reduce the time taken to deploy a model to production from weeks to hours with the following key benefits:

  • Automated model deployment using enterprise standard CI/CD tools
  • Multi-cloud and multi regional deployments
  • Multiple deployment environments (Dev/Stage/Prod) and automatic promotion
  • Optimized cost benefits with auto-scaling computes
  • Multiple Deployment Strategies to reduce deployment risks
  • Observability with integration to centralized logging and monitoring services
  • Disaster Recovery (DR) for deployments

Bend-the-Cost-Curve

With more use cases getting onboarded to the platform, the total cloud spend has been growing rapidly as well. Every Day Low Cost (EDLC) is a core operating principle for us and hence the platform also operates with the objective to implement effective cost reduction strategies for workloads running on it without any user intervention. The platform has helped to keep the costs under control through the following capabilities:

  • Shared infrastructure for multi-tenant workloads thereby increasing platform’s cloud infrastructure resource utilization (upto 80%)
  • On-demand resource provisioning on the cloud
  • Cloud auto-scaling (upscale/downscale) for all workloads
  • Smart choice of cloud VM (Virtual Machines) flavours based on user-workload demands
  • Auto clean up of idle resources
  • Resource recommendations and utilization insights demanding user action
  • Optimize operational cost by using low cost/pre-emptible resources wherever possible
  • Leverage other enterprise Services to reduce operational footprint and thereby cost

Governance

As we continue to innovate and scale up the use of AI and ML, a strong governance framework is needed to ensure all models meet the technical requirements, be legally compliant, and not present any ethical concerns. Our Governance framework comprises of two key elements:

  1. Policies and Process definitions: These are the guiding principles for ethical and responsible use of technology and data as defined by leaders & experienced practitioners from the Data Science Governance Council.
  2. Technology and enablers: These are the tools and automation frameworks that ensure enforcement and enable easier compliance to the Governance policies; this is operationalized through the element platform.

We have built a few and are in the process of building more of the Governance capabilities listed below:

  • Access Control: Authenticated and Authorized access to artifacts (datasets, models, notebooks)
  • Data Security: Encryption of data at rest and in motion
  • Accountability: Record of ownership of artifacts
  • Auditability: Bring visibility to all stakeholders throughout the life cycle of models
  • Health Monitors: Model Performance and Health Monitoring dashboards
  • Continuous Improvement: Data Drift/Skew monitoring to identify model retraining needs
  • Fairness Assessment: Fairness/Bias Monitoring for models which are sensitive in nature
  • Transparency: Provide explainable outputs to improve trust for business stakeholders

Conclusion

element platform has evolved strongly with a rich set of capabilities that align closely with both our Business and Architecture objectives. The platform gets around 400+ daily active users and 1000+ monthly unique users from 250+ teams across the organization. It has evolved continuously with more features getting added every month bringing some of the innovations from the open source world and our partners. It started on a single Kubernetes cluster on our On-Premise cloud infrastructure with a handful of services, connectors, and CPU computes. Now it’s deployed across multiple clouds and regions with around two dozen services, spawning workloads on thousands of CPU cores and hundreds of GPUs. We have learnt a great deal over the years on challenges of running a platform for a diverse set of users, clouds and use cases at our scale. We will write about specific challenges and how we addressed them in our future blogs.

--

--