面向两级多中心架构的深度学习平台设计与实现

DESIGN AND IMPLEMENTATION OF DEEP LEARNING PLATFORM FOR TWO LEVEL MULTI CENTER ARCHITECTURE

  • 摘要: 大型企业的深度学习工作存在管理散乱和大量重复建设的问题。为了支持大规模深度学习的全过程管理和模型成果的高效复用,以国家电网公司的两级多中心部署架构为背景,提出一种深度学习平台。系统将训练、推理、数据和模型的管理工作分布在不同中心完成,彼此间协同完成深度学习的闭环。构建基于Kubernetes的私有云来支撑大批量深度学习应用的并行计算。前端界面采用基于算子的流程编排实现建模可视化和功能的可扩展。实验结果表明系统能够支持多个深度学习任务的并行,且额外的性能开销是可以接受的。

     

    Abstract: There are some problems in the deep learning work of large enterprises, such as scattered management and a large number of redundant projects. In order to support the whole process management of large-scale deep learning and efficient reuse of model results, a deep learning platform is proposed based on the two level multi center deployment architecture of State Grid Corporation of China. The system distributed the management work of training, inferencing, data and models into different centers, and they cooperated to complete the closed-loop of deep learning. A private cloud based on Kubernetes was used to support the parallel computing of large number of deep learning applications. The front-end interface adopted operator-based flow arrangement to realize modeling visualization and function expansion. The experimental results show that the system can support the parallel execution of multiple deep learning tasks, and the additional performance overhead is acceptable.

     

/

返回文章
返回