Total Pageviews

Sunday, 10 May 2026

扣肉的做法

反重力飛船、115號元素、被刪掉的人生:Bob Lazar講了36年的故事,可能是真的!

印度航空171空难|起飞后32秒坠毁

-最好别再坐波音的飞机。坐空客的飞机要安全得多。

相关帖子:https://briteming.blogspot.com/2026/05/blog-post_58.html

timednews.com

timednews.com

苏州,请不要为我哭泣

 

deep-learning-project-template

 Pytorch Lightning code guideline for conferences。

Use this seed to start new deep learning / ML projects.

  • Built in setup.py
  • Built in requirements
  • Examples with MNIST
  • Badges
  • Bibtex

Goals

The goal of this seed is to structure ML paper-code the same so that work can easily be extended and replicated.

DELETE EVERYTHING ABOVE FOR YOUR PROJECT


Your Project Name

Paper Conference Conference Conference

CI testing

Description

What it does

How to run

First, install dependencies

# clone project   
git clone https://github.com/YourGithubName/deep-learning-project-template

# install project   
cd deep-learning-project-template 
pip install -e .   
pip install -r requirements.txt

Next, navigate to any file and run it.

# module folder
cd project

# run module (example: mnist as your main contribution)   
python lit_classifier_main.py 
from  https://github.com/Lightning-AI/deep-learning-project-template

ColossalAI

 

Making large AI models cheaper, faster and more accessible


Colossal-AI: Making large AI models cheaper, faster, and more accessible

GitHub Repo stars Build Documentation CodeFactor HuggingFace badge slack badge WeChat badge

| English | 中文 |

Instantly Run Colossal-AI on Enterprise-Grade GPUs

Skip the setup. Access a powerful, pre-configured Colossal-AI environment on HPC-AI Cloud.

Train your models and scale your AI workload in one click!

  • NVIDIA Blackwell B200s: Experience the next generation of AI performance (See Benchmarks). Now available on cloud from $2.47/hr.
  • Cost-Effective H200 Cluster: Get premier performance with on-demand rental from just $1.99/hr.

Get Started Now & Claim Your Free Credits →

Instant Access Top Open Models at Half the Cost

Skip the hassle. Access powerful, long-context LLMs seamlessly through HPC-AI Model APIs.

Build your AI agents, chatbots, and RAG applications with HPC-AI Model APIs!

  • Latest & Greatest Models: Experience state-of-the-art performance with Kimi 2.5, MiniMax 2.5, and GLM 5.1. Perfect for massive 2M+ context windows and complex coding tasks.

  • Unbeatable Pricing: Stop overpaying for API endpoints. Get premier inference speed at up to 50% cheaper than OpenRouter.

Get Started Now & Claim Your $4 Free Credits →

Colossal-AI Benchmark

To see how these performance gains translate to real-world applications, we conducted a large language model training benchmark using Colossal-AI on Llama-like models. The tests were run on both 8-card and 16-card configurations for 7B and 70B models, respectively.

GPU GPUs Model Size Parallelism Batch Size per DP Seqlen Throughput TFLOPS/GPU Peak Mem(MiB)
H200 8 7B zero2(dp8) 36 4096 17.13 samp/s 534.18 119040.02
H200 16 70B zero2 48 4096 3.27 samp/s 469.1 150032.23
B200 8 7B zero1(dp2)+tp2+pp4 128 4096 25.83 samp/s 805.69 100119.77
H200 16 70B zero1(dp2)+tp2+pp4 128 4096 5.66 samp/s 811.79 100072.02

The results from the Colossal-AI benchmark provide the most practical insight. For the 7B model on 8 cards, the B200 achieved a 50% higher throughput and a significant increase in TFLOPS per GPU. For the 70B model on 16 cards, the B200 again demonstrated a clear advantage, with over 70% higher throughput and TFLOPS per GPU. These numbers show that the B200's performance gains translate directly to faster training times for large-scale models.

Latest News

Table of Contents

Why Colossal-AI

Prof. James Demmel (UC Berkeley): Colossal-AI makes training AI models efficient, easy, and scalable.

(back to top)

Features

Colossal-AI provides a collection of parallel components for you. We aim to support you to write your distributed deep learning models just like how you write your model on your laptop. We provide user-friendly tools to kickstart distributed training and inference in a few lines.

(back to top)

Colossal-AI in the Real World

Open-Sora

Open-Sora:Revealing Complete Model Parameters, Training Details, and Everything for Sora-like Video Generation Models [code] [blog] [Model weights] [Demo] [GPU Cloud Playground] [OpenSora Image]

(back to top)

Colossal-LLaMA-2

[GPU Cloud Playground] [LLaMA3 Image]

Model Backbone Tokens Consumed MMLU (5-shot) CMMLU (5-shot) AGIEval (5-shot) GAOKAO (0-shot) CEval (5-shot)
Baichuan-7B - 1.2T 42.32 (42.30) 44.53 (44.02) 38.72 36.74 42.80
Baichuan-13B-Base - 1.4T 50.51 (51.60) 55.73 (55.30) 47.20 51.41 53.60
Baichuan2-7B-Base - 2.6T 46.97 (54.16) 57.67 (57.07) 45.76 52.60 54.00
Baichuan2-13B-Base - 2.6T 54.84 (59.17) 62.62 (61.97) 52.08 58.25 58.10
ChatGLM-6B - 1.0T 39.67 (40.63) 41.17 (-) 40.10 36.53 38.90
ChatGLM2-6B - 1.4T 44.74 (45.46) 49.40 (-) 46.36 45.49 51.70
InternLM-7B - 1.6T 46.70 (51.00) 52.00 (-) 44.77 61.64 52.80
Qwen-7B - 2.2T 54.29 (56.70) 56.03 (58.80) 52.47 56.42 59.60
Llama-2-7B - 2.0T 44.47 (45.30) 32.97 (-) 32.60 25.46 -
Linly-AI/Chinese-LLaMA-2-7B-hf Llama-2-7B 1.0T 37.43 29.92 32.00 27.57 -
wenge-research/yayi-7b-llama2 Llama-2-7B - 38.56 31.52 30.99 25.95 -
ziqingyang/chinese-llama-2-7b Llama-2-7B - 33.86 34.69 34.52 25.18 34.2
TigerResearch/tigerbot-7b-base Llama-2-7B 0.3T 43.73 42.04 37.64 30.61 -
LinkSoul/Chinese-Llama-2-7b Llama-2-7B - 48.41 38.31 38.45 27.72 -
FlagAlpha/Atom-7B Llama-2-7B 0.1T 49.96 41.10 39.83 33.00 -
IDEA-CCNL/Ziya-LLaMA-13B-v1.1 Llama-13B 0.11T 50.25 40.99 40.04 30.54 -
Colossal-LLaMA-2-7b-base Llama-2-7B 0.0085T 53.06 49.89 51.48 58.82 50.2
Colossal-LLaMA-2-13b-base Llama-2-13B 0.025T 56.42 61.80 54.69 69.53 60.3

ColossalChat

ColossalChat: An open-source solution for cloning ChatGPT with a complete RLHF pipeline. [code] [blog] [demo] [tutorial]

  • Up to 10 times faster for RLHF PPO Stage3 Training

  • Up to 7.73 times faster for single server training and 1.42 times faster for single-GPU inference

  • Up to 10.3x growth in model capacity on one GPU
  • A mini demo training process requires only 1.62GB of GPU memory (any consumer-grade GPU)

  • Increase the capacity of the fine-tuning model by up to 3.7 times on a single GPU
  • Keep at a sufficiently high running speed

(back to top)

AIGC

Acceleration of AIGC (AI-Generated Content) models such as Stable Diffusion v1 and Stable Diffusion v2.

  • Training: Reduce Stable Diffusion memory consumption by up to 5.6x and hardware cost by up to 46x (from A100 to RTX3060).

  • Inference: Reduce inference GPU memory consumption by 2.5x.

(back to top)

Biomedicine

Acceleration of AlphaFold Protein Structure

  • FastFold: Accelerating training and inference on GPU Clusters, faster data processing, inference sequence containing more than 10000 residues.

  • xTrimoMultimer: accelerating structure prediction of protein monomers and multimer by 11x.

(back to top)

Parallel Training Demo

LLaMA3

LLaMA2

  • 70 billion parameter LLaMA2 model training accelerated by 195% [code] [blog]

LLaMA1

  • 65-billion-parameter large model pretraining accelerated by 38% [code] [blog]

MoE

  • Enhanced MoE parallelism, Open-source MoE model training can be 9 times more efficient [code] [blog]

GPT-3

  • Save 50% GPU resources and 10.7% acceleration

GPT-2

  • 11x lower GPU memory consumption, and superlinear scaling efficiency with Tensor Parallelism

  • 24x larger model size on the same hardware
  • over 3x acceleration

BERT

  • 2x faster training, or 50% longer sequence length

PaLM

OPT

  • Open Pretrained Transformer (OPT), a 175-Billion parameter AI language model released by Meta, which stimulates AI programmers to perform various downstream tasks and application deployments because of public pre-trained model weights.
  • 45% speedup fine-tuning OPT at low cost in lines. [Example] [Online Serving]

Please visit our documentation and examples for more details.

ViT

  • 14x larger batch size, and 5x faster training for Tensor Parallelism = 64

Recommendation System Models

  • Cached Embedding, utilize software cache to train larger embedding tables with a smaller GPU memory budget.

(back to top)

Single GPU Training Demo

GPT-2

  • 20x larger model size on the same hardware

  • 120x larger model size on the same hardware (RTX 3080)

PaLM

  • 34x larger model size on the same hardware

(back to top)

Inference

Colossal-Inference

Grok-1

  • 314 Billion Parameter Grok-1 Inference Accelerated by 3.8x, an easy-to-use Python + PyTorch + HuggingFace version for Inference.

[code] [blog] [HuggingFace Grok-1 PyTorch model weights] [ModelScope Grok-1 PyTorch model weights]

SwiftInfer

  • SwiftInfer: Inference performance improved by 46%, open source solution breaks the length limit of LLM for multi-round conversations

(back to top)

Installation

Requirements:

If you encounter any problem with installation, you may want to raise an issue in this repository.

Install from PyPI

You can easily install Colossal-AI with the following command. By default, we do not build PyTorch extensions during installation.

pip install colossalai

Note: only Linux is supported for now.

However, if you want to build the PyTorch extensions during installation, you can set BUILD_EXT=1.

BUILD_EXT=1 pip install colossalai

Otherwise, CUDA kernels will be built during runtime when you actually need them.

We also keep releasing the nightly version to PyPI every week. This allows you to access the unreleased features and bug fixes in the main branch. Installation can be made via

pip install colossalai-nightly

Download From Source

The version of Colossal-AI will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problems. :)

git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

# install colossalai
pip install .

By default, we do not compile CUDA/C++ kernels. ColossalAI will build them during runtime. If you want to install and enable CUDA kernel fusion (compulsory installation when using fused optimizer):

BUILD_EXT=1 pip install .

For Users with CUDA 10.2, you can still build ColossalAI from source. However, you need to manually download the cub library and copy it to the corresponding directory.

# clone the repository
git clone https://github.com/hpcaitech/ColossalAI.git
cd ColossalAI

# download the cub library
wget https://github.com/NVIDIA/cub/archive/refs/tags/1.8.0.zip
unzip 1.8.0.zip
cp -r cub-1.8.0/cub/ colossalai/kernel/cuda_native/csrc/kernels/include/

# install
BUILD_EXT=1 pip install .

(back to top)

Use Docker

Pull from DockerHub

You can directly pull the docker image from our DockerHub page. The image is automatically uploaded upon release.

Build On Your Own

Run the following command to build a docker image from Dockerfile provided.

Building Colossal-AI from scratch requires GPU support, you need to use Nvidia Docker Runtime as the default when doing docker build. More details can be found here. We recommend you install Colossal-AI from our project page directly.

cd ColossalAI
docker build -t colossalai ./docker

Run the following command to start the docker container in interactive mode.

docker run -ti --gpus all --rm --ipc=host colossalai bash

(back to top)

Community

Join the Colossal-AI community on Forum, Slack, and WeChat(微信) to share your suggestions, feedback, and questions with our engineering team.

Contributing

Referring to the successful attempts of BLOOM and Stable Diffusion, any and all developers and partners with computing powers, datasets, models are welcome to join and build the Colossal-AI community, making efforts towards the era of big AI models!

You may contact us or participate in the following ways:

  1. Leaving a Star ⭐ to show your like and support. Thanks!
  2. Posting an issue, or submitting a PR on GitHub follow the guideline in Contributing
  3. Send your official proposal to email contact@hpcaitech.com

Thanks so much to all of our amazing contributors!

(back to top)

CI/CD

We leverage the power of GitHub Actions to automate our development, release and deployment workflows. Please check out this documentation on how the automated workflows are operated.

from  https://github.com/hpcaitech/ColossalAI

 

猪脚的做法

 

Saturday, 9 May 2026

茄子的做法

-可以简化一下,不用加小米椒,胡椒粉

美國議員死逼川普交底,以色列90枚核武徹底曝光

-世界可能被美国和以色列搞死,虽然他们终究不敢按下核按钮。川普这个神经病非要参与以色列打击伊朗的战争不可,最终彻底葬送了美国的霸权。川普不是make america great again.二是make america die sooner.

Tuesday, 5 May 2026

我们的明天比蜜甜

 https://drive.google.com/file/d/1e4DtqD5O01q12lr2_xGXMe1CA-fDu6Yg/view

Monday, 4 May 2026

抗美援朝战争换来了前苏联对中共的全面援助

 

以下是关于苏联援助的主要内容总结:

1. 核心工业: “156项工程”

这是苏联援助的重中之重。尽管实际落地的项目数量在不同阶段有所调整(最终约150项),但它构建了中国现代工业的骨架

能源与原材料: 建设了抚顺电站、鞍山钢铁公司(鞍钢)的扩建项目,以及多处煤矿和有色金属冶炼厂。

机械制造: 建立了长春第一汽车制造厂、洛阳拖拉机厂、哈尔滨电机厂等,使中国第一次具备了生产汽车、坦克和重型机械的能力

地理分布: 项目大多部署在东北、华中和西部地区,客观上改变了旧中国工业畸重沿海的布局。

2. 军事与国防技术

在朝鲜战争背景下,苏联的军事援助起到了决定性作用。

武器换装: 苏联提供了大量现役装备(如米格-15战斗机、坦克、大炮),帮助解放军从“小米加步枪”向现代化军队转型。

国防工业化: 苏联援助建设了航空、导弹、坦克和无线电等军工企业,并转让了大量武器制造专利。

核技术起步: 早期苏联为中国研制核武器提供了实验型反应堆和部分技术数据,虽然后期撤走专家,但最初的基础由其奠定。

3. 技术转移与人才培养

苏联采取了“手把手”式的教学模式,这种软性援助的影响甚至超过了硬件设备。

派遣专家: 先后有超过1万名苏联专家来华,深入工厂、学校和政府部门指导

 留学生培养: 中国派遣了约3.8万名留学生和实习生赴苏学习。

技术资料: 苏联提供了数万套机器图纸、工艺规程和技术标准,帮助中国建立了一套完整的苏式技术体系

4. 贷款与财政支持

苏联通过提供长期低息贷款,解决了中国初期工业化资金匮乏的问题。

主要贷款: 包括1950年签订的3亿美元贷款协议,以及后续用于购买军事装备和工业设备的专项贷款。

偿还方式: 中国主要通过出口农副产品、矿产品(如钨、锡、橡胶)来抵偿债务。

--------------

 

 前苏联对中共的全面援助是无法否认的。

---------------

 

1 大力援助中国的是赫鲁晓夫,而不是韩战的始作俑者斯大林。斯大林在1953年3月5日就死了。苏联援助中国的高峰时在中国第一个五年计划 (1953–1957)。那时斯大林已经死了,而且斯大林所有政策受到赫鲁晓夫的清算。

2 斯大林是对中国伤害最大的祸害,没有之一。其中包括和南京大屠杀同等的海参威大屠杀。三十万同袍死于非命。斯大林把俄罗斯占领的中国土地上中国人清理一空,再加上他的小弟毛泽东的帮助,中国三分之一土地永久离开了中国,包括外蒙古。

3 斯大林在朝鲜战争时仍然占着中国的旅顺,拒不归还。后来赫鲁晓夫在1955年5月(斯大林已死两年)主动归还了旅顺。而且当时毛泽东还坚持不要。那时朝鲜战争已经结束(1953年7月27日)两年

 

川普下赌注,谁愿意跟?

 

Sunday, 3 May 2026

关于 vibe coding 和 agentic engineering 的区别

 Karpathy 最近在 Sequoia 的 AI Ascent 2026 上聊了一个多小时,信息量很大,我试着把核心观点捋一捋。

他说自己从去年 12 月开始,突然觉得"跟不上了"。原因很简单:最新的模型写代码几乎不出错了,他不断让 agent 多做一点,结果每次都做对了,慢慢就完全信任它,进入了纯粹的 vibe coding 状态。他强调这个转折点非常陡峭,很多人还停留在去年用 ChatGPT 搜搜东西的印象里,但 12 月之后的 agentic 工作流已经完全不同了。

然后他聊到"软件 3.0"这个概念。软件 1.0 是人写代码,2.0 是用数据训练神经网络,3.0 则是你的编程变成了写 prompt、管理上下文窗口。他举了个特别生动的例子:他自己 vibe code 了一个叫 MenuGen 的 app,拍餐厅菜单然后生成菜品图片,结果后来发现只要把照片丢给 Gemini 加一句提示词,模型直接在原图上渲染出菜品图,整个 app 根本不需要存在。这说明我们不能只把 AI 当成"加速器",它能做到以前根本不可能的事情。

关于 vibe coding 和 agentic engineering 的区别,他说得很清楚:vibe coding 是抬高地板,让所有人都能写软件;agentic engineering 是保住天花板,在不牺牲质量的前提下用 agent 加速。后者是一门严肃的工程学科,因为 agent 本质上是"尖刺状"的实体,能力分布极不均匀,你得学会协调它们。

他花了不少时间解释为什么模型能力如此"参差不齐"。核心原因是可验证性加上实验室的关注度。能被 RL 环境验证的领域(数学、代码)进步飞快,其他领域就会卡住。他举了个让人哭笑不得的例子:最强的模型能重构十万行代码库,却会告诉你"洗车店只有 50 米远,走路去吧"。这种荒谬的不一致说明你必须留在 loop 里,保持判断力。

最后他聊到人类还剩什么价值。现阶段 agent 像实习生,你负责审美、品味、架构设计和顶层规划,agent 负责填充细节。他引用了一句让他反复琢磨的话:"你可以外包你的思考,但你没法外包你的理解。" 理解力仍然是瓶颈,因为你得知道要构建什么、为什么值得做、怎么指挥你的 agent。这也是他做 LLM 知识库项目的原因,他觉得用不同视角重新投射信息,能帮助自己真正理解事物,而这件事目前模型还做不好。

美国若从德国撤军,德国将暗爽

生命的火花

解放初期,许多青年人为了祖国的需要,参加了开发边疆的队伍,来到西北的天山脚下,要在这里建起第一座农场。刘海英是这群青年中年纪较小的一个,她在劳动中吃苦耐劳,热情愉快。由于她努力钻研业务,很快学会了开拖拉机。

波音是如何自毁的?

-今天的美军打不赢伊朗,也是整个美国“脱实向虚”,放弃制造业,大玩金融业造成的恶果,实在怪不得别人。制造业空心化, 防空弹/拦截弹不能足量及时的造出来,还打仗?

Saturday, 2 May 2026

珍贵奥斯卡奖中国抗战纪录片《苦干——中国不可征服的秘密》(英文名:kukan)1941年 荣获第14届奥斯卡纪录片特别奖

https://drive.google.com/file/d/11zqc4u-NhJS5c27F2GlEOea6XwHYpYj1/view

特朗普5月訪華的真實目的,3張單子暴露美國軟肋

-特朗普訪華倒數!美國跪求中國別見死不救

中巴聯手送伊朗王炸大禮包,美國封鎖hormuz海峡徹底失效

美國春耕爛在地裏,特朗普票倉集體反水,中國手握救命鑰匙卻不開門

-美國活该,这是大嘴巴亲手发动的对伊朗的战争造成的恶果。川大嘴巴只能自食恶果。

伊朗七炸美航母,福特號狼狽逃竄

川普转发争议帖,华裔议员痛骂

 

-川普就是个神经病,有这么个家伙当上美国总统,真是美国人的不幸

Friday, 1 May 2026

日本将成为战火之地

-很好,让它这么搞下去。中国到时,新仇旧恨一起算,痛击日本。

伊朗呛埋葬美军于海底

 

。68亿元采购8500台机器人,一台机器人要卖80万。

Argo-CD

 

Declarative Continuous Deployment for Kubernetes

 
 

Releases: Release Version Artifact HUB SLSA 3

Code: Integration tests codecov CII Best Practices OpenSSF Scorecard

Social: Twitter Follow Slack LinkedIn Bluesky

Argo CD - Declarative Continuous Delivery for Kubernetes

What is Argo CD?

Argo CD is a declarative GitOps continuous delivery tool for Kubernetes.

Argo CD UI

Argo CD Demo

Why Argo CD?

  1. Application definitions, configurations, and environments should be declarative and version controlled.
  2. Application deployment and lifecycle management should be automated, auditable, and easy to understand.

Who uses Argo CD?

Official Argo CD user list

Documentation

To learn more about Argo CD go to the complete documentation. Check live demo at https://cd.apps.argoproj.io/.

Community

Contribution, Discussion and Support

You can reach the Argo CD community and developers via the following channels:

Participation in the Argo CD project is governed by the CNCF Code of Conduct

Blogs and Presentations

  1. Awesome-Argo: A Curated List of Awesome Projects and Resources Related to Argo
  2. Unveil the Secret Ingredients of Continuous Delivery at Enterprise Scale with Argo CD
  3. GitOps Without Pipelines With ArgoCD Image Updater
  4. Combining Argo CD (GitOps), Crossplane (Control Plane), And KubeVela (OAM)
  5. How to Apply GitOps to Everything - Combining Argo CD and Crossplane
  6. Couchbase - How To Run a Database Cluster in Kubernetes Using Argo CD
  7. Automation of Everything - How To Combine Argo Events, Workflows & Pipelines, CD, and Rollouts
  8. Environments Based On Pull Requests (PRs): Using Argo CD To Apply GitOps Principles On Previews
  9. Argo CD: Applying GitOps Principles To Manage Production Environment In Kubernetes
  10. Creating Temporary Preview Environments Based On Pull Requests With Argo CD And Codefresh
  11. Tutorial: Everything You Need To Become A GitOps Ninja 90m tutorial on GitOps and Argo CD.
  12. Comparison of Argo CD, Spinnaker, Jenkins X, and Tekton
  13. Simplify and Automate Deployments Using GitOps with IBM Multicloud Manager 3.1.2
  14. GitOps for Kubeflow using Argo CD
  15. GitOps Toolsets on Kubernetes with CircleCI and Argo CD
  16. CI/CD in Light Speed with K8s and Argo CD
  17. Machine Learning as Code. Among other things, describes how Kubeflow uses Argo CD to implement GitOPs for ML
  18. Argo CD - GitOps Continuous Delivery for Kubernetes
  19. Introduction to Argo CD : Kubernetes DevOps CI/CD
  20. GitOps Deployment and Kubernetes - using Argo CD
  21. Deploy Argo CD with Ingress and TLS in Three Steps: No YAML Yak Shaving Required
  22. GitOps Continuous Delivery with Argo and Codefresh
  23. Stay up to date with Argo CD and Renovate
  24. Setting up Argo CD with Helm
  25. Applied GitOps with Argo CD
  26. Solving configuration drift using GitOps with Argo CD
  27. Decentralized GitOps over environments
  28. Getting Started with ArgoCD for GitOps Deployments
  29. Using Argo CD & Datree for Stable Kubernetes CI/CD Deployments
  30. How to create Argo CD Applications Automatically using ApplicationSet? "Automation of GitOps"
  31. Progressive Delivery with Service Mesh – Argo Rollouts with Istio

from  https://github.com/argoproj/argo-cd

( https://github.com/argoproj/gitops-engine)

----------

 

This repository contains Kustomize manifests that point to the upstream manifest of each Kubeflow component and provides an easy way for people to change their deployment according to their need. ArgoCD application manifests for each component will be used to deploy Kubeflow. The intended usage is for people to fork this repository, make their desired kustomizations, run a script to change the ArgoCD application specs to point to their fork of this repository, and finally apply a master ArgoCD application that will deploy all other applications.

To run the below script yq version 4 must be installed

Overview of the steps:

  • fork this repo
  • modify the kustomizations for your purpose
  • run ./setup_repo.sh <your_repo_fork_url>
  • commit and push your changes
  • install ArgoCD
  • run kubectl apply -f kubeflow.yaml

Folder setup

Root files

Prerequisite

  • kubectl (latest)
  • kustomize 4.0.5
  • docker (if using kind)
  • yq 4.x

Quick Start using kind

Install kind

On linux:

curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.10.0/kind-linux-amd64
chmod +x ./kind
mv ./kind /<some-dir-in-your-PATH>/kind

On Mac:

curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.10.0/kind-darwin-amd64
chmod +x ./kind
mv ./kind /<some-dir-in-your-PATH>/kind

On Windows:

curl.exe -Lo kind-windows-amd64.exe https://kind.sigs.k8s.io/dl/v0.10.0/kind-windows-amd64
Move-Item .\kind-windows-amd64.exe c:\some-dir-in-your-PATH\kind.exe

Deploy kind cluster

Note - This will overwrite any existing ~/.kube/config file Please back up your current file if it already exists

kind create cluster --config kind/kind-cluster.yaml

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.3.6/components.yaml
kubectl patch deployment metrics-server -n kube-system -p '{"spec":{"template":{"spec":{"containers":[{"name":"metrics-server","args":["--cert-dir=/tmp", "--secure-port=4443", "--kubelet-insecure-tls","--kubelet-preferred-address-types=InternalIP"]}]}}}}'

Deploy MetalLB

Edit the IP range in configmap.yaml so that it is within the range of your docker network. To get your docker network range, run the following command:

docker network inspect -f '{{.IPAM.Config}}' kind

After updating the metallb configmap, deploy it by running:

kustomize build metallb/ | kubectl apply -f -

Deploy Argo CD

Deploy Argo CD with the following commaind:

kustomize build argocd/ | kubectl apply -f -

Expose Argo CD with a LoadBalancer to access the UI by executing:

kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

Get the IP of the Argo CD endpoint:

kubectl get svc argocd-server -n argocd

Login with the username admin and the output of the following command as the password:

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Deploy Kubeflow

To deploy Kubeflow, execute the following command:

kubectl apply -f kubeflow.yaml

Note - This deploys all components of Kubeflow 1.3, it might take a while for everything to get started. Also, it is unknown what hardware specifications are needed for this at the current time, so your mileage may vary. Also, this deployment is using the manifests in this repository directly. For instructions how to customize the deployment and have Argo CD use those manifests see the next section.

Get the IP of the Kubeflow gateway with the following command:

kubectl get svc istio-ingressgateway -n istio-system

Login to Kubeflow with "email-address" user@kubeflow.org and password 12341234

Remove kind cluster

Run: kind delete cluster

Installing ArgoCD

For this installation the HA version of ArgoCD is used. Due to Pod Tolerations, 3 nodes will be required for this installation. If you do not wish to use a HA installation of ArgoCD, edit this kustomization.yaml and remove /ha from the URI.

  1. Next, to install ArgoCD execute the following command:

    kustomize build argocd/ | kubectl apply -f -
  2. Install the ArgoCD CLI tool from here

  3. Access the ArgoCD UI by exposing it through a LoadBalander, Ingress or by port-fowarding using kubectl port-forward svc/argocd-server -n argocd 8080:443

  4. Login to the ArgoCD CLI. First get the default password for the admin user: kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

    Next, login with the following command: argocd login <ARGOCD_SERVER> # e.g. localhost:8080 or argocd.example.com

    Finally, update the account password with: argocd account update-password

  5. You can now login to the ArgoCD UI with your new password. This UI will be handy to keep track of the created resources while deploying Kubeflow.

Note - Argo CD needs to be able access your repository to deploy applications. If the fork of this repository that you are planning to use with Argo CD is private you will need to add credentials so it can access the repository. Please see the instructions provided by Argo CD here.

Installing Kubeflow

The purpose of this repository is to make it easy for people to customize their Kubeflow deployment and have it managed through a GitOps tool like ArgoCD. First, fork this repository and clone your fork locally. Next, apply any customization you require in the kustomize folders of the Kubeflow applications. Next will follow a set of recommended changes that we encourage everybody to make.

Credentials

The default username, password and namespace of this deployment are: user, 12341234 and kubeflow-user respectively. To change these, edit the user and profile-name (the namespace for this user) in params.env.

Next, in configmap-path.yaml under staticPasswords, change the email, the hash and the username for your used account.

staticPasswords:
- email: user
  hash: $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
  username: user

The hash is the bcrypt has of your password. You can generate this using this website, or with the command below:

python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

To add new static users to Dex, you can add entries to the configmap-path.yaml and set a password as described above.If you have already deployed Kubeflow commit these changes to your fork so Argo CD detects them. You will also need to kill the Dex pod or restart the dex deployment. This can be done in the Argo CD UI, or by running the following command:

kubectl rollout restart deployment dex -n auth

Ingress and Certificate

By default the Istio Ingress Gateway is setup to use a LoadBalancer and to redirect HTTP traffic to HTTPS. Manifests for MetalLB are provided to make it easier for users to use a LoadBalancer Service. Edit the configmap.yaml and set a range of IP addresses MetalLB can use under data.config.address-pools.addresses. This must be in the same subnet as your cluster nodes.

If you do not wish to use a LoadBalancer, change the spec.type in gateway-service.yaml to NodePort.

To provide HTTPS out-of-the-box, the kubeflow-self-signing-issuer used by internal Kubeflow applications is setup to provide a certificate for the Istio Ingress Gateway.

To use a different certificate for the Ingress Gateway, change the spec.issuerRef.name to the cert-manager ClusterIssuer you would like to use in ingress-certificate.yaml and set the spec.commonName and spec.dnsNames[0] to your Kubeflow domain.

If you would like to use LetsEncrypt, a ClusterIssuer template if provided in letsencrypt-cluster-issuer.yaml. Edit this file according to your requirements and uncomment the line in the kustomization.yaml file so it is included in the deployment.

Customizing the Jupyter Web App

To customize the list of images presented in the Jupyter Web App and other related setting such as allowing custom images, edit the spawner_ui_config.yaml file.

Change ArgoCD application specs and commit

To simplify the process of telling ArgoCD to use your fork of this repo, a script is provided that updates the spec.source.repoURL of all the ArgoCD application specs. Simply run:

./setup_repo.sh <your_repo_fork_url>

If you need to target a specific branch or release on your for you can add a second argument to the script to specify it.

./setup_repo.sh <your_repo_fork_url> <your_branch_or_release>

To change what Kubeflow or third-party componenets are included in the deployment, edit the root kustomization.yaml and comment or uncomment the components you do or don't want.

Next, commit your changes and push them to your repository.

Deploying Kubeflow

Once you've commited and pushed your changes to your repository, you can either choose to deploy componenet individually or deploy them all at once. For example, to deploy a single component you can run:

kubectl apply -f argocd-applications/kubeflow-roles-namespaces.yaml

To deploy everything specified in the root kustomization.yaml, execute:

kubectl apply -f kubeflow.yaml

After this, you should start seeing applications being deployed in the ArgoCD UI and what the resources each application create.

Updating the deployment

By default, all the ArgoCD application specs included here are setup to automatically sync with the specified repoURL. If you would like to change something about your deployment, simply make the change, commit it and push it to your fork of this repo. ArgoCD will automatically detect the changes and update the necessary resources in your cluster.

Bonus: Extending the Volumes Web App with a File Browser

A large problem for many people is how to easily upload or download data to and from the PVCs mounted as their workspace volumes for Notebook Servers. To make this easier a simple PVCViewer Controller was created (a slightly modified version of the tensorboard-controller). This feature was not ready in time for 1.3, and thus I am only documenting it here as an experimental feature as I believe many people would like to have this functionality. The images are grabbed from my personal dockerhub profile, but I can provide instructions for people that would like to build the images themselves. Also, it is important to note that the PVC Viewer will work with ReadWriteOnce PVCs, even when they are mounted to an active Notebook Server.

Here is an example of the PVC Viewer in action:

PVCViewer in action

To use the PVCViewer Controller, it must be deployed along with an updated version of the Volumes Web App. To do so, deploy experimental-pvcviewer-controller.yaml and experimental-volumes-web-app.yaml instead of the regular Volumes Web App. If you are deploying Kubeflow with the kubeflow.yaml file, you can edit the root kustomization.yaml and comment out the regular Volumes Web App and uncomment the PVCViewer Controller and Experimental Volumes Web App.

Troubleshooting

I can't get letsencrypt to work. The cert-manager logs show 404 errors.

The letsencrypt HTTP-01 challenge is incompatible with using OIDC (Link). If your DNS server allows programmatic access, use the DNS-01 challenge solver instead.

I am having problems getting the deployment to run on a cluster deployed with kubeadm and/or kubespray.

The kube-apiserver needs additional arguments if your are running a kubenetes version below the recommended version 1.20: --service-account-issuer=kubernetes.default.svc and --service-account-signing-key-file=/etc/kubernetes/ssl/sa.key.

If your are using kubespray, add the following snipped to your group_vars:

kube_kubeadm_apiserver_extra_args: 
  service-account-issuer: kubernetes.default.svc
  service-account-signing-key-file: /etc/kubernetes/ssl/sa.key

I have unbound PVCs with rook-ceph.

Note that the rook deployment shipped with ArgoFlow requires a HA setup with at least 3 nodes.

Make sure, that there is a clean partition or drive available for rook to use.

Change the deviceFilter in cluster-patch.yaml to match the drives you want to use. For nvme drives change the filter to ^nvme[0-9]. In case your have previously deployed rook on any of the disks, format them, remove the folder /var/lib/rook on all nodes, and reboot. Alternatively, follow the rook-ceph disaster recover guide to adopt an existing rook-ceph cluster.

from  https://github.com/argoflow/argoflow

------

 

deployKF builds machine learning platforms on Kubernetes. We combine the best of Kubeflow, Airflow†, and MLflow† into a complete platform.

 

Your Open ML Platform

deployKF Logo



About deployKF

What is deployKF?

deployKF builds machine learning platforms on Kubernetes.
We combine the best of Kubeflow, Airflow, and MLflow into a complete platform that is easy to deploy and maintain.

Coming soon, see our current and future tools.

Why use deployKF?

deployKF combines the ease of a managed service with the flexibility of a self-hosted solution.

Our goal is that any Kubernetes user can build a machine learning platform for their organization, without needing specialized MLOps knowledge, or a team of experts to maintain it.

The key features of deployKF are:

Video Introduction

Title: deployKF: A better way to deploy Kubeflow (and more)
Event: Kubeflow Summit 2023

Featured Stories

We are always excited to see how and where deployKF is being used!

Here are some stories of deployKF being used in the wild:

Organization Article / Video
Cloudflare A look inside the Cloudflare ML Ops platform

Have a story to share? Let us know!



Using deployKF

Getting Started

To help you get started with deployKF, we have prepared a number of guides:

Release Information

For more information about our releases, please see:

Support the Project

deployKF is a new and growing project. If you like what we are doing, please help others discover us by sharing the project with your colleagues and/or the wider community.

We greatly appreciate GitHub Stars ⭐ on the deployKF/deployKF repository:

Star History Chart


Other Resources

Commercial Support

To discuss commercial support options for deployKF, please connect with Aranui Solutions, the company started by the creators of deployKF. Learn more on the Aranui Solutions Website.

Community

The deployKF community uses the Kubeflow Slack for informal discussions among users and contributors.

Please see our community page for more information.

History of deployKF

deployKF was originally created and is maintained by Mathew Wicks (GitHub: @thesuperzapper), a Kubeflow lead and maintainer of the popular Apache Airflow Helm Chart. deployKF is a community-led project that welcomes contributions from anyone who wants to help.

 from  https://github.com/deployKF/deployKF