Total Pageviews

Tuesday 19 December 2017

在Ubuntu 16.04 LTS下,安装cuda


如有问题,查找错误: vi /var/log/apache2/error.log.
# change user password and add ssh key
sudo passwd username
vi ~/.ssh/authorized_keys

# need for amber:
sudo apt-get install csh
sudo apt-get install gfortran
## X11 need
sudo apt-get install xorg-dev
## need for NAB and antechamber
sudo apt-get install flex
## flexible to install
sudo apt-get install zlib1g-dev libbz2-dev
## openmp and MPI for amber
sudo apt-get install openmpi-bin
sudo apt-get install openmpi-bin libopenmpi-dev 

# install php
sudo apt-get install php
sudo apt-get install libapache2-mod-php7.0
### 加密扩展库和curl用于异步
sudo apt-get install php7.0-mcrypt
sudo apt-get install php7.0-curl

# install mysql
sudo apt-get install mysql-server
#sudo apt-get install mysql-server-core-5.7
#sudo apt-get install mysql-client-core-5.7
sudo apt-get install php-mysql
# the following needed by phpMyAdmin, maybe fail for php-mysqli when Ubuntu-16
sudo apt-get install php-mbstring
sudo apt-get install php-mysqli

# restart service
sudo service apache2 restart
sudo /etc/init.d/mysql restart

关于权限的进一步调试:

Default user and group for apache2: www-data
sudo chown -R yourname:www-data foldername
sudo chmod -R g+x foldername

额外安装AmberTools16

tar -xjf AmberTools16.tar.bz2
# Configure AMBERHOME for once
export AMBERHOME=`pwd`
# Update many patch
./update_amber --update
# it will install many softwares!
./configure gnu
# It should be load!
source /home/user/Software/Amber16/amber.sh
make install
# To install MPI and openmp version, do following one more time
./configure -mpi -openmp gnu
source /home/user/Software/Amber16/amber.sh
make install
如要安装CUDA:
### If you don't install nvidia driver, use following two commands
#sudo apt-get purge nvidia*
#reboot

### If you don't want to install nvidia-toolkit in official way:
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get install nvidia-367 nvidia-367-dev  
sudo apt-get install libcuda1-367 nvidia-cuda-dev nvidia-cuda-toolkit
以上方法能帮你安装好cuda. 但是cuda的执行文件在/usr/bin/,库文件很零散,如果有程序需要CUDA_HOME变量实在不知道怎么指定比较合理,而且要想安装多个CUDA-toolkit就比较烦了.
最后推荐用官方更规范的方法安装, 在笔者尝试中在官方下载只提供CUDA7.5 的14.04和15.04的deb包, 不过有提供run包(两个ubuntu版本是一样的!).这里就参考这篇文章NVIDIA CUDA with Ubuntu 16.04 beta on a laptop (if you just cannot wait)来安装:
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda_7.5.18_linux.run
# To install cuda. Notice that "DON'T" install the NVIDIA driver in the run file! Setup see the following block
sudo ./cuda_7.5.18_linux.run --override
可能出现下面的画面(这里我缺了个库,ln -s /usr/lib/x86_64-linux-gnu/libGLU.so.1 /usr/lib/x86_64-linux-gnu/libGLU.so搞掂):
Do you accept the previously read EULA? (accept/decline/quit): accept    
You are attempting to install on an unsupported configuration. Do you wish to continue? ((y)es/(n)o) [ default is no ]: y
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 352.39? ((y)es/(n)o/(q)uit): n
Install the CUDA 7.5 Toolkit? ((y)es/(n)o/(q)uit): y
Enter Toolkit Location [ default is /usr/local/cuda-7.5 ]: 
Do you want to install a symbolic link at /usr/local/cuda? ((y)es/(n)o/(q)uit): y
Install the CUDA 7.5 Samples? ((y)es/(n)o/(q)uit): y
Enter CUDA Samples Location [ default is /home/hom ]: /usr/local/cuda-7.5
Error: unsupported compiler: 5.4.0. Use --override to override this check.
Missing recommended library: libGLU.so

Error: cannot find Toolkit in /usr/local/cuda-7.5
如果有对应Ubuntu版本的deb包, 安装则是:
# If exist deb package...
sudo dpkg -i cuda-repo-<distro>_<version>_<architecture>.deb
sudo apt-get update
sudo apt-get install cuda
如果安装成功, 就进行CUDA配置. 在.bashrc或者.profile里面加入:
export CUDA_HOME="/usr/local/cuda" 
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/lib:$LD_LIBRARY_PATH
测试是否安装好:
# Just to test whether CUDA is OK
nvcc -V
cd /usr/local/cuda/samples
sudo chown -R <username>:<usergroup> .
cd 1_Utilities/deviceQuery
make
./deviceQuery

source .bashrc加载一下后, 就可以安装CUDA版本了!
./configure -mpi -openmp -cuda gnu
source /home/user/Software/Amber16/amber.sh
make install
可能会遇到以下错误:
In file included from /usr/local/cuda/include/cuda_runtime.h:76:0,
                 from <command-line>:0:
/usr/local/cuda/include/host_config.h:115:2: error: #error -- unsupported GNU version! gcc versions later than 4.9 are not supported!
 #error -- unsupported GNU version! gcc versions later than 4.9 are not supported!
因为CUDA这里限定了编译器不能高于4.9. 我们实际ubuntu16就已经超过5.4了. 为此, 要修改一个文件:
sudo cd /usr/local/cuda/include/ #进入到头文件目录cuda;
sudo cp host_config.h host_config.h.bak #备份原头文件;
sudo gedit host_config.h #编辑头文件;
# ctrl+F查找4.9出现的地方,大约位于115行,在第113行处应该显示if _GNUC_>4 || (_GNUC_ == 4 && _GNUC_MINOR_ > 9),因为我们的是5.4,因此,把上面的2个4都改成5就ok了,保存退出。
随后make clear; make install再来一遍
mpicc -Duse_SPFP -O3 -mtune=native -DMPICH_IGNORE_CXX_SEEK -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -DBINTRAJ -DMPI   -DCUDA -DMPI -DMPICH_IGNORE_CXX_SEEK -I/usr/local/cuda/include -IB40C -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi  -c gputypes.cpp
/usr/local/cuda/bin/nvcc -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -use_fast_math -O3  -Duse_SPFP -DCUDA -DMPI -DMPICH_IGNORE_CXX_SEEK -I/usr/local/cuda/include -IB40C -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi   -c kForcesUpdate.cu
/usr/include/string.h: In function ‘void* __mempcpy_inline(void*, const void*, size_t)’:
/usr/include/string.h:652:42: error: ‘memcpy’ was not declared in this scope
   return (char *) memcpy (__dest, __src, __n) + __n;
                                          ^
Makefile:38: recipe for target 'kForcesUpdate.o' failed
遇到这个问题, 我修改了config.h里面PMEMD_CU_DEFINES=-DCUDA -DMPI -DMPICH_IGNORE_CXX_SEEK -D_FORCE_INLINES, 加入了后面的-D_FORCE_INLINES. 编译成功! (参考自[memcpy' was not declared in this scope (Ubuntu 16.04) ](https://github.com/opencv/opencv/issues/6500)). 在安装GROMACS时在CMakelists.txt加入set(CMAKE_CXX_FLAGS “${CMAKE_CXX_FLAGS} -D_FORCE_INLINES”)`.

Ref:
  1. CUDA Toolkit Documentation
  2. Ubuntu 16.04下安装Tensorflow(GPU)
  3. testing Ubuntu 16.04 for CUDA development. Awesome integration!
  4. NVIDIA CUDA with Ubuntu 16.04 beta on a laptop (if you just cannot wait)
  5. Ubuntu 14.04 安装配置CUDA, CUDA入门教程
  6. Amber11+AmberTools1.5+CUDA加速 安装总结
  7. Ubuntu15.10_64位安装Theano+cuda7.5详细笔记 : 很好的一篇从头装theano的
  8. ubuntu 16.04 编译opencv3.1,opencv多版本切换

Bug

异步问题
异步asynchronous call, 打断时间那里可以用CURLOPT_TIMEOUT_MS毫秒(默认秒, 只在新版本支持)
$curl = curl_init();
 
curl_setopt($curl, CURLOPT_URL, 'http://www.mysite.com/myscript.php');
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 1);
 
curl_exec($curl);
curl_close($curl);
因为我在另一个脚本进行异步调用的原因, 新服务器解析网址有点慢..TIMEOUT设置1会导致这步响应没完成就结束从而导致脚本没有继续执行. 解决办法是增大时间, 更好是整合为一个脚本或者把下载一步放到主脚本内. 先随便改个6秒吧..
后来在上传文件时依然有问题. 原因是PHP7里面已经移除了一些旧式用法, 更新为新用法即可.
phpMyAdmin问题
安装下载: 官方地址,下载解压后放置到某个目录即可运行.
不能登录问题, 设置登录方式http/cookie 或者用户密码都不能解决.
查找apache2/error.log 找到以下错误
PHP Fatal error: Uncaught Error: Call to undefined function mb_detect_encoding() phpmyadmin
参考PHP Fatal error when trying to access phpmyadmin mb_detect_encodingFatal error: Call to undefined function mb_detect_encoding(), 查看phpinfo后发现缺了两个重要php库:mbstring和mysqli, 安装即可.
PHP Fatal error: Uncaught Error: Call to undefined function mysql_connect()
网上说没有安装php_mysql.装好后依然不行,有说修改 /etc/php/7.0/apache2/php.ini移除mysql相应extension的注释,但在7.0只有mysqli没有mysql…依旧不行.
原因是PHP7.0 (>5.5被废弃,7移除了)已经废弃了mysql_*函数取而代之使用mysqli_*代替. 一些说明参考MySQL Improved Extension
$con=mysqli_connect(hostname, user, passwd) //the same
if (!$con){
       die('Could not connect: ' . mysqli_error());
}
//// Notice: the argument with different order!!!
//mysql_select_db(dbname,$con);
mysqli_select_db($con, dbname);
//mysql_query(query)
$result=mysqli_query($con, query_string);
//mysql_fetch_array($result)
$row=mysqli_fetch_array($result);
//mysql_close($con)
mysqli_close($con);
---------------------

在 CentOS 7.6 上部署 CUDA、cuDNN、tensorflow-gpu 运行环境


Background

网上关于安装 tensorlfow-gpu 环境的博客可谓是多如牛毛,但是参考了很多篇,都没能成功的在 CentOS 7.6 上把整套环境布置上。折腾了差不多一个礼拜,时间跨度长达几个月,总算是把这一堆装上了,分享一下过程。本文主要目的在于给出一个非常详细的安装过程(从一个全新环境开始后的每一个操作)以及环境配置说明,这样后面的人参考的时候就能根据不同之处快速定位问题。

测试环境

系统版本为 CentOS 7.6.1810

版本选择

因为目的是运行tensorflow程序,所以各个版本都要跟着tf来。目前为止(确切来说是昨天),默认安装的tensorflow-gpu版本是1.12,这个版本也发布了挺久的了,尽量选择新的稳定的。

参照以下链接中的tf版本支持情况,可以看到 1.12 (仅)支持 CUDA 9 和 cuDNN 7,需要注意的是,这里虽然没有写小版本号,但是实际上它支持且仅支持9.0,其他版本会出现找不到so库的错误。

Tested build configurations

所以 CUDA 版本选择 9.0,cuDNN 选择 7.0 版本。

即便您的版本跟环境都相同,也不建议直接复制粘贴命令,因为随着时间流逝,很多文件名都已变化

本文最后更新于 20190305,如果按照文中步骤出现错误,可以通过邮件联系我。

安装 Nvidia GPU 驱动

各个教程都建议通过cuda安装包来安装驱动,但是测试了很多次都没有成功,反复出现

NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

这个错误,试了网上很多诸如禁用nouveau模块、安装kernel包,都没有解决这个问题,安装cuda9.2倒是能成功运行,可坑爹的tensorflow只认9.0版本啊!

后来尝试单独安装Nvidia driver,发现是完全OK的,只要跟cuda版本兼容就行。所以本教程中Nvidia驱动是单独安装的,此外单独安装还有另外一个原因:后期准备在docker里运行程序,宿主机上只安装必需的程序。

首先,点击链接,选择相关的参数,没有给出CentOS的选项,所以选了Linux 64-bit RHEL7,然后搜索进入下载页面,点击下载,同意并继续,然后在浏览器下载队列里复制文件下载链接。

Download Drivers

我的是http://us.download.nvidia.com/tesla/384.183/nvidia-diag-driver-local-repo-rhel7-384.183-1.0-1.x86_64.rpm

连上服务器,切换到root用户,首先下载文件。

1
2
3
4
5
6
sudo su

# 如果没有wget,先安装
yum install -y wget

wget http://us.download.nvidia.com/tesla/384.183/nvidia-diag-driver-local-repo-rhel7-384.183-1.0-1.x86_64.rpm

(现在似乎下载的都是run文件,直接运行就可以)

然后安装驱动,会自动装上一些依赖,如dkms什么的。

1
2
3
4
5
rpm -i nvidia-diag-driver-local-repo-rhel7-384.183-1.0-1.x86_64.rpm

yum clean all

yum install -y cuda-drivers

如果提示需要dkms,安装一下

1
2
yum install -y epel-release
yum install -y --enablerepo=epel dkms

安装完毕之后输入 nvidia-smi 命令,不出意外就能看见输出了。

驱动安装完毕,还是挺简单的,没有网上那么复杂的安装和修改模块啥的。

如果只想在docker里使用GPU,后面的CUDA和cuDNN就都不用安装。

安装 cuda9.0

首先到以下页面选择期望的cuda版本,如9.0,然后选择你的环境参数等信息。

CUDA Toolkit Archive

网站提供了三种安装方式:runfile(local)、rpm(local)、rpm (network)

如果要部署的服务器很多,推荐使用local安装的形式,毕竟文件还是挺大的。runfile和rpm都行,我这里选择了runfile模式。

点击下载,然后获取下载链接。(9.0版本现在有很多的patch补丁,需要从第一个本体程序开始都装一遍,步骤类似)

1
2
3
wget -O cuda_9.0.176_384.81_linux.run https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run

sh cuda_9.0.176_384.81_linux.run

由于文件较大,等待一会后进入README,不停按空格,一直到最后让你确认这块,当然是选择 accept 啦(我有的选么。。。),然后问是否安装驱动,因为上一步已经装过了,所以这步选n跳过。整个过程中只需要在cuda toolkit这一步选y,其他都n,然后等待安装完成。(中间有一步问是否创建软链接,我感觉没必要,毕竟tensorflow那么坑爹,没法做到透明升级)

不出意外的话,cuda就安装完毕了(等待时间还挺久的),最后会提示让把cuda的路径添加到环境变量里。

echo 'export PATH=/usr/local/cuda-9.0/bin:$PATH' >> /etc/profile.d/cuda-9.0.sh
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH' >> /etc/profile.d/cuda-9.0.sh

source /etc/profile.d/cuda-9.0.sh

最后一步也可以不输入,重启服务器等自动生效也行,不过服务器重启一次实在是太慢了。

至此,cuda安装完毕,可以通过执行 /usr/local/cuda-9.0/extras/demo_suite/deviceQuery 程序来判断是否安装成功。

安装 cuDNN 7.0

首先不得不吐槽一下,Nvidia 你一个卖硬件的,下个驱动还得注册账号。。。

打开以下链接,选择合适的版本,注意cuda的版本号。选好之后就告知要注册账号,按照要求注册完账号之后,回来再重新走一遍流程,就能下载了。

cuDNN Archive

同样是获取下载链接,然后在服务器上

1
2
# 请把下面这个链接换成你自己的,不然会访问禁止
wget -O cudnn-9.0-linux-x64-v7.tgz https://developer.download.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-linux-x64-v7.tgz

压缩包里就几个文件,复制到cuda目录下即可。

1
2
3
4
5
6
7
tar zxvf cudnn-9.0-linux-x64-v7.tgz

cp cuda/include/* /usr/local/cuda-9.0/include
cp cuda/lib64/* /usr/local/cuda-9.0/lib64

rm -rf cuda
rm -f cudnn-9.0-linux-x64-v7.tgz

至此,cuDNN就安装完毕了。

安装 tensorflow-gpu

安装 tf 本身没难度,但是有一个地方需要注意一下,就是必须指定版本,不然你不知道哪天tf就升级了,装了个不兼容的版本。(比如我昨天装还是默认1.12,今天就变成了1.13,说出来都是泪啊)

1
2
3
4
# 新环境可能没有pip,需要先装上。
yum install -y python-pip

pip install tensorflow-gpu==1.12

包挺大的,在外网服务器上还装了几分钟,网络不好的还是建议下安装包吧。

装好之后写个文件测试一下:

1
2
3
4
5
6
7
import tensorflow as tf

hello = tf.constant('Hello, Tensorflow')

sess = tf.Session()

print(sess.run(hello))

似乎tensorflow经过了这么久的开发,版本兼容问题还是渣,不同的tf版本依赖不同的cuda版本,个人独占的服务器还好,要是实验室共享的GPU服务器,那这个版本势必乱到无法直视。

建议使用docker,既能隔离乱七八糟的环境,同时也不会造成太大的性能损失。

Nvidi-docker2

卸载旧版本nvidia与cuda驱动

如果当前系统已经存在 nvidia 驱动,需要先卸载。

根据旧驱动安装形式不同,卸载方式也不同

run 文件安装

使用nvidia-smi查看nvidia驱动版本,然后下载相应版本的驱动文件(测试过,不用完全一致)

1
2
3
4
5
6
7
8
9
# 卸载cuda-toolkit

sudo bash /usr/local/cuda/bin/uninstall_***

# 卸载nvidia驱动

例如文件是 NVIDIA-Linux-x86-384.183.run

则运行如下命令:sudo bash NVIDIA-Linux-x86-384.183.run --uninstall

rpm形式安装

1
2
3
sudo yum list | grep nvidia

sudo yum remove ${通过旧驱动安装的包}

卸载完毕后需要重启服务器(谨慎重启),然后再安装新的驱动。

Trouble Shootings

  1. X-server 问题
1
2
3
4
ERROR: You appear to be running an X server; please exit X before            
installing. For further details, please see the section INSTALLING
THE NVIDIA DRIVER in the README available on the Linux driver
download page at www.nvidia.com.

解决:

1
2
3
4
5
6
7
# To stop:
sudo init 3

# 安装驱动

# To resume:
sudo init 5


reference:

https://blog.csdn.net/jiangpeng59/article/details/78215642

https://www.jianshu.com/p/78a936c27ec4

https://blog.csdn.net/jiede1/article/details/81062552

https://serverfault.com/questions/942844/nvidia-smi-cant-communicate-with-nvidia-driver

-----------------------------------------------------------------------------------

PhoenixGo is a Go(围棋) AI program which implements the AlphaGo Zero paper "Mastering the game of Go without human knowledge". It is also known as "BensonDarr" and "金毛测试" in FoxGo, "cronus" in CGOS, and the champion of World AI Go Tournament 2018 held in Fuzhou China.

If you use PhoenixGo in your project, please consider mentioning in your README.

If you use PhoenixGo in your research, please consider citing the library as follows:

@misc{PhoenixGo2018,
  author = {Qinsong Zeng and Jianchang Zhang and Zhanpeng Zeng and Yongsheng Li and Ming Chen and Sifan Liu}
  title = {PhoenixGo},
  year = {2018},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Tencent/PhoenixGo}}
}

Building and Running

On Linux

Requirements

  • GCC with C++11 support
  • Bazel (0.19.2 is known-good)
  • (Optional) CUDA and cuDNN for GPU support
  • (Optional) TensorRT (for accelerating computation on GPU, 3.0.4 is known-good)

The following environments have also been tested by independent contributors : here. Other versions may work, but they have not been tested (especially for bazel).

Download and Install Bazel

Before starting, you need to download and install bazel, see here.

For PhoenixGo, bazel (0.19.2 is known-good), read Requirements for details.

If you have issues on how to install or start bazel, you may want to try this all-in-one command line for easier building instead, see FAQ question

Building PhoenixGo with Bazel

Clone the repository and configure the building:

$ git clone https://github.com/Tencent/PhoenixGo.git
$ cd PhoenixGo
$ ./configure

./configure will start the bazel configure : ask where CUDA and TensorRT have been installed, specify them if need.

Then build with bazel:

$ bazel build //mcts:mcts_main

Dependices such as Tensorflow will be downloaded automatically. The building process may take a long time.

Recommendation : the bazel building uses a lot of RAM, if your building environment is lack of RAM, you may need to restart your computer and exit other running programs to free as much RAM as possible.

Running PhoenixGo

Download and extract the trained network:

$ wget https://github.com/Tencent/PhoenixGo/releases/download/trained-network-20b-v1/trained-network-20b-v1.tar.gz
$ tar xvzf trained-network-20b-v1.tar.gz

The PhoenixGo engine supports GTP (Go Text Protocol), which means it can be used with a GUI with GTP capability, such as Sabaki. It can also run on command-line GTP server tools like gtp2ogs.

But PhoenixGo does not support all GTP commands, see FAQ question.

There are 2 ways to run PhoenixGo engine

1) start.sh : easy use

Run the engine : scripts/start.sh

start.sh will automatically detect the number of GPUs, run mcts_main with proper config file, and write log files in directory log.

You could also use a customized config file (.conf) by running scripts/start.sh {config_path}. If you want to do that, see also #configure-guide.

2) mcts_main : fully control

If you want to fully control all the options of mcts_main (such as changing log destination, or if start.sh is not compatible for your specific use), you can run directly bazel-bin/mcts/mcts_main instead.

For a typical usage, these command line options should be added:

  • --gtp to enable GTP mode
  • --config_path=replace/with/path/to/your/config/file to specify the path to your config file
  • it is also needed to edit your config file (.conf) and manually add the full path to ckpt, see FAQ question. You can also change options in config file, see #configure-guide.
  • for other command line options , see also #command-line-options for details, or run ./mcts_main --help . A copy of the --help is provided for your convenience here

For example:

$ bazel-bin/mcts/mcts_main --gtp --config_path=etc/mcts_1gpu.conf --logtostderr --v=0

(Optional) : Distribute mode

PhoenixGo support running with distributed workers, if there are GPUs on different machine.

Build the distribute worker:

$ bazel build //dist:dist_zero_model_server

Run dist_zero_model_server on distributed worker, one for each GPU.

$ CUDA_VISIBLE_DEVICES={gpu} bazel-bin/dist/dist_zero_model_server --server_address="0.0.0.0:{port}" --logtostderr

Fill ip:port of workers in the config file (etc/mcts_dist.conf is an example config for 32 workers), and run the distributed master:

$ scripts/start.sh etc/mcts_dist.conf

On macOS

Note: Tensorflow stop providing GPU support on macOS since 1.2.0, 

so you are only able to run on CPU.

Use Pre-built Binary

Download and extract CPU-only version (macOS)

Follow the document included in the archive : using_phoenixgo_on_mac.pdf

Building from Source

Same as Linux.

On Windows

Recommendation: See FAQ question, to avoid syntax errors in config file and command line options on Windows.

Use Pre-built Binary

GPU version :

The GPU version is much faster, but works only with compatible nvidia GPU. It supports this environment :

  • CUDA 9.0 only
  • cudnn 7.1.x (x is any number) or lower for CUDA 9.0
  • no AVX, AVX2, AVX512 instructions supported in this release (so it is currently much slower than the linux version)
  • there is no TensorRT support on Windows

Download and extract GPU version (Windows)

Then follow the document included in the archive : how to install phoenixgo.pdf

note : to support special features like CUDA 10.0 or AVX512 for example, you can build your own build for windows, see #79

CPU-only version :

If your GPU is not compatible, or if you don't want to use a GPU, you can download this CPU-only version (Windows),

Follow the document included in the archive : how to install phoenixgo.pdf

Configure Guide

Here are some important options in the config file:

  • num_eval_threads: should equal to the number of GPUs
  • num_search_threads: should a bit larger than num_eval_threads * eval_batch_size
  • timeout_ms_per_step: how many time will used for each move
  • max_simulations_per_step: how many simulations(also called playouts) will do for each move
  • gpu_list: use which GPUs, separated by comma
  • model_config -> train_dir: directory where trained network stored
  • model_config -> checkpoint_path: use which checkpoint, get from train_dir/checkpoint if not set
  • model_config -> enable_tensorrt: use TensorRT or not
  • model_config -> tensorrt_model_path: use which TensorRT model, if enable_tensorrt
  • max_search_tree_size: the maximum number of tree nodes, change it depends on memory size
  • max_children_per_node: the maximum children of each node, change it depends on memory size
  • enable_background_search: pondering in opponent's time
  • early_stop: genmove may return before timeout_ms_per_step, if the result would not change any more
  • unstable_overtime: think timeout_ms_per_step * time_factor more if the result still unstable
  • behind_overtime: think timeout_ms_per_step * time_factor more if winrate less than act_threshold

Options for distribute mode:

  • enable_dist: enable distribute mode
  • dist_svr_addrs: ip:port of distributed workers, multiple lines, one ip:port in each line
  • dist_config -> timeout_ms: RPC timeout

Options for async distribute mode:

Async mode is used when there are huge number of distributed workers (more than 200), which need too many eval threads and search threads in sync mode. etc/mcts_async_dist.conf is an example config for 256 workers.

  • enable_async: enable async mode
  • enable_dist: enable distribute mode
  • dist_svr_addrs: multiple lines, comma sperated lists of ip:port for each line
  • num_eval_threads: should equal to number of dist_svr_addrs lines
  • eval_task_queue_size: tunning depend on number of distribute workers
  • num_search_threads: tunning depend on number of distribute workers

Read mcts/mcts_config.proto for more config options.

Command Line Options

mcts_main accept options from command line:

  • --config_path: path of config file
  • --gtp: run as a GTP engine, if disable, gen next move only
  • --init_moves: initial moves on the go board, for example usage, see FAQ question
  • --gpu_list: override gpu_list in config file
  • --listen_port: work with --gtp, run gtp engine on port in TCP protocol
  • --allow_ip: work with --listen_port, list of client ip allowed to connect
  • --fork_per_request: work with --listen_port, fork for each request or not

Glog options are also supported:

  • --logtostderr: log message to stderr
  • --log_dir: log to files in this directory
  • --minloglevel: log level, 0 - INFO, 1 - WARNING, 2 - ERROR
  • --v: verbose log, --v=1 for turning on some debug log, --v=0 to turning off

mcts_main --help for more command line options. A copy of the --help is provided for your convenience here

Analysis

For analysis purpose, an easy way to display the PV (variations for main move path) is --logtostderr --v=1 which will display the main move path winrate and continuation of moves analyzed, see FAQ question for details

It is also possible to analyse .sgf files using analysis tools such as :

  • GoReviewPartner : an automated tool to analyse and/or review one or many .sgf files (saved as .rsgf file). It supports PhoenixGo and other bots. See FAQ question for details

FAQ

You will find a lot of useful and important information, also most common problems and errors and how to fix them

Please take time to read the FAQ

from https://github.com/Tencent/PhoenixGo 

 

No comments:

Post a Comment