Распределенный тензор потока: внутренняя ошибка - сбой запуска Blas GEMM

Question

Распределенный тензор потока: внутренняя ошибка - сбой запуска Blas GEMM

1

Я экспериментирую с распределенным Tensorflow и начинаю с двух процессов на localhost (Windows 10, Python 3.6.6, Tensorflow 1.8.0). Каждый процесс выполняет реплику простой нейронной сети (1-скрытый слой), смоделированной для подмножества набора данных UrbanSounds (5268 образцов с 193 функциями каждый).

Следуя этому хорошо написанному сообщению: https://learningtensorflow.com/lesson11/ Я мог бы повторить их основной пример, вычисляя среднее из результатов двух разных процессов. Для моего набора данных я изменил код следующим образом, чтобы разделить итоговые образцы на две части и позволить двум различным процессам вычислять функцию стоимости отдельно. Но после успешного запуска сервера RPC оба процесса заканчиваются следующей ошибкой:

InternalError (см. Выше для трассировки): запуск Blas GEMM завершился неудачно: a.shape = (263, 193), b.shape = (193, 200), m = 263, n = 200, k = 193

[[Узел: MatMul = MatMul [T = DT_FLOAT, transpose_a = false, transpose_b = false, _device = "/job: local/replica: 0/task: 0/device: GPU: 0"] (_ recv_Placeholder_0_G7, w1/read) ]]

Мне кажется, какая-то основная ошибка с настройкой нейронной сети или подготовка наборов данных для feed_dict, но я не могу видеть, что так нужна другая пара глаз. Другое наблюдение в этом эксперименте заключается в том, что GPU в основном снимается до max и код прерывается. Пожалуйста, помогите мне с какой-либо ошибкой в коде или стратегии распространения Tensorflow?

Спасибо.

### ERROR TRACE (removed duplicate rows ...) ####
train_data, train_labels (528, 193) (528, 10)
test_data, test_labels (22, 193) (22, 10)
2018-08-27 14:35:29.096572: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 14:35:29.330127: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.63GiB
...
2018-08-27 14:35:33.982347: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
2018-08-27 14:35:33.989312: W T:\src\github\tensorflow\tensorflow\stream_executor\stream.cc:2001] attempting to perform BLAS operation using StreamExecutor without BLAS support
    return fn(*args)
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

Caused by op 'MatMul', defined at:
  File "tf_dis_audio_test.py", line 78, in <module>
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
...
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

### CODE SAMPLE ###
# selected UrbanSounds dataset
print("train_data, train_labels", train_data.shape, train_labels.shape)
print("test_data, test_labels", test_data.shape, test_labels.shape)

# neural network configurations
cost = 0.0
n_tasks = 2
n_epochs = 10
n_classes = 10
n_features = 193
n_hidden_1 = 200
learning_rate = 0.1
sd = 1/np.sqrt(n_features)
cost_history = np.empty(shape=[1], dtype=float)

# task#0 is set as rpc host process
rpc_server = "grpc://localhost:2001"

# run two separate python shells, each with its task number (0,1), as:
#>python this_script.py  0
#>python this_script.py  1
task_number = int(sys.argv[1])

# cluster specs with two localhosts on different ports (2001, 2002)
cluster = tf.train.ClusterSpec({job_name:["localhost:2001", "localhost:2002"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
server.start()

graph = tf.Graph()
with graph.as_default():    
    X = tf.placeholder(tf.float32, [None, n_features])
    Y = tf.placeholder(tf.float32, [None, n_classes])

    w1 = tf.Variable(tf.random_normal([n_features, n_hidden_1], mean = 0, stddev=sd), name="w1")
    b1 = tf.Variable(tf.random_normal([n_hidden_1], mean=0, stddev=sd), name="b1")
    w2 = tf.Variable(tf.random_normal([n_hidden_1, n_classes], mean = 0, stddev=sd), name="w2")
    b2 = tf.Variable(tf.random_normal([n_classes], mean=0, stddev=sd), name="b2")
    
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
    _y = tf.nn.softmax(tf.matmul(z, w2) + b2)
    
    cost_function = tf.reduce_mean(tf.square(Y - _y))
    train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
    prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(_y, 1))
    accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32)) * 100.0
    print("#2: {}".format(datetime.utcnow().strftime(datetime_format)[:-3]))

# hack to fix the GPU out of memory issue
# but it does not make any good, GPU still shoots :(
gpuops = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config = tf.ConfigProto(gpu_options=gpuops)

with tf.Session(rpc_server, graph=graph, config=config) as ss:
    # setting up the session with RPC host
    ss = tf.Session(rpc_server)
    ss.run(tf.global_variables_initializer())

    for epoch in range(n_epochs):
        batch_size = int(len(train_labels) / n_tasks)

	# run session for task#0
        if (task_number == 0):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[:batch_size-1], Y:train_labels[:batch_size-1]})

	# run session for task#1
        elif (task_number == 1):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[batch_size:-1], Y:train_labels[batch_size:-1]})

	# recording the running cost of both processes
        cost_history = np.append(cost_history, cost)
        print(" epoch {}: task {}: history {:.3f}".format(epoch, task_number, cost_history))

    print("Accuracy SGD ({}): {:.3f}".format(
        epoch, round(ss.run(accuracy, feed_dict={X: test_data, Y: test_labels}), 3)))

Shakeel Anjum 27 авг. 2018, в 16:05

Источник

Теги:

python

tensorflow

distributed-computing

1 ответ

Ещё вопросы

Shakeel Anjum · Accepted Answer · 2018-09-03T08-49-00.000Z

Просто переместив данный код в Ubuntu 16.04.4 LTS, я решил эту проблему для меня.

Я не уверен, но это похоже на что-то, что связано с GRPC + Fiewall в Windows 10.

Если кто-то встретит BLASS-ошибку в Windows и сможет ее решить в Windows, отправьте решение для остальных из нас.

Приветствия.