import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0"
You can double check that you have the correct devices visible to TF
from tensorflow.python.client import device_lib
print device_lib.list_local_devices()
To determine the device number of your GPU in TensorFlow, you can use the following code snippet:
import tensorflow as tf
# Create a TensorFlow session
sess = tf.config.list_physical_devices('GPU')
# Get a list of all available GPU devices
devices = sess.list_devices()
# Iterate over the devices and print their details
for device in devices:
print(device)
Installing cuda and cuda toolkit
To install cuda
checked if it is installed
nvidia-smi
apt install nvidia-cuda-toolkit
Check what card you have
lscpu | egrep 'Model name|Socket|Thread|NUMA|CPU\(s\)’
Check the run-time driver information
The command
cat /proc/driver/nvidia/version
When I run the code with 256 and 128 epoch
File "/home/salehmak/anaconda3/envs/malnet-image/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1024,256,1,1] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node resnet50/conv4_block1_3_conv/Conv2D (defined at main.py:161) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [Op:__inference_train_function_17083]
with the size 32x32
ValueError: Input size must be at least 32x32; got input_shape=(24, 24, 3)
Working with multiple GPUs :
import tensorflow as tf
def get_gpu_names():
gpu_names = []
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
gpu_names.append(device.name)
return gpu_names
def get_gpu_ids():
gpu_ids = []
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
gpu_ids.append(int(device.name.split(':')[-1]))
return gpu_ids
# Example usage
gpu_names = get_gpu_names()
gpu_ids = get_gpu_ids()
print("GPU Names:", gpu_names)
print("GPU IDs:", gpu_ids)
One line verify
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"