如何在GPU上正确运行model.fit? 异常行为

问题描述

目前,我正在上Udemy Python数据科学课程。在那里,有以下示例在Tensorflow中训练模型:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

model = Sequential()

# Choose whatever number of layers/neurons you want.
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))

# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-Feedforward-neural-netw

model.compile(loss='binary_crossentropy',optimizer='adam')

model.fit(x=X_train,y=y_train,epochs=3,validation_data=(X_test,y_test),verbose=1
          )

我现在的目标是让它在我的GPU上运行。为此,我对最后一部分进行了如下更改(这些时间段是有目的的,我只想查看每个时间段在扩展之前需要花费多长时间):

with tf.device("/gpu:0"):
    model.fit(x=X_train,verbose=1
              )

为了进行比较,也如下:

with tf.device("/cpu:0"):
    model.fit(x=X_train,verbose=1
              )

但是,结果是非常出乎意料的:这两个版本都占用了GPU的所有内存,但似乎不对其进行任何计算,并且每个时期花费的时间完全相同。或者,GPU版本仅因以下错误而崩溃:

C:\Program Files (x86)\Microsoft Visual Studio\Shared\Anaconda3_64\envs\gpu\lib\site-packages\six.py in raise_from(value,from_value)
 
InternalError:  Blas Gemm launch Failed : a.shape=(32,78),b.shape=(78,m=32,n=78,k=78
     [[node sequential/dense/MatMul (defined at <ipython-input-115-79c9a84ee89a>:8) ]] [Op:__inference_distributed_function_874]
 
Function call stack:
distributed_function

有时它会崩溃,有时它可以工作,但所需的时间与cpu一样长。有时,甚至cpu版本每个时期也要花费20秒,而其他时候则需要40秒。代码保持不变,所不同的是,我在两者之间重新启动了内核。我真的不明白。

当我使用以下代码测试GPU和conda环境时,一切似乎都可以正常工作,可重现,并且GPU的运行速度大约是cpu的20倍:

# https://     gist.github.com/ikarus-999/1a845437b454cdfcc1eb5455d373fe63
import sys
import numpy as np
import tensorflow.compat.v1 as tf # compatibility for TF 1 code
from datetime import datetime
 
def test_device (device_name: str):
    shape = (int(10000),int(10000))
    startTime = datetime.Now()
    with tf.device(device_name):
        random_matrix = tf.random.uniform(shape=shape,minval=0,maxval=1)
        dot_operation = tf.matmul(random_matrix,tf.transpose(random_matrix))
        sum_operation = tf.reduce_sum(dot_operation)
 
    result = sum_operation
 
    print("Shape:",shape,"Device:",device_name)
    print("—"*50)
    print(result)
    print("Time taken:",datetime.Now() - startTime)
    print("\n" * 2)
    
test_device("/cpu:0") # 6 sec
test_device("/gpu:0") # 0.3 sec

所以,我确定我做错了什么。

TLTR:

在GPU上调用model.fit的正确方法是什么?

>在不更改代码的情况下,不同的运行如何导致截然不同的结果(崩溃,大大不同的计算时间)?

非常感谢您的帮助!

解决方法

经过反复尝试,我终于找到了一种强制CPU或“混合使用”的工作方法。不过,GPU似乎似乎不起作用。我的原始帖子中的with tf.device()方法在这种情况下似乎没有任何作用。如果只想使用CPU,我必须隐藏GPU(Tensorflow 2.1.0):

仅CPU

import React from 'react';
import { firebaseApp } from '../utils/firebase';
import { browserHistory } from 'react-router';
import Helmet from "react-helmet";

import RaisedButton from 'material-ui/RaisedButton';
import TextField from 'material-ui/TextField';
import Paper from 'material-ui/Paper';

import Avatar from '@material-ui/core/Avatar';
import Button from '@material-ui/core/Button';
import CssBaseline from '@material-ui/core/CssBaseline';
import FormControlLabel from '@material-ui/core/FormControlLabel';
import Checkbox from '@material-ui/core/Checkbox';
import Link from '@material-ui/core/Link';
import Grid from '@material-ui/core/Grid';
import LockOutlinedIcon from '@material-ui/icons/LockOutlined';
import Typography from '@material-ui/core/Typography';
import Container from '@material-ui/core/Container';
import withStyles from "@material-ui/core/styles/withStyles";

const styles = {
  paper: {
    display: 'flex',flexDirection: 'column',alignItems: 'center',},avatar: {
    backgroundColor: 'orange',form: {
    width: '100%',// Fix IE 11 issue.
  },submit: {
  },};

class Signup extends React.Component {
  constructor(props) {
    super(props);

    this.state = {
      email: '',password: '',emailError: '',passwordError: ''
    };

    this.handleSubmit = this.handleSubmit.bind(this);
    this.handlePasswordChange = this.handlePasswordChange.bind(this);
    this.handleEmailChange = this.handleEmailChange.bind(this);
  }

  handleEmailChange(e) {
    this.setState({ email: e.target.value });
  }

  handlePasswordChange(e) {
    this.setState({ password: e.target.value });
  }

  handleSubmit(e) {
    e.preventDefault();
    const email = this.state.email.trim();
    const password = this.state.password.trim();

    firebaseApp.auth().createUserWithEmailAndPassword(email,password).then((user) => {
      browserHistory.push('/polls/dashboard');
    }).catch((error) => {
      if (error.code === 'auth/weak-password') {
        this.setState({ passwordError: error.message,emailError: '' });
      } else {
        this.setState({ emailError: error.message,passwordError: '' });
      }
      //console.log(error);
    });
  }

  render() {
    const { classes } = this.props; //<----- grab classes here in the props

    return (
        <Container component="main" maxWidth="xs">
          <CssBaseline />

          <div className={classes.paper}>

            <Avatar className={classes.avatar}>
              <LockOutlinedIcon />
            </Avatar>
            <Typography component="h1" variant="h5">
              Sign up
            </Typography>


            <form className={classes.form} noValidate>
              <Grid container spacing={2}>
                <Grid item xs={12}>
                  <TextField
                      variant="outlined"
                      required
                      fullWidth
                      id="email"
                      label="Email Address"
                      name="email"
                      autoComplete="email"
                      floatingLabelText="Email"
                      value={this.state.email}
                      onChange={this.handleEmailChange}
                      errorText={this.state.emailError}
                  />
                </Grid>
                <br /><br />

                <Grid item xs={12}>
                  <TextField
                      variant="outlined"
                      required
                      fullWidth
                      name="password"
                      label="Password"
                      id="password"
                      autoComplete="current-password"
                      floatingLabelText="Password"
                      value={this.state.password}
                      onChange={this.handlePasswordChange}
                      type="password"
                      errorText={this.state.passwordError}
                  />


              <br /><br />
                  <FormControlLabel
                      control={<Checkbox value="allowExtraEmails" color="primary" />}
                      label="I accept the terms and conditions."
                  />
                </Grid>
              </Grid>
              <Button
                  type="submit"
                  fullWidth
                  variant="contained"
                  color="primary"
                  className={classes.submit}
                  primary={true}
                  label="Signup"

              >
                Sign Up
              </Button>
              <Grid container justify="flex-end">
                <Grid item>
                  <Link href="#" variant="body2">
                    Already have an account? Sign in
                  </Link>
                </Grid>
              </Grid>
            </form>
            <br /><br />
        </div>
        </Container>

    );
  }
}

export default withStyles(styles)(Signup);

这将导致每个周期3-4秒,并且不会给GPU造成负担。

重新启动内核,然后:

仅GPU

# force CPU (make CPU visible)
cpus = tf.config.experimental.list_physical_devices('CPU')
print(cpus)
tf.config.set_visible_devices([],'GPU')  # hide the GPU
tf.config.set_visible_devices(cpus[0],'CPU') # unhide potentially hidden CPU
tf.config.get_visible_devices()
    
model.fit(x=X_train,y=y_train,epochs=25,batch_size=256,validation_data=(X_test,y_test),verbose=1
           )

这显然不可行,因为此型号显然需要CPU:

“ NotFoundError:此过程中没有可用的CPU设备”

默认(CPU和GPU混合):

重新启动内核,然后:

# force GPU (make GPU visible)
# note: does not work without restarting the kernel,otherwise:
# "Visible devices cannot be modified after being initialized"
gpus = tf.config.experimental.list_physical_devices('GPU')
print(gpus)
tf.config.set_visible_devices([],'CPU') # hide the CPU
tf.config.set_visible_devices(gpus[0],'GPU') # unhide potentially hidden GPU
tf.config.get_visible_devices()

model.fit(x=X_train,verbose=1
          )

这导致每个时期5-6秒,消耗GPU的所有RAM,并使用GPU的少量处理能力(

如果默认模式(CPU和GPU)引发以下错误,则表明GPU已被另一个进程占用,重新启动Windows会有所帮助: “内部错误:Blas GEMM启动失败”

还有许多谜团要留给我:

  • 为什么“混合”模式比仅CPU慢?
  • 是否可以在不重新启动内核的情况下更改可见设备,以避免出现以下错误? “可见设备初始化后无法修改”
  • 为什么with tf.device()方法不适用于此模型(无效),而适用于test_device()代码?

如果有人能提供一些见识,非常感谢:)