Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test second shell script MCMC_800_1s-1.sh #5

Closed
gcapes opened this issue Apr 3, 2024 · 27 comments
Closed

Test second shell script MCMC_800_1s-1.sh #5

gcapes opened this issue Apr 3, 2024 · 27 comments
Assignees

Comments

@gcapes
Copy link
Collaborator

gcapes commented Apr 3, 2024

Check I can get this to run on CSF3

@gcapes gcapes self-assigned this Apr 3, 2024
@gcapes gcapes changed the title Test second shell script MCMC_800_1s-1.txt Test second shell script MCMC_800_1s-1.sh Apr 4, 2024
@gcapes
Copy link
Collaborator Author

gcapes commented Apr 4, 2024

I get this error message from the imports section of the python script:

  import pandas as pd
2024-04-04 10:06:08.867980: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-04 10:06:11.265771: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-04 10:06:11.267078: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-04 10:06:45.939341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

@RoryAtBar Have you encountered this previously?

I've made sure to install the correct versions of tensorflow and tensorflow-probability, which I've now added to a requirements file and documented in the csf setup file.

The versions of packages I have from pip freeze are here:

absl-py==2.1.0
arviz==0.17.0
astunparse==1.6.3
cachetools==5.3.2
certifi==2023.11.17
charset-normalizer==3.3.2
check-shapes==1.1.1
cloudpickle==3.0.0
cons==0.4.6
contourpy==1.2.0
cycler==0.12.1
decorator==5.1.1
Deprecated==1.2.14
dm-tree==0.1.8
dropstackframe==0.1.0
etuples==0.3.9
fastprogress==1.0.3
filelock==3.13.1
flatbuffers==23.5.26
fonttools==4.47.2
gast==0.4.0
google-auth==2.27.0
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
gpflow==2.9.0
grpcio==1.60.0
h5netcdf==1.3.0
h5py==3.10.0
idna==3.6
jax==0.4.24
keras==2.12.0
kiwisolver==1.4.5
lark==1.1.9
libclang==16.0.6
logical-unification==0.4.6
Markdown==3.5.2
MarkupSafe==2.1.4
matplotlib==3.8.2
miniKanren==1.0.3
ml-dtypes==0.2.0
multipledispatch==1.0.0
numpy==1.24.3
oauthlib==3.2.2
opt-einsum==3.3.0
packaging==23.2
pandas==2.2.0
pillow==10.2.0
protobuf==4.23.4
pyasn1==0.5.1
pyasn1-modules==0.3.0
pymc==5.10.3
pyparsing==3.1.1
pytensor==2.18.6
python-dateutil==2.8.2
pytz==2023.4
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.12.0
six==1.16.0
tabulate==0.9.0
tensorboard==2.12.3
tensorboard-data-server==0.7.2
tensorflow==2.12.1
tensorflow-estimator==2.12.0
tensorflow-io-gcs-filesystem==0.35.0
tensorflow-probability==0.20.1
termcolor==2.4.0
toolz==0.12.1
typing_extensions==4.5.0
tzdata==2023.4
urllib3==2.2.0
Werkzeug==3.0.1
wrapt==1.14.1
xarray==2024.1.1
xarray-einstats==0.7.0

@gcapes
Copy link
Collaborator Author

gcapes commented Apr 16, 2024

Rory suggested trying gpflow <= 2.5.2

@gcapes
Copy link
Collaborator Author

gcapes commented Apr 25, 2024

Have resubmitted with gpflow=2.5.2 and it looks to be running so far...

@gcapes
Copy link
Collaborator Author

gcapes commented Apr 26, 2024

Ok so I get what looks to be sensible output, but also this error. Should I be concerned/do you know how I can fix this? @RoryAtBar

  import pandas as pd
2024-04-25 09:17:55.314937: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 09:17:58.346260: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 09:17:58.347395: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 09:18:38.423735: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-25 10:17:51.302837: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs.
Traceback (most recent call last):
  File "/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py", line 432, in <module>
    idata = pm.sample(tune=10000, draws=20000, step=step,cores=1, chains=5)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/.venv/lib/python3.11/site-packages/pymc/sampling/mcmc.py", line 744, in sample
    model.check_start_vals(ip)
  File "/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/.venv/lib/python3.11/site-packages/pymc/model/core.py", line 1660, in check_start_vals
    raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'Friction_interval__': array(0.), 'Conductance_interval__': array(0.)}

Logp initial evaluation results:
{'Friction': -1.39, 'Conductance': -1.39, 'likelihood': nan}
You can call `model.debug()` for more details.

@RoryAtBar
Copy link
Owner

The issue with running the script on GPUs I'm not sure about, but it doesn't sound like a major problem.

This issue with initial evaluation results, yes I have encountered it before. The problem is essentially that the likelihood function is somehow mis-specified, and is giving spurious results, so the chains are being initialised outside of what should be allowed by the prior probability distribution (which is specified in the pm.Model() context manager).

The likelihood function uses the Gaussian process model. There could be something wrong with the GP, does the script plot the fit of the GP? If the GP looks ok, then I'll need to plot out some of the values of the likelihood function.

Might be worth me having a play with the script, I can have a look early next week

@gcapes
Copy link
Collaborator Author

gcapes commented May 9, 2024

Hi Rory, you asked on slack

I think the script plots the gaussian process against the FEM data. Did the script produce a JPEG file that shows blue lines running through black dots?

Not that I can see - I guess this means it's an important error :)

@gcapes
Copy link
Collaborator Author

gcapes commented May 17, 2024

The issue with running the script on GPUs I'm not sure about, but it doesn't sound like a major problem.

This issue with initial evaluation results, yes I have encountered it before. The problem is essentially that the likelihood function is somehow mis-specified, and is giving spurious results, so the chains are being initialised outside of what should be allowed by the prior probability distribution (which is specified in the pm.Model() context manager).

The likelihood function uses the Gaussian process model. There could be something wrong with the GP, does the script plot the fit of the GP? If the GP looks ok, then I'll need to plot out some of the values of the likelihood function.

Might be worth me having a play with the script, I can have a look early next week

Hi @RoryAtBar
Did you manage to have a look at this?

@RoryAtBar
Copy link
Owner

RoryAtBar commented May 17, 2024 via email

@RoryAtBar
Copy link
Owner

Hi Gerard,

I have added a solution to an extra branch (gp_kernel_tester) which trains GP models of increasing flexibility until one works. It's crude and not scientifically rigorous but it is adequate for this specific problem, though might need to be changed at a later date if a more general solution is needed.

Seems to be working for now.

@gcapes
Copy link
Collaborator Author

gcapes commented May 31, 2024

Just submitted a job using this new script.

@gcapes
Copy link
Collaborator Author

gcapes commented May 31, 2024

AttributeError: module 'gpflow.models' has no attribute 'Matern52'
@RoryAtBar any ideas on this one?

@RoryAtBar
Copy link
Owner

RoryAtBar commented May 31, 2024 via email

@RoryAtBar
Copy link
Owner

This is from MCMC_800C_1s-1.py right?

The error makes it sound like somewhere in the code there is a line that says:
gpflow.models.Matern52()

If that was the case, then the fix is to change this line to
gpflow.models.GPR()
and make sure that the kernel is specified correctly i.e.

kernel=gpflow.kernels.Matern52()

where

model = gpflow.models.GPR(
    (X_normed, Y[cond_filter,None]),
    kernel=gpflow.kernels.Matern52(np.shape(X_normed)[-1], lengthscales=np.ones(np.shape(X_normed)[-1])),)

I had this previously because when creating the branch gp_kernel_tester, I had put this in by mistake and fixed it. When you sent this error, I presumed I had simply forgotten to push it to github. I can't however find this error in the code, would you be able to direct me to it?

@gcapes
Copy link
Collaborator Author

gcapes commented Jun 4, 2024

Looks like you found it :)
With the changes you made in 943fcc4 and e407eb9 this script now looks to be running ok.

@gcapes
Copy link
Collaborator Author

gcapes commented Jun 5, 2024

@RoryAtBar
Could you take a quick look at this and confirm whether they're as expected?

$ cat MCMC_800C_1s-1.sh.e4990184 
mkdir: cannot create directory ‘/mnt/iusers01/support/mbexegc2/scratch/MCMC_GPsurrgt_800C_1s-1_cond0-1500_20000_chain’: File exists
/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py:10: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
2024-06-04 09:15:26.741914: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-04 09:15:29.923159: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-04 09:15:29.924662: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-04 09:16:18.024203: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-04 09:44:59.306185: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs.
2024-06-04 09:45:06.553742: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs.
/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py:465: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  Force_at_800C_1s[n] = dat[:,1][abs((dat[:,0]+x_correction)-step)==min(abs((dat[:,0]+x_correction)-step))]
Sequential sampling (5 chains in 1 job)
CompoundStep
>Metropolis: [Friction]
>Metropolis: [Conductance]
Sampling 5 chains for 10_000 tune and 20_000 draw iterations (50_000 + 100_000 draws total) took 23522 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

The output log file looks ok, except that the progress bar lines show mac line endings, which doesn't match the rest of the file. Do you know which part of the code generates these?

MCMC_800C_1s-1.sh.o4990184: |█████████████████████████████| 100.00% [30000/30000 1:16:54<00:00 Sampling chain 4, 0 divergences]

@RoryAtBar
Copy link
Owner

RoryAtBar commented Jun 5, 2024 via email

@RoryAtBar
Copy link
Owner

Very sorry for the slow response,

The results look ok, the actual values are a bit odd, possibly because of testing this with a limited set of data (conductance limited to 1500).

I'm not getting the issue with the limited effective sample size, maybe I'm using a different set of input data to you? All I have done is used the scripts currently in the main branch

@gcapes
Copy link
Collaborator Author

gcapes commented Jun 14, 2024

Hi Rory,

That's encouraging. It's been a while since I last looked at this but I think I was using the gp_kernel_tester branch.

@RoryAtBar
Copy link
Owner

RoryAtBar commented Jun 14, 2024 via email

@gcapes
Copy link
Collaborator Author

gcapes commented Jun 14, 2024

Might be different version of the libraries perhaps?
I'll try to have another look at this next week when I've got back up to speed with things :)

@gcapes
Copy link
Collaborator Author

gcapes commented Jul 25, 2024

I'll re-run this next week to see if I still get the error. Rory said there's a bit of randomness involved and I might have got a a bad seed. It can be set up to re-start if it fails, but currently isn't.

@gcapes
Copy link
Collaborator Author

gcapes commented Jul 30, 2024

I forgot that this script uses the output from the first one... I was tidying up and deleted the output so I'm running it again before I can run the second script. 🙄

@gcapes
Copy link
Collaborator Author

gcapes commented Aug 6, 2024

Second script now running using the test-second-step branch, having run the first job using the main branch.

@gcapes
Copy link
Collaborator Author

gcapes commented Aug 6, 2024

Same error - re-reading some detail, I see this was the wrong branch! Resubmitting on gp_kernel_tester

@gcapes
Copy link
Collaborator Author

gcapes commented Aug 21, 2024

I think this has run successfully now. Is this image any / a good measure that the job has run well?

Image

If so I'll move on to trying to re-jig the code into MatFlow

@gcapes gcapes closed this as completed Aug 21, 2024
@RoryAtBar
Copy link
Owner

Thanks Gerard,

Unfortunately, the image shown shows an extreme case of overfitting. I have re-jigged the way the Gaussian processes are trained for the part of the project I am currently working through. At the risk of you killing me, can we have a call where I show you how I want it to work?

  1. I want to change the GP from fitting individual data points to fitting basis functions using scikit-fda
  2. Randomly separate out training data and validation data and test the fit of the validation data (about 20% of the samples to be used not for conditioning the GP, but for checking that the predicted values fit correctly)
    3)Automatically check which of four kernels fits best rather than picking the first one that fits at all

Then there is the MCMC step in that script that needs a small modification to adapt to the above change

@gcapes
Copy link
Collaborator Author

gcapes commented Aug 21, 2024

Sure - I could do tomorrow or Friday?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants