New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataLossError (see above for traceback): corrupted record at 12 #13463
Comments
@huangrandong is this problem repeatable, or did it happen just one time? |
@cy89 , thank you for your response。This problem happened many times,and it will come out whenever i run my program. The problem can not be repeatable. the reason can be the problem of my computer configuration. my program can run on another machine and don't display the error. |
Can you post a small example that will cause the |
@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code: create the TFrecord file function:img_tfrecord_name = image_base_name + ".tfrecord" read the TFrecord file :def _parse_function_for_train(example_proto): def CreateTrainDataset(): batched_train_dataset = CreateTrainDataset() |
@huangrandong can you post a complete, self-contained example I can copy to a text file and run? In the code above, @saxenasaurabh @vrv, any idea what the problem could be? |
@reedwm you can define the variable which the code didn't define. and the code is used to put a numpy array of image and another label array into the tfrecord file. Then, reading the two array from the tfrecord file |
It's much easier to quickly reproduce these issues if I have a self-contained example without having to define variables. Perhaps the issue only occurs for certain values of |
I also had reports of this error which appears to occur randomly during the training. It happened on multiple occasions and with different reported offsets (see OpenNMT/OpenNMT-tf#19). To investigate the issue, I wrote a small script that repeatedly loops over the same TFRecord dataset that threw the error and applies the same processing as done during the training. However, I was not able to reproduce it, indicating that no records are corrupted in the file and something else is going one during the training. Any pointers to better investigate the issue would be appreciated. |
Same problem here. For several different sets of TFRecord files we get this error at random times during training. |
I have reproduced the error at the same record location. The first and third got the error in the middle of 'Filling up shuffle buffer' and the second got the error in the beginning of that. In my case, this error looks highly relevant with the buffer shuffling process although different size of buffer didn't work. I hope this would be helpful for debugging.
|
Allow me to further complicate matters. (Although I am not 100% sure that it is the same issue) I have some custom data and know that the TFRecord is not corrupt, because I've iterated over it (using the same code) successfully before. Now I've encountered the same situation that homink described. Assuming that it is related, is there any caching involved when reading the .tfrecord? Either from tensorflow, python or the OS? (I am currently running it on Win10) |
@FirefoxMetzger I am too having this issue so I tried restarting my machine, as you did, and it did not fix the problem. I'm using Ubuntu 16.04. |
It has been 14 days with no activity and the |
/CC @mrry @saxenasaurabh, any ideas what the issue could be? This is hard to debug without a small example that reproduces the issue. |
AFAICT, this problem only affects ZLIB-compressed TFRecord files (because that is the sole source of /CC @saxenasaurabh @rohan100jain, who last touched the ZLIB-related code in that file. |
I confirm that the issue was encountered without any compression configured, unless it is the default (which is not AFAIK). |
Pardon my mistake, indeed there are other code paths that can print that message, and each of them is related to a CRC mismatch. |
It has been 14 days with no activity and the |
Anymore thoughts on this? It's a big issue for me but I don't know where to start debugging. Each time I reprocess my data the errors appear in different locations. Sometimes it takes a couple training epochs to occur. |
/sub This is happening to us as well, any ideas? Edit to add: We are using zlib compression, reading a bunch of files off GCS with interleave and shuffling them into one large Dataset; as a result, there's no way to catch the error and try and carry on. Is it possible this is some GCS transient? I'm also having trouble repeating it with the same data. |
does the
|
@guillaumekln thanks for the pointer to How does it handle
|
I think it does.
Not sure sure about this. The following snippet does raise the import tensorflow as tf
dataset = tf.data.Dataset.range(10)
dataset = dataset.apply(tf.data.experimental.ignore_errors())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
while True:
print(sess.run(next_element)) |
@guillaumekln You're right, what I'm seeing may not have to do with UPDATE: #25700 (comment)
|
Don't use org_train_image_resize_raw = org_train_image_resize.tostring(), |
why this modification can solve the problem? |
In my case, I solved this problem in this way: |
I ran into this problem once today while I was using Google Colab GPU Version and I fixed this just restart my notebook. Here is the address of my note book |
I had this issue. After execute sudo sh -c "sync; echo 1 > /proc/sys/vm/drop_caches" or restart my machine, it fix the problem temporally and can run a few epochs. However, this issue happened sometime later. Finally, I found it may relate to my rams and used memtest86 (https://www.youtube.com/watch?v=9_xFNojChNA) to test each of them. It turned out that one of my rams was faulty. Never have this problem again after plugging out the faulty ram. |
I just restarted my computer and it works. dk what the problem is. could be memory issue. |
Fixed by increasing the number of tf record shards. To check tf record files:
|
i encountered this problem in MAC system, the reason is, in my directory , it has a .DS_store file,but my code noly need the records data, so when i filter the .DS_store, it run sucessed |
I encountered an issue where TFRecords where occasionally genuinely corrupt (verified with the posted code above) when producing them in parallel, but not when producing them in a single process. It turned out to be an issue with the fact that the list of TFRecords to be produced was not unique, and that would occasionally make two processes write to disk at the same time, causing corruption. So if you encounter this issue and you are using parallelism, double check that your dataset doesn't contain duplicate items. |
I also encountered |
I was getting DataLossError (see above for traceback): Attempted to pad to a smaller size than the input element. I did tried everything that was given on this thread but none of them work. First of all using iterate tf record i was not able detect any corrupt record. Now records were not corrupt but i was still getting this error. After multiple trial and errors I found that my current training set had max 324 boxes for few images. So so all i have to do was to update the train for max box param. def val_fun1(filenames): iterator = dataset.make_one_shot_iterator() with tf.Session() as sess: thanks...Pankaj |
hello my friend I am new to python, would just help me how can I run this script? |
I was facing the same problem , but it's happened when I training several epochs..... |
Encounter the same problem. "sync; echo 1 > /proc/sys/vm/drop_caches" works for me. |
Thank you, your answer resolved my problem that bothered me three hours. |
I had a similar problem as I was preparing my data in gzip files to later use them on training a bert model. |
where to add this code? |
I have a big problem, I use the tfrecord file to import data for my tensorflow program. But, when the program run a period of time, it displays the DataLossError:
System information
OS Platform and Distribution : Linux Ubuntu 14.04
TensorFlow installed from : Anaconda
TensorFlow version : 1.3.0
Python version: 2.7.13
CUDA/cuDNN version: 8.0 / 6.0
GPU model and memory: Pascal TITAN X
Describe the problem
2017-10-03 19:45:43.854601: W tensorflow/core/framework/op_kernel.cc:1192] Data loss: corrupted record at 12
Traceback (most recent call last):
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 312, in Training
feed_dict={learning_rate: lr})
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]
Caused by op u'IteratorGetNext', defined at:
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 251, in Training
batch_image, batch_label = iterator.get_next()
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 304, in get_next
name=name))
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 379, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
DataLossError (see above for traceback): corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]
Thanks anyone to answer this question.
The text was updated successfully, but these errors were encountered: