MICCAI 2018 Daily

operation – the area under the curve for those findings was near perfect, while of course for the diseases it was slightly lower. Jonathan was surprised to learn just how high was the disagreement between radiologists. He points out that there are papers in the literature that say that explicitly, so it wasn’t a surprise for them, but it was for him. It’s not rare that a finding would have only 60-70% agreement between two expert radiologists. Even if they employed an additional in-house radiologist to tackle the images, the labels would still be noisy, because another radiologist would disagree around 30% of the time. Basically, X- ray is not clear enough to give an unequivocal diagnosis. It is used mostly for screening rather than for diagnostics. You can see that something may be wrong, although you’ll disagree on what it is exactly, but it’s enough information to decide to keep someone in the hospital or send them for another review. In CT, you would not have such a high level of disagreement between the experts. Jonathan adds: “ We have a debate in our company about whether we should use large datasets with noisy labels, or use smaller datasets with near perfect labels, where three radiologists look at every image and tell you what the correct finding is. What we figured out in this work is that we don’t really have to send our studies for our own radiologists to do the tagging, because radiologists already viewed all our datasets -- it’s in the text of the report - - and the radiologist who wrote the report is no better or worse than our in-house radiologists. We can just take the report’s opinion as the input data. The value that we get from it by having all of our million studies already labelled far outweighs the noise that is both due to the radiologists’ disagreement and the algorithm that reads the report. ” When they looked at the reports, they figured out that the basic element of analysing them is the sentence. When they did statistics on the reports, they found out that some sentences appeared tens of thousands of times in the data. Both positive sentences and negative sentences. How did they cover millions of reports? Jonathan explains they took the 20,000 most prevalent sentences and had humans – medical students – tag those sentences. Every sentence that they Jonathan Laserson 5 Tuesday

MICCAI 2018 Daily - Tuesday