Computer Vision News - September 2024

A publication by A publication by JULY 2023 September 2024 Be the new owner of this magazine! Best Student Paper at UR 2024 (page 2) Best Art Paper SIGGRAPH 2024 (page 6) Enthusiasm is common, endurance is rare!

M2CURL: Sample-Efficient Multimodal Reinforcement Learning via Self-Supervised Representation Learning for Robotic Manipulation Computer Vision News Computer Vision News 2 UR24 Best Student Paper Award In this paper, Fotios and his fellow researchers tackle an important robotics problem – how to effectively fuse different sensing modalities, such as vision and touch, so that a reinforcement learning algorithm can make sense of them and robots can make better decisions during manipulation tasks. These algorithms are notoriously data-hungry, requiring vast amounts of examples to learn from and make informed decisions. This situation intensifies when dealing with high-dimensional inputs like vision and touch. “Imagine you have big images that have a lot of pixels, and each pixel can take a lot of values,” Fotios says. “The same goes for tactile sensing. You have very discrete information. It’s also high dimensional, and if you give all this high-dimensional raw data to the algorithm, it will never figure out how to map observation to an actual action.” Fotios Lygerakis is a third-year PhD student and University Assistant at the University of Leoben in Austria. He speaks to us fresh from winning the Best Student Paper Award at the International Conference on Ubiquitous Robots (UR2024).

3 Computer Vision News Computer Vision News M2CURL - Sample-Efficient Multimodal Reinforcement Learning … To address this, the team focused on a key task: making sense of this information and turning it into a form the learning algorithm can understand. This process involves learning representations, like patterns in the data, that are much more digestible for the algorithm. By distilling the raw data into meaningful patterns, the algorithm can learn more efficiently, needing fewer samples to achieve a higher success rate. While the concept of this research is not entirely new, what distinguishes it from other works is the innovative way in which it pays attention to each modality during the learning process. “What has been done before is that researchers learned their representation for one modality – for vision, for example – then learned the representation for another modality, and then they just added them together and gave it to the algorithm,” Fotios explains. “In that case, this learning part doesn’t pay attention to what the other modality is doing, and it’s quite important for the algorithm to know how these two modalities look alike in this latent representation space.” Another innovative aspect of this research is the use of selfsupervised representation learning. In scenarios where labeled data is scarce and expensive, selfsupervised learning allows the algorithm to learn from vast amounts of unlabeled data. The team combined this with their multimodal learning approach, using both an intra-modal loss (where you pay attention when learning from the same modality) and inter-modal loss (between different modalities). This dual attention approach enabled the algorithm to learn better representations that pay attention to both modalities simultaneously, leading to an improved understanding of the environment

Computer Vision News Computer Vision News 4 and, consequently, better decisionmaking when producing an action. This work not only addresses a significant challenge in robotic learning but also opens up new avenues for research. “We show there is a lot of work that needs to be done on combining the learning of multiple modalities, especially in the realm of vision and touch,” Fotios points out. “Particularly in robotics, they kind of help each other. For example, vision gives you more global information about the world, but when you have something in your hand, or you’re touching something, most information comes from the tactile sense.” The team is currently exploring the temporal aspect of these representations – how they change over time and how each modality’s importance shifts depending on the task. For instance, when a human grasps an object, vision is crucial initially, but tactile feedback becomes more important once the object is in hand. Understanding and modeling this temporal shift in importance could lead to even more sophisticated robotic systems. Winning the Best Student Paper Award at UR2024 is no small feat. What does Fotios believe sets his work apart from the competition? “I think there are two parts to it,” he tells us. “First was the paper itself, the idea and how we implemented it, how important the problem was we were solving, and how novel was the way that we were solving it. Also, how you communicate your work, which is hard for us researchers because we think everybody thinks the way we think, and then we don’t explain why something is important. We did an UR24 Best Student Paper Award

5 Computer Vision News Computer Vision News okay job of explaining what the problem is and how our method actually solves the problem. Second, they told us that the final decision would be based on the presentation, so we really put effort into making our presentation informative and accessible to everyone and highlighting the problem and the novelty in solving it!” Fotios hails from Thessaloniki, Greece, “a really beautiful city” and the second biggest in Greece. After completing a Master’s in Engineering at the Technical University of Crete, he discovered a passion for research. This passion eventually led him to pursue research jobs and his PhD in robot learning in Austria. “I’m trying to make robots smarter,” he reveals. “I’m trying to make them understand and perceive the world as we humans do, focusing on how you can harvest vision and touch and combine them to help robots learn interesting tasks!” M2CURL - Sample-Efficient Multimodal Reinforcement Learning …

Computer Vision News Computer Vision News 6 From left: Dionysios Papanikolaou, Gaetan Robillard and Jérome Nika at the Ircam studios. Photography: Gabriel de la Chapelle Critical Climate Machine: A Visual and Musical Exploration of Climate Misinformation through Machine Learning SIGGRAPH 2024 Best Art Paper Award Gaëtan Robillard is a visual artist and visiting professor at the Université Gustave Eiffel and contributes to research with the Canadian research group ARCANES. Jérôme Nika is an electronic musician and a researcher at Ircam. They speak to us fresh from winning the Best Art Paper award at SIGGRAPH 2024 for their multi-sensory work exploring climate misinformation through machine learning.

7 Computer Vision News Computer Vision News Critical Climate Machine: a Visual … Critical Climate Machine is a visual and musical art installation using machine learning to tackle the issue of climate misinformation. The installation combines machine learning algorithms, sculpture, visualization, and sound to create an immersive experience that informs and challenges its audience. In this paper, Gaëtan and Jérôme present two algorithms from the installation, revealing how they were used to explore misleading claims and how they interact to produce an educative setup for climate discourse, with critical discourse for refuting false claims about climate change. The installation seeks to educate the public to be more responsive to the problem and experiment with new methods of communicating climate discourse, using sound and visuals as remediation tools. While Gaëtan created the artwork, Jérôme collaborated on the sound design aspect of the project alongside Dionysios Papanikolaou and Tony Houziaux. “This sound design uses a digital instrument that I developed as a researcher and is the instrument that I use myself as an electronic musician,” he tells us. “The idea of this instrument is to feed Critical Climate Machine (detail), installation view at Deutshes Museum (ZukunftMuseum), Nürnberg, 2021. © Gaëtan Robillard

Computer Vision News Computer Vision News 8 the machine with memories – a database where we pick and retrieve sound slices – and fit it with scenarios – temporal guides that guide the machine when generating new sounds. The idea here was to hybridize some things Gaëtan recorded and to articulate: What are the guides? What are the memories?” The installation itself is divided into three parts set in a large room. As visitors enter, they encounter a data sculpture on the floor. It processes and visualizes X (formerly Twitter) data to classify and quantify climate-skeptical discourse as a landscape of error codes. Initially, this operated in real-time, but changes in X’s policies mean it works differently now, although the core concept remains. This data-driven sculpture is accompanied by a thermal visualization that reflects the computational process’s heat generation and the amount of misinformation detected at a given time. The visualization uses colors to convey this – the warmer the installation and the higher the amount of misinformation, the warmer the color tones. SIGGRAPH 2024 Best Art Paper Award Critical Climate Machine, various material, data collection and classification program X (Twitter), octophonic sound installation, 30-inch screen, variable dimensions, 2021-2024. Installation view at Université Gustave Eiffel, 2024. © Gaëtan Robillard. Photography: Gabriel de La Chapelle.

9 Computer Vision News Computer Vision News The third part of the installation is an eight-speaker surround sound system, which immerses the audience in a dialogue between voices representing both climate advocates and skeptics. These voices circulate the space to reflect the ongoing public debate about climate change. “You have a space where the two types of discourse are opposing each other, and all the time, systematically, the climate advocates attempt to establish scientific facts,” Gaëtan points out. “It’s a communication tool composed very specifically with the algorithm made and developed by Jérôme.” People of all ages have visited the installation, and it has received a wide range of responses. Gaëtan was particularly surprised by the strength of reaction from its younger visitors. “The young people aged between 8 and 14 are very much Critical Climate Machine: a Visual … “You have a space where the two types of discourse are opposing each other, and all the time, systematically, the climate advocates attempt to establish scientific facts!” “Patterns of Heat”, visualizations captured at three different stage. Each point represents one of the analysed tweets. Below each color grid, a randomly selected tweet is displayed. © Gaëtan Robillard

Computer Vision News Computer Vision News 10 concerned,” Gaëtan reveals. “Somehow, much more than the adults. Their desire to engage was striking for me. One of the most interesting reactions is that the young people want to extend the discourse by themselves!” While some viewers found the sound installation challenging, describing it as stressful or overwhelming due to the overlapping sounds and repetitive patterns, this discomfort was actually an intentional aspect of the work. “The sound installation is not necessarily made to be sensed as something very beautiful,” Gaëtan explains. “There’s a certain pressure that was designed. There are repetitions and overlapping sounds, so if you spend a lot of time with it, there’s this discomfort that’s part of the project.” The machine learning algorithm wasn’t designed to generate discourse with minimal input. Instead, the focus was on carefully tuning the instruments, formalizing how they should operate, and being precise about the artistic intention behind the work. Jérôme highlights the importance of reflexivity in the creative process: “The sound installation uses recordings of voices arguing about climate issues, and the idea was: what do we want to do with that? Do we want this type of argument to be the structure, or do we want it to be the material? Do we want to have a kind of hybridization between both? Do we want to see the meaning of the sentences in the data, or do we want to forget about the meaning but use its acoustic characteristics?” The fact that machine learning can operate autonomously was not used to facilitate the process. On the contrary, it was more and more reflexive and precise about the inputs and outputs and what they wanted to see in the data. The team intentionally chose not to display the content from Twitter but rather abstracted data derived from it, focusing on structure and texture rather than the literal meaning. Looking ahead, Gaëtan and Jérôme are excited about the installation’s potential to travel and reach a wider audience. After successful exhibitions in Germany and at Université Gustave Eiffel earlier this year, they are in discussions to showcase the SIGGRAPH 2024 Best Art Paper Award Two students – Laëtitia Ngaha and Anastasiya Balan - recording dialogs between climate skeptics and climate advocates. © Gaëtan Robillard

11 Computer Vision News Computer Vision News Critical Climate Machine in other French cities, including Lille and Lyon. They are also hopeful about eventually taking the installation to the United States. Beyond writing award-winning papers, Gaëtan and Jérôme maintain active careers in their respective fields. Jérôme spends much of his time tuning his digital instruments and collaborating with musicians. Gaëtan balances teaching and exhibitions with the constant hunt for funding for his research and artistic creations. Reflecting on their collaborative process, Gaëtan emphasizes the importance of creativity and interaction with machine learning systems in their work. “We concluded that it’s very important to focus on data aspects or making with data, and also interactivity itself, more so than automating creation,” he adds. “Jérôme’s work has been very fruitful because we worked as a team on the sound. There were four of us, and it was very interactive amongst our team. I think the algorithms devised for generating music by Jérôme were very purposeful!” Critical Climate Machine: a Visual … Chaining two generative agents respectively in charge of the structure and the texture of the outputs. © Jérôme Nika

Computer Vision News Computer Vision News 12 Playlist - Graphics in 5 minutes Steve Seitz is a Professor at the University of Washington and a Distinguished Scientist at Google. He speaks to us about Graphics in 5 minutes, his incredible YouTube playlist, which promises a university-level computer graphics course in two hours. “I enjoy doing it so much that it’s not hard to find the time!” Steve, can you tell us about your work in general? I work in computer vision, and my specialty is 3D computer vision and applications of vision for computer graphics. I got to the University of Washington in 2000, and I was at the faculty of Carnegie Mellon for a couple of years before that. We are almost of the same generation, I think, but you have much more hair. [Steve laughs] Well, we’ll see how long it lasts! Let’s talk about the playlist. I discovered it because Shmuel Peleg sent it to his mailing list. It’s a set of lectures, really, but I call them cartoons. The idea is to give you a fairly deep understanding of a topic in just five minutes. I teach a computer graphics course at the University of Washington, and what I found was that you could convey the same material as a 60-minute class in five minutes using the right techniques. That was the inspiration for this series. Even very busy people could find five minutes for that. [laughs] Hopefully, yeah. I try to make them as entertaining as possible. I think people will find them both fun and informative. Thinking particularly about the most junior computer vision scholars, how do you suggest they get started with this series? Yeah, it’s a good question. I probably

The cartoon style of presentation, from the "Projection" video 13 Computer Vision News Computer Vision News Steve Seitz wouldn’t do it all at once, although you could. In about the time that it takes to watch a normal movie, in about two hours, you can take an entire quarter of the class. You can finish it. So, you could watch it backto-back, but most people I know like to watch a few at a time. I teach this class at the University of Washington based on this channel. For that, the students watch three of these videos a week, and then they do projects and so forth. That’s one way to do it. I know other people who like to binge-watch them. Maybe they’ll watch three or four a night for a few days. That’s also possible. Whatever you like the most. Are all of them fit for computer vision people? They’re certainly applicable to anyone wanting to learn computer vision or computer graphics. It’s designed for a university-level class, but I designed the videos so that anyone can watch them. My wife, who is not trained in computer science, she’s watched all of them. My 18-year-old son has also watched most of them. They’re designed for anyone to be able to follow, although there’s a little bit of calculus in one or two of them, so you might not get all pieces of it if you haven’t had the math, but most of them don’t require much math at all. Did you create this on your own initiative? Yes. I was assigned to teach computer graphics, and normally, I teach computer vision, so I hadn’t

Computer Vision News Computer Vision News 14 taught the graphics course in many years. What I thought I would do was find the best graphics classes online and just assign those to my students, thinking that there surely must be better teachers of computer graphics than me. [laughs] I thought I would find the best lectures online and then just assign them, but I found that it was very difficult for me to sit through an hour-long lecture on YouTube, and maybe this is just because I’m not very patient, but I felt like if I can’t sit through an hour-long lecture, it’s not fair to assign it to my students. But what I did find was one lecture on affine transformations, just a five-minute cartoon on Algorithm Archive, which is a fantastic video. This covered about two-thirds of the material that I normally cover in my lecture, and I thought, wow, this is amazing. Why don’t I find one of these little YouTube lectures for every single topic in the course, and then I’ll be set? But it turned out that that was the only one. This affine transformations cartoon was the only one that covered the topics in the class in five minutes, and so I was inspired to do the rest of them myself. As I started doing this, I realized I really enjoyed this process of creatively trying to figure out how to teach a complex topic in five minutes, and so, even after I finished the graphics class, I kept going and did a few more. Basically, whenever I find a topic that I don’t understand well enough, I will force myself to learn it and explain it to Giant home-made pinhole camera (also from the Projection video) Playlist - Graphics in 5 minutes

15 Computer Vision News Computer Vision News “My most popular video, on transformers and large language models.” other people in the same way that I learned it. I’ve done a few on machine learning. There’s one on ChatGPT and large language models – that’s probably my most popular. I did a few on reinforcement learning. The next one I’m doing is going to be on the Aurora Borealis, so I’m teaching myself physics in order to be able to do this one. I did something similar with this magazine, and now, I have published 201 issues, but out of 7,000 pages of content, the ones from 2016 are starting to look a little bit outdated. Are you afraid that in the years to come, people will come to your playlist and say it’s out of date? Yeah, I’ve thought about this issue. It’s a good question. I guess I have two answers. One is that I try to choose topics that are fundamental and may be less likely to go out of date quickly, although I’m sure some of them will go out of date. The other thing is that these are only five-minute videos, so if one or two of them go out of date, that’s perfectly fine. I’ll try to create some new ones on the latest material. I enjoy doing it so much that it’s not hard to find the time! On the other hand, one of the challenges is that things change so rapidly today in our field, right? Every week, there are 10 new arXiv papers coming out, and so it can be hard to keep up. Each of these videos takes me at least three weeks. I’m not working full-time. I have two other jobs. Given the limited time I have in the evenings and weekends, it takes me three weeks at least, so that’s too slow to keep up in terms of the latest Steve Seitz

Computer Vision News Computer Vision News 16 breakthroughs. That’s a challenge – how do you keep up with the cadence of innovation? Do you ever feel that other pillars of our community should participate in the effort with you? Yeah, that would be fantastic! I would love for others to start creating these. In fact, a number of people have asked me, ‘How do I create them?’ In my playlist, I have a video that says how to create one of these five-minute cartoons. Please, if any of your readers are interested, I would love to have them start creating videos just like this. Do you have any funny stories to tell us about the production of these videos? Yeah, maybe two things. One is that I always fact-check these videos, so I send them to an expert who I consider more knowledgeable than I on each of these topics, and they’ll tell me what they think. Another is that some of the ideas that I have are ridiculously hard to execute. For example, I did a series on splines, which are a technique for creating curves, and I had heard, when I taught the course and read books and so forth, that the mathematical technique of splines was modeled after old shipbuilding tools. They had these big metal weights called duck weights, and they had a wood strip, and then they would pull the strip with different weights, and it would create these curves, and then you could trace them out. Then, the mathematics tried to approximate the dynamics of elasticity - bending energies of the curve. What I wanted to do was find the original duck weights. This turned out Another machine-learning video, on reinforcement learning Playlist - Graphics in 5 minutes

to be very, very difficult. There are almost no images on the web of duck weights, at least none that are Creative Commons that I could use, but I found one manufacturer in the United States who still produces these, and I ordered a set of them. [laughs] They took several weeks to arrive, and then I had to find the right thickness of wood strips, so I probably went through 20 or 30 different wood strips from Amazon and Home Depot. I finally got the thickness of the wood that I could approximate this whole thing and so it took me months to create what turned out to be a 20-second segment of the spline video. [laughs] Do you still have them? I still have them. Here’s one! What else should our readers know? Everyone always asks me: Can you really learn an hour’s worth of material in five minutes? I did a formal user study to prove that you can. I taught the computer graphics class at the University of Washington and divided the class into two groups, A and B. Group A watched these five-minute videos, and group B, I gave traditional lectures with PowerPoint but covering exactly the same material – literally the same slides and visuals in the videos as in the lectures. The lectures did take an hour for each five-minute video. Then, I repeated this process twice, so each group got to see lectures for half the time and videos for the other, and I asked the students which they preferred. Two-thirds of them preferred the videos, the shorter format. Then I also compared performance on exams, homework and projects, and there was no correlation, which means that the students who learn from the short videos learn just as well as the ones who learn from the lectures! 17 Computer Vision News Computer Vision News Steve Seitz How the duck weight is used to create curves

Read 160 FASCINATING interviews with Women in Science Read 160 FASCINATING interviews with Women in Science “This is me reading a newspiece online about my BSc student Sara Sterlie. She did a really nice BSc thesis on bias in large language models, and she has become a bit of a posterwoman both for the university and disseminating to other disciplines, when it comes to women, tech, and responsible AI.”

19 Computer Vision News Computer Vision News Aasa Feragen is a Professor at the Technical University of Denmark. Aasa, people know you have spent most of your professional career between Finland and Denmark, but you are actually Norwegian. Yes, I’m from Norway, which is why when I pronounce my name, I sound completely different. What is your work about? My education is in mathematics. When I started working – can I tell the long story? We love long stories. [laughs] I always thought I would do research in mathematics, but then when I finished my PhD, the financial crisis had just hit the job market, and, especially as a theoretical mathematician, there were no jobs to get. I found a three-month postdoc doing medical image analysis at a group in Copenhagen – I think it was three months because they considered me high-risk, and they wanted to figure out what sort of a person I was. It was also with the goal of applying for more funding so I could stay. That worked out. I got funding, I got to stay as a postdoc, and this is how I started working with medical image analysis. In the beginning, my work was very mathematical, very geometrical, but through the years, I have become better at maybe the more engineering and technical parts. Today, I work also sometimes with very applied and clinical – well, maybe this community would not call it clinical, but from where I’m coming from, it’s very applied and very clinical, where we also start looking at how do our algorithms interact with clinicians and with patients and so on. There’s the mathematical end of the spectrum, where we think about how do we even formulate the technical problems that we’re trying to solve. We formulate them in terms of math, and then we use computers to solve those mathematical problems, and that gives us an algorithm. But then I also work on the other end, where we start thinking about how this affects other people. No more geometry? Well, it’s not completely gone, but at the moment, I’m not doing so much geometry, unfortunately. I very much enjoy it, but it just doesn’t come naturally at the moment. I am very much disappointed - you are not as high risk as they thought that you were. You finally blended pretty well into the new field! I’m not sure. That sounds really depressing. [we laugh] Now you’re calling me boring! [Aasa laughs] Let’s get out of the boring part! What Aasa Feragen

Computer Vision News Computer Vision News 20 Women in Computer Vision is exciting about what you do? What I really like is the interaction you have with people that have very different backgrounds. Trying to understand what on earth they are talking about when their language is not the same as your own. Clinicians think about things differently from what I do. They use different words. When I talk to clinicians, I talk like a baby. I don’t know what anything is called in their world. Nassir Navab told me a few years ago at CARS about funny things he does to bring surgeons to the lab to see demos and talk to students. Yeah, and he’s probably right about that. I haven’t been that active with clinicians until very recently. I very early on had one project where I was talking to clinicians, and what I really enjoy about it is the aspects that have to do with trying to understand each other. I really find that interesting. Do you have any advice for the next clinicians who talk to you? To bear with me! [she laughs] But they usually do, to be honest. Usually, they’re fine when you walk in and say, I don’t know what the thing I’m talking about is called, but I “Ralph, you took this photo of me and my youngest daughter Ingrid at MICCAI 2019 in Shenzhen, when one of my students had an oral and I brought my 4 month old along! Let’s showcase that having children doesn't necessarily need to stop you from anything!”

21 Computer Vision News Computer Vision News would describe it like this. Most of them engage very happily. They are patient with you. Yes, exactly. Do you return this kindness? I try. [both laugh] It’s something I work with students with a lot because, whereas I find this communication very interesting, not everybody feels the same way. Some people would just like to get the talking done and then get started on the work. If you’re not ready to invest enough in that communication, your work won’t be right. Sometimes, I work a lot with the students to make sure that they are patient enough with the clinicians, and sometimes, I work with the clinicians to make sure they are patient enough with the students. You make peace in the world. Fantastic! I wouldn’t say I make peace in the world. [laughs] Maybe sometimes this work comes from there being not so much peace, always. I think the advice for both sides now is just to remember that the other people are good at something else than you and that you need to give them the benefit of the doubt. How many students do you have now? At the moment, I have four PhD students and four postdocs, but I have a couple of people leaving and a couple of people coming in. What takes up the most of your mind about your teaching activity? Both with my research students and with university students, I spend a lot of time thinking about how to teach them the things I want to teach them and how to steer them scientifically in the right way while at the same time also making sure that they “In 2022, my PhD student Steffen Czolbe received an elite researcher travel award, handed over by the (then) Crown Princess Mary of Denmark (she is now queen). It's one of those few occasions for Danish researchers to feel fancy!” Aasa Feragen

Computer Vision News Computer Vision News 22 Women in Computer Vision they’re happy. There’s a lot of pressure in the academic world, both when you’re a student but also when you’re a researcher. If there’s too much pressure and you feel too pressured, you can’t concentrate on doing good science. Then you start thinking: how many papers will I write? How many will cite my papers? Will I get my next job? Or if you’re a student, you think: how do I get the best grade on the next exam? If your mind is occupied with those things, you won’t do good science. You won’t learn, or you won’t write good papers. So, somehow making sure that people feel safe and happy enough that they can concentrate. I am sure that you quite often crack some jokes in class. I don’t think I’m very good at telling jokes – I mostly make fun of myself. I can tell you one silly thing that I did, but it’s not a joke. I have a daughter who’s nine years old, and I bike to work, and when I bike to work, I wear sports clothes because it’s a half-hour bike ride, and I get sweaty. When I go to work, I change into some other clothes. In the winter, at some point, I had brought with me a dress and some tights to wear under the dress… only the tights belonged to my daughter. They were very short and small. I couldn’t wear them. I even tried to put them on – it didn’t work! [laughs] When I came to class to teach, I was wearing these blue sports tights under my dress, and I had to explain to the whole class why I was looking like that. What did the class say? They laughed at me, of course. What did your daughter say? I think she laughed at me as well. [Aasa laughs] That’s the sort of joking I do. I don’t have good jokes. Why did you choose Finland and Denmark? “Usually, they’re fine when you walk in and say, I don’t know what the thing I’m talking about is called, but I would describe it like this!” “My first week at the Technical University of Denmark -- they hired me as full professor when I was 8 months pregnant, and I had 9 days at work before going on maternity leave!”

23 Computer Vision News Computer Vision News Long story again. I come from the countryside in Norway, and as a child, I was that very strange child in the countryside that liked to read and actually enjoyed school. I was bullied. I wanted to get away, so I went, not to Finland, to the US. I actually studied the first year of my bachelor’s in the US, but I couldn’t really afford to go to an American university where you have to pay your own way. I had gone there thinking, let’s go for one year and see if I can get a fellowship or a scholarship to do the whole education. Only problem was, when I was there, it turned out I really liked physics, so I changed my major to physics. I actually went there to do chemistry. In physics at that university, there were virtually no scholarships that I could apply for as an international. If I wanted to continue studying physics, I had to leave again, so I went to Finland. I didn’t want to go home, so I went to Finland. This means that when you are rich, you will have to make a scholarship for physics students. Maybe, yes. I went and studied physics in Finland for another year, but then in my third year, I was done with my minor in Mathematics, and I didn’t want to stop doing the mathematics, so again, I swapped and finished a master’s degree in mathematics in Finland. That’s how I ended up there. Then, I started a PhD. Again, this is really old school academia. I also had a very old professor as my supervisor. He was very hands-off. I think in the course of three and a half years, I had three meetings with him about science. I didn’t feel like I was getting supervision. Frankly, I was really annoyed. At some point, he was going for a one-year sabbatical where he was going to go and sit on his island because he had bought an island and often he would just work remotely from his island. This is 20 years ago, before Zoom or anything, so when he was on his island, we were just left on our Aasa Feragen “PhD supervision in my garden during Covid restrictions”

Computer Vision News Computer Vision News 24 Women in Computer Vision our own. That’s why I only saw him three times. He was going to go sit on his island for a year, so I said, maybe then I can go to Denmark as a visiting student. So, I did. I went to a place called Aarhus, the second biggest city in Denmark, as a visiting student. After having been there for a year, I asked, ‘Can I stay?’ They allowed me to stay as a visiting student for the rest of my PhD. We have spoken about the past, but not much about the future. What island are you going to buy? I’m not buying any islands. I think this was not good for the PhD students! [Aasa laughs] Where are you going? All of this past shows, at least in terms of supervision, a lot of how I do not want to be myself as a supervisor. I want to be a very present supervisor. I want to build my group and have a happy group here in Denmark. I want to grow my group in that way. You know, being with the students, trying to understand them just like the clinicians, and trying to do good science together with them. I really love it when you can see the students grow and you can see them going from being the shy first-year student to being the confident senior PhD that is helping the new people coming in. That’s really nice. Do you have a message for the community? I’ve been Program Chair for MICCAI this year, and one of the things that I think we’re seeing in some of the cases we’re handling is that in parts of the community, there is a very high pressure. Pressure to publish, pressure to do well, to build a CV. I just want to remind the community that, at the end of the day, we shouldn’t be striving for more papers; we should be striving for better science. Especially as young scientists, if you don’t have that as your main goal, I think it will be very hard to maintain your motivation, and very hard to stay happy, but also really hard to do science that makes a better world. “Professor party after my inaugural lecture. It shows the nice social feel of the university section that I am part of.”

19 Computer Vision News Computer Vision News “During the winter 2022-23, energy prices were skyrocketing and universities turned down the temperature. I'm Norwegian and like to think of myself as robust, but this was cold!” Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision!

Computer Vision News Computer Vision News 26 Congrats, Doctor Erik! Erik Sandström recently obtained his PhD from the Computer Vision Lab at ETH Zürich, working with Luc Van Gool and Martin R. Oswald. In his thesis, he focused on dense SLAM with monocular RGB or RGB-D input. He is now looking for industry positions in 3D Vision - feel free to reach out, he’s a catch! Congrats, Doctor Erik! Visual Simultaneous Localization and Mapping (VSLAM) is a longstanding problem in computer vision. The input is a video from an RGB or an RGB-D camera and the output is generated in real-time in the form of camera poses per frame and the geometry of the surroundings. Traditionally, sparse SLAM was the dominating solution, where only a subset of the map is reconstructed. On the other hand, dense SLAM, where we seek to reconstruct geometry for every pixel input, lends itself for a multitude of downstream tasks like path planning, perception and various AR solutions. Traditional dense SLAM used either voxel grids storing signed distances or surfels. Since the introduction of neural implicit representations, neural radiance fields and lately, 3D Gaussian splats, new attention has been shed to dense SLAM, with the aim to leverage the benefits of these representations for improved accuracy and robustness. In my PhD, I focused on three main problems related to dense SLAM: handling

27 Erik Sandström Computer Vision News Computer Vision News handling input depth noise, identifying suitable scene representations, and improving RGB-only SLAM robustness. To address depth sensor noise, we both investigated approaches for multisensor fusion and for single sensors. The goal of multi-sensor fusion is to generate an output geometry more accurate than what could be achieved with any single sensor in isolation. We phrased this problem as a learning problem of sensor properties to predict an accurate weighting between the sensors in 3D. When using a single depth sensor, we phrased the problem as a self-supervised online learning problem of depth uncertainty, showing improvements over existing methods using offline pretrained strategies. To tackle the question of what 3D scene representation to use, we first noted that existing methods used grid-based data structures for storing the 3D map. This representation does not easily lend itself for global updates of the map as a result of e.g. loop closure. Instead, we developed a point based neural implicit representation for SLAM, where the density of points relates to the information density of the input frames. Furthermore, we coupled this to a robust pose graph optimization formulation, enabling globally consistent pose and map estimation. SLAM without depth from a sensor leads to geometric ambiguities, making it a harder problem to solve. We identified two key components in aiding the performance of such systems. First, we introduced an optimization layer which combines multi-view depth estimation via dense optical flow and monocular depth estimation. Second, we showed how to build globally consistent 3D Gaussian splats by deforming the Gaussians at loop closure and global bundle adjustment. Our code is open source on GitHub at eriksandstroem. I hope my thesis will inspire the next generation researchers to pursue dense SLAM!

Computer Vision News Computer Vision News 28 Deep Learning for the Eyes by Christina Bornberg @datascEYEnce Hello everyone, welcome to another datascEYEnce story! I am Christina Bornberg and I interview people in the field of deep learning for ophthalmology to showcase their research here in the Computer Vision News magazine! This time, I want to introduce Emanuele to you!! Retina Revealed: Analysing Eye Images to Diagnose Different Diseases featuring Emanuele Trucco Emanuele Trucco, a professor of computer vision at the School of Science and Engineering at the University of Dundee, has been part of the Division of Computing since 2007. His research focuses on retinal image analysis and biomarker research, collaborating closely with doctors to explore what the retina can tell us about the progression of systemic conditions such as dementia and Alzheimer’s, cardiovascular disease, diabetes, and of course he also focuses on common retinal diseases.

29 datascEYEnce! Computer Vision News Computer Vision News Before we get started, I want to give our readers a task: complete the sentence: The eye is a window to _ _ _ _ _ _ _ _ disease. Since Emanuele among other things focuses on these kinds of diseases, you will find the answer by reading about his research here! Or you can cheat and find the answer in the last paragraph! How did you get into deep learning for ophthalmology? Emanuele answered the part of ophthalmology simply with: he didn’t choose the eye, the eye chose him. His journey into the world of retinal images began after an unexpected series of events many, many years ago. A research office of the university contacted him about a company that is interested in a collaboration, and the rest of the story is history! And as for deep learning: deep learning nowadays is not a choice, it’s amust. Could you tell me about your previous research and projects? Emanuele has a great amount of projects. We were speaking about VAMPIRE, GOUDARTS and My Diabetes among others. To my surprise, Emanuele told me about how doctors have not been reserved but have been eager to use diagnosis software, often requesting to collaborate on research. I will give a short summary of them here:

Computer Vision News Computer Vision News 30 Deep Learning for the Eyes The VAMPIRE colaboration VAMPIRE (Vessel Assessment and Measurement Platform for Images of the REtina) is a software application for efficient, semi-automatic quantification of retinal vessel properties with large collections of fundus camera images. The platform could assess vessel width, tortuosity, and branching patterns, providing valuable biomarkers for a range of systemic diseases, including cardiovascular and neurodegenerative conditions. The GoDARTS study and myDiabetes One of the studies that took advantage of VAMPIRE was the Genetics of Diabetes Audit and Research Tayside Scotland (GoDARTS) study in Tayside, Scotland. The study investigated the genetic basis of diabetes. A collection of retinal images, clinical histories, full drug histories and genetic profiles of patients were helpful resources to help identify genetic and environmental factors of the disease. Having access to data dating back up until 1987, makes

31 datascEYEnce! Computer Vision News Computer Vision News GoDARTS a longitudinal cohort. Such rich data has been instrumental in augmenting diabetic management and commercialising tools like “myDiabetes.” What kind of exciting results did you encompass and what are some future directions? One of the most exciting details of his work is that retinal images and genome data are rather uncorrelated and hence combining them, adds a lot of information to cardiovascular disease understanding and diagnosis. With these words, he highlighted the importance of multi-modality. Looking ahead, Emanuele is exploring the time dimension in his research. By analysing the history of retinal images using techniques like LSTM, time series genetic algorithms, and transformers, his team aims to provide a more comprehensive understanding of these diseases. And finally, what are the main challenges or limitations you have encountered during your research? Of course, I didn’t only want to hear the positive parts of the research because it is not always rainbows and butterflies! Emanuele sees a range of limitations, especially in computational cost. It is currently unfeasible for many labs to access or afford an environment for experiments. Therefore research for computationally more efficient large models is needed. Another limitation is the collection and preparation of data. It is a lengthy process but is of great importance since the accuracy and robustness of a neural network are dependent on the amount and variety of data it has been trained on. I really enjoyed the interview with Emanuele and want to thank him again for telling me about his research and projects. And now, the answer you have all been waiting for! The eye is a window to vascular disease which I personally find quite fascinating as it is very easy to take an image of the retinal vasculature! Doctors have recognised that changes to the tiny blood vessels in the retina are indicators of broader vascular disease, including problems with the heart, diabetic vascular diseases, and vascular dementia. Computer Vision News Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and all conference organizers.

Computer Vision News Computer Vision News 32 Grand Challenge - Medical Imaging Medical Imaging and Data Resource Center (MIDRC) XAI Challenge This Grand Challenge is organized and overseen by the Medical Imaging and Data Resource Center’s (MIDRC) Grand Challenge Working Group (GCWG). Key members of the GCWG are Karen Drukker (left), a Research Associate Professor at the University of Chicago, Lubomir Hadjiiski (center), a Professor of Radiology at the University of Michigan, and Sam Armato (right), a Professor in the Department of Radiology and Medical Physics at the University of Chicago. They are all here with Emily Townley, the MIDRC Program Manager at the American Association of Physicists in Medicine (AAPM), to tell us more. The XAI Challenge aims to address a critical issue in the field of AI and medical imaging: the need for explainable AI. With a focus on chest radiographs, challenge participants must identify pulmonary disease, specifically pneumonia, and explain the reasoning behind their AI’s predictions. They are not only trying to get the correct answers for the cases in the challenge but also to demonstrate that their AI system is getting the right answer for the right reason, which is a challenge in itself in the world of AI. Alongside the concept of explainable AI, there is a complementary idea of trustworthy AI. The challenge comes at a time when AI in medicine often operates as a black box, making it difficult for radiologists to trust its decisions. The team wants to develop methods that allow radiologists to trust their output in a way that benefits patient care. “The better the decisions, the more sure everyone is of what’s going on, the earlier the patient can be treated, which results in better outcomes,” Karen tells us. “Fewer mistakes and

33 Computer Vision News Computer Vision News Chest x-Ray: identify and explain AI’s predictions better treatment. That’s the ultimate goal!” Participants in the challenge will develop and train AI models that predict, on a pixel-by-pixel basis, the probability that any given chest radiograph shows signs of pneumonia. “We’ll compare the participants’ output probabilities with expert annotations,” Sam explains. “We had three radiologists outlining the regions of disease that they saw in those images, and that’s the comparison we’ll make against the output of the participants’ methods.” However, the challenge is not without its obstacles as existing methods of improving AI’s explainability are imperfect. “There are existing techniques out there, like Grad-CAM, but they’re actually not very good,” Karen adds. “They don’t explain much of anything!” One of the additional targets of this challenge is to understand what the limits are at this point in time and to explore new, improved methods that can better achieve that goal. Lubomir notes that the challenge requires delving deeper than participants usually would into the

Computer Vision News Computer Vision News 34 inner workings of deep neural networks and AI systems. “Once you have multiple participants, you can do many different comparisons,” he tells us. “They’re on the same playing field competing, and we can do some secondary analysis and understand what’s going on.” The spectrum of pneumonia severity is broad, and the challenge dataset reflects this diversity. It could be argued that for very severe cases, AI is unnecessary because the answer is obvious. Still, most fall into the more subtle range, where AI could significantly assist in diagnosing cases that radiologists can easily overlook. “We’re developing these AI systems to try to improve the diagnostic abilities of physicians because AI is not on its own,” Lubomir adds. “It’s a helping tool to the physician. We hope the AI can Grand Challenge - Medical Imaging

35 Computer Vision News Computer Vision News find some additional features or descriptors of the disease that aren’t there in the mainstream training of our physicians. Potentially, it can detect it earlier, and maybe, if trained correctly, it can detect more specific types of pneumonia.” The variance in severity across different patients is exactly why challenges like this are essential; all participants will work with the same set of cases and the range of difficulties they represent. “In the literature, if individual groups publish their own results, they typically have their own database of cases, and it’s really difficult to compare apples to apples because you’re not quite sure of the range and distributions and subtleties of the cases that individual groups use,” Sam points out. “Comparing performance across methods is not straightforward. That is exactly why we’re doing a challenge: to be able to level the playing field so all groups and all methods have the same cases.” Although this challenge focuses on chest X-rays, other imaging modalities, such as CT, have seen significant improvements in quality and diagnostic potential in recent years. Of course, any improvements in diagnosis rely on people putting themselves forward for or agreeing to imaging in the first place, as well as other considerations, such as health disparities and more complex social issues. “People can still wait till later, but that would be illadvised,” Karen adds. “Here in the United States, people tend to see the physician sooner, and treatment may be more aggressive than in other countries. I don’t know. It’s maybe a little bit of a cultural difference, too.” For teams considering participating in the challenge, Sam encourages a bold and collaborative approach while emphasizing its friendly and constructive nature. “The advice we would give to participants or people thinking about participating is to just go for it!” he says. “We’re looking for people to do their best and be as clever as possible in addressing the problem. The whole purpose of this is to try to move the field forward collectively. Every group that gets good results, or maybe not so good results, all of that is contributing to the knowledge of what is the best Chest x-Ray: identify and explain AI’s predictions

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=