Computer Vision News - October 2024

A publication by A publication by JULY 2023 October 2024 Best Paper Runner-Up MICCAI (page 8) Euro Biometrics Research Award (page 14) Thank you Ron and RSIP Vision! (page 24) Enthusiasm is common, endurance is rare!

Rasterized Edge Gradients: Handling Discontinuities Differentiably Computer Vision News Computer Vision News 2 ECCV Best Paper Hon. Mention Rasterization is a commonly used method for rendering meshes. Despite advancements in techniques like NeRF and Gaussian splatting, meshes remain popular in computer graphics due to their high degree of hardware optimization and explicit parameterization. Meshes are easier to control and manipulate for tasks such as tracking and offer a more structured and user-friendly approach than representations like point clouds from Gaussian splatting, especially in real-time applications. Stanislav Pidhorskyi is a Research Scientist at Meta. Originally from Kharkiv, a city in the East of Ukraine that has been under attack since the Russian invasion in February 2022, Stan speaks to us after winning a Best Paper Honorable Mention award at ECCV 2024 for his work on making rasterization differentiable. Congrats, Stan! Russian Invasion of Ukraine The major conference in our field – CVPR - adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine. We decided to host a Ukraine Corner also here.

3 Computer Vision News Computer Vision News Rasterized Edge Gradients UKRAINE CORNER “Meshes can either be rendered using ray tracing methods or rasterization,” Stan tells us. “Ray tracing methods are typically slow, while rasterization is significantly faster. In our application, we’re particularly interested in fast methods that are also fast during training!” When fitting something to a dynamic sequence with significantly more data than a static scene, a fast method is required for rendering and backpropagating the gradients. However, differentiable rendering of surfaces is challenging because they create discontinuities when projected on the image plane, and this is even more challenging in the case of rasterization. In contrast, ray tracing is often considerably more computationally intensive, but it allows for principled approaches to handle discontinuities in images. With ray tracing, you compute an integral using Monte Carlo integration over the pixel area, and if there is a discontinuity, you take the integral, and it is no longer a discontinuity.

Computer Vision News Computer Vision News 4 Moving the differentiation operator under the integral and applying the product rule handles these discontinuities by splitting them into two integrals – one over the area and another over the boundary. “Unfortunately, with rasterization, we can’t sample this boundary explicitly,” Stan points out. “We have a fixed grid of pixels, and that’s our samples. We can’t decide to sample more or decide to sample on the boundary.” Some methods attempt to address this challenge by relaxing the problem – substituting the original, more complex issue with an easier, more manageable one. A notable example is soft rasterization, which replaces the original triangles in a mesh with triangles with blurry edges to eliminate discontinuities. However, this introduces other issues, as it is not what you wanted to render and adds cost. The closest method to this work is Nvdiffrast, which uses differentiable analytic antialiasing. It is fast but produces sparse gradients and modifies the rendered image. “In our case, we took a slightly different direction, so we don’t modify the rendered image, but we can still backpropagate the gradients,” Stan explains. “We have this discontinuous rasterization, but we assume there is another fully continuous process that produces the exact same result as we get with our initial non-differentiable rasterization.” The novelty of this work lies in the introduction of micro-edges. In this virtual construct, edges are placed precisely between pixels so that when rendered with an imaginary, fully-antialiased renderer, images appear identical to the vanilla nondifferentiable rasterization. What this means is that we can assume the ECCV Best Paper Hon. Mention

5 Computer Vision News Computer Vision News image was generated by this differentiable, fully-antialiased renderer, even though it wasn’t. We then compute the gradients for the micro-edges as if the image had been rendered this way— without actually modifying the image itself. While the method is rooted in computer graphics techniques, it is designed to serve core computer vision tasks like 3D reconstruction. The algorithm optimizes a model for reconstruction using a collection of images of a static or dynamic scene. “This is inverse computer graphics,” Stan notes. “We go from camera images to the model. To do that, we build a rendering pipeline and can then backpropagate the gradients to optimize the model for reconstruction. I showed a scene of a monkey, one of my daughter’s favorite toys, during my oral presentation. I took around 500 pictures of it in my backyard and ran COLMAP on it to get the extrinsics of the cameras. Then, just just from this collection of images, I can optimize the mesh and the texture and get a model of the toy!” Another groundbreaking aspect of this work is its novel approach to handling geometry intersections in rasterization – the first method of its kind, as far as Stan is aware. “I didn’t find anyone attempting to handle geometry intersections in the literature,” he reveals. “I understand why. In most cases, it’s challenging, and for static scenes, it’s not really necessary, but it can be very necessary for dynamic scenes!” Rasterized Edge Gradients UKRAINE CORNER

Computer Vision News Computer Vision News 6 He recalls an example of a human head reconstruction. Here, subtle details, like a tongue accidentally penetrating the teeth during specific frames, are nearly impossible to optimize without properly handling geometry intersections. The error might be small, just one or two millimeters, but it significantly affects realism. In scenarios like this, a differentiable renderer that cannot backpropagate gradients effectively would fail to correct the error. “The naive, straightforward way to handle it would be, before each optimization iteration, we could go through the geometry, detect intersecting triangles, split them, and now we don’t have geometry that has intersections,” he points out. “Still, this operation has to be differentiable, and that is costly. In my case, with zero overhead, I add the possibility to optimize intersecting geometry.” While there may be other ways to solve this problem, Stan’s research represents one step in a broader exploration of differentiable rendering techniques. Looking ahead, he hopes it will inspire others to build upon it, pushing the field forward toward more efficient and principled methods. “Maybe someone can come up with something as fast as rasterization but more principled,” Stan ponders. “There are many possibilities. I compare with ray tracing methods like Mitsuba and Redner. Those are principled ray tracing packages that use ray tracing to render meshes and compute global illumination and shadows. I’m just doing rasterization. The shading afterward is deferred. I use a neural network in most applications to perform the shading in screen space.” Stan has been developing this method for several years, having initially had ECCV Best Paper Hon. Mention

7 Computer Vision News Computer Vision News the idea in 2021. “It was the longest time from an idea to actually publishing a paper,” he says. “In our current fast pace of everything, that almost never happens.” The first implementation worked well but felt almost too simple. After receiving feedback from users regarding its inability to handle geometry intersections, he got to work on a solution. “At first, I didn’t think it was possible, but I figured out how to handle it,” he continues. “Then, through the years, I fixed this little thing, fixed that little thing, and it turned out to be a useful tool many people used. That was when I was like, ‘Okay, yeah, now I have to publish it!’” Thank goodness he did, or we may not be sitting here today with him having just accepted a Best Paper Honorable Mention award at ECCV 2024 in Milano! “That was a big surprise for me!” he smiles. “There are a huge amount of papers that truly deserve it. The committee deciding to give it to me is a huge honor!” For Stan, the work comes from a lifelong passion for computer graphics and computer vision. “There can’t be anything better than a combination of the two!” he laughs. He remembers being captivated by Quake, the first 3D computer game he encountered as a child. Although the geometry and textures were low resolution by today’s standards, the experience stayed with him. “I had never seen anything like it,” he recalls. “It’s a 3D world on the screen, and it feels real. It feels volumetric. I was immediately very fascinated. How does it work? How does it render this? I was never into games, but I was always into what technology exists behind them.” This sense of wonder and curiosity has driven his career ever since and led him to where he is now. He adds enthusiastically: “I’m in the right place at the right time doing what I love!” Rasterized Edge Gradients UKRAINE CORNER

Computer Vision News Computer Vision News 8 MICCAI 2024 Best Paper Runner-Up

9 Computer Vision News Computer Vision News ORacle In this paper, Chantal and Ege use external cameras to capture the entire surgical process and comprehensively understand what happens in the operating room (OR) second by second. “We use scene graphs, where nodes represent people or objects in the OR and the relations between them,” Chantal explains. “If you imagine the head surgeon is doing something – drilling in the patient’s knee, for example – then this is one relation in our scene graph.” Summarizing everything happening in a scene in a structured representation has been the focus of the team’s research for the past few years. Specifically, this paper aims to enable knowledge guidance and a scalable, adaptable approach to the real world. “What we mean with adaptability is every OR looks different,” Chantal continues. “In every hospital in every country, you have different ORs with different people and tools. It’s not feasible to train a new model every time. Therefore, we want adaptability during test time, where we can tell our model, ‘Today, this tool looks like this,’ or, ‘This action looks like this.’” The team had the idea that large language models (LLMs) could offer an advantage here by leveraging their knowledge of the world. “We wanted to see if they could help us understand the OR better,” Ege tells us. Chantal Pellegrini and Ege Özsoy are PhD students in Nassir Navab’s lab at the Technical University of Munich. They speak to us after winning the Best Paper Runner-Up Award at MICCAI 2024 for their work on achieving a holistic, adaptable, realtime understanding of surgical environments. ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling

Computer Vision News Computer Vision News 10 “The first experiment we did was to find out if they understood things out of the box. Could we upload pictures from the OR and have them tell us what we’re talking about?” However, the results were not very successful. Rather than producing valuable insights, the model would often generate nonsensical outputs, such as claiming that a surgeon was performing an action on a piece of furniture instead of a patient. The team realized a specialized network needed to be built on top of these powerful models. “The knowledge integration we do relies on being able to tell the model what it should look for,” Chantal points out. “For this, we still believe these large pretrained models are very beneficial.” The work features several key innovations, including an image pooler that compresses information from multiple views in the OR into a single, fixed-size representation using a transformer network, ensuring the model receives a comprehensive visual summary of the scene. A question that often arises for the team is why scene graphs are so critical to their research. “From the simplest end, OR light makers come to us and say, ‘We’d like to automatically adjust the lights function to what’s going on in the surgery,’” Ege reveals. “Ideally, in the future, a digital OR would have something like an API that you can query and get information about this.” Another practical use case is more directly tied to improving surgical outcomes. The World Health Organization has shown that MICCAI 2024 Best Paper Runner-Up

11 Computer Vision News Computer Vision News following a surgical checklist reduces the likelihood of errors and adverse outcomes. However, overwhelmed clinical staff cannot always have the capacity to keep track of these steps during a procedure, which is where an automatic system could step in to enhance patient safety. Looking further into the future, Ege envisions a time when robots might assist with, or even replace, specific roles in the OR, such as the circulating nurse. “It’s not enough to only replace their dexterity or spatial awareness,” he says. “You need to replace contextual awareness – the entire brain of that person!” This level of understanding would require AI systems to perceive the surgery much deeper, recognizing and interpreting all the interactions between people, tools, and tasks. Scene graphs representing these relationships are vital for achieving such a complex system. One challenge the team faced was making scene graphs and large language models (LLMs) more compatible. To address this, they devised a method to represent scene graphs as sequences, encoding them as a list of triplets in the form of <subject, object, predicate>. “It becomes semantically the same thing,” Ege explains. Initially, they considered allowing the LLM to work with humanreadable names, such as <head surgeon, patient, suturing>, but found that this led to overfitting during training. The model would focus on familiar terms rather than adapting to new situations. To overcome this, they introduced a symbolic representation. Instead of using terms like ‘head surgeon’ or ‘patient,’ they encoded everything with symbols (e.g., A, B, C, or Alpha, Beta, Gamma) and provided corresponding descriptions in the prompts. This change forced the model to read or analyze the descriptions instead of overfitting to labels. “We switch the meaning of the symbols every sample,” Chantal reveals. “In one sample, Alpha is ‘drilling’; in the next, it’s Beta, Gamma, etc.” By randomizing the symbols, the model has to interpret the descriptions rather than simply remembering class names. This approach allows the system to handle unseen actions or tools on the “You need to replace contextual awareness – the entire brain of that person!” ORacle

Computer Vision News Computer Vision News 12 fly by adding and defining new symbols without retraining. “It’s trivial to add a 13th letter, define what it is, and suddenly you can predict something that you’ve never seen during training,” Ege notes. “For us, the entire intuition behind using LLMs was never just to be state of the art but to allow this adaptability during inference time, and these textual and visual descriptors or prompts were the key to making that work!” To further enhance adaptability, they needed to address a limitation in their dataset, which only included recordings of 10 simulated surgeries. “You often saw the same things,” Chantal recalls. “For our model to learn how to use these descriptions and link them to objects in the scene, we needed to increase the variability in our data.” To solve this, they used stable diffusion to generate synthetic images of surgical tools in varying colors, shapes, and sizes. This allowed the model to practice identifying new objects by linking detailed textual descriptions to these synthetic images in the scene. With all this innovation and the work’s several major contributions to surgical data science, it is no surprise that it was selected as Best Paper Runner-Up at MICCAI 2024. Chantal and Ege both express their gratitude and surprise at the honor. “We knew we were on the shortlist, but of course, you never expect something like this,” Ege says. What MICCAI 2024 Best Paper Runner-Up

13 Computer Vision News Computer Vision News does he think swayed the jury in their direction? “It was a very packed paper,” he continues. “We pulled off a lot of valuable contributions, and the entire direction of real-world OR understanding is a very exciting topic for the community. While we’re not seeing this as clinically ready, we believe it makes significant progress in that direction.” Chantal adds that the work rejects the usual path of focusing on solving a specific problem in a closed domain: “One thingI’mreally proud about in this work is that we look ahead at how we can realistically use these systems without having to retrain all the time, and this makes it more a step toward a real-world application.” They identify a key area for growth is incorporating more diverse types of information into the model. “Right now, we’re only looking at images,” Chantal clarifies. “There’s other information, such as the sound in the surgery, or if it’s robotic surgery, the tracking information of the robot, or even information about the patient, such as their vital signs. Including all this information is crucial to a fully holistic understanding! It’s a big step, but it’s where the journey is going.” Could we be getting a sneak preview of next year’s Best Paper? They both laugh and say in unison: “Let’s see!” How does it feel to win an award at MICCAI? ORacle

Computer Vision News Computer Vision News 14 European Biometrics Research Award Peter Rot is a researcher and final-year PhD student at the University of Ljubljana, where he is a member of the Computer Vision Laboratory at the Faculty of Computer and Information Science and the Laboratory for Machine Intelligence at the Faculty of Electrical Engineering. He speaks to us fresh from winning the European Biometrics Research Award 2024 for his thesis on soft-biometric privacy-enhancing techniques for facial recognition systems. Privacy-preserving face analytics using deep learning methods Facial recognition technology is a popular and sometimes controversial topic due to its widespread use and the ethical and privacy concerns it raises. Modern facial recognition models create detailed biometric templates from input images, which are then used to compare faces and infer identity. However, in addition to capturing identity, they store other potentially sensitive facial attributes, like a person’s gender or ethnicity. In this paper, Peter takes on the challenge of addressing privacy issues by proposing innovative solutions to protect sensitive data. He explores the balance between preserving identity-related facial features and minimizing the exposure

15 Computer Vision News Computer Vision News Privacy-preserving face analytics … of soft biometrics – attributes such as age, gender, ethnicity, and healthrelated cues that can be inferred from an image of a face. “Privacy in the context of face analytics can be approached from two high-level angles,” Peter tells us. “One is de-identification. Let’s say you have an image of a face and want to remove identity-related cues from it. You can just put a black box over it, but that’s probably not the best way, or you can try to do something more photorealistic. You can remove identity-related cues but preserve soft biometrics. The other angle is to preserve identity but try to remove soft-biometric attributes, so it’s like the inverse of such deidentification.” This work is particularly important when biometric templates are stored in cloud-based systems to protect them from misuse by authorized persons for profit or discriminatory practices. In this regard, it fits well with the European Union’s more cautious approach to artificial intelligence and focus on privacy regulation. Peter suspects this alignment caught the attention of the award judges at the European Association for Biometrics (EAB). “If you put a constraint on a problem, you also have innovation in that respect,” he points out. “It’s kind of suppressing creativity and innovation in one sense, but from another perspective, when you have additional constraints, it forces you to think in directions that others don’t care about!” Soft biometrics can be divided into two broad categories: local and global. Local soft-biometrics, such as hair length or lipstick or the presence of a beard, are defined by pixels that are close together. Global The goal of soft-biometric privacy-enhancing techniques (SB-PETs) is to establish a control mechanism over information (illustrated by locking system) to disable the extraction of specific soft-biometric attribute (locked/unlocked) while still preserving identity information (always unlocked), for example for face verification purposes.

Computer Vision News Computer Vision News 16 soft-biometrics, like gender or ethnicity, require consideration of many different facial features spatially distant from one another. One of the most challenging aspects of Peter’s research involves manipulating these global softbiometrics without erasing all the biometric utility. “When trying to preserve the privacy of global soft-biometric attributes, you need to modify many regions on the face,” he explains. “The same goes if you’re trying to do privacy enhancement on the template level. If you say, let’s remove gender information from a face, and you’re trying to modify all those cues, you can end up with an image which doesn’t have any identity-related information at all.” Peter approached solving this problem from two different angles. He explored the first in PrivacyProber, the first contribution of his PhD dissertation, which involved working on an image level, manipulating specific regions to protect soft biometrics while retaining identity. Although many state-of-the-art soft-biometric privacy-enhancing techniques employ Soft-biometric attributes can be divided based on the proximity of pixels that encode them: local soft-biometrics have pixels in close proximity, while cues, from which global soft-biometrics can be inferred, are extracted by a combination of local ones. Therefore, to enhance privacy of a global soft-biometric attribute, multiple regions on an image should be manipulated, however such manipuations decrease the amount of discriminative, identity-related. In our work, we focused on exploring the trade-offs between privacy and utility. European Biometrics Research Award

17 Computer Vision News Computer Vision News this method, it is prone to reconstruction attacks, where protected information (such as gender) can be extracted from modified images. The second approach, which formed the core of the work that earned him the EAB award, was ASPECD: Adaptable SoftBiometric Privacy-Enhancement using Centroid Decoding for Face Verification. This contribution focuses on template-level manipulation. “You have an image, you extract the features, and then we have multiple modules that try to preserve the privacy of each attribute separately,” he explains. “If users have different privacy-related preferences, let’s say someone is only concerned about ethnicity, you can only apply this module. After it’s applied, it doesn’t preserve gender-related information. It just targets one or both. It depends on what the user wants. In each case, you would get a template comparable to all other preferences.” The first contribution of my PhD dissertation is PrivacyProber (the paper was published in IEEE Transactions on Dependable and Secure Computing). In this work we showed that existing SB-PETs, which manipulate raw image pixels, are to a high degree prone reconstruction attacks. Original image is denoted as I_or, and we produce its privacy-enhanced version I_pr (note that privacyenhancement flips gender prediction) using chosen privacyenhancing mechanism \psi. By transforming I_pr, we were able to obtain concealed soft-biometric information back, demonstrating the vulnerability of contemporary SB-PETs. Privacy-preserving face analytics …

Computer Vision News Computer Vision News 18 Although these methods are effective, they are not without limitations, which leaves plenty of room for extensions to this work. One key challenge is the correlation between certain soft-biometric attributes. “A beard reveals information about both age and gender, for example, so you’re limited in how much you can target each of those without affecting the other,” Peter points out. “Another interesting result is that different recognition models encode softbiometric information in different ways with different degrees of correlation.” Ultimately, training face recognition models to inherently protect soft biometrics could be a more robust long-term solution. By embedding these features into the design of recognition networks from the outset, it may be possible to achieve better privacy protection while maintaining the high accuracy biometric systems require. Our next contribution was "ASPECD: Adaptable Soft-biometric Privacy-Enhancement using Centroid Decoding". In this work, we do not focus on raw face images, but on face templates (i.e. vectors extracted from face, from which similarity between faces can be calculated). It supports that users can have different privacy-related preferences (see arrows of different colors, where red user would only want to preserve privacy of ethnicity, black for gender and ethnicity, etc.), but the comparison between templates, protected with different preferences is still possible. European Biometrics Research Award

19 Computer Vision News Computer Vision News ECVA Young Researcher Award

Computer Vision News Computer Vision News 20 ECCV 2024 Paper E.T. the Exceptional Trajectories: Text-to-camera-trajectory generation with character awareness Robin Courant is a PhD student at Ecole Polytechnic in Paris, co-supervised by Vicky Kalogeiton and Marc Christie. His research interests include computer vision, deep learning, and cinematography (Francis Ford Coppola is his favorite filmmaker – as seen in the photo on the right!) Robin speaks to us about his excellent ECCV paper, which opens up new possibilities in the meeting of art and technology. Cinematography is an art form where camera movement plays a critical role in bringing a story to life on a cognitive, emotional, and aesthetic level. Traditionally, this expertise has been reserved for skilled directors who understand the common language of film grammar, making every camera angle and trajectory count.

21 Computer Vision News Computer Vision News E.T. the Exceptional Trajectories In this paper, Robin proposes a way for this knowledge to be democratized. “It’s about trying to translate the film grammar of directors, like the artistic point of view of cinematography,” he tells us. “When you shoot a scene, an amateur won’t do the same things as a professional director because they’re experts and know the film grammar. They know with a certain camera trajectory what feelings the audience will perceive.” The idea of this work is to generate camera trajectories based on simple textual prompts. Rendering 3D scenes manually in tools like Blender or Unity can be a painstaking process. Here, users can say, ‘I want a camera trajectory moving to the right,’ and create a professionallooking shot that they can then rectify, modify, or leave as-is. The ultimate goal is to make camera trajectory accessible to everyone, helping general users create compelling 3D scenes without needing years of experience or training. A common concern when translating artistic processes into algorithms is the fear of losing creativity. Good cinema thrives on breaking the rules and pushing boundaries. While the current goal is not to break new ground in cinematography, Robin says the system may occasionally produce unexpected results. “Since our approach is new and imperfect in the way of copying every director trajectory, maybe some outlier could create a new camera trajectory that would have never been seen before,” he points out. “Then we could say, okay, we’ve created a new cinema movement!”

Computer Vision News Computer Vision News 22 Of course, there are always challenges when entering a project like this. One of the most daunting aspects was processing the vast datasets required to make the system work. “It was months of tedious work,” Robin admits. “While we used off-the-shelf methods to extract character and camera poses from videos with good accuracy, we had a lot of post-processing to do. It wasn’t the most interesting part of the project, but at the end of the day, it’s what makes it work!” A crucial part of the work is the camera generation framework known as ‘DIRECTOR,’ which stands for DiffusIon tRansformEr Camera TrajectORy and is designed to create smooth and realistic camera movements based on textual prompts. It is a diffusion-based model that learns the distribution of a ground truth dataset. This initial data distribution is perturbed with Gaussian noise equivalent to a normal distribution. Then a neural network is used to iteratively and progressively denoise the Gaussian distribution, ultimately generating new camera trajectories that align with the ground truth distribution. While DIRECTOR is based on established diffusion theory, the architecture draws inspiration from the Diffusion Transformer (DiT) model, which proposed different configurations and ways to incorporate conditioning within the diffusion framework. “We took inspiration from DiT and wanted to put this kind of architecture into the motion world,” Robin explains. “Here, we’re dealing with camera movement, but we could put our work in the human motion community. It would be the closest community to ours.” Looking to the future, Robin tells us the team is already working on a ECCV 2024 Paper

23 Computer Vision News Computer Vision News follow-up to this work, generating camera trajectory and human motion simultaneously to develop entire 3D scenes based solely on user input. However, he identifies textual conditioning as an area with room for improvement. “I guess it’s true for most 3D generation,” he says. “Textual conditioning is too limited in general and not userfriendly enough. I want a way to condition things with human perception, but I don’t know what it would be yet...” Robin’s supervisors, Vicky Kalogeiton and Marc Christie, told us that Robin’s paper is special because it is one of the first attempts to democratize cinematography, making it more accessible and controllable for everyday users. “It’s super great to work with Vicky,” Robin adds. “For every PhD student, sometimes there are down moments, and you don’t know where to go, and she’s always there to motivate you. Then, after a meeting with her, you’re back at work.” Do they ever disagree on the best path to follow? “Not during my first year because I was new, but now that I’m a second-year PhD student, I have my own ideas!” he laughs. “Yeah, for sure, I fight, and we discuss to produce better ideas together.” E.T. the Exceptional Trajectories

Computer Vision News Computer Vision News 24 Editorial Dear reader, After almost 9 years, we say a fond farewell to our dear publisher RSIP Vision. This marks our final publication with them - capping an impressive series of 202 magazines that started in 2016. The full archives are here. Computer Vision News will continue its publications even after this separation! This bond is the story of a bridge between industry and academy; also, between scientific journals and commercial publications. It’s the story of over 7,000 pages of scientific content distributed for free to the whole AI world. It’s also the story of thousands of scientists, whom I have interviewed and whose work I have the honor to disseminate. I will keep doing that! My affectionate thought goes today to Ron Soferman, the CEO of RSIP Vision, the company that has generously sponsored this endeavor without ever asking for anything in exchange. Thank you, Ron, for believing in my idea and for giving me all the means to make the dream come true. How wonderful it would be if some readers, some members of the algorithm community, and some of the computer scientists who were featured in these magazines could send him a word of thanks! If you’d like, you can send your messages to me, and I’ll ensure they reach him. Let’s talk about the future: the IEEE Computer Society - the marvelous folks who organize CVPR, WACV and ICCV - will keep Computer Vision News Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and all conference organizers.

25 Computer Vision News Computer Vision News partnering with Computer Vision News and have officially asked us to keep covering the meetings with the best content in town. Thank you, Walter Scheirer, the IEEE, and Nicole Finn for your precious support, friendship and trust! Together, we will continue to have an impact and make history in the AI field. None has even tried to imitate Computer Vision News . Enjoy the reading and keep being part of this awesome community by subscribing for free here! Ralph Anzarouth Editor, Computer Vision News editor@computervision.news

Alma Andersson is a senior machine learning scientist at Genentech. What does a senior machine learning scientist do at Genentech? Well, it depends on which department you’re in, but for me, I focus on method development, specifically AI and machine learning methods applied to clinical trial design. Also, target discovery, trying to facilitate the work of our biologists and the people designing the clinical trials. Developing tools for them to be better informed and make more strategic decisions. When we say that AI may save lives, we are talking about something similar to what you are doing? I think so, yes. I definitely think AI can save a lot of time, and implicitly, that also saves lives. We get drugs faster to the market, and we can also, hopefully, select the best drug for you as a patient. Of course, we don’t want anyone to get sick, but in case you get sick, then hopefully we Computer Vision News Computer Vision News 26 Women in Science “You’re not an island. You’re an average of the people around you!” Read 160 FASCINATING interviews with Women in Science

27 Computer Vision News Computer Vision News can give you a better drug and a better treatment plan. We might also be able to better discover the drugs, discover them faster, and narrow down the different targets and ideas that we pursue to make sure that we only go for the best one. That’s where AI is really effective in narrowing down the different options that we have, and then our expert biologists go and can actually explore what we tell them to focus on. Sometimes, we’re completely wrong, but sometimes, we help them quite a lot. What happens if you find the wrong medicine? [laughs] Yeah, I think that’s discovered way before it hits the market, so I wouldn’t be too worried about that. If I make a mistake, someone else in the company will catch it way before the patient does, so I guess it’s just going to be me who gets a bad rep within the company. But we do sometimes present incorrect results because it can never be 100% accurate. I think what we’re always trying to work towards is to decrease the error rate, essentially, where the suggestions we make to people who then go on and design the drugs and test them is as low as possible. It is generous of you as a healthy person to be dedicating the best years of your life to finding medicine for ailments that you will never have. Well, you never know, right? [laughs] Jokes aside, it’s very nice to feel you’re making a contribution to society. A lot of people say bad things about pharma, but in the end, someone’s got to develop the drugs, and it’s a very hard and lengthy process. I think that justifies, to some extent, high prices in some cases. It makes me really happy to be able to work with my main interest. I’m so interested in the technical aspects and learning more about machine learning, but then to actually make some good use of it. I could work at a high-frequency trading kind of firm, but I’m not sure how that would help society, so it feels like a good position for me, at least. Is this something that you could foresee 20 years ago? 20 years ago, I don’t think I foresaw a lot of things! [laughs] I did find from almost 20 years ago a plan for a future that we wrote in second grade, and it said, ‘I want to be an engineer and save lives.’ I did get an engineering Alma Andersson

Computer Vision News Computer Vision News 28 degree, and in some minor way, maybe I contribute to saving lives. Perhaps I foresaw a few things, but I don’t think it was anything I could have guessed, like ending up in the US or pursuing an academic path before I joined industry. None of my parents have a higher education. I didn’t even know what a PhD was when I started uni, so it was a very random position to end up in for me. How near are you to your ideal definition of what a scientist should care about? Oh, that’s a good question! I’m maybe 80% there because I do work with methods and ideas and people that I care about, and I feel like I work towards a purpose. I think most scientists are driven by two different things: curiosity for their topic and the desire to have an impact. Being able to work towards both those things is super rewarding for me. That puts me at 80%. The remaining 20% would come from if I could feel like I had an even faster impact on things. Working in a really big company, everything takes a bit of time, and I’m young and eager and in a rushed mode sometimes. Perhaps in a few years, I’ll give you 95% because then I’ve acclimatized to the slower pace of things, but I’m very happy with what I’m doing now. You talked about curiosity, you talked about impact, but you forgot the third dimension, which is probably the most important one. Okay… and this is? Integrity. That’s true, yes. [laughs] That’s a very good point. That’s actually something that I’m quite surprised about, given my position. I chose to work for not an academic lab but in industry, so I thought I’d have no say in what I did or what sort of projects I pursued, but I was super surprised. Coming from Sweden, I heard so many bad things about US corporate institutions. Like you’ll just have to do what your manager says. But I’m actually able to define my own type of projects and also say no when I don’t think something’s worthwhile to pursue. Of course, I have to provide good evidence and good arguments. I have quite a lot of integrity in terms of what I work on and what I can pursue in terms of my interests. I think that axis checks a pretty high box as well. I am happy that we agree about the three-dimensionality of science. Women in Science

29 Computer Vision News Computer Vision News Yes. Now that I learnt about the third axis, I fully agree! [she laughs] We have talked about what was in your mind 20 years ago and what is in your mind now, but what will be in your mind 20 years from now? Wow, that’s a very good question. I never really think that far ahead. 20 years from now, I hope that I have more of my own team where I can mentor and lead people and give them a similar experience to what I have had, where I can encourage them to pursue their interests but also guide them where they need guidance and learn and mentor them. Leadership is really about mentoring for me, so that’s something I’d like to be able to do in the future. In the upcoming 20 years, it’s gathering as much experience and knowledge as I possibly can and then passing it down to people who are better than me at a lot of things, but maybe I can guide in some way. To see how you lift other people up around you. I always enjoy teaching. I was a teaching assistant at my university. I was teaching more than I went to my own classes, even. Mainly because I was poor and needed money, but I also enjoyed it! Being able to see people learn something new and how you help them in that journey. That’s super important to me. That’s also why I chose a position where I knew you could grow and create your own mini lab within the company. Can I infer that you had good mentors? Yeah, I want to say that I’ve had a few really good mentors throughout my life. Do you want to tell us about a specific mentor? Can I do three? The first one was Lucie Delemotte, one of my teachers in Dynamical Systems at my university. She saw that I was really interested in the topic, and she asked me if I wanted to do a research position in her lab. That’s very uncommon in Sweden. You never Alma Andersson

Computer Vision News Computer Vision News 22 Women in Science

31 Computer Vision News Computer Vision News never really get to do that. I remember coming to a research lab as a second-year bachelor’s student, and I was just in awe. I was like, this is what I want to do for the rest of my life! [laughs] I want to be in a lab and do research. It was a computational lab, but just her giving me that opportunity and then giving me advice was one of the most useful things that I’ve experienced. The second one is my PhD supervisor, Joakim Lundeberg. He was very successful in his field, and I joined the lab as a computational person, and he was on the experimental side. He had no idea about my type of stuff, but he always tried to understand it, and he was very humble about the fact that he knew more about one thing and I knew more about another. That’s very impressive from someone who’s so successful in one part of science, and then they’re open to learning from someone who’s very young and inexperienced. That really left an imprint. He gave me a lot of really good opportunities as well. That’s someone I’m very thankful to as well. The third person is my current manager, Aïcha Bentaieb, who you know, actually. She’s been very patient with me and helped me with my transition to industry. Not only do I know Aïcha, but all my readers know her because she was named and shamed on these pages exactly eight years ago. Time flies. Exactly, yes! [laughs] Aïcha also was not able to say bad things about you, so it is probably reciprocal. [laughs] That’s good. That’s very nice of her. I bet there are a lot of things to say! Alma, we are being very positive. We find good medicines. We make good science. We get good mentors. We have great opportunities. We have a bright future. But there are ups and downs. How do you manage the downs? Alma Andersson

Computer Vision News Computer Vision News 32 That’s a good question. I try to keep a positive mindset as often as possible and then bunker up on good experiences that I can pull from when I have a down period, but sometimes that’s not always enough, and that’s when it’s really important to have a good support system around you. That doesn’t have to be specifically just in your work environment. It could be, but I also think that friends and partners, like my girlfriend, for example, is someone who’s really nice to just have on the side and speak with and vent to if something’s going bad. It’s very helpful to have people who are non-scientists, who can just help me decouple sometimes, and kind of isolate work from my personal life. It’s still a bit of luck where, if you’re having a bad time, if you have colleagues who are there for you and are able to rekindle your spirit a bit, then that’s also very helpful. Having people around you that sometimes, when you’re on a low, maybe they’re on a high, and you can lift each other up. It’s a combination of that. This is very random, but I’m a very active person, and I think that helps a lot. When I’m down or frustrated, I always go for a run or a hike, and being in nature really lifts my spirit as well. I realize there’s more than just the work that I do, and I’m more than my job or the latest method that I’m developing. That’s another way for me to just reconnect and restart. Many years ago at ECCV 2016, Cheng Zhang told me that whenever she has a bad day, she says to herself: ‘It’s a bad day; it’s not a bad life!’ Yeah, that’s a very good attitude. I try to do that as well. She is a brilliant scientist and very successful too, so this also probably depends on your attitude. Yes, for sure, but it also depends a lot on the people around you. I think it’s easy to sometimes think that you’re the strong one, but then you realize you’re not an island. You’re an average of the people around you! I don’t want to attribute everything to my own mindset. I’m very appreciative of everyone around me as well. We have not spoken about the fact that you are not Californian. You come from very far away. You landed with a parachute in California somehow… Women in Science

33 Computer Vision News Computer Vision News Alma Andersson Something like that, yeah! [laughs] Do you want to say a couple of words about Sweden? Yes, of course. It’s the most average country you’ll ever visit. If you’re just looking for a nice and chill life, it’s the most fabulous place you can find. It’s safe, it’s calm, it has good social support, and it has amazing nature. It’s a very nice place. I do think that sometimes it’s so remote that it lacks the attraction that some other cities have, like San Francisco, New York, or other tech centres. They don’t have the sun, they don’t have the big cities, so they have nothing to pull talent there. That’s why I came here instead. It’s a very nice country, but maybe not as high up on the list for talent within certain domains to go there. I guess that’s why I ended up here. That’s funny. I grew up in Milano in Italy, and our city anthem makes fun of the people from Napoli: “Everyone sings ‘far away from Napoli you die’, but then they come here to Milano!” Your story reminds me a little bit of that. Yeah, that’s so funny! [laughs] Alma, this has been a fascinating interview, but as we draw to a close, what is your final word for the community? As an AI scientist, I find the AI revolution or the AI hype extremely interesting, and I’m super excited about all of the things we have ahead of us. But I also think we should be cautious and make sure we think about certain safety aspects, biases, and what could happen if some of the methods we develop end up in the wrong hands. It’s easy to be very forward-looking and just think about the exciting stuff, but the best time to think ahead is before something happens. Let’s not have 20/20 hindsight but more like 20/20 foresight! Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision!

Computer Vision News Computer Vision News 34 Deep Learning by RSIP Vision Extracting 3D Information from a Single Medical Image Using Deep Learning Arik Rond, who is Vice President, Research & Development at RSIP Vision, talks to us about extracting 3D information from medical images and videos and how recent advances in deep learning are changing the game and enabling more precise and efficient solutions. Accurately measuring and analyzing 3D structures is crucial for clinicians in medical procedures, and doing so has traditionally required complex techniques like stereo imaging or video analysis. However, a new solution has emerged, which means approximate 3D information can be extracted from a single image, simplifying surgeries and enhancing precision. Many interventions can greatly benefit from this kind of technology, and one of the most common procedures is kidney stone surgery and removal. A typical process inserts a ureteroscope into the kidney to visualize the stone while a laser breaks it down. “The question is when to stop,” Arik tells us. “The surgeon needs to remove the stone at some point but doesn’t want to do so before it’s small enough because it’ll get stuck in the ureter, requiring additional action.”

35 Computer Vision News Computer Vision News Extracting 3D Information from a Single Medical Image Currently, two primary methods have been used for extracting 3D information. One is video imaging, which relies on the camera’s movement to capture multiple angles of the target object. By analyzing this motion, mathematical tools can estimate the size and structure of the object. “This isn’t even deep learning,” Arik points out. “The mathematical foundation is good, but it’s slow, and there are some issues that can make these algorithms fail. For example, when there is a tissue motion, especially non-rigid, the solution may fail. Also, the physician needs to move the camera to get enough 3D context, which makes the medical procedure more complex.” Another approach is stereo imaging, where two images are captured simultaneously from slightly different angles. This approach is more common in robotic surgeries or 3D laparoscopes. Most standard procedures, such as kidney stone surgery, do not use stereo imaging equipment. However, the latest advancements in deep learning have made it possible to extract 3D information from just a single image. “We take one picture and get a depth map, which means we know how far each pixel is from the camera,” Arik explains. “A neural network computes the depths and is combined with the RGB feed to get an RGBD image. From that, we can get a 3D mesh model.”

Computer Vision News Computer Vision News 36 Moving from video or stereo imaging to a single image has several advantages. It greatly simplifies the procedure for the surgeon, who can focus on the operation at hand without needing to perform specific movements or capture images from multiple angles. The software handles all the necessary computations. As well as being widely applicable, it is also an effective tool for real-time surgical decision-making. In the kidney stone example, the surgeon can click on two points in the image, get the position of each, and calculate the distance between them. If this is smaller than a defined threshold, the surgeon will get an indication and can make an informed choice about whether to extract the stone or continue to break it up. Training the algorithms behind this innovative technology requires a substantial amount of data. A combination of real and simulated data is often used to develop and refine these models. “Simulations give us a very good ground truth,” Arik advises. “There are several variants. Some train a complex model with synthetic data, which works quite well on real cases and is then used to train a smaller model. Some train on a mix of real and synthetic cases. In all cases, simulated data is critical to the training data. Once we have ground truth, using simulated or real data, this already fits a regular flow of training deep neural networks.” While this technology holds immense promise, it is still relatively new and needs to be tested against the ground truth in real-world cases. For these experiments, stones that have been removed will need to be measured for their actual size and compared against their computed measurements from videos. Ultimately, the hope is that the technology will be widely adopted in medical practice, giving surgeons a powerful new way to analyze 3D structures and make critical decisions during procedures. If you think we could help with your project, contact RSIP Vision today for an informal discussion about your work. Deep Learning by RSIP Vision

RkJQdWJsaXNoZXIy NTc3NzU=