WACV 2024 Daily - Friday

A publication by Winter Conference on Applications of Computer Vision Friday 2024 WACV

[1524] Rotation-Constrained Cross-View Feature Fusion for Multi-View AppearanceBased Gaze Estimation [1727] UNSPAT: Uncertainty-Guided SpatioTemporal Transformer for 3D Human Pose and Shape Estimation on Videos 134 3D-Aware Talking-Head Video Motion Transfer 143 Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation VIRTUAL Towards Realistic Generative 3D Face Models Umur’s picks of the day: Umur Aybars Çiftçi is a Research Assistant Professor at the State University of New York at Binghamton where he leads the Image and Acoustic stream of FRI program. Umur’s main research focuses on deepfake detection and generation, using humancentric approaches. His research group also actively studies adversarial defenses against malicious uses of AI and novel approaches using AI for good. Being always on the lookout for watermarks that makes us human against synthetically generated or manipulated videos, Umur currently works with Ilke Demir of Intel in her battle against deepfakes, in collaboration with her Trusted Media team. For today, Friday 5 2 Umur’s Picks DAILY WACV Friday Orals Umur forgot to tell you that he's also presenting his poster #121 today with Ilke: How Do Deepfakes Move? Motion Magnification for Deepfake Source Detection. Posters If you don’t remember Umur from his social media inspired poster at WACV 2023, you can certainly catch glimpses of him still trying to avoid cameras (saying My Face My Choice!), enthusiastically talking about deepfakes, and exploring WACV posters for inspiration and collaboration opportunities.

3 DAILY WACV Friday Oral Presentation In recent years, image compression has seen significant advancements, with neural network-driven techniques gaining more attention. Several works have employed deep generative methods like GANs and diffusion models to improve perceptual quality or realism. However, optimizing models for different bit rates remains a key challenge. “In image compression with deep learning, most models are optimized for a single target bit rate,” Shoma begins. “In other words, we need to train multiple models to compress images into different bit rates. Enhancing the perceptual quality of compressed images is another problem, especially when we compress images to a very small data size. In that case, a lot of information is lost.” Although there are existing methods to tackle these issues individually, very few studies address both, which was the motivation behind this work. Its proposed variable-rate GAN-based approach places a key emphasis on the discriminator’s role in training. Shoma explains he experimented with various discriminator designs to identify the one most suitable for the task and, additionally, introduced a novel adversarial loss function. “We show that these two methods improve performance and bridge the gap between the state-of-theart and this high-controllability Shoma Iwai is a second-year PhD student at Tohoku University in Miyagi Prefecture, Japan. His paper proposes a single solution to two common challenges in image compression with deep learning. He speaks to us ahead of his oral presentation this afternoon. Controlling Rate, Distortion, and Realism: Towards a Single Comprehensive Neural Image Compression Model

method,” he tells us. “Our performance matches the current state-of-the-art method proposed in CVPR last year. At the same time, our method can control the bit rate with a single model. It’s a very good result.” Thinking about next steps, Shoma would like to add even more controllability to the model by incorporating Region of Interest (ROI) coding. This feature enables pixel-level compression control, allowing users to prioritize the quality of specific regions of the image, a crucial advancement for practical applications. “For example, if there are three people in a picture, in most cases, the quality of the people is important, but the quality of the background isn’t as much,” he says. “We can maintain the high quality with the three people’s regions and reduce the data size of the background.” Shoma reveals his interest in image compression stemmed from an undergraduate class on image processing, where the details of JPEG compression were explained. JPEG is the most common image compression technique, with quality 4 DAILY WACV Friday Oral Presentation

parameters to control the compressed data size. The statistical analysis and engineering work behind it and the implementation of standard techniques and widespread application fascinated him. So much so that he chose image compression as his research topic when he started work as a graduate student. How does he feel now that he is about to present his work at this prestigious conference? No mean feat when you consider that not only was his paper chosen from thousands to be part of the event, but it was also picked to be one of a limited number of orals on the agenda. “That’s a good question, and frankly, I’m so surprised!” he laughs. “Obviously, I was very happy when I heard it was accepted, and now it’s chosen as an oral presentation. If we think about the applicability, this conference is about applications of computer vision, and our method can be applied to various use cases. Of course, there are a lot of challenges to implementation in the real world, but it has high controllability and a lot of potential. I think that’s why it was chosen.” To learn more about Shoma’s work, visit Orals 2.1 [Paper 2] today at 15:00-16:00 (Naupaka). 5 DAILY WACV Friday Controlling Rate, Distortion, and Realism

6 DAILY WACV Friday Poster Presentation Radim’s innovative approach involves simultaneous deblurring of the object and temporal superresolution, a process that generates multiple renderings of the deblurred object in consecutive time steps. Existing methods require at least three consecutive images to reconstruct a fast-moving object due to the need for background estimation. However, these approaches face significant challenges in simultaneously estimating the background, reconstructing the object’s original shape or texture, and determining its trajectory. “We tried a lot of things, but in the end, we didn’t have the proper tool,” he tells us. “We needed a very strong generative model. With the denoising diffusion probabilistic model, we finally got the right tool, and we were able to, at the same time, tackle the background and foreground reconstruction, like the object deblurring and shape recovery.” Radim employed a 3D diffusion model architecture conditioned on a single image. The robustness and ease of use of diffusion models were key to the method’s success. In contrast to Generative Adversarial Networks (GANs), it simplified the process, making it more manageable. Single-Image Deblurring, Trajectory and Shape Recovery of Fast Moving Objects with Denoising Diffusion Probabilistic Models Radim Špetlík is a PhD student at the Czech Technical University under the supervision of Jiri Matas. His paper proposes a novel method to reconstruct a video sequence from a single blurry image of a fastmoving object. He speaks to us ahead of his poster presentation tomorrow.

7 DAILY WACV Friday Single-Image Deblurring … “The hardest part was trying not to make mistakes because, compared to GANs, you just give the network the right loss, check your data, and it works!” he explains. “The diffusion model we used is given a white Gaussian noise, and it transforms the noise into something meaningful. What we trained the model to do is, given this single image of a blurred object that is moving very fast and K Gaussian white noise initializations, reconstruct K temporally consequent images of that object as if it was captured by a highspeed camera. It turned out to work very well.”

Obtaining the training data was a challenge in itself. Instead of finetuning a pre-trained model, such as Stable Diffusion, Radim opted for an approach more suited to this specific deblurring and temporal super-resolution application. He didn’t want to lose detail by putting images into a lower-dimensional feature space, which often happens when using a variational autoencoder. “We trained the denoising model in the full input resolution,” he reveals. “For that, we needed a lot of data. We generated our training datasets with a computer rendering program called Blender. In Blender, we loaded a lot of already existing 3D models and rendered them with different textures. Then, we rendered them as if they were captured by a high-speed camera. We got 24 images of these fastmoving objects, and the object was more or less sharp in every consecutive image. We then simulated the fast-moving object blur by averaging the 24 images.” The task generated tens of thousands of images, and, with this vast amount of data, Radim trained the denoising diffusion probabilistic model and was pleased to discover that it generalized effectively to real-world data. Looking ahead, he is already discussing potential future directions for the work and remains optimistic about further refining its approach. One avenue involves improving single-image temporal super-resolution, aiming to match the performance of multi-frame methods that require more than one input image. 8 DAILY WACV Friday Poster Presentation

“We thought we wouldn’t be able to get the same performance as these multi-frame methods because it’s a very heavily ill-posed problem,” Radim points out. “Knowing something about your background helps a lot. Next, we’d like to separate the tasks of estimating the background and foreground and apply different strategies to the two.” To learn more about Radim’s work, visit Posters 3 [Paper 68] tomorrow at 17:15-19:15 (Naupaka). 9 DAILY WACV Friday Single-Image Deblurring …

10 DAILY WACV Friday Poster Presentation This work explores the parameterefficient fine-tuning method to adapt large Vision Transformer models for downstream tasks using less computational resources. “We have many good models, but the issue is they’re large, like LLMs and ViTs,” Imad begins. “We’d like to adapt these large models for some specific tasks but don’t want to use too much computational power. There is a method known as parameter-efficient fine-tuning, where the objective is to get really good performance using these large models on tasks but with less computational power.” The title ‘Mini but Mighty’ (MiMi), credited to Imad’s supervisor Enzo, encapsulates the work’s objective to achieve robust performance with large models using minimal computational resources. Addressing the practical applications of the work, Mini but Mighty: Finetuning ViTs with Mini Adapters Imad Eddine Marouf is a second-year PhD student at Télécom-Paris in France under the supervision of Enzo Tartaglione and Stephane Lathuiliֻѐre. In this paper, he proposes MiMi, a parameter-efficient training framework for Vision Transformers (ViTs). He speaks to us following yesterday’s poster presentation.

11 DAILY WACV Friday Mini but Mighty he highlights its ability to overcome the computational inefficiency associated with adapting large models for image classification, where each time the model is finetuned on a specific task, it loses computational power, resulting in suboptimal performance. Imad observes that adapters perform poorly when their dimensions are small. To solve this, the method starts with large adapters that can reach high performance and iteratively reduces their size. However, another challenge is determining which parameters of the large models should be finetuned for optimal performance on specific downstream tasks. With models boasting millions of parameters, selecting the right layers becomes a critical consideration. “Since we have a very large model, we don’t know which layers to focus on for the downstream task,” he explains. “Each task is different, so the layers in the model are quite different. We were grateful to find a way to do this dynamically. In our approach, given any downstream task, the model itself will decide which layers to focus on to get the best performance.” In terms of the downstream tasks, Imad focused mainly on image classification. “We evaluated our method on 29 image classification datasets featuring medical images, cars, and sketches,” he reveals. “A well-known benchmark we evaluated on was DomainNet, with specific images for 365 classes but in different formats – some are sketches, some are real images, and

some are animated images. The objective is to get the best performance across all these benchmarks.” MiMi surpassed existing approaches in identifying the optimal balance between accuracy and trained parameters across the three benchmarks: DomainNet, VTAB, and Multi-task. 12 DAILY WACV Friday Poster Presentation

Looking ahead, Imad outlines two key future directions for this work. First, there is an interest in streamlining the iterative process involved in finding the optimal model configuration, aiming for a more efficient and straightforward approach. Second, the plan is to extend the evaluation of the method to additional computer vision tasks, such as semantic segmentation and object detection. “With the image classification task, we’ve noticed that, depending on the downstream task you want to adapt your model on, it will adapt certain layers and discard the others,” he points out. “We’d like to find a way to remove the iterative process because, in our approach, it’s an iterative process to find the right configuration for the model.” Imad’s enthusiasm for this work stems from living in an era where we have really large models at our disposal, which are excellent in general but sometimes need improvement concerning specific tasks. “I’d like to do this in an efficient way using less computational power and also get really good performance,” he tells us. “I like working on computer vision in general. It’s really interesting. That’s why I’m focusing mainly on computer vision tasks.” 13 DAILY WACV Friday Mini but Mighty

Oana Ignat is a postdoctoral researcher at the University of Michigan. She is a co-author on a paper which is presented as a poster at WACV 2024: “Augment the Pairs: Semantics-Preserving ImageCaption Pair Augmentation for Grounding-Based Vision and Language Models”. Oana, what do you do at UoM? I finished my PhD there last year in August and I continue my research on the intersection of computer vision and natural language processing. I build AI models and datasets that focus on human action understanding. That was my thesis work. Now, I’m extending this work towards analyzing model performance across demographics. Across different languages, different countries, income levels, and so on. Does that include Romania? Women in Computer Vision 14 DAILY WACV Friday

Oh, yes! [she laughs] I always look for including Romania in my dataset. I really want to do more work on that area, for sure. What is the goal of this research? We want to see how foundation, state-of-the-art models, for example, CLIP – I have a recent paper that was presented at EMNLP in Singapore a few weeks ago – work across different demographics, because these models are usually made by research laboratories in Western countries and focus mostly on Western data. We want to see how well it performs across the world. In this paper, we look specifically at income. How this model works across different income levels. Images from households from different income levels in different countries. We found that there is a considerable gap in performance. The model performs much better on high-income versus low-income images. How does this help us? Well, we draw attention to this. First, we show that this gap in performance exists. This is important because it means that this model will not work well for images from those income levels or from those countries. We show that this performance is not uniform. It’s not globally uniformly distributed. We want to show that we have to make sure that we train these models on data from different countries, from different demographics, and also include 15 DAILY WACV Friday Oana Ignat

annotators. People who annotated data should be included from different countries and different income levels because we see that images look different. Even images from very common household items, like a toothbrush or refrigerator, we’re used to seeing a certain image of that in Western countries, but they vary depending on demographics. Do you like being a researcher? Yeah, for sure. I really enjoy it. I gradually got here. Back in Romania, I worked a bit in industry. Like a software developer, but still focused on research applications. I really enjoyed that because you never know what you’re going to get. It’s research. It’s work in progress. It’s for discovering new things. That’s what I enjoy. It’s not predictable. How long have you been in the States? Quite long now. Already six years. Out of everything you do, what is one thing that you would not be able to do if you were still in Romania? That’s an interesting question. I think there are much more opportunities here in the US. I can see the university is on board with any activity I propose to do and sponsor it. I feel like things are moving much faster here. Like people are really open to investing time and money in projects. Maybe in Romania, it’s a bit more difficult to do that because the budget is stricter. There’s more paperwork to do. Also, I find the optimism of people here is much more encouraging. When you have an idea, usually, people are open to it. They want to try it and are excited. That excitement is contagious, and I think leads to projects having more success and making more progress. Is there a chance you will go back to Romania one day to work? [Oana laughs] I’m constantly thinking about that, yes. Actually, when I started the PhD, I started with the thought in mind that I will come back after the PhD and teach, be a professor, and also collaborate with industry because I’m in the middle. I like the application side of industry and the fast progress, and I also like the research side, the academia side, where you’re open to think of diverse problems. Now, I’m not sure. I want to stay a bit more in the US to explore more and have more opportunities, but I want to also collaborate with students from Romania, with professors from Romania, so I’m looking for that. If one day you go back to Romania, Romania Women in Computer Vision 16 DAILY WACV Friday

what is the one thing that you will bring back with you from the States? Oh, I think many things! [she laughs] I hope so. Maybe one thing is this attitude that things are possible, that you have to stay optimistic and push hard for things to happen and be determined. I really value that. I am going to ask you the opposite now – what did you bring to the Americans that they did not have before you came from Romania? Oh, hmm. [laughs] I’m not sure. Let’s see. Maybe some food! [she laughs] The national food, sarmale, maybe. Soup! I really miss the soup from back home. I wish in the US, there would be more soup places. [laughs] It’s very healthy! Where are you from in Romania? I’m from Botoșani. It’s a small city in the northeast of Romania. In the direction of Ukraine? Exactly, yes. We’re at the border with Ukraine. Wow. I am very fond of Odessa. Not very far from you guys. Yes, yes. I hear stories about Odessa all the time from my parents, but I’ve never been there. Were they in Odessa? No, but maybe my father visited or some friends of his. I don’t know, but I heard the name before. With this dichotomy between industry and academia, is there a chance that you might find yourself in a position one day that bridges the two? Yes, that’s what’s been on my mind. It took me a long time to decide what I wanted to do after the PhD program. Whether to go into industry or to go as a professor into academia. I had a few internships in industry, and I got to experience it also before the PhD. When I came back as a postdoc, I really enjoyed the mentoring aspect of postdoc. I think that was the last straw that made me decide to go into academia. I really like to mentor students and work with them. In industry, there’s not much opportunity for that, or things are much more product-oriented. It would be nice to have maybe some collaboration between those two. Can you see how that could happen? What would be the setting? 17 DAILY WACV Friday Oana Ignat

I know, at least in sabbaticals, there are professors who go to industry for a bit, maybe for a year or a semester. They’re like advisors or research projects in industry. Yeah, that’s an open possibility. I would like to try that to see how it is. You have spoken to us about your current research. Are there projects coming up that you haven’t told us about yet? [Oana laughs] Yeah, there’s always projects in the works. Now, I’m working with my mentee. We’re working together on a multilingual dataset generated by ChatGPT. Large language models are very common, very popular nowadays, so it’s very interesting to see how they work and their output. Especially, I’m very interested across languages. How their output differs. I’m really interested in the analysis of the data generated by GPT. Would it be a safe bet to guess which languages you chose for your cross-language work? Do you want to guess the language that I’m working on? Would it be easy to guess them? Oh, well, yeah, maybe! [laughs] Yes, I think so. Okay, so English, Romanian. Yes! [she laughs] It was an easy guess. But we have 10 languages. We have more than those two. What is the goal of this research? ? Women in Computer Vision 18 DAILY WACV Friday

We’re looking mostly at analyzing the data. We’re looking at generating hotel reviews in these different languages. We want to compare the generated data with the real data and see if a model can easily distinguish between what is real and what is generated. I think this is very relevant for reviews because there are more and more automatically generated reviews out there on the internet, and we want to see if we can train the model to distinguish. Because they are fake? Yes, exactly. We want to catch the fake information. Are you going to help the world to prevent scams, frauds, and cheating? Yes, hopefully that will help. That will contribute to finding those and also analyzing what is different about this data. Many reviews are written in very bad language actually with typos. People do not ever reread them. Is this noise an obstacle to your research? That’s a good point. We were thinking about this, actually. We were brainstorming what to consider when we generate this data. We noticed that GPT doesn’t really generate this kind of data. It’s actually the opposite. It’s very polished. Yeah, the style is very formal. It doesn’t really look like what a person would write. Maybe we should try to include some noise in the generation process. I am jumping to a different subject. Is there an eminent Romanian scientist from the past that you admire? I immediately think of my advisor, Rada Mihalcea. I think she can be in this position of one of the greatest researchers. What did you learn from her that you would take into your own teaching? I think this attitude of never giving up. She learns a lot from rejections. She never hides away from trying new things and submitting to conferences. Even if we get back rejections, she always says the more rejections she has, the more she learns or the more success she has. That’s very inspiring to me because I didn’t take rejections in a good way. It would discourage me very fast. But also, as a researcher, you have to deal with that on a regular basis, and it’s very good to adopt this mentality that rejection is not a step back. It’s a step forward. You learn from it, and you continue stronger in the future. Read 100 FASCINATING interviews with Women in Computer Vision! 19 DAILY WACV Friday Oana Ignat

Double-DIP Don’t miss the BEST OF WACV 2023 in Computer Vision News of February. Subscribe for free and get it in your mailbox! Click here

21 DAILY WACV Friday UKRAINE CORNER Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. Mykhailo (Misha) Shvets presents his research on "Joint Depth Prediction and Semantic Segmentation with Multi-View SAM" at WACV 2024. The work combines cost volume-based MVS and Transformer-based segmentation methods, introducing a novel multi-view multi-task architecture. Having recently completed his PhD at UNC Chapel Hill, Misha looks forward to the next phase of his career. As a Ukrainian national, he wears a traditional vyshyvanka shirt to express solidarity with his country. Find Misha at WACV to discuss his research and Ukraine relief efforts. WACV's sister conference CVPR adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine. We decided to host a Ukraine Corner also in the WACV Daily.

22 DAILY WACV Workshop Friday Sarah Ostadabbas (left) is an Associate Professor of Electrical and Computer Engineering at Northeastern University (NU), where she is also Director of the Augmented Cognition Laboratory (ACLab); CoDirector of the Center for Signal Processing, Imaging, Reasoning, and Learning (SPIRAL); and Director of Women in Engineering. Michael Wan (center) is a Research Scientist at the Institute for Experiential AI (EAI) at NU, working at the Roux Institute campus in Portland, Maine. Elaheh Hatamimajoumerd (right) is a Postdoctoral Researcher at ACLab and EAI. They are the co-organizers of the Computer Vision with Small Data (CV4Smalls) Workshop and are here to tell us more about Sunday’s main event. The inaugural CV4Smalls Workshop brings computer vision and machine learning specialists together to address the challenges of working with small datasets involving infants and animals. Whether it’s the difficulty of gathering or accessing data or the prohibitive costs and security concerns associated with labeling, the workshop is a platform to ensure that those grappling with such data constraints can leverage and benefit from ongoing advancements in the field. Previous workshops at other conferences have focused on dataefficient machine learning, adapting approaches for domains with limited data, including concepts like few-shot and zero-shot learning. However, the data being discussed here is significantly smaller, and it is in the wild, with distribution even within classes coming from diverse sources.

“In healthcare, military, and security applications, it’s very hard to collect data,” Sarah tells us. “Even if you can collect data, you can’t crowdsource to get labels because they’re very private and secure. Sometimes, it’s even hard to get adjacent domain data. You can’t say I’m sitting on this specific dataset on human behavior, and I’ll just finetune it a bit to bring it to our application.” Bridging the domain gap is challenging. Despite remarkable strides in computer vision for human-centric applications, challenges persist in domains such as infant monitoring and endangered animal studies. The organizers believe addressing these challenges will improve outcomes in these domains and contribute to advancements in related fields facing similar data constraints. “If you’re interested in applied machine learning, then you’re interested in small data machine learning because we’re never going to have the huge datasets we want,” Michael declares. “There’s often still a focus on technical innovation or finding data that fits interesting theory. We do lots of things that have a theoretical grounding in our work, but we’re focusing on domains where it’s critical to solve these problems even though the datasets are so small that we can’t necessarily apply existing well-known small data machine learning tools.” 23 DAILY WACV Friday CV4Smalls Infant Static Pose Generation: The structure of the proposed three-phase infant Static Generative Pose Model: 3D pose estimation, posture-guided generation, and image rendering. The workshop will start with a general discussion on the challenges faced in computer vision with small data, focusing on infants and endangered animals. Then, the day will be split, with the first half related to infant health monitoring and the second dedicated to animal

behavior, including lightning talks and keynotes. Keynote speakers include the eminent Michael Black, who will discuss advancements in monitoring humans in the wild, from 2D and 3D pose to motion and activity. There will be a Best Paper Award announcement before the closing remarks at the end of the day. 24 DAILY WACV Workshop Friday Infant Dynamic Pose Generation: Overview of the proposed dynamic generative pose model. Starting with pose extraction from infant videos, we will adapt a Motion Diffusion Model to infant motion data, recalibrated for infant proportions, and fine-tune it for infant-specific movements. The final output will be scaled to infant body ratios using an infant 3D shape model. “We have the top three papers selected, but we’re still working on finalizing the best paper among that,” Sarah teases. “It’s a hard selection. We received a very highquality set of papers. We also had a mini rebuttal, which allowed us to see how people addressed the reviews they received. It’s been a rigorous process.” Just before the awards, a panel discussion will host experts from robotics, human-computer interaction, psychology, neuroscience, and computer science, highlighting the crossdisciplinary nature of the workshop. Moderated by Sarah, it promises to be a unique opportunity for the audience to engage with leaders in the field. Michael, who now works independently, tells us CV4Smalls stems from years of research in Sarah’s lab. Beyond a single workshop, it represents a unique and ongoing opportunity to shape the emerging field of computer vision and machine learning applications for infant and animal domains. “That’s something we’ve been pioneering in Sarah’s lab, and it makes it really exciting because we’re not just talking about a 0.5% improvement in the latest metrics

for 2D human face pose estimation or something,” he explains. “We’re trying to solve problems, and we’re trying to define the problems as well, like with the Multiple Toddler Tracking in Indoor Videos paper. What are the important things to pay attention to? What’s dangerous to safety? What’s of interest to the parents?” Elaheh adds: “What makes this workshop unique is its strong backbone. We’re not just bringing part of the lab; we’re bringing our collaborators. The regular open problem in computer vision is that we have a dataset, and people come and make some improvements on top of an already developed method. However, we studied different problems in the domains and worked on the building blocks of this ourselves. It will be good to show people the challenges we faced, all the open problems. We sometimes didn’t even have small data and were working toward that. We believe it will have a huge impact on the whole computer vision community. We’re so excited to share it with the rest of the world! Who can say they don’t care about infants and endangered animals?” Another important emphasis of the event is on the positive applications of AI amidst current concerns about its potential negative impact. For example, AI tools that could detect and monitor the early signs of autism or torticollis. These conditions could then be addressed or rehabilitated sooner. “That would have a cascading positive effect on people’s lives,” Sarah says. “Rather than just talking about the scariness of AI, I think now is a good time for these two 25 DAILY WACV Friday CV4Smalls SPAC-Net: An overview architecture of our synthetic prior-aware animal ControlNet (SPAC-Net), composed of three parts: pose augmentation, style transfer and dataset generation. The SPAC-Net pipeline leads to generation of our probabilistically-valid animal SPAC-Animals dataset.

26 DAILY WACV Workshop Friday ID Switch and Fragmentation Errors: ID switch and fragmentation errors: Toddler 1 and toddler 3 in the top image, have had their ID numbers swapped with each other in the bottom image, constituting an ID switch error. Toddler 2, present in the top image, is no longer tracked in the bottom image and is treated as a new toddler assigned the ID 4, indicative of a fragmentation error. The proposed method aims to decrease both ID switches and fragmentation errors.

domains to show the positives that our field and the students, collaborators, and researchers working in it can bring to the world for a better future.” When she joined NU six years ago, Sarah brought small data into view when the prevailing focus was on big data. However, with models growing in size and complexity, more and more datasets are now seen as small data. It is no longer a niche problem but relatively widespread, with more people needing to think about making their models work with less data. “With these two specific applications, when we started talking to people, we saw clusters of labs and researchers working in the field, so this workshop brings them together,” she reports. “It’s a heartwarming feeling to see that other groups are working on this challenging problem. Reading the papers coming to us, we said, ‘Wow! Interesting!’ They’ve been working on this field from another perspective. We’re looking forward to meeting those researchers that really care.” The organizers invite everyone to join the CV4Smalls Workshop on Sunday and stay tuned for the Best Paper and Special Issue journal announcement. 27 DAILY WACV Friday CV4Smalls Multiple Toddler Tracking: MTTSort for multiple toddler tracking in indoor videos. This diagram illustrates two significant enhancements to the traditional DeepSort framework: (1) Pooled Aggregated Feature Association with a Custom Buffer, a mechanism that accumulates and consolidates features across consecutive frames in a user-defined buffer, and (2) Attention-Based Feature Extraction with Vision Transformer (ViT), which replaces conventional CNNs with the ViT for a more refined and attention-focused feature extraction process.

Did you read Computer Vision News of December? Read it here

29 DAILY WACV Friday More Orals… ☺ Elad Hirsch (top), presenting his oral work “Asymmetric Image Retrieval With Cross Model Compatible Ensembles”. Jie Zhang (bottom), presenting his oral work “Contextual Affinity Distillation for Image Anomaly Detection”.

Dima Damen (Bristol University), presenting her keynote speech “Opportunities in Egocentric Video Understanding”. 30 DAILY WACV Keynote… ☺

31 DAILY WACV Friday More Posters… ☺ Monika Wysoczanska (top), a PhD Student at Warsaw University of Technology, presenting her work on leveraging image-text aligned models for open-vocabulary semantic segmentation at no extra cost: no training and no annotations. Cagri Gungor (bottom), a PhD student at the University of Pittsburgh, explaining his work on how depth modality boosts the weakly supervised object detection performance by analyzing the relationship between language context and depth.

RkJQdWJsaXNoZXIy NTc3NzU=