TechcraftingAI Computer Vision

Brad Edwards

TechcraftingAI Computer Vision brings you summaries of the latest arXiv research daily. Research is read by your virtual host, Sage. The podcast is produced by Brad Edwards, an AI Engineer from Vancouver, BC, and a graduate student of computer science studying AI at the University of York. Thank you to arXiv for use of its open access interoperability.

  1. 15/06/2024

    Ep. 247 - Part 3 - June 13, 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024. 00:21: LRM-Zero: Training Large Reconstruction Models with Synthesized Data 01:56: Scale-Invariant Monocular Depth Estimation via SSI Depth 03:08: GGHead: Fast and Generalizable 3D Gaussian Heads 04:55: Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset 06:34: Towards Vision-Language Geo-Foundation Model: A Survey 08:11: SimGen: Simulator-conditioned Driving Scene Generation 09:44: Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition 11:03: Sagiri: Low Dynamic Range Image Enhancement with Generative Diffusion Prior 12:32: LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living 13:56: WonderWorld: Interactive 3D Scene Generation from a Single Image 15:21: Modeling Ambient Scene Dynamics for Free-view Synthesis 16:29: Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA 17:50: Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms 19:39: Real-Time Deepfake Detection in the Real-World 21:17: OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation 23:02: Yo'LLaVA: Your Personalized Language and Vision Assistant 24:30: MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations 26:26: Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion 28:03: Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models 29:59: ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing 31:24: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities 33:16: Towards Evaluating the Robustness of Visual State Space Models 34:57: Data Attribution for Text-to-Image Models by Unlearning Synthesized Images 36:09: CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras 37:37: Scene Graph Generation in Large-Size VHR Satellite Imagery: A Large-Scale Dataset and A Context-Aware Approach 40:02: MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding 41:40: Explore the Limits of Omni-modal Pretraining at Scale 42:46: Interpreting the Weight Space of Customized Diffusion Models 43:58: Depth Anything V2 45:12: An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels 46:23: Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models 48:11: Rethinking Score Distillation as a Bridge Between Image Distributions 49:44: VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

    52 min
  2. 15/06/2024

    Ep. 247 - Part 2 - June 13, 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024. 00:21: INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance 02:11: Large-Scale Evaluation of Open-Set Image Classification Techniques 03:43: PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation 05:00: MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era 06:41: Auto-Vocabulary Segmentation for LiDAR Points 07:30: AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring 08:43: EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts 10:23: Fine-Grained Domain Generalization with Feature Structuralization 12:03: SR-CACO-2: A Dataset for Confocal Fluorescence Microscopy Image Super-Resolution 14:13: ReMI: A Dataset for Reasoning with Multiple Images 15:41: A Large-scale Universal Evaluation Benchmark For Face Forgery Detection 17:26: Thoracic Surgery Video Analysis for Surgical Phase Recognition 18:58: Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval 20:40: Adaptive Slot Attention: Object Discovery with Dynamic Slot Number 22:26: CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification 24:22: Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024 25:21: Optimizing Visual Question Answering Models for Driving: Bridging the Gap Between Human and Machine Attention Patterns 26:30: WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals 27:44: MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction 29:28: Comparison Visual Instruction Tuning 30:51: MirrorCheck: Efficient Adversarial Defense for Vision-Language Models 32:14: Deep Transformer Network for Monocular Pose Estimation of Ship-Based UAV 33:10: Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos 34:33: Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models 36:04: StableMaterials: Enhancing Diversity in Material Generation via Semi-Supervised Learning 37:30: Parameter-Efficient Active Learning for Foundational models 38:31: Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation 40:22: Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases 42:38: Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans 44:36: Memory-Efficient Sparse Pyramid Attention Networks for Whole Slide Image Analysis 46:19: Instance-level quantitative saliency in multiple sclerosis lesion segmentation 48:37: CMC-Bench: Towards a New Paradigm of Visual Signal Compression 50:05: Needle In A Video Haystack: A Scalable Synthetic Framework for Benchmarking Video MLLMs 52:05: CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models

    53 min
  3. 15/06/2024

    Ep. 247 - Part 1 - June 13, 2024

    ArXiv Computer Vision research for Thursday, June 13, 2024. 00:21: FouRA: Fourier Low Rank Adaptation 01:41: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation 03:18: Few-Shot Anomaly Detection via Category-Agnostic Registration Learning 04:57: Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting 06:46: ToSA: Token Selective Attention for Efficient Vision Transformers 08:00: Computer vision-based model for detecting turning lane features on Florida's public roadways 09:08: Improving Adversarial Robustness via Feature Pattern Consistency Constraint 10:52: Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network 12:10: NeRF Director: Revisiting View Selection in Neural Volume Rendering 13:36: Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency 15:03: Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality 16:40: COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing 18:16: Fusion of regional and sparse attention in Vision Transformers 19:26: Zoom and Shift are All You Need 20:17: EgoExo-Fitness: Towards Egocentric and Exocentric Full-Body Action Understanding 21:49: The Penalized Inverse Probability Measure for Conformal Classification 23:24: OpenMaterial: A Comprehensive Dataset of Complex Materials for 3D Reconstruction 24:47: Blind Super-Resolution via Meta-learning and Markov Chain Monte Carlo Simulation 26:30: Computer Vision Approaches for Automated Bee Counting Application 27:17: Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding 28:16: A Label-Free and Non-Monotonic Metric for Evaluating Denoising in Event Cameras 29:43: Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer 31:25: Neural NeRF Compression 32:29: Preserving Identity with Variational Score for General-purpose 3D Editing 33:50: AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings 34:51: Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition 36:10: Enhancing Cross-Modal Fine-Tuning with Gradually Intermediate Modality Generation 37:34: AMSA-UNet: An Asymmetric Multiple Scales U-net Based on Self-attention for Deblurring 38:49: Cross-Modal Learning for Anomaly Detection in Fused Magnesium Smelting Process: Methodology and Benchmark 40:45: A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding 42:02: Steganalysis on Digital Watermarking: Is Your Defense Truly Impervious? 43:28: FacEnhance: Facial Expression Enhancing with Recurrent DDPMs 45:11: How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models 47:08: Suitability of KANs for Computer Vision: A preliminary investigation

    48 min
  4. 13/06/2024

    Ep. 246 - Part 3 - June 12, 2024

    ArXiv Computer Vision research for Wednesday, June 12, 2024. 00:20: From a Social Cognitive Perspective: Context-aware Visual Social Relationship Recognition 02:09: APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentatio 03:57: 2.5D Multi-view Averaging Diffusion Model for 3D Medical Image Translation: Application to Low-count PET Reconstruction with CT-less Attenuation Correction 05:47: DDR: Exploiting Deep Degradation Response as Flexible Image Descriptor 06:58: Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze 08:02: LaneCPP: Continuous 3D Lane Detection using Physical Priors 09:23: FontStudio: Shape-Adaptive Diffusion Model for Coherent and Consistent Font Effect Generation 11:10: VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks 12:46: MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos 14:39: OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text 16:49: AWGUNET: Attention-Aided Wavelet Guided U-Net for Nuclei Segmentation in Histopathology Images 18:15: Diffusion Soup: Model Merging for Text-to-Image Diffusion Models 19:58: Coherent Optical Modems for Full-Wavefield Lidar 21:32: Transformation-Dependent Adversarial Attacks 22:45: PixMamba: Leveraging State Space Models in a Dual-Level Architecture for Underwater Image Enhancement 24:10: GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices 25:57: ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery 27:26: Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement 28:51: Real2Code: Reconstruct Articulated Objects via Code Generation 30:02: Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models 31:42: RMem: Restricted Memory Banks Improve Video Object Segmentation 33:12: What If We Recaption Billions of Web Images with LLaMA-3? 34:42: Real3D: Scaling Up Large Reconstruction Models with Real-World Images 36:07: Enhancing End-to-End Autonomous Driving with Latent World Model 37:12: Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation 38:43: On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models 40:16: Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models 42:15: ICE-G: Image Conditional Editing of 3D Gaussian Splats

    44 min
  5. 13/06/2024

    Ep. 246 - Part 2 - June 12, 2024

    ArXiv Computer Vision research for Wednesday, June 12, 2024. 00:21: From Sim-to-Real: Toward General Event-based Low-light Frame Interpolation with Per-scene Optimization 01:44: Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement 03:20: Adversarial Patch for 3D Local Feature Extractor 04:00: Valeo4Cast: A Modular Approach to End-to-End Forecasting 05:38: The impact of deep learning aid on the workload and interpretation accuracy of radiologists on chest computed tomography: a cross-over reader study 08:50: Universal Scale Laws for Colors and Patterns in Imagery 10:11: CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer 11:44: ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs 13:25: Continuous fake media detection: adapting deepfake detectors to new generative techniques 15:18: Category-level Neural Field for Reconstruction of Partially Observed Objects in Indoor Environment 16:23: One-Step Effective Diffusion Network for Real-World Image Super-Resolution 18:12: 2nd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation 19:22: Diffusion-Promoted HDR Video Reconstruction 21:09: Runtime Freezing: Dynamic Class Loss for Multi-Organ 3D Segmentation 21:52: A Sociotechnical Lens for Evaluating Computer Vision Models: A Case Study on Detecting and Reasoning about Gender and Emotion 23:54: DistilDoc: Knowledge Distillation for Visually-Rich Document Applications 25:28: Using Deep Convolutional Neural Networks to Detect Rendered Glitches in Video Games 26:39: OpenCOLE: Towards Reproducible Automatic Graphic Design Generation 27:23: Dataset Enhancement with Instance-Level Augmentations 28:33: Interpretable Representation Learning of Cardiac MRI via Attribute Regularization 29:33: A New Class Biorthogonal Spline Wavelet for Image Edge Detection 30:48: Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata 32:10: Vessel Re-identification and Activity Detection in Thermal Domain for Maritime Surveillance 33:32: AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer 35:09: From Chaos to Clarity: 3DGS in the Dark 36:32: LaMOT: Language-Guided Multi-Object Tracking 38:07: UDON: Universal Dynamic Online distillatioN for generic image representations 39:49: WMAdapter: Adding WaterMark Control to Latent Diffusion Models 40:48: Blind Image Deblurring using FFT-ReLU with Deep Learning Pipeline Integration 42:06: DocSynthv2: A Practical Autoregressive Modeling for Document Generation

    43 min
  6. 13/06/2024

    Ep. 246 - Part 1 - June 12, 2024

    ArXiv Computer Vision research for Wednesday, June 12, 2024. 00:20: FaithFill: Faithful Inpainting for Object Completion Using a Single Reference Image 01:21: Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation 02:49: Unveiling the Power of Wavelets: A Wavelet-based Kolmogorov-Arnold Network for Hyperspectral Image Classification 04:26: Flexible Music-Conditioned Dance Generation with Style Description Prompts 05:52: Robust 3D Face Alignment with Multi-Path Neural Architecture Search 07:00: Small Scale Data-Free Knowledge Distillation 08:48: KernelWarehouse: Rethinking the Design of Dynamic Convolution 10:31: A Comprehensive Survey on Machine Learning Driven Material Defect Detection: Challenges, Solutions, and Future Prospects 12:34: Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation 14:02: IFTD: Image Feature Triangle Descriptor for Loop Detection in Driving Scenes 14:54: Multi-Teacher Multi-Objective Meta-Learning for Zero-Shot Hyperspectral Band Selection 16:30: DemosaicFormer: Coarse-to-Fine Demosaicing Network for HybridEVS Camera 18:10: Spatial-Frequency Dual Progressive Attention Network For Medical Image Segmentation 20:07: Accurate Explanation Model for Image Classifiers using Class Association Embedding 21:55: Real-world Image Dehazing with Coherence-based Label Generator and Cooperative Unfolding Network 23:11: SimSAM: Simple Siamese Representations Based Semantic Affinity Matrix for Unsupervised Image Segmentation 24:06: Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization 25:34: OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding 26:58: Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model 28:26: Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models 29:52: Deep Learning for Slum Mapping in Remote Sensing Images: A Meta-analysis and Review 31:49: LVBench: An Extreme Long Video Understanding Benchmark 33:14: Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking 34:48: A Robust Pipeline for Classification and Detection of Bleeding Frames in Wireless Capsule Endoscopy using Swin Transformer and RT-DETR 36:23: 3D CBCT Challenge 2024: Improved Cone Beam CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement 37:29: MWIRSTD: A MWIR Small Target Detection Dataset 38:34: CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models 40:27: A$^{2}$-MAE: A spatial-temporal-spectral unified remote sensing pre-training method based on anchor-aware masked autoencoder 42:35: Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams 44:26: Identification of Conversation Partners from Egocentric Video

    46 min
  7. 13/06/2024

    Ep. 245 - Part 3 - June 11, 2024

    ArXiv Computer Vision research for Tuesday, June 11, 2024. 00:21: DERM12345: A Large, Multisource Dermatoscopic Skin Lesion Dataset with 38 Subclasses 01:44: Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration 02:49: Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning 04:04: OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding 06:01: 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models 07:24: VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 08:58: Image Neural Field Diffusion Models 10:11: Comparing Deep Learning Models for Rice Mapping in Bhutan Using High Resolution Satellite Imagery 12:29: GLAD: Towards Better Reconstruction with Global and Local Adaptive Diffusion Models for Unsupervised Anomaly Detection 14:26: ReduceFormer: Attention with Tensor Reduction by Summation 15:23: Trim 3D Gaussian Splatting for Accurate Geometry Representation 16:44: SPIN: Spacecraft Imagery for Navigation 18:24: Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions 20:00: Understanding Visual Concepts Across Models 21:12: Instant 3D Human Avatar Generation using Image Diffusion Models 22:47: Neural Gaffer: Relighting Any Object via Diffusion 24:19: Autoregressive Pretraining with Mamba in Vision 25:51: Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance 27:19: Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning 28:50: Situational Awareness Matters in 3D Vision Language Reasoning 30:10: Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense? 31:46: Zero-shot Image Editing with Reference Imitation 33:08: Image and Video Tokenization with Binary Spherical Quantization 34:18: An Image is Worth 32 Tokens for Reconstruction and Generation 36:28: Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring

    38 min
  8. 13/06/2024

    Ep. 245 - Part 2 - June 11, 2024

    ArXiv Computer Vision research for Tuesday, June 11, 2024. 00:21: NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images 01:27: Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph 03:14: T2S-GPT: Dynamic Vector Quantization for Autoregressive Sign Language Production from Text 04:45: Benchmarking and Boosting Radiology Report Generation for 3D High-Resolution Medical Images 06:23: FaceGPT: Self-supervised Learning to Chat about 3D Human Faces 07:52: RecMoDiffuse: Recurrent Flow Diffusion for Human Motion Generation 09:15: VoxNeuS: Enhancing Voxel-Based Neural Surface Reconstruction via Gradient Interpolation 10:51: RAD: A Comprehensive Dataset for Benchmarking the Robustness of Image Anomaly Detection 12:05: RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker 13:52: MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD 15:15: Can Foundation Models Reliably Identify Spatial Hazards? A Case Study on Curb Segmentation 16:56: MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance 18:20: Open-World Human-Object Interaction Detection via Multi-modal Prompts 20:03: Which Country Is This? Automatic Country Ranking of Street View Photos 20:44: Needle In A Multimodal Haystack 22:10: Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models 23:24: Towards Realistic Data Generation for Real-World Super-Resolution 24:37: Unsupervised Object Detection with Theoretical Guarantees 25:43: Embedded Graph Convolutional Networks for Real-Time Event Data Processing on SoC FPGAs 27:45: A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation 29:01: Cinematic Gaussians: Real-Time HDR Radiance Fields with Depth of Field 30:24: Minimizing Energy Costs in Deep Learning Model Training: The Gaussian Sampling Approach 32:09: Global-Regularized Neighborhood Regression for Efficient Zero-Shot Texture Anomaly Detection 33:52: Deep Implicit Optimization for Robust and Flexible Image Registration 35:28: Visual Representation Learning with Stochastic Frame Prediction

    37 min

About

TechcraftingAI Computer Vision brings you summaries of the latest arXiv research daily. Research is read by your virtual host, Sage. The podcast is produced by Brad Edwards, an AI Engineer from Vancouver, BC, and a graduate student of computer science studying AI at the University of York. Thank you to arXiv for use of its open access interoperability.