Home > News content

Microsoft releases Windows Vision Skills Preview Edition, which makes it easy to call computer vision API

via:博客园     time:2019/6/13 19:05:50     readed:1133

Editor's note: Microsoft recently released the Windows Vision Skills preview, which currently includes APIs for object detection, human skeleton detection, and emotion recognition. With Windows Vision Skills, you don't need to understand complex computer vision knowledge, you can easily call the API to solve some computer vision problems.

Computer vision technology has a wide range of application scenarios and market needs. Microsoft recently released a preview version of Windows Vision Skills designed to simplify the deployment of computer vision technology on Windows, helping developers solve some computer vision problems by simply calling a set of APIs. Currently, Windows Vision Skills includes a set of APIs that can be used to implement specific types of computer vision skills, including:Object detection, human skeleton detection and emotion recognition.


Figure 1: Results of object detection, human skeleton detection, and emotion recognition from left to right

For developers, the Windows Vision Skills framework greatly reduces the application threshold for computer vision technology. Application developers can use WinRT APIs to easily integrate built-in visual technologies such as object detection, human skeleton detection, etc. on Windows applications (.NET, Win32, and UWP) without having to understand the complex algorithms and designs within the technology. Greatly shorten the development cycle and improve development efficiency. In addition, computer vision developers can use the hardware acceleration framework on Windows devices to package their solutions into a visual technology package without worrying about the underlying design.


Among the three computer vision APIs that have been released, the human skeleton detection technology comes from the Intelligent Multimedia Group of Microsoft Research Asia. Since the actual application has extremely high requirements on the processing speed and resource consumption of the model, the human skeleton detection model for research purposes is difficult to meet the actual needs. To this end, researchers at the Microsoft Research Institute's Intelligent Multimedia Group designed efficientLightweight skeleton detection modelThe parameter quantity is only 4M.

We compared this lightweight skeleton detection model with OpenPose, an open source model that is currently widely used in the industry. Because OpenPose's neural network is relatively complex and not suitable for direct application in products, we have simplified the OpenPose model (ie, reducing 6 stages to 1 stage to reduce model complexity). Compared to the simplified version of the OpenPose model,Our proposed lightweight skeleton detection model achieved a 90% reduction in computational load (FLOPS) (86G FLOPS vs. 9G FLOPS) and a 4x increase in CPU processing speed.

Dr. Zeng Wenjun, Principal Investigator of Microsoft Research Asia, said, “Microsoft Asia Research Institute has long been committed to basic research and has put its results on products. Our range of vision technologies, such as object tracking, pedestrian recognition, etc., will be released on the Windows Vision Skills framework and the Microsoft Cognitive Services Platform. ”

Human skeleton detection is a basic task in computer vision and plays an important role in the understanding and analysis of people in images and videos.The skeleton detection model detects and locates key points in the image and video (such as shoulders, wrists, knees, etc.), as shown in Figure 2. Since the human skeleton contains many kinds of information such as human signs, postures, and movements, the skeleton detection model has great application value in the fields of entertainment, education, and medical care.


Figure 2: Human skeleton detection

The skeleton detection of the human body has the following application scenarios:

Virtual Reality:In social and entertainment, people sometimes want to add special effects based on the human skeleton to generate enhanced and interesting content to assist in the delivery of information.

Behavior recognition:The human skeleton contains the body's posture and movement information, and provides important information for the recognition of human behavior types. The biological observation work done by psychophysicist Gunnar Johansson shows that human behavior can be identified by observing the movement of the finite joint points of the human body. In recent years, much work has been done to study how to design a skeleton-based behavior recognition model.

Human-computer interaction:In machine intelligence, it is one of the core issues to perceive and understand human language and even body language to make timely responses. Explicit skeleton information facilitates the understanding of body language and instructions.

Motion analysis:In medical rehabilitation and physical exercise, intelligent analysis of human movement can greatly reduce human input and improve rehabilitation and training efficiency. For example, in the assessment and rehabilitation of osteoarthrosis, skeleton testing can be used to analyze the pattern of patient walking, and to assess the flexibility of the joint and the severity of the disease.


You can find examples of the use of Microsoft Windows Vision Skills Human Skeletal Detection, Object Detection, and Emotion Recognition APIs at the following website:

Use example


For more information, please refer to the Windows Vision Skills tutorial and the NuGet.org package:

Use tutorial



NuGet package


As a high-level human body semantic information, the human skeleton is often used as an effective auxiliary information for other research tasks. For example, in the task of Person Re-identification, human skeleton information is often used to assist in the detection of body parts to solve the spatial semantic misalignment problem between different pictures. In the near future, we will explain in detail in an article an academic paper on pedestrian recognition in CVPR 2019. In this paper, in order to solve the practical problem of spatial misalignment in pedestrian recognition, we use the finer-grained dense semantics (Dense Pose) to help the network learn robust features.

Paper: Densely Semantically Aligned Person Re-identification, CVPR, 2019

Zhizheng Zhang, Cuiling Lan, Wenjun Zeng, Zhibo Chen

About the Author


Lan Cuiling, researcher at the Intelligent Multimedia Group of Microsoft Research Asia, is engaged in computer vision and signal processing. His research interests include behavioral recognition, pose estimation, pedestrian recognition, video analysis, etc., and published 30 papers in several top conferences and journals.

China IT News APP

Download China IT News APP

Please rate this news

The average score will be displayed after you score.

Post comment

Do not see clearly? Click for a new code.

User comments