marți, noiembrie 16, 2010

How does the Kinect really work?

I’m quite tired of seeing the stupid things people try to pass off as facts about how the kinect works so instead of complaining about it I decided to make this handy guide.

First of all what it isn’t: it is not an EyeToy or just a webcam. 

Hardware:
That weird (huge) box contains:
  • a VGA 640x480 color camera (CMOS) with a Bayer color filter
  • a IR 640x480 camera (CMOS)*
  • an IR projector
  • a motor,  an assortment of control chips and 4 microphones
  • a fan!
(* it seems the outputted depthmap is 640x480 but the IR camera sensor size is closer too 1600x1200)
 
The hardware is something researchers have been pining over for a loooong time because it’s a cheap camera that provides distance measurements for every pixel. Previously all they could do was guess and hope for ideal conditions so that your algorithm would work.

How it works:
It is NOT a time of flight camera as a lot of websites have incorrectly stated. It is a structured light camera. You’ve seen those videos of 3d scanners projecting stripes onto things as if imitating venetian blinds? It the same principle at work only there’s nothing venetian about it.

That IR projector mentioned above is an IR laser that passes through a diffraction grating and turns into a lot of IR dots.










They look chaotic because that’s how the kinect can tell what’s going on. NO IT WILL NOT GIVE YOU CANCER. It is a class 1 laser so it's safe to be exposed to it indefinitely. Each of those points is unique and the IR camera sees how it distorts and says „hmm the distance to that point is x”.
Combine that data with the color image from the camera and you have this:


That ladies and gentlemen is a realtime 3D video stream!
Those odd shadows are where the camera can’t see. Combine a few more you say? You can’t. Remember how I said the dots are unique? They suddenly become a lot less unique if there are two or more dotfields in the same area. The Kinects would become confused, sad and depressed.
Switching between them so there’s only one dotfield active at a time might work but that would divide the 30Hz framerate by the number of Kinects rendering the whole thing useless.


Unfortunately laser light is already polarized to some degree so a hack to enable Kinects to operate simultaneously with polarized filters would prove challenging. In the same tone wavelength shifting would be insanely complicated and expensive as narrow band IR filters are not cheap or easy to come by.

Radu Bogdan Rusu has said that "Judging from a few basic tests with two Kinects, I can't seem to get them to interfere with each other to the point where the data is unusable."

Also here is a great explanation of how the pattern is used to determine depth:



The software:
The software is also quite revolutionary. One guy had this to say about it:
The sophistication and performance of the algorithms rival or exceed anything that I've seen in academic research, never mind a consumer product.
...
We would all love to one day have our own personal holodeck. This is a pretty measurable step in that direction.

 
Who’s Johnny Chung Lee? Remember this guy?





Him.


What’s he working on? The Kinect.
He’s not exaggerating either. The software is nothing short of revolutionary. Estimates are it took 5 to 10 years to develop! That’s the average lifespan of a ferret. 




(missing picture: Ferret computer scientists)




How it works:

The team first built a software that could learn, then sent it to school to learn how humans look and move.
And it works! It actually works. Until now computer vision has been this thing that sortof works and breaks if you look at it funny but this actually works.
As if that wasn’t enough it also understands your voice and tracks it in the room (those 4 microphones were not just for weigh balancing).

Researchers would pay thousands until now for hardware that wasn’t half as good as this. The community has embraced this thing with open arms. Microsoft has said they’ll release an SDK to work with their wonderful software (hurry up will you?). Until now computers couldn’t actually see. Even the limited info they could receive wasn’t really that easy to interpret so it wasn’t used. The Kinect has given computers more than eyes, it’s given them vision.


One question is: Why livingroom gaming as the first target?  It seems like an odd first choice. My personal theory is that it’s a trojan horse tactic. First get it into everything then enable everyone to write software for it. Developers, developers, developers.
One thing is certain. This has changed the face of computing and robotics forever.

34 de comentarii:

  1. Thanks for this write up. In comment to the "why the living room" -- well, every household has one. It's more or less the "killer app" for it. Sadly, us geeks are the ones thinking of it's applications beyond being a fancy toy.

    RăspundeţiȘtergere
  2. What do you think about Kinect IR class1 laser . is it dangerous for eyes at short distance ?
    what about people playing a lot of time ?

    RăspundeţiȘtergere
  3. A class 1 laser is eye safe under all conditions. No worries.

    RăspundeţiȘtergere
  4. According to research depth camera is 640x480 not 320x240

    RăspundeţiȘtergere
  5. you could use more than one kinect by polarising the light - or swapping the IR leds for ones emitting a different IR frequency

    RăspundeţiȘtergere
  6. There are no IR leds. It's a laser diode inside and lasers are already polarized. The idea might be worth exploring but it will be no walk in the park.

    RăspundeţiȘtergere
  7. Nice wrap-up. Thank you.

    RăspundeţiȘtergere
  8. Thanks for an interesting read. I have a couple of quibbles. Computer Vision researchers don't spend thousands on hardware; quite the opposite. The majority of systems use one or two cameras. Also, i'm not totally convinced about the scale of this revolution you're expecting. This system won't scale to large or outdoor installations easily, nor will it handle the majority of tasks computer vision focuses on (the detection of events, for example). That said, for small room systems this could be very useful, gesture recognition springs to mind.

    RăspundeţiȘtergere
  9. Acest comentariu a fost eliminat de administratorul blogului.

    RăspundeţiȘtergere
  10. Is the second image of the sensor (without the cover) photographed by you? Just curious - I would like to use it on a different webpage, and want to make sure that I give the appropriate credits. Thanks.

    RăspundeţiȘtergere
  11. No it's from the Ifixit teardown. The image links to it.

    RăspundeţiȘtergere
  12. My first thought for making the two light fields work with each other is a shutter system over the laser projector and camera. An external mod. I'd be interested in seeing how much light loss the IR(near ir) camera can deal with before depth data is lost. I'm pretty sure it could deal with a 1/15th exposure time allowing for at least two to work together.

    The next question is then what is the Jello effect like on the camera and how much does the shutter cause distortion?

    I would have assumed the camera was a 1/60 shutter. 1/30th results in much motion bluring.

    RăspundeţiȘtergere
  13. yes, hurry up and release SDK PLEASE!!!!

    RăspundeţiȘtergere
  14. Acest comentariu a fost eliminat de autor.

    RăspundeţiȘtergere
  15. You said: "Each of those points is unique and the IR camera sees how it distorts and says „hmm the distance to that point is x”.

    The QUESTION is: Nobody really knows how they did that! Please clarify!

    RăspundeţiȘtergere
  16. My best guess:

    Each point has a position at the minimum distance that the kinect measures. The camera recognizes that point based on it's position in the grid and measures the difference between the actual position and the default position. It's the same concept as IR projector images getting bigger the farther away they are.

    RăspundeţiȘtergere
  17. Thanks for that article! I have one question - do you know if you can also operate kinect without the fan? I have to make my kinect as small as possible and it would be helpful if I could simply remove the fan without destroying kinect.

    Stefan

    RăspundeţiȘtergere
  18. The fan is there to keep the components in a specific range of temperature (specifically the IR camera and laser). Removing it should be interesting :P
    My guess is that it would work but who knows what your distance error will be. It certainly won't be ruined by a few tests

    RăspundeţiȘtergere
  19. They could prevent crosstalk by using a pseudorandom modulation scheme. The idea is that the laser's intensity can be controlled, and that a series of random intensities is repeatedly applied to the laser to form a "chirp." As the chirp is being projected it could also "simultaneously" be captured. Points that do not match the expected intensity range (intensity would vary given distance, surface reflectivity, etc) would be discarded. Points that do would be kept for the duration of the chirp. At the end of the capturing sequence, the depth map could be computed as normal from the remaining points.

    Unfortunately, this would increase the cost of the hardware. As the laser would need to be controlled by a DAC, and the framerate of the camera would need to be increased. Also, unless the modulation frequency is high enough (which hardware costs would probably prohibit), a motion tracking algorithm might also need to be added to keep real dots from being falsely filtered due to their position changing during a chirp sequence.

    RăspundeţiȘtergere
  20. Actually, for long enough chirps, a pseudorandom square wave would probably be good enough. That would at least eliminate the need for the DAC.

    RăspundeţiȘtergere
  21. I just acquire a Kinect and made some incredible measurements. The laser output power is about 40mW to 60mW, i.e. it exceeds hundred times the class 1 limit. Moreover this is an IR light so there is no blink reflex that could protect the eye. Although the diffractive optical element - that generate the strutured dot pattern for 3D measurement- separates the incident laser beam into thousand of low power beams, at short distance (few centimeters) all this power is focused on the retina. I've been involved in laser product certification, and I can't understand how Microsoft got this class 1 . For me this is a very dangerous device and I would recommand to certificationnever look at the laser dot pattern at less than 50cm. Take care with the children.

    RăspundeţiȘtergere
  22. I just acquire a Kinect and made some incredible measurements. The laser output power is about 40mW to 60mW, i.e. it exceeds hundred times the class 1 limit. Moreover this is an IR light so there is no blink reflex that could protect the eye. Although the diffractive optical element - that generate the strutured dot pattern for 3D measurement- separates the incident laser beam into thousand of low power beams, at short distance (few centimeters) all this power is focused on the retina. I've been involved in laser product certification, and I can't understand how Microsoft got this class 1 . For me this is a very dangerous device and I would recommand to never look at the laser at less than 50cm. Take care with the children.

    RăspundeţiȘtergere
  23. This is 100% bullshit.
    Do you really think Microsoft will release a device that's not 100% safe? Do yo you really think any sensible company would take that insane risk? I have a feeling that they understand this subject just a bit more than you do. The device is a certified class 1 laser and that means that you can do whatever you want with it, from any distance, for any amount of time and it will still be safe. Unless you have some *REAL* proof.. stop scaring people.

    RăspundeţiȘtergere
  24. I confirm that a class 1 laser cannot deliver more than 0.39mW into a 7mm diameter pupil at 7cm from the laser source (ANSIZ136/IEC60825). This is not, by far, the case with Kinect. So probably that Kinect doesn't use a laser source but a high power IR LED. Anyway it is not recommend to stare a 60mW IR source (coherent or not) at short distance.

    RăspundeţiȘtergere
  25. I did the same with a my handycam as seen in the video and i really don t know it is save to see hours and hours in this laser light, because when you look throu the camera you cant see anything else then a bright light

    RăspundeţiȘtergere
  26. Hi, I came accross your website as I was looking for a technical description of the Kinect.. I find it very interesting because I am doing some research on Avatars, and how we could use them to create 'signing avatars' that could interpret into BSL (British Sign Language) rather than having real-life interpreters all the time.
    There have been comments that the Kinect Avatars are actually half a second behind real-life movement, does this affect game-play? Just how much detail can be obtained with the Kinect and could this software be adapted to have short recordings of different signs and then stored into a dictionary to be called by voice-activation?
    Sorry, a million questions at once but this might be a revolutionary technical find that could change so many peoples lifes! :)

    RăspundeţiȘtergere
  27. The delay is mostly caused by the processing software. The hardware is perfectly capable of realtime work. The processing power of the computer

    Your application is tricky because in the default setting it the kinect doesn't register fingers. It is perfectly capable of doing so from close range but the software doesn't support it. The good news is that you can write your own software using OpenNI and soon using the Microsoft SDK. It's certainly possible to do what you want and the kinect is perfect for it but it won't be easy.

    If you have any more question I'd be happy to answer them

    RăspundeţiȘtergere
  28. Thank you for replying :) Made my day!
    Yes, I was thinking the finger recognition didn't really work.. I take it, it also doesn't really notice facial expressions?
    I was also looking into the PS3 Move, I don't know how much you know about that.. would that be better for finger/facial expressions? (don't worry if you don't know the answer to that!!)
    :)

    RăspundeţiȘtergere
  29. The PS3 move wouldn't work AT ALL. All the Move does is track the remote through the room. No other 3D data is processed.

    For facial expressions you could process the video information from the Kinect. It's doable with OpenCV but not easy.

    RăspundeţiȘtergere
  30. Thank you so much for explaining how kinect actually works. This is a pretty good article until you start talking about personal holodecks, the tone changes to a salesman after that point. "The Kinect has given computers more than eyes, it’s given them vision." LOL..if you were a computer vision guy you wouldn't say that.

    RăspundeţiȘtergere
  31. I wrok with computer vision and I think he's right.

    RăspundeţiȘtergere
  32. for getting multiple angles, why double up the entire system? what about just using a second IR camera to decode the one dot-field, from a different vantage point? I.e., there'd still be a single unique dot field, you'd just pull the same distance-inference trick with the video stream from the new, 2nd IR cam.

    RăspundeţiȘtergere
  33. haha...I liked your last comment about the living room video console mostly it brings to attention the following notion:

    It took entertainment to bring a novel technology that has so many other applications to actually come to fruition.

    Though I think the Kinect is awesome and I am using it to build a robot, but I think its a little bit evident of human nature where our priorities are.

    RăspundeţiȘtergere