MLNI – Midterm Project (Wei Wang)

The Concept

For the midterm project, I looked at the concept of social labels. Nowadays, people are overly influenced by the social labels that are passively thrown on themselves. And sometimes even those labels with positive labels would be a burden that push people to shift their behavior in order to match the mode of labels, therefore, have no chance of respite to see the true self.  For this project, I want to enable people to physically get rid of the labels by pushing the texts on the screen with their hand. And once the labels get pushed out of the scree, real figure of the person in front of the camera will show up. I would use Chinese Characters with the meaning of “optimistic”, “pessimistic”, “lazy”,  to get the visual more organized with fixed length of words.

The Process

I divided the programing into two parts, the interaction with texts and the webcam image manipulation. At first, I used PoseNet to capture the key point of wrist and object oriented programing to build an array of particles with little animation. And when the wrists positions reach any of the particle, it will fly out of the screen. The blue and green points in the video below represents left and right wrists respectively.

Then I started to manipulate webcam image that shows on the screen. I was using BodyPix for segmentation and captured the area of both hands that it detected and shew real-time image only the part on the screen where the hands were and where they had been. So I used an array to record all the index positions where the hands had reached to show the image. And I also drew semi-transparent ellipses of the surroundings and only showing clear image of the person in front of the camera to stress on personal figure.

And finally, I put the these two parts together and achieved the visual expectation, but spent great amount of time with its speed. Here is the final version of my project.

The Challenge

The greatest challenge of the project is about the speed. After I put the codes of two parts together, the models ran too slowly to interact with. There were mainly three approaches that I took to increase the speed.

First, I was originally using an array to store the index positions of hands segments. And Each frame, I used array.includes() to check if any of the pixel index was added to this array individually. So for every pixel on the screen, the computer would go through the array from the first element until it found the matching index.

if (data[index] == 21 || data[index] == 23 || new_index_array.includes(index)) {   if (new_index_array.includes(index) == false) {     new_index_array[index] = true;  } }

Unfortunately, this was really time-consuming. So instead of adding the position index each time the hand reached there, I created an array with the length that matched pixel volumes of the screen  at the beginning with ‘false’ on all the index positions. And once the position was reached by hand, I changed the corresponding element on the same position in this global array, so when deciding which part of the screen to uncover, the computer could go directly to the index position and check if the element was true or false.

if (data[index] == 21 || data[index] == 23 || new_index_array[index] == true) {   if (new_index_array[index] == false) {     new_index_array[index] = true;  } }

Second, running PoseNet and BodyPix at the same time could slow down the project. So I took off PoseNet and used BodyPix to calculate the key points of hands. 

let sumX_right = sumX_left = 0;
let sumY_right = sumY_left = 0;
let avgX_right = avgX_left = 0;
let avgY_right = avgY_left = 0;
let count_right = count_left = 0;

// to calculate the sum of all x and y positions of hands
if (data[index] == 21) {
  sumX_right += x;
  sumY_right += y;
  count_right++;
}

if (data[index] == 23) {
  sumX_left += x;
  sumY_left += y;
  count_left++;
}

// to calculate the average of x and y positions
if (count_left > 0) {
  avgX_left = sumX_left / count_left;
  avgY_left = sumY_left / count_left;
  leftHandX = map(avgX_left, 0, img.width, 0, windowWidth);
  leftHandY = map(avgY_left, 0, img.height, 0, windowHeight);;
}

if (count_right > 0) {
  avgX_right = sumX_right / count_right;
  avgY_right = sumY_right / count_right;
  rightHandX = map(avgX_right, 0, img.width, 0, windowWidth);
  rightHandY = map(avgY_right, 0, img.height, 0, windowHeight);;
}

And thirdly, with the help from Moon, I figured out that the texts animation took up too much computational power which involved more complicated calculation to get the positions of the pixels that made up the texts. And I learned the efficient method the check which the speeds of each part of codes running with machine learning models by printing the frame count when the model gets a result. Finally, I used pictures of the words and modify the size in stead of text() to lay over the screen.

Future Improvement

The visualization of webcam image could be further manipulated to look aesthetically better and consistent with the words on the screen. During the development process, I tried to use ellipses to blur the surroundings, which was taken off to speed up the model. Also, the edge between the revealed and unrevealed parts is sharp and hard. Some method to smooth the edge may make the visualization more friendly.

Leave a Reply