MLNI-Final Project Concept — Crystal Liu

Initial Thought:

I want to develop my final project based on my midterm project. As I said in my midterm documentation, I want to add storytelling part and smooth interaction to my final project. For the storytelling part, I plan to design a theme about the festival. Since Christmas is around the corner, I choose Santa’s journey on Christmas Eve as the main topic. 

     

If the users touch the Merry Christmas, they will see this crystal ball.

As the user gets closer and closer to this image, the image will be bigger and bigger, which seems like the user is approaching this crystal ball in a real world. If the distance reach the point, the user will see another image, which means they enter into the scene successfully:

I will set a large size for these images, which is larger than the canvas size. The users can drag this image by stretching their left or right hand. Also, they can trigger something in the image. For example, if the butterfly approach the elf who is raising his hands in the air on the second image, the user will hear “Merry Christmas” in an exciting mood. This is the first scene. The users can go to the next scene by letting the butterfly get close to the right edge of the image. If they do so, they can see an arrow guiding them to the next scene. Every scene has its own surprising part as my midterm and I plan to add some hints to guide the users. As Tristan suggested, I can use fade function to let the users recognize they just triggered something. 

Technology

The core technology is still poseNet. I was inspired by Shenshen’s and Billy’s midterm project. The users can zoom in the image by getting close to the screen. Also, I want to make some filters for the users and the image or GIF’s position is located based on poseNet. I also want to use style transfer to enrich the visual output. But I’m afraid that the model will get stuck and can’t work smoothly.

 

Week 10 Assignment: Style Transfer — Crystal Liu

Model training

I choose this painting from Picasso as the target style and use cd image_path and qsub image_name colfax to upload this image to my devcloud. 

But when I submitted my training task, I couldn’t see the expected result. Professor checked and found that I didn’t have the local train.sh. So I created one and upload it to the devcloud. This time it worked and then I downloaded the model by using scp -r colfax:train_style_transfer_devCloud/models 111(my name). At first, I did this on the devcloud platform, so I failed to download the model. Then I realized that I should download the model on the local terminal and got my model successfully.

Application

I changed the original code to apply my own model and got this:

This image is abstract and is quite different from my original image, but it looks funny. Also, the canvas isn’t smooth when the style is changed. I think  the reason might be the huge calculation like my midterm project. I thought this was simple because I could only change the style by pressing the space. Thus, I added poseNet model to use the pose to replace space. When the coordinates of the user’s nose reached a certain range, the style will change. However, every time I changed my position, the style would change from the original one to Picasso style. Therefore, it looked so weird and it blinked. So I had to give up this method and still applied the keyPressed function. 

But it inspired me a lot for my final project. Maybe I can use the pitch or tone of sound to trigger the change of the style so that it can enrich the visual output. For example, if the tone is high enough, the style will transfer to another one. However, the precondition is to solve the problem of stuck live video.

Midterm Documentation — Crystal Liu

Background

         My inspiration is a project called teachable machine. This model can be trained in real time. The user can make some similar poses as the input for a class. The maximum of the class is 3. Each pose corresponds an image or GIF. After setting up the dataset, once the user makes one of the three poses, the corresponding result will come out.

For me the core idea is excellent but the form of output is a little.  There are also some projects about motion and music or other sound.

So I want to add audio as output. Also, the sound of different musical instruments is really artistic and people are familiar with it. Thus, my final thought is to let users trigger the sound of the instrument by acting like playing the corresponding musical instrument. 

Motivation

My expected midterm project is an interactive virtual instrument. Firstly, the trained model can identify the differences in how musical instruments are played. Once it gets the result it will play the corresponding sound of the instrument. Also, there will be a picture of the instrument on the screen around the user.

        For example, if the user pretended to be playing guitar, the model would recognize the instrument is guitar and automatically play the sound of guitar. Then there will be an image of guitar showing on the screen. The expected result is that it looks like the user is really playing the guitar on the screen.

Methodology

In order to achieve my expectation, I need the technology to locate each part of my body and to identify different poses and then classify them automatically and immediately. According to the previous knowledge, I decide to use PoseNet to do the location part.

I plan to set a button on the camera canvas so that the users don’t need to press the mouse to input information and the interaction will be more natural. To achieve it, I need to set a range for the coordination of my hand. When I lift my hand to the range, the model will start receiving the input image and 3 seconds later it will automatically end it. Next time the user makes similar poses the model will give corresponding output. KNN is a traditional algorithm to classify things. So it can be used to let the model classify the poses in a short time and achieve real time training.

Experiment

Virtual button

The first step to make my project is to replace buttons that users need to use the mouse to press by to the virtual button triggered by the human body part, for example, the wrist. To achieve it, I searched for the image for the button and found the following GIF to represent the virtual button. To avoid touching mistakenly, I put the buttons at the top of the screen which looks like the following picture.

I used the poseNet to get the coordinates of the user’s left and right wrists and then set a range for each virtual button. If the user’s wrist approaches the button, the button will change into GIFs containing different instruments (Guitar, drum and violin). These GIFs play the role of feedback to let the user know they have successfully triggered the button. 

After that the model should automatically record the video as the dataset. The original one is that if the user press the button once, there will be five examples added to the class A. For my virtual button, the recording part should run once the user trigger the button. However, I need to set a delay function to give users time to put their hands down and prepare to play a musical instrument. Because the model shouldn’t count the image that users put their hands down as the dataset. So I set 3s delay for the users. But collecting examples is discontinuous if I keep raising my hand and dropping it.

Sound

The second step is to add audio as the output. At first, I said if the classification result is A and then the song will play (song.play( ); ). But the result is that the song played a thousand time in 1 second. Thus, I can only hear noise not the sound of guitar. So I asked Cindy and Tristan for help, and they suggested me to use the following method: if the result is A and the song is not playing right now, the song will play. Finally, it worked. There was only one sound at a time.

UI

The third step is to beautify UI of my project. First is the title: Virtual Instrument. I made a rectangle as the border and added an image to decorate it. It took some time to change the size of the border to the smaller one. Also, I have added shadow to the words and added 🎧🎼 to emphasize music.

Then I added some GIFs which shows the connection between body movement and music. They are beside the camera canvas.

At last I added the border to the result part:

Problems

The problems I found in the experiment are as follows:

  1. The process of recording and collecting examples is discontinuous. It often gets stuck. But the advantage is that the user will know whether the collection part end by seeing if the picture is smooth or stuck. Also, the stuck image may have something to do with my computer.
  2. Sometimes the user might touched two buttons at the same time, but it is hard for me to avoid this situation through the code. So I just changed the range of each button to widen the gap between them. 
  3. I have set the button to start predicting but it was hard for the model to catch the coordinates of left wrist. Sometimes it took a lot of time for the user to start predicting. Thus, I have changed the score from 0.8 to 0.5 to make it better.
  4. Once the user pressed the start button, there would be a sound of the drum, even though the user didn’t do anything. It made me confused. Maybe it is because that KNN cannot consider the result that doesn’t belong to any classification. The model can only consider the most possible classification the input belongs to and give the corresponding sound.

Therefore, the next step is to solve the problems and enrich the output. For example, I can add more kinds of musical instruments. And also the melody can be changed according to different speed of the body movement. 

Social impact

My goal is to create a project to let people interact with AI in an artistic and natural way. They can make the sound of a musical instrument without having a real physical one. Also, it is a way to combine advanced technology and daily art. It provides an interesting and easy way to help people learn about and experience Artificial Intelligence in their daily life. In a word, it provides a new way of interaction between human and computer. I think this project can be used as a way to display the project in the museum and as a medium to bridge the viewers and the project.

Further development

The next step is to solve the problems happened in the experiment and enrich the output. And I need to fully utilize the advantage of real-time training in my final project. My idea is to give users more opportunities to show their thought and creativity. To let them decide which kind of input can trigger which kind of output. Also, I want to add style transfer model to enrich the visual output. The style of the canvas can be changed as the mood of melody changes. For example, if the user choose the romantic style of the melody, the color of the canvas can turn to pink and also there will be pink bubble on the canvas. But the most essential problem is how to let the users create their own ways of expression through the real-time training. How to make the interaction smoother is also important for the final project. Also, I want to take advantage of the sound classification to play the role of the visual button. On the one hand, the users can create their own sound command to control the model. On the other hand, this function can avoid the problem happened in my previous experiment. But I am worried that the model of sound classification can not work accurately enough so that the result can not reach my expectation.

MLNI Week Assignment–Crystal Liu

This week I want to use KNN and poseNet to make a filter, because I want to make some filters for users in my final project. I made three themes for the filters. One is Christmas theme. If the user make a decided pose, then there will be flowing snow, red hat and white beard on the screen. For the Halloween theme, I made a spider and a pumpkin head. For the Spring Festival I added the fireworks and lantern.

At first, I want to change the RGB of the canvas. For example, if the classification result is A, everything in the canvas seems yellow. If the result is B, the color will be blue. If C, then the color will be red. However, it didn’t work. That was because I just directly used the code from last class and it became a screenshot so the canvas was still. Then I made some changes but it couldn’t work either. But it provided me an idea for my final project. If I solve this problem, I can apply it on my final project.

Midterm Project Documentation–Crystal Liu

My plan

My plan is to create an interactive painting. The user is a butterfly in the painting, and the coordinates of the butterfly is the location of the user’s left eye. I want to add some surprising part to let users explore. For example, there is a tree in the picture. If the butterfly gets close enough to it or even touches it, the leaves will fall down to the ground. Also, the user can do some specific poses to add the object to the picture. For example, if the user puts his or her hand in the air, there will be a cloud in the sky. To achieve this effect, I used poseNet as the core model since it can get the coordinates of the human’s body parts.

My process

First, I searched for some beautiful paintings about natural scenes and set the following picture as the background. And then I observed the painting to find the surprising part.

I found that there were plenty of trees. So maybe I could add a tree in the painting and when the butterfly approached, the leaves would fall down to the ground. Then I searched online for the GIF about trees, and I’ve found this gif. And I used screenshot to get a still PNG file.

 

First, I put the still image into the draw function and added an “if statement” so that when the butterfly approaches the still tree, the dynamic tree will show up.

 I also applied Same method to this group:

 

To achieve the other idea, I used two body parts and used the distance between them to trigger things, since the model couldn’t recognize poses directly. I used left eye and left wrist. For example, if the hand is a certain amount above the eye, a cloud will come out. In this way, the user needs to lift his/her hand to trigger the cloud.

These are about visual part. I also noticed that there were deer in the painting. So I wanted to add audio output. If the butterfly get close to the deer, the user will hear the cry of the deer. 

Difficulties

At first I used the coordinates of left wrist to control the position of butterfly but it was too tired to keep my hand in the air for such a long time. I also tried to use nose position, but it was not good as well. At last I decided to use the position of left eye to move the butterfly.

Also, the butterfly couldn’t fly smoothly. The professor suggested me I can use lerp () to make it smooth.  

Here, the number is 0.05, so the butterfly will cover 5% of the distance from (bfX, bfY) to (x, y) and then cover 5% of the rest part. Finally, it will get really close to the destination but never reach it. 

Another difficulty for me is to differentiate the coordinates of left wrist and left eye. At first I used x1=x, y1=y in the section of left wrist and x2=x, y2=y in the section of left eye. I didn’t declare x1 and x2, but it worked. But that was not a right way to get it. Professor told me I can declare x1 or other name and give the variable a value first. And then use the distance function to calculate the result.

Reflection

I have received so many excellent  suggestions in the presentation. I am really interested in Professor Moon’s idea. And my thought is that I can set different thesis and give users the corresponding filters. For example, in the map of Christmas thesis, if the user finds the “present” in the painting and collects it, there will be a red hat and white beard on his or her face (since I don’t want to hide the camera, the filter can appear in the camera canvas). Also, I like the suggestion provided by Tristan. I can make the interaction more smooth by making it continuous and gradual. For example, as he said, the user can hear a slight sound of running water when he gets close to the waterfall and the sound will become louder and louder as he gets closer to the waterfall. Also if the user leaves the waterfall, the sound will gradually fade away. Besides, Shenshen’s project inspired me a lot. Since I want to enrich my paintings into various pictures with some connections (storytelling part). And Shenshen’s project provides a way for the users explore or observe my project. They can use their eyes to “walk” in the painting. I can add these thought to my final project. And I really appreciate anyone who helped me promote my project or test it, especially Professor Moon!