Text-based Editing of Talking-head Video (SIGGRAPH 2019)

July 28, 2019

in this walk we present the first text-based video editing approach that lets editors insert new text in addition to cutting copying and pasting the existing transcript text our poach allows editing at any point and synthesizes the corresponding correct lip synced video okay the market closed today with Apple's stock price at one hundred and ninety one point four five dollars per share here we replace ninety one point four with eighty two point two okay the market closed today with Apple's stock price at one hundred and eighty two point two five dollars per share give in a video of a talking head and a transcript of the speech we first align the transcript to the video at the level of phonemes we also register a 3d parametric face model with the video given any edit operation we perform a Visine search to find the best match between sub sequences of phonemes in the Edit and the input video this step gives us parameters of our face model which we further blend to obtain temporally coherent edits we use these to synthesize a lower face region and combine it with the input to create a composite to bridge the domain gap between the composites and rail video footage we use a neural phase rendering approach this gives us the final edited video we now show more text-based editing results we can add new words in a sentence he will replace well for our Dow with why are you and here's another edit I love the smell of napalm in the morning I love the smell of french toast in the morning note that the synthesized words were not spoken by the subjects in the training video the audio of the new words were separately recorded we also provide examples using a synthesized voice I got hacen aided by your networks here we showed the retrieved frames which we'll use to synthesize the mouth motion I got a sedated vibe in your networks even though these frames come from different parts of the video in a not temporally coherent our method produces temporally smooth outputs I got a sedated vibe in your networks we can also synthesize audio using text-to-speech systems like the vocal system of genital in this example the audio for ice creams was generated using vocal she sells ice creams by the seashore we can also delete words learning from examples and and scientist over the last few decades learning from examples and scientist over the last few decades our yn are worried over silly items why n over silly items our results can be seamlessly composited into the original video sequence which allows us to edit videos of arbitrary resolutions she sells ice cream by the seashore in addition to such edits we can also synthesize full sentences just from text to give a virtual assistant a face she sells ice cream by the seashore here we again show the frames where the face model parameters were retrieved from she sells ice cream by the seashore she sells ice cream by the seashore did you hear about the crook who stole a calendar no I did not full sentence synthesis enables our method to be used for video translation here we enable our non German speaking subject to speak German each player dodge morph cut is a video transition tool in premiere poll for removing jump cuts that is based on the work of Berta soy Dahl in our setting morph cut produces artifacts as it requires the transition to be in a relatively static part of the video so deep learning is inside machine learning is one of the approaches to machine learning so deep learning is inside machine learning is one of the approaches to machine learning learning from example and scientist over the last few decades after about learning from examples and scientists over the last few decades after demo morph God cannot be used to composite short segments and thus cannot be used to blend our retrieved video sequences the market closed today with Apple stock price at one hundred and eighty two point two five dollars per share okay the market closed today with Apple stock price at one hundred and eighty two point two five dollars per share we compel on your face rendering network two deep video portraits of key metal as this apology does not perform text-based editing with / in a self reenactment setting where a test sequence is reproduced note for fairness we train their approach using our recurrent generator network the video portraits cannot deal with dynamic background or even dynamic foreground such as the motion of the hands or the shirt our method handles these challenging scenarios well we also synthesize a higher-quality mouth region compared to key metal pochi synthesizes a higher quality and temporally more stable mouth region than face to face of digital without our parameter blending strategy results are temporally incoherent the market closed today with Apple stock price at 180 2.25 dollars per share we blend the parameters of the face model in every transition region this leads to realistic looking motion which is well aligned to the edited text the market closed today with Apple stock price at 180 2.25 dollars per share I got into neural networks I got into neural networks here we evaluate result quality with respect to the size of the data set I love the smell of french toast in the morning I love the smell of french toast in the morning I love the smell of french toast in the morning I love the smell of french toast in the morning best results are obtained using the full data set the quality of results degrades gracefully with the size of the data set we evaluate the realism of our results by performing a web-based user study we asked users to rate the realism of source videos which we want to edit the dingo ate your baby really reference videos were the subjects speak the edited sentences the dingo ate my baby and edited videos obtained using our approach the dingo ate my baby she sells seashells by the seashore she sells ice cream by the seashore she sells ice cream by the seashore our results were considered realistic by more than half of the participants thank you for watching


49 Replies to “Text-based Editing of Talking-head Video (SIGGRAPH 2019)”

    all simple recording will be inadmissible in court lol. the complicated ones like chasing and stabbing are harder or near impossible to execute.

    All these kind of tech are developed by bogus or shady companies tied to western governments.

    Since social media became so dormant and anyone could potentially be their own news outlet. this took all the power away from corporate media that was in bed with the State. It's in the mainstream and State-run medias interest to muddy the water on truth and this is what they are doing.

    Ask yourself this question. why would average people who depend on video evidence when they have interactions with police and the State to prove their innocence, with little or no resources create something like this?

    Couple this with a neural network that synthesises voices and an algorithm that scans video recordings to see faces from all different angles, and you can have software that can automatically create videos of any public figure saying whatever you want.

    So where is the download link?

    This research should not be public! I can't imagine many positive uses for this technology, however, I can imagine plenty of malicious uses for it. Scientists should now be focusing far more effort on using machine-learning to rapidly and accurately analyze and discriminate deep fakes like these from real videos. Because these tools have become so advanced and easy to use, Pandora's Box is now open, so it is essential to the survival of democracy and social trust around the world to develop even more powerful deep fake-spotting tools. I hope your team will be working on these!

    Said this before, will say it again: don't trust if you can't verify. We've been here before with pictures and the editing thereof.
    It's not some online generator. You still require knowledge of the program and digital media in general, in order to make a convincing fake.

    They coud use AI research to cure cancer, but no… to make it impossible to distinguish a real video/audio evidence from a fake one is more fun, right?

    Trump original: "I like french toast in the morning."
    Trump edited: "Launch all nuclear missiles immediately, I like french toast in the morning."
    "Okay mr. president we will begin a neverending nuclear holocaust… and definitely have that toast ready for you?"

    What is the source of the voice at the end? Sure sounds like the Bell Telephone Lab synthesized voice of 1962. An inside joke?

    President says something, triggers an invasion/war, then claims it's "fake" and shows the "original" video.

    If you use this on yourself will you be pronouncing Rs correctly or will you be still saying them as a H sound?

    There must be some way to use a program that can scan it and see that its been manipulated, so maybe it´s not that dangerous as one can think

    imagine yourself making project like that only to prove pewdiepie is really a nazi

    Ladies and gentleman, welcome in Orwells 1984!

    Pretty convincing, but you can hear the anomaly where the reverb behind the edited part doesn't quite match the edited words.

    remember when everyone was freaking out over that one weird frame in that interview with julian assange? this is why

    This won't be seen for a very long time. This isn't as dangerous as people are saying. It is all just fear mongering

    Undoubtedly one of the most culturally impactful emerging technologies

    thnk u

    people who promote this technology should be arrested and never allowed to see the light of day again.

    Jason Jason Jason

    Well, now we know deepfakes are on par with politicians when it comes to telling the truth.

    Because of this, I can now 100% say everything on the internet can't be trusted, every single thing.

    CBS NEWS: Putin declare war with USA, we have video evidence!

    Evil people who love to lie 🙂

    The Russians are going to wreak havoc with this.

    I know the technology has to advance in every frontier, but how can this improve the lives of people?

    The Chinese dude Zeyu Jin works for the communist party of China so all the research that was done is already on their hands.

    I am curious how many people are in prison right now, after having been convicted with the use of edited video. If it has not already happened it is coming soon, very soon.

    Now they can begin to rewrite history.

    We need to live in a world of truths and so some bright spark invents a lie machine. This will be used to manipulate and create fear and to push agendas to benefit a few while everyone else suffers.

    Have you noticed origin of the man who narrates the video?

    How to start WW3 in a few easy steps.

    Next we will see fake (white) Jesus talking in videos and people will worship him. Idolatry of images and statues is SIN. That's why lots of people will worship the antichrist even some Christians who think Jesus is the man we've painted or the one who play his role in movies. Jesus is black, his skin is black and He has white frizzy (4c type) hair, like white wool. Revelation 1. This technology is the technology of the antichrist. Repent! Turn away from sin! Walk in Holiness ! Prepare the way because the Messiah is coming soon!!

    I’ll tell you the problem with the scientific power that you’re using here: it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could and before you even knew what you had you patented it and packaged it and slapped it on a plastic lunchbox, and now you’re selling it, you want to sell it.

    In lamens terms, fake news clips in the horizon.

    This technology should either be illegal or there has to be some form of technology that can be used to discover the usage of this technology. The only reason anyone would use this is for purposes of crime, deception, fraud or worse political censorship.

    Denial of Reality. Some compromising video evidence used to control or blackmail someone becomes public. Due to this technology being shown and accepted by the public the video evidence is discounted as fake.

    oh boy

    They are working on making "The Cube" and eventually will be inside it along with the rest of us.

    I bet he would like to race against Ricky Bobby.

    To what good purpose would this software be used for? I am not a bible thumper but:
    John 8:44 New International Version (NIV)

    44 You belong to your father, the devil, and you want to carry out your father’s desires. He was a murderer from the beginning, not holding to the truth, for there is no truth in him. When he lies, he speaks his native language, for he is a liar and the father of lies.

    Bravo. I guess the pace of our current march toward dystopia just wasn't enough for you guys, huh?

    The only reason to create this technology is to destroy ourselves. What a worthless exercise. Imagine the people behind this thinking what a great job they're doing.There is no good reason for this to exist, except to spread fake news, completely control thought and crush the human spirit. Maybe that's the point.

    How will we live in a world without discernable integrity?
    Remember how many previously cutting edge softwares are now globally-common place on PCs and laptops. CAD software, 3D animations etc. To some extent this evolution was tempered by the parallel evolution in computing power. Now everyone has computing power at their fingertips. Imagine when it comes to the time that anyone can get an easy to use copy of deepfake software on their laptop. No one will know whether what they see and hear is real or fake, who to believe or who they can trust. Anything a person says can be discounted as fake if it doesnt suit. Imagine a world where you have to question everything but where the majority take all for granted.

