A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.
While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.
"Science cannot tell us what we ought to do, only what we can do."
- Jean-Paul Sartre, Being and Nothingness
Here is a work in progress of the most interesting research papers for 2022. In short, it is curated list of the latest breakthroughs in AI and Data Science by release date with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read!
The complete reference to each paper is listed at the end of this repository. Star this repository to stay up to date! ⭐️
Maintainer: louisfb01
Subscribe to my newsletter - The latest updates in AI explained every week.
Feel free to message me any interesting paper I may have missed to add to this repository.
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!
👀 If you'd like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.
👉Follow this quick guide, use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn't take more than 5 minutes to set up and will change your life as it did for me! Here's a more advanced guide for using Hyperparameter Sweeps if interested :)
🙌 Thank you to Weights & Biases for sponsoring this repository and the work I've been doing, and thanks to any of you using this link and trying W&B!
- Resolution-robust Large Mask Inpainting with Fourier Convolutions [1]
- Stitch it in Time: GAN-Based Facial Editing of Real Videos [2]
- NeROIC: Neural Rendering of Objects from Online Image Collections [3]
- SpeechPainter: Text-conditioned Speech Inpainting [4]
- Towards real-world blind face restoration with generative facial prior [5]
- 4D-Net for Learned Multi-Modal Alignment [6]
- Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [7]
- Hierarchical Text-Conditional Image Generation with CLIP Latents [8]
- MyStyle: A Personalized Generative Prior [9]
- OPT: Open Pre-trained Transformer Language Models [10]
- BlobGAN: Spatially Disentangled Scene Representations [11]
- A Generalist Agent [12]
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [13]
- Paper references
You’ve most certainly experienced this situation once: You take a great picture with your friend, and someone is photobombing behind you, ruining your future Instagram post. Well, that’s no longer an issue. Either it is a person or a trashcan you forgot to remove before taking your selfie that’s ruining your picture. This AI will just automatically remove the undesired object or person in the image and save your post. It’s just like a professional photoshop designer in your pocket, and with a simple click!
This task of removing part of an image and replacing it with what should appear behind has been tackled by many AI researchers for a long time. It is called image inpainting, and it’s extremely challenging...
- Short Video Explanation:
- Short read: This AI Removes Unwanted Objects From your Images!
- Paper: Resolution-robust Large Mask Inpainting with Fourier Convolutions
- Code
- Colab Demo
- Product using LaMa
You've most certainly seen movies like the recent Captain Marvel or Gemini Man where Samuel L Jackson and Will Smith appeared to look like they were much younger. This requires hundreds if not thousands of hours of work from professionals manually editing the scenes he appeared in. Instead, you could use a simple AI and do it within a few minutes. Indeed, many techniques allow you to add smiles, make you look younger or older, all automatically using AI-based algorithms. It is called AI-based face manipulations in videos and here's the current state-of-the-art in 2022!
- Short Video Explanation:
- Short read: AI Facial Editing of Real Videos ! Stitch it in Time Explained
- Paper: Stitch it in Time: GAN-Based Facial Editing of Real Videos
- Code
Neural Rendering. Neural Rendering is the ability to generate a photorealistic model in space just like this one, from pictures of the object, person, or scene of interest. In this case, you’d have a handful of pictures of this sculpture and ask the machine to understand what the object in these pictures should look like in space. You are basically asking a machine to understand physics and shapes out of images. This is quite easy for us since we only know the real world and depths, but it’s a whole other challenge for a machine that only sees pixels. It’s great that the generated model looks accurate with realistic shapes, but what about how it blends in the new scene? And what if the lighting conditions vary in the pictures taken and the generated model looks different depending on the angle you look at it? This would automatically seem weird and unrealistic to us. These are the challenges Snapchat and the University of Southern California attacked in this new research.
- Short Video Explanation:
- Short read: Create Realistic 3D Renderings with AI !
- Paper: NeROIC: Neural Rendering of Objects from Online Image Collections
- Code
We’ve seen image inpainting, which aims to remove an undesirable object from a picture. The machine learning-based techniques do not simply remove the objects, but they also understand the picture and fill the missing parts of the image with what the background should look like. The recent advancements are incredible, just like the results, and this inpainting task can be quite useful for many applications like advertisements or improving your future Instagram post. We also covered an even more challenging task: video inpainting, where the same process is applied to videos to remove objects or people.
The challenge with videos comes with staying consistent from frame to frame without any buggy artifacts. But now, what happens if we correctly remove a person from a movie and the sound is still there, unchanged? Well, we may hear a ghost and ruin all our work.
This is where a task I never covered on my channel comes in: speech inpainting. You heard it right, researchers from Google just published a paper aiming at inpainting speech, and, as we will see, the results are quite impressive. Okay, we might rather hear than see the results, but you get the point. It can correct your grammar, pronunciation or even remove background noise. All things I definitely need to keep working on, or… simply use their new model… Listen to the examples in my video!
- Short Video Explanation:
- Short read: Speech Inpainting with AI !
- Paper: SpeechPainter: Text-conditioned Speech Inpainting
- Listen to more examples
Do you also have old pictures of yourself or close ones that didn’t age well or that you, or your parents, took before we could produce high-quality images? I do, and I felt like those memories were damaged forever. Boy, was I wrong!
This new and completely free AI model can fix most of your old pictures in a split second. It works well even with very low or high-quality inputs, which is typically quite the challenge.
This week’s paper called Towards Real-World Blind Face Restoration with Generative Facial Prior tackles the photo restoration task with outstanding results. What’s even cooler is that you can try it yourself and in your preferred way. They have open-sourced their code, created a demo and online applications for you to try right now. If the results you’ve seen above aren’t convincing enough, just watch the video and let me know what you think in the comments, I know it will blow your mind!
- Short Video Explanation:
- Short read: Impressive photo restoration by AI !
- Paper: Towards real-world blind face restoration with generative facial prior
- Code
- Colab Demo
- Online app
How do autonomous vehicles see?
You’ve probably heard of LiDAR sensors or other weird cameras they are using. But how do they work, how can they see the world, and what do they see exactly compared to us? Understanding how they work is essential if we want to put them on the road, primarily if you work in the government or build the next regulations. But also as a client of these services.
We previously covered how Tesla autopilot sees and works, but they are different from conventional autonomous vehicles. Tesla only uses cameras to understand the world, while most of them, like Waymo, use regular cameras and 3D LiDAR sensors. These LiDAR sensors are pretty simple to understand: they won’t produce images like regular cameras but 3D point clouds. LiDAR cameras measure the distance between objects, calculating the pulse laser’s traveling time that they project to the object.
Still, how can we efficiently combine this information and have the vehicle understand it? And what does the vehicle end up seeing? Only points everywhere? Is it enough for driving on our roads? We will look into this with a new research paper by Waymo and Google Research...
- Short Video Explanation:
- Short read: Combine Lidar and Cameras for 3D object detection - Waymo
- Paper: 4D-Net for Learned Multi-Modal Alignment
As if taking a picture wasn’t a challenging enough technological prowess, we are now doing the opposite: modeling the world from pictures. I’ve covered amazing AI-based models that could take images and turn them into high-quality scenes. A challenging task that consists of taking a few images in the 2-dimensional picture world to create how the object or person would look in the real world.
Take a few pictures and instantly have a realistic model to insert into your product. How cool is that?!
The results have dramatically improved upon the first model I covered in 2020, called NeRF. And this improvement isn’t only about the quality of the results. NVIDIA made it even better.
Not only that the quality is comparable, if not better, but it is more than 1'000 times faster with less than two years of research.
- Short Video Explanation:
- Short read: NVIDIA Turns Photos into 3D Scenes in Milliseconds
- Paper: Instant Neural Graphics Primitives with a Multiresolution Hash Encoding
- Code
Last year I shared DALL·E, an amazing model by OpenAI capable of generating images from a text input with incredible results. Now is time for his big brother, DALL·E 2. And you won’t believe the progress in a single year! DALL·E 2 is not only better at generating photorealistic images from text. The results are four times the resolution!
As if it wasn’t already impressive enough, the recent model learned a new skill; image inpainting.
DALL·E could generate images from text inputs.
DALL·E 2 can do it better, but it doesn’t stop there. It can also edit those images and make them look even better! Or simply add a feature you want like some flamingos in the background.
Sounds interesting? Learn more in the video or read more below!
- Short Video Explanation:
- Short read: OpenAI's new model DALL·E 2 is amazing!
- Paper: Hierarchical Text-Conditional Image Generation with CLIP Latents
This new model by Google Research and Tel-Aviv University is incredible. You can see it as a very, very powerful deepfake that can do anything.
Take a hundred pictures of any person and you have its persona encoded to fix, edit or create any realistic picture you want.
This is both amazing and scary if you ask me, especially when you look at the results. Watch the video to see more results and understand how the model works!
- Short Video Explanation:
- Short read: Your Personal Photoshop Expert with AI!
- Paper: MyStyle: A Personalized Generative Prior
- Code (coming soon)
We’ve all heard about GPT-3 and have somewhat of a clear idea of its capabilities. You’ve most certainly seen some applications born strictly due to this model, some of which I covered in a previous video about the model. GPT-3 is a model developed by OpenAI that you can access through a paid API but have no access to the model itself.
What makes GPT-3 so strong is both its architecture and size. It has 175 billion parameters. Twice the amount of neurons we have in our brains! This immense network was pretty much trained on the whole internet to understand how we write, exchange, and understand text. This week, Meta has taken a big step forward for the community. They just released a model that is just as powerful, if not more and has completely open-sourced it.
- Short Video Explanation:
- Short read: Meta's new model OPT is GPT-3's closest competitor! (and is open source)
- Paper: OPT: Open Pre-trained Transformer Language Models
- Code
BlobGAN allows for unreal manipulation of images, made super easily controlling simple blobs. All these small blobs represent an object, and you can move them around or make them bigger, smaller, or even remove them, and it will have the same effect on the object it represents in the image. This is so cool!
As the authors shared in their results, you can even create novel images by duplicating blobs, creating unseen images in the dataset like a room with two ceiling fans! Correct me if I’m wrong, but I believe it is one of, if not the first, paper to make the modification of images as simple as moving blobs around and allowing for edits that were unseen in the training dataset.
And you can actually play with this one compared to some companies we all know! They shared their code publicly and a Colab Demo you can try right away. Even more exciting is how BlobGAN works. Learn more in the video!
- Short Video Explanation:
- Short read: This is a BIG step for GANs! BlobGAN Explained
- Paper: BlobGAN: Spatially Disentangled Scene Representations
- Code
- Colab Demo
Gato from DeepMind was just published! It is a single transformer that can play Atari games, caption images, chat with people, control a real robotic arm, and more! Indeed, it is trained once and uses the same weights to achieve all those tasks. And as per Deepmind, this is not only a transformer but also an agent. This is what happens when you mix Transformers with progress on multi-task reinforcement learning agents.
Gato is a multi-modal agent. Meaning that it can create captions for images or answer questions as a chatbot. You’d say that GPT-3 can already do that, but Gato can do more… The multi-modality comes from the fact that Gato can also play Atari games at the human level or even do real-world tasks like controlling robotic arms to move objects precisely. It understands words, images, and even physics...
- Short Video Explanation:
- Short read: Deepmind's new model Gato is amazing!
- Paper: A Generalist Agent
If you thought Dall-e 2 had great results, wait until you see what this new model from Google Brain can do.
Dalle-e is amazing but often lacks realism, and this is what the team attacked with this new model called Imagen.
They share a lot of results on their project page as well as a benchmark, which they introduced for comparing text-to-image models, where they clearly outperform Dall-E 2, and previous image generation approaches. Learn more in the video...
- Short Video Explanation:
- Short read: Google Brain's Answer to Dalle-e 2: Imagen
- Paper: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- Project page with results
If you would like to read more papers and have a broader view, here is another great repository for you covering 2021: 2021: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!
[1] Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K. and Lempitsky, V., 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 2149–2159)., https://arxiv.org/pdf/2109.07161.pdf
[2] Tzaban, R., Mokady, R., Gal, R., Bermano, A.H. and Cohen-Or, D., 2022. Stitch it in Time: GAN-Based Facial Editing of Real Videos. https://arxiv.org/abs/2201.08361
[3] Kuang, Z., Olszewski, K., Chai, M., Huang, Z., Achlioptas, P. and Tulyakov, S., 2022. NeROIC: Neural Rendering of Objects from Online Image Collections. https://arxiv.org/pdf/2201.02533.pdf
[4] Borsos, Z., Sharifi, M. and Tagliasacchi, M., 2022. SpeechPainter: Text-conditioned Speech Inpainting. https://arxiv.org/pdf/2202.07273.pdf
[5] Wang, X., Li, Y., Zhang, H. and Shan, Y., 2021. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9168–9178), https://arxiv.org/pdf/2101.04061.pdf
[6] Piergiovanni, A.J., Casser, V., Ryoo, M.S. and Angelova, A., 2021. 4d-net for learned multi-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 15435–15445), https://openaccess.thecvf.com/content/ICCV2021/papers/Piergiovanni_4D-Net_for_Learned_Multi-Modal_Alignment_ICCV_2021_paper.pdf.
[7] Thomas Muller, Alex Evans, Christoph Schied and Alexander Keller, 2022, "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding", https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf
[8] A. Ramesh et al., 2022, "Hierarchical Text-Conditional Image Generation with CLIP Latents", https://cdn.openai.com/papers/dall-e-2.pdf
[9] Nitzan, Y., Aberman, K., He, Q., Liba, O., Yarom, M., Gandelsman, Y., Mosseri, I., Pritch, Y. and Cohen-Or, D., 2022. MyStyle: A Personalized Generative Prior. arXiv preprint arXiv:2203.17272.
[10] Zhang, Susan et al. “OPT: Open Pre-trained Transformer Language Models.” https://arxiv.org/abs/2205.01068
[11] Epstein, D., Park, T., Zhang, R., Shechtman, E. and Efros, A.A., 2022. BlobGAN: Spatially Disentangled Scene Representations. arXiv preprint arXiv:2205.02837.
[12] Reed S. et al., 2022, Deemind: Gato - A generalist agent, https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf
[13] Saharia et al., 2022, Google Brain, Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, https://gweb-research-imagen.appspot.com/paper.pdf