Google made a recent announcement (as of yesterday when writing this) unveiling a new project called "Magi." This initiative aims to integrate advanced large-language models with their search capabilities. As there is no public demo available for review, I've decided to create a clone inspired by the image provided below:
There are already competitors in this space, like Perplexity AI (perplexity.ai) which is chat tool that uses foundational language models, such as GPT-4 from OpenAI, along with current information from the internet. It not only provides answers, but also references to the sources that contributed to those answers. This simple, yet powerful approach addresses the limitation of potentially outdated training data used to train the models. By returning the sources used to provide an answer, you can verify its accuracy. This combats the issue of language models generating incorrect answers.
This may sound like a major project and a serious undertaking, but modern tools have made it surprisingly easy.
Technical analysis: How is it possible that Perplexity.ai is so fast?
Looking at the responses used for the answer, we can also see that it’s also the same sites, and in exactly the same order as in the Google search request. It could be Bing, but given they are raising their API costs 300-500% in the next few weeks it is doubtful — or would require some awkward conversations with their investors as I don’t think they can ever reach profitability then. This means they have to do one of of two things:
- Derive their QA context from the search response metadata only, or
- Use a hard cutoff while streaming data to maintain snappyness
I will implement both.
We want to build a search engine that is ethical and is respectful of the website owners (as determined by their robot.txt
instructions). If a website declines to be crawled or indexed then we will respect their wishes. This could possibly result in the edge case where all the target sites in a query has declined crawling and we cannot construct a reference set of embeddings. If this is the case, we will simply not show an AI chat message and instead only display the Google results as-is.
- User performs a search query
- Instantly return Google/Bing search results and display “Thinking…” in the Chat section
- “Asking websites” — scrapes up to the top 10 results
- First, we query each websites
robots.txt
if we are allowed to scrape their website. If it is either disallowed or the query takes >200 ms, then that site is ignored. Since this is a small educational application, we cannot guarantee that we respect theCrawl-delay
rule, for example in the case two related search queries returns the same domain within this period. - If we are allowed to scrape, then scrape as much data as possible from the URL over a maximum of 750 ms. If the entire target website cannot be scraped in this timeframe, we will sever the connection and proceed with whatever we have.
- We do not expect that all 10 target URLs will respond in this time frame.
- First, we query each websites
- “Producing chunks” — breaks websites into usable chunks with LangChain
- “Producing embeddings” — creates embeddings from the chunks using OpenAI
- “Computing similarity” — computes the cosine similarity using LangChain
- Stuff the related sections of text as context into a LLM query and re-ask the query from 1 given this data — using a manual tweak of the default prompt in LangChain QA.
- Display a “similar questions” or “related searches”
- Since neither Google nor Bing provides this data, we ask a LLM to perform dream up a few related questions. We do this by setting a really high temperature (like 1.0).
Current direct-to-consumer prices for different relevant OpenAI API queries:
Model | Usage |
---|---|
gpt-3.5-turbo | $0.002 / 1K tokens |
Ada | $0.0004 / 1K tokens |
Google API | $0.005 / query |
Let’s assume that we embed 25,000 tokens from the search results (on average) and perform a 4,000 token search, we end up with a cost of
It is expensive to compute chunks and embeddings all the time. How can we overcome this?
- Make chunks + embeddings live in an ephemeral database (like Redis) with at most X GB of storage
- Only scrape a website if the timestamp from Google/Bing has changed
- Return embeddings only for the sites in the query (collect E1, E2, E3…) and collect them in an array and perform super-cheap cosine similarity computation and return the top results and their score. This way we don’t need to maintain a vector database at all since we only focus on the top-10 webpages and their contents and nothing else.
and light mode
- Open
pages/index.vue
and enter your own keys where it says
const google_api_key = 'YOUR-GOOGLE-API-KEY';
const google_api_cx = 'YOUR-GOOGLE-API-CX';
const openaiApiKey = 'YOUR-OPENAI-API-KEY';
- Install the dependencies
npm install
- Run the app
npm run dev --
That's it.