Thought Process & Architecture Overview of my Perplexity.ai Clone

Introduction

In this blog post, I will share my thought process and the architectural overview of my project, detailing the challenges I faced, the technologies I employed, and the lessons I learned along the way. Whether you're a seasoned developer or a curious newcomer to the world of AI, I hope to provide valuable insights into the development of a generative AI application and inspire you to embark on your own projects in this fascinating field.
You can find the source code at:- GitHub-Perplexity Clone

System Architecture

Generate search engine queries

You might be wondering why I generated search queries instead directly firing user question into search engine. I added this step as for some question you need to fire a different query rather than the question. For example if the question is :- Give me analysis on stock price of Nvidia over last quarter.
If we directly fire this query in search engine it would get the website which have done analysis of any last quarter instead of the last quarter of current time period.
So we need to generate proper search engine query for finding the correct resources

Extracting relevant website

I used duckduckgo search engine to fire the search query to extract the relevant website and takes random 5 website for generating knowledge.

Scraping the website and normalizing the website

Scraped website by sending get request on the extracted urls from above step and sanitize and normalize html

Normalizing HTML

We need to normalize the html as the content has a huge size and most of the content of html is not useful for generating our knowledge base and if not done may hallucinate the LLM while generating knowledge base

I normalized the html by

  1. Selecting main block if available or the body block

  2. Removing all scripts and style elements

  3. Removing all svg, header, footer, img, forms, buttons, textarea, input elements

  4. Removing class attributes from elements as most of the styles are useless for generating knowledge base

Generating Knowledge Base

You might be wondering why I didn’t directly generated the answer from the above content. In some cases you need to understand the data and generate response by applying some logic over the data So in that case it might fail if it generated the answer directly.
Generating knowledge base can be useful when answering subsequent answer as we don’t have to process the extracted html

I have generated knowledge base in markdown format as it can lead to better response as it can get the understanding of the structure of data and it is importance.

Generating Answer

Uses above generated knowledge base to answer the question asked. It also provides the citation of the website it used to answer the question it used.

Current Limitation of My Project

  1. The execution takes a lot of time as current implementation is completely sequential.

  2. Fails on analytical question where it needs to get data from internet and process the data to provide analytics of data. Eg:- Stock price analysis of any stock.

  3. Does not have ability to ask subsequent questions

Future Goals

  1. Make the execution faster by using asynchronous implementation and using streaming responses

  2. Make normalizing much efficient

  3. Implement Chat history

  4. Implement ability to ask subsequent questions and answering it using the generated knowledge base

  5. Improving the data used(in terms of size) for generating answer by using vector to store and retrieve knowledge base’

The End

Would love your thoughts, suggestion, criticism on the project
You can find me at:-
1. LinkedIn:- @shah-dhwanil
2. X(Twitter):- @shah-dhwanil
3. Email Id:- itzdhwanil@gmail.com