【Repost】 Three Major AI Application Layer Models: Prompt, RAG, Agent

doggie · February 26, 2025, 12:50pm

This article is transcoded by SimpRead, original address guangzhengli.com

Over the past few months, we seem to be in the midst of an artificial intelligence revolution. Besides the widely known OpenAI ChatGPT, many novel, interesting, and practical AI applications have emerged continuously. While using these applications, I have indeed felt an increase in productivity.

However, there seems to be little information currently available about GPT application development knowledge and pathways, so I decided to organize some of my experiences and thoughts into a series, hoping to help everyone.

This article mainly introduces development considerations related to GPT applications. In April this year, I learned about GPT-related technologies while developing the open-source project ChatFiles. Due to limited time and energy, I hadn’t had the chance to organize these insights into an article until recently, when I had new ideas and open-sourced VectorHub, another project based on GPT Prompt and Embeddings technologies. This deepened my understanding of GPT and Embeddings, which led to this sharing.

Starting from Prompts

AI application development has attracted many developers recently. Besides the well-known ChatGPT, a large number of AI applications with real practical value have emerged, such as AI-based translation applications like openai-translator and immersivetranslate; writing applications like Notion AI; programming assistance applications like GitHub Copilot and GitHub Copilot Chat.

Some of these applications optimize existing experiences, such as the GPT-based translation openai-translator, whose translation quality and reading experience far surpass previous machine translation systems. Others provide previously unattainable functions, such as code completion and generation by GitHub Copilot, or functionalities like answering coding questions, explaining code, generating unit tests, and suggesting fixes for bugs offered by GitHub Copilot Chat. The difficulty of implementing these features was unimaginable before.

Although these applications differ in functionality, they all primarily rely on GPT Prompt implementations at the core. A Prompt is text or instructions given to the model, used to guide the model to generate natural language output (Completion). It can provide contextual information to the model and is crucial for the model’s output quality.

We know GPT (Generative Pre-trained Transformer) is an inference model based mainly on two stages: pre-training and fine-tuning.

During the pre-training stage, a large corpus such as Wikipedia, news articles, novels, etc., is used for foundational training. After training, given an input sentence, it predicts the probability of the next word to append based on the knowledge learned during pre-training. By iteratively predicting words, a sentence or text can be generated. This mechanism is why it is called generative AI.

This input sentence is what we call the Prompt, and it forms the basis of generative AI’s probabilistic generation. This explains why the same Prompt input may yield different outputs each time since the results are generated probabilistically.

Thus, we can understand why Prompt is crucial to GPT application development: beyond fine-tuning, prompting is the only way we directly interact with the GPT model (of course, we can also adjust model configurations such as temperature and top_p to control GPT’s diversity or creativity, but these do not significantly affect output quality or downstream task handling). Therefore, Prompt is the core part of GPT application development and requires the most developer consideration and optimization.

After pre-training, in the fine-tuning stage, the GPT model is loaded onto specific tasks and trained using task-specific datasets. This allows the model to better understand Prompts and generate task-related text. Through fine-tuning, GPT can adapt to various tasks like text classification, sentiment analysis, Q&A systems, etc. However, due to high costs and unstable outcomes, fine-tuning is not a very favorable option for most GPT developers currently. Hence, most GPT applications are based on Prompt development.

Prompt Learning Path

For basic Prompt knowledge, you can start with Andrew Ng’s ChatGPT Prompt Engineering. This video course under two hours quickly introduces how to use Prompts and their appeal.

After gaining a preliminary understanding, I recommend the Prompt Engineering Guide, which contains extensive fundamental knowledge and future directions for Prompts. For GPT application developers, besides learning basics, this guide offers insightful thoughts from both engineering and academia, which are invaluable for AI application development.

Finally, I highly recommend reviewing OpenAI’s official GPT Best Practices document. This is OpenAI’s official guide containing many Prompt examples and usage tips. For GPT developers, it is very valuable because it summarizes best practices across different business domains learned through partners, Hackathons, and other collaboration forms. It offers considerable inspiration for developers! Below is an excerpt from this document:

Write Clear Prompts

1⃣️: Include detailed information in your question, such as in the following examples:

Poor question: Summarize the meeting notes.

Excellent question: Summarize the meeting notes into a paragraph. Then list participants’ key points in a Markdown list format. Finally, if there are any, list the speakers’ suggested next steps or action items.

Poor question: Write code to calculate the Fibonacci sequence…

— Guangzheng Li (@iguangzhengli) June 12, 2023

Prompt Best Practices

Regarding Prompt writing best practices, the most recommended resource is of course the OpenAI official GPT Best Practices, but for developing GPT applications, I would also like to share some practical experience combined with this best practice.

Clarity and Detail

In reality, most developers use GPT mainly to solve programming problems or ask questions, which often leads them to apply their past experience using search engines like Google to GPT.

For example, if you want to know how to write a Fibonacci sequence in Python, previously you might have searched python fibonacci on Google. That’s sufficient because Google uses inverted indices and the PageRank algorithm and can provide high-quality webpage answers just by keywords.

Such brief input is the simplest and most efficient; even if you add more words like how to write python fibonacci, the output quality difference on Google is minimal.

However, for GPT, input like python fibonacci is unfriendly because it cannot clearly understand your intent, possibly producing irrelevant results (varying by model quality).

But if you input Write a Python function to efficiently compute the Fibonacci sequence. Comment each line of code to explain each part’s function and why it is written that way., this input is very clear and detailed for GPT, enabling a precise understanding of your intent. As a result, it produces more accurate outputs—raising the upper limit of output quality while guaranteeing a quality baseline!

This is completely different from developers’ past experience with Google and is easily overlooked by GPT developers and users. Early this year, when developing ChatFiles, I anonymously collected user Prompts and found over 95% used very simple Prompts, which sometimes felt overly terse.

Therefore, when developing GPT applications, developers must pay attention to Prompt clarity and detail. Try multiple times and select a Prompt that ensures stable output quality and consistent formatting. This is key to ensuring GPT application quality.

How to Handle More Complex Tasks

I believe all developers can design good Prompts and obtain decent output quality for simple tasks with some extra effort. But for complex tasks, improving GPT output quality requires two crucial techniques: encouraging GPT to reason rather than directly answer, and breaking down tasks for guided processing.

Reasoning Instead of Answering

Reasoning instead of answering means that the Prompt instructs the GPT model not to immediately judge correctness or give an answer, but to engage in deeper thinking. You can ask it to first list various perspectives on the problem, break down the task, explain the reasoning behind each step, and then draw the final conclusion. Adding step-by-step reasoning requirements in the Prompt allows the language model to spend more time on logical thinking, leading to more reliable and accurate results.

For example, an OpenAI official example shows that if you want GPT to judge whether a student’s answer is correct, with the Prompt Judge if the student's solution is correct, GPT may produce wrong answers for complex computational problems because it does not reason through the answer before responding but immediately outputs a judgment. Like humans who cannot quickly calculate complicated math in the short term, this leads to incorrect answers.

If your Prompt is changed to First solve the problem yourself, then compare your solution with the student's and evaluate whether the student's solution is correct. Do not decide if the student's solution is correct before completing your own work, this explicit instruction leads GPT to spend more time deriving the answer, resulting in more accurate outcomes.

Task Breakdown

Breaking down tasks for guided processing means splitting a complex task into multiple subtasks and guiding GPT to reason about each separately, then integrating the subtask results to get the final result. This method helps GPT focus on each subtask, improving output quality.

For example, when summarizing a book, a direct overall summary by GPT might be unsatisfactory. Instead, you can use a series of subtasks to summarize each section and then aggregate the generated summaries.

Of course, task breakdown also introduces challenges: if a single subtask’s output quality suffers, the overall output quality is affected. Additionally, token costs can be high, so guided task breakdown adds expenses. Nevertheless, designing and splitting complex tasks is a core issue all GPT applications must consider and a key to maintaining their AI moats. It is also a central design topic in large model AI frameworks like LangChain. I may write a separate article on this.

Usage Tips

Besides the above important practices, some smaller tips can also help application development. Here are some I’ve summarized:

Provide few-shot examples: give the model one or two input-output samples so it understands our requirements and expected output formats.
Request structured output in the Prompt: output in JSON format facilitates subsequent code processing.
Use delimiters: use delimiters like """ to separate different instructions and contexts to prevent the system Prompt and user input Prompt from confusing or conflicting.

GPT Embedding Application Development

Above, we mainly introduced how to develop AI applications based on Prompts. But sometimes, we face a new problem: large models’ training data are often several months or years old. When we need GPT applications to provide answers based on the latest information, such as recent news or private documents, the large model itself hasn’t been trained on this material and cannot solve such problems.

At this point, we can have the model use reference texts to answer questions. For example, our prompt can be written as:

You will receive a document separated by triple quotes and a question.

Your task is to use the provided document to answer the question, citing the paragraphs used to answer.

If the document does not contain the information required to answer the question, simply write: "Information insufficient."

If providing an answer to the question, it must include citation annotations.

Use the following format to cite related paragraphs ({ "citation": … }).
```This usage method allows GPT to answer based on the reference text we provide it. For example, if we want to ask who the latest World Cup champion is, we can attach the latest World Cup news as the reference text. GPT will first understand the entire news article before answering the question. This approach can address the issues of large models related to timeliness and specific downstream tasks.

However, this solution introduces another problem, namely the reference text length limitation. GPT prompts have size limits; for example, the gpt-3.5-turbo model has a limit of 4K tokens (~3000 characters). This means that users can input up to 3000 characters for GPT to understand and infer answers.

Once the required reference text exceeds 3000 characters, it is impossible to get the answer from GPT in one go. Developers need to figure out how to split the reference text into multiple parts, then convert each into vectors through GPT Embeddings and store them in a vector database. For more information about vector databases, you can refer to another blog of mine: [Vector Database](https://guangzhengli.com/blog/zh/vector-database/). When asking questions based on the reference text, you first convert the question into a vector, retrieve from the vector database, and finally convert the retrieved vectors back to text output. This way, you get reference text that fits within GPT's token limit and is relevant to the question.

We take both the question and this related reference text, submit them to GPT, and get the answer we want. This process is the core of GPT Embeddings application development. Its core idea is to retrieve the text segments most relevant to the question via vector search, thereby bypassing the GPT token limit.

![](upload://1uVniX1QfTrFOQhpFy631gI8ldX.jpeg)

The overall development process is shown in the above figure:

1. Load documents to obtain target text information. For example, mainstream LLM frameworks like LangChain offer two text acquisition methods: File Loader and Web Loader.
    1. File Loader based on file systems, such as loading PDF files, Word documents, etc.
    2. Web Loader based on the network, such as a webpage, AWS S3, etc.
2. Split the target text into multiple paragraphs, mainly using two splitting methods.
    1. One method is splitting by the number of characters, e.g., 1000 characters per paragraph. The advantage is simplicity; the disadvantage is it may split a paragraph into multiple parts, reducing paragraph coherence and potentially lowering answer quality due to lack of context.
    2. The other is splitting based on punctuation, e.g., using newlines as delimiters. This method maintains better paragraph coherence, but paragraph sizes vary, possibly triggering GPT token limits.
    3. Lastly, splitting based on GPT token limits, e.g., grouping every 2000 tokens, and when querying, retrieving the two most relevant paragraphs. Together they total only 4000 tokens, which avoids the 4096-token limit.
3. Store all the split text blocks uniformly in the [vector database](https://guangzhengli.com/blog/zh/vector-database/).
4. Convert the user's question into a vector, then retrieve the most relevant text paragraphs via the vector database. (Note that this retrieval is not traditional fuzzy matching or inverted index but semantic search, so the retrieved text can answer user questions. For details, see the other blog [Vector Database](https://guangzhengli.com/blog/zh/vector-database/))
5. Combine the retrieved related text information, user question, and system prompt into a prompt for the Embeddings scenario, e.g., the prompt explicitly instructs to answer using the reference text rather than GPT’s own knowledge.
6. GPT answers the final prompt to get the final answer.

If you are interested in the specific implementation code, you can check the [Retrieval chapter](https://js.langchain.com/docs/modules/data_connection/) in LangChain.

GPT embeddings applications can solve problems where GPT alone cannot answer, which is their greatest advantage. No training or fine-tuning is needed; simply convert text to vectors and retrieve to realize it. This enables certain business scenarios at low cost.

However, this solution also introduces some new issues, such as how embedding text splitting and retrieval quality largely affect final results, and how to balance query scope, quality, and query time. How should one handle situations where retrieved reference text cannot answer user questions? These are critical questions developers must carefully consider when facing business needs and scenarios.

Looking further, historically all documents were written for humans, which is not the best organizational pattern for vector retrieval. Will there be specially written texts aimed at AI and retrieval in the future, allowing AI to better understand and fit database retrieval? These questions require long-term reflection and demonstration and will not be expanded upon here.

GPT Agents Application Development
---------------

Besides the above two solutions of prompt and embeddings that can address business needs, GPT application development also has a very common requirement: how to integrate existing systems, or how to integrate existing APIs.

Since the software industry has developed for many years, many companies have their own systems and APIs. These APIs can greatly expand GPT application capabilities. If GPT applications want to be implemented in real life scenarios, integration with existing systems is inevitable.

For instance, if you want to ask GPT a simple question such as "What is the weather in Beijing today?" From the previous sections, we know GPT itself cannot answer this type of question. However, if GPT can call a weather API by itself, the development would become very convenient.

To realize GPT calling a weather query API, we face two problems: one is letting the GPT application understand the API's functionality and call it when appropriate; the other is ensuring input and output are structured to guarantee system stability.

### Understanding and Calling Existing APIs

The best way to let GPT understand an API’s function is obviously for developers to manually add names and specific description information for the API, including the input and output structures and what each field represents. This greatly affects GPT’s judgment and its final decision to call the API.

As shown in the OpenAI official [function calling example](https://openai.com/blog/function-calling-and-other-api-updates), the description of calling the weather API is as follows:

function_descriptions = [
{
“name”: “get_current_weather”,
“description”: “Get the current weather in a given location”,
“parameters”: {
“type”: “object”,
“properties”: {
“location”: {
“type”: “string”,
“description”: “The city and state, e.g. San Francisco, CA”,
},
“unit”: {
“type”: “string”,
“description”: “The temperature unit to use. Infer this from the users location.”,
“enum”: [“celsius”, “fahrenheit”]
},
},
“required”: [“location”],
},
}
]


The above method clearly describes that to call this function, at least the current location must be provided, and optionally the temperature unit. GPT will decide whether to call the function based on whether the question is relevant to the function description. Specific code details will be described in a future post and are not elaborated here.

We can provide many API specification documents to let LLMs understand and learn these APIs’ functionalities, usage, and how to combine them. Eventually, given an overall request, the system can decompose it into multiple subtasks, each interacting with a specific API, achieving automation and AI goals.

The existing APIs described here are not limited to HTTP-based requests; they can include converting to SQL to query databases, calling SDKs to implement complex functions, and even in the future, extend to operate physical switches, robotic arms, etc. Personally, based on current GPT capabilities and development trends, human-machine interaction methods will undergo tremendous changes.

### Structured Output

Besides letting GPT understand and call existing APIs, it's also necessary for GPT's output results to be understandable by existing systems, which requires GPT output to be structured, such as data in JSON format.

You might immediately think that this can be done with prompts, and in most cases it can, but in some instances, prompt-based methods are not as stable as function calling. Traditional systems require very high stability. For example:

student_description= Xiao Wang is a second-year student majoring in Computer Science at Peking University with a GPA of 3.8. He is excellent at programming and an active member of the university’s Robotics Club. He hopes to pursue a career in AI after graduation.


In this text, we can request the output to be in JSON via prompt, for example:

Please extract the following information from the given text and return it as a JSON object:

name
major
school
grades
club

This is the body of text to extract the information from:
{student_1_description}


However, a tricky part is that we cannot guarantee whether GPT outputs grades as `3.8` or `3.8 GPA`. These two results make no difference to humans but are completely different for computers—the former is a float number, the latter a string. For some languages, conversion may fail directly.

Of course, we can add more descriptions to the prompt to reduce such issues, but reality is complex. It’s hard to fully describe the requirements precisely in natural language and guarantee GPT’s responses remain consistent every time. For these issues, OpenAI’s function calling can somewhat solve the problem of interaction between natural language and machine language.

As in the above example, the problem can be described as a function where the grades field is specified as an integer to avoid such issues. This structured ability is crucial for developing a stable system.

student_custom_functions = [
{
‘name’: ‘extract_student_info’,
‘description’: ‘Get the student information from the body of the input text’,
‘parameters’: {
‘type’: ‘object’,
‘properties’: {
‘name’: {
‘type’: ‘string’,
‘description’: ‘Name of the person’
},
‘major’: {
‘type’: ‘string’,
‘description’: ‘Major subject.’
},
‘school’: {
‘type’: ‘string’,
‘description’: ‘The university name.’
},
‘grades’: {
‘type’: ‘integer’,
‘description’: ‘GPA of the student.’
},
‘club’: {
‘type’: ‘string’,
‘description’: 'School club for extracurricular activities. ’
}

        }
    }
}

]


For a complete debugging process, see this [OpenAI function calling example tutorial](https://www.datacamp.com/tutorial/open-ai-function-calling-tutorial).

GPT Application Demand Analysis
----------

Above mainly covers some skills on how to develop GPT applications, but if you want to create products, you still need to start from business demands and think about what kind of business value can be created to meet current potential user needs.

> Updated 2023/09/06: LangChain official documentation has been updated, dividing documents into RAG (Retrieval Augmented Generation) and Agents sections. This shows that since the GPT boom a few months ago, the industry currently regards RAG and Agents as two directions for business implementation. RAG corresponds to the GPT Embeddings part discussed above. Later, we will illustrate examples for both directions. I think anchoring to frontline frameworks like LangChain is very valuable for developers because they summarize best practices and experiences throughout business implementation, which greatly helps developers understand business demands.

### Content Generation

Content generation is currently the most mainstream demand and also the AI application direction with the highest traffic statistics. Besides widely known products like ChatGPT, applications such as [Character AI](https://beta.character.ai/), primarily based on Prompt development for AI companions (role-playing), also have extremely high traffic.

In the broader content generation category, image generation fields like [Midjourney](https://www.midjourney.com/), voice generation like [ElevenLabs](https://elevenlabs.io/), and creative writing fields like [copy ai](https://www.copy.ai/), as well as niche areas such as novel generation assistance [AI-Novel](https://ai-novel.com/), etc., exist.

Content generation is the AI demand field with the highest Internet traffic presently and also the easiest to implement because of widespread application scenarios. Content generation assistance often directly improves productivity, so paying willingness is often high. Competition in this field is often the fiercest.

### GPT Embeddings Demand

I think GPT Embeddings is a direction with great potential. When [ChatFiles](https://github.com/guangzhengli/ChatFiles) was first open-sourced, I received inquiries about optimizing current scenarios and business within customer service, sales, operating manuals, knowledge bases, etc. My personal answer is that GPT Embeddings has strong prospects in these scenarios.

Startups currently visible in this field include [mendable ai](https://www.mendable.ai/), which has captured a certain market share and supports documentation Q&A functions in leading GPT frameworks like [LangChain](https://js.langchain.com/). They are also actively expanding into sales, customer, and other business scenarios.

Besides this, the highest traffic in this field currently probably belongs to [ChatPDF](https://www.chatpdf.com/), which allows uploading PDF files and then asking questions or making requests like summarizing based on PDFs. This direction has also spawned various niche demands, like assisting paper reading and collaboration, such as [Jenni AI](https://jenni.ai/).

### GPT Agents Demand

Demands for GPT Agents are diverse because it depends on integrating existing systems. We can analyze existing system APIs to determine what business value we can create. For example, many GPT-provided online features are realized by integrating SerpAPI, a meta-search engine aggregating various major search engines. This lets ChatGPT answer questions based on search results, such as current weather, stock market, news, etc.

Among these, the well-known projects are [Auto GPT](https://github.com/Significant-Gravitas/Auto-GPT) and [AgentGPT](https://github.com/reworkd/AgentGPT). If interested in GPT Agents applications, you might want to check them out. Additionally, companies like [cal.com](https://github.com/calcom/cal.com/tree/main/apps/ai) integrate AI agents to use natural language to enhance appointment booking, which might inspire you differently.

### Unstructured Input and Structured Output

Another point I feel many people overlook is GPT/LLM's ability to process unstructured data. In the past, processing even simple texts required significant development time—for example, extracting key information such as name, phone number, address, etc., from text messages. Since message templates vary, this information is unstructured.

We cannot convert this content into structured output formats like JSON through a simple method, so we often need to extract this information using a complex set of regular expressions or NLP (Natural Language Processing) techniques. However, developing regular expressions is complicated. For different SMS templates, combined with the fact that templates often change over time, we need to constantly adjust the regular expressions. This process consumes a lot of development time and engineers' effort.

NLP technology can only target certain specific scenarios, such as extracting phone numbers, addresses, etc. For different scenarios, we need different NLP techniques. So once business requirements change, such as recognizing license plates, we need to redevelop and adjust accordingly.

However, with a unified API like OpenAI GPT, we only need to provide the service with different prompts, and the GPT can be guided by the prompts to perform reasoning, thereby obtaining the desired results. This greatly simplifies our development process, shortens development time, and allows us to quickly respond to market changes.

Moreover, due to GPT's powerful generalization ability, we only need to consider how to handle unstructured data for different scenarios. This capability to handle unstructured data will change many traditional business development processes and methods in future software development, having a profound impact on the software lifecycle.

### Natural Language Interaction

Finally, I want to say that GPT's ability to process natural language will deeply change the way humans interact with machines. In the past, from command lines to graphical interfaces, and then to touch screens, these represent the evolution of human-computer interaction. Most people's interaction with machines actually requires a programmer as an intermediary. People have various demands; business analysts and developers extract the requirements, programmers then create graphical interfaces through code, and people interact with these graphical pages to achieve human-computer interaction capabilities.

This process involves information transmission loss and generates significant limitations and costs. If machines can understand natural language, we can interact directly with machines, then familiar concepts like software, machines, and intelligence will undergo huge changes, and human-computer interaction will be transformed dramatically.

If you find it difficult to understand this uncertain and unprecedented change, you can check out the [open interpreter](https://github.com/KillianLucas/open-interpreter) project, which generates code based on natural language and executes it directly on the computer. Although it is very primitive and unstable, it shows us the future transformation of human-computer interaction.

References
----------

*   [https://github.com/guangzhengli/ChatFiles](https://github.com/guangzhengli/ChatFiles)
*   [https://github.com/guangzhengli/vectorhub](https://github.com/guangzhengli/vectorhub)
*   [https://js.langchain.com/docs/modules/data_connection](https://js.langchain.com/docs/modules/data_connection)
*   [https://www.datacamp.com/tutorial/open-ai-function-calling-tutorial](https://www.datacamp.com/tutorial/open-ai-function-calling-tutorial)
*   [https://openai.com/blog/function-calling-and-other-api-updates](https://openai.com/blog/function-calling-and-other-api-updates)
*   [https://openai.com/blog/chatgpt-plugins#code-interpreter](https://openai.com/blog/chatgpt-plugins#code-interpreter)
*   [https://github.com/KillianLucas/open-interpreter](https://github.com/KillianLucas/open-interpreter)
*   [https://www.chatpdf.com](https://www.chatpdf.com/)
*   [https://jenni.ai](https://jenni.ai/)
*   [https://github.com/reworkd/AgentGPT](https://github.com/reworkd/AgentGPT)
*   [https://a16z.com/how-are-consumers-using-generative-ai](https://a16z.com/how-are-consumers-using-generative-ai)

Topic		Replies	Views
A043 GPT只能提高效率，不能代替学习 🗣️闲聊 gpt , 学习	0	21	September 14, 2024
大模型agent api的调用——stream流式 💻编程 api	1	30	October 30, 2025
【工具推荐】大模型api聚合平台——newapi 🛠工具与编程工具推荐	0	17	December 17, 2025
写了一个临床带教prompt，欢迎使用 🛠工具与编程人工智能	0	15	May 26, 2025
【转载】 AI 应用层第四样MCP 🛠工具与编程人工智能	0	31	February 26, 2025