Prompt Playbook: AI Fundamentals PART 4

Prompt Playbook: AI Fundamentals

Kyle Balmer
April 24, 2025

Hey Prompt Entrepreneur,

Data!

Boring but important. It’s the lifeblood of AI models and can make or break an implementation.

Lots of people learn a little about machine learning and AI and assume (understandably) that MORE data is always better.

This is (kinda) correct for big foundational models. But as we talked about in the previous Part we’re not trying build our own foundational models for our businesses. That’s not a game we can win.

Instead we’re deploying more strategic smaller plays. And to pull these off we need to talk about what data we need.

"More is better" is actually a misconception in many practical AI applications for businesses. It would be more accurate to say "more relevant, high-quality data is better, up to a point" - but that's not as catchy!!

Let’s get started:

Summary

Fuelling our models

Why data is the new competitive battleground in AI
GIGO: The critical importance of data quality for prediction
The three types of AI data and how they're used
Why tech giants are getting "thirsty" for your data
Data privacy considerations that impact your strategy
Building your data advantage as an entrepreneur

The New Competitive Battleground

There's an increasingly important shift happening in the AI landscape. While the headlines focus on which company has the "best" model, the real battle is happening elsewhere - over data.

Here's why this matters: foundation models are rapidly commoditising. GPT-4, Claude, Llama, Mistral, and others are becoming more similar in capabilities. What isn't commoditising is proprietary data - the unique information that only your business has access to.

All the foundation models were trained on (basically) the same data at first: the internet. And now that’s insufficient to gain an edge.

Consider these data plays:

The New York Times suing OpenAI over using their content for training then OpenAI striking a deal
Reddit signing a $60 million deal with Google for data access
OpenAI reportedly exploring the creation of a social media network (likely to generate fresh training data)
and countless other examples of AI companies teaming up with content/data providers of all stripes

These aren't random business decisions - they're strategic moves in the new data economy. The companies building foundation models are getting increasingly "thirsty" for high-quality data, and they're willing to pay premium prices to get it.

This creates both challenges and opportunities for entrepreneurs. While you may not be able to compete with OpenAI on model development, you might have access to unique data in your niche that could be extraordinarily valuable.

This is your edge.

For entrepreneurs, this creates a clear opportunity. As we discussed yesterday, you shouldn't build base models from scratch - that's a game for companies with massive resources. Instead, your advantage comes from leveraging your unique data to customise existing models.

How Entrepreneurs Should Think About Data

Remember the last Part’s message: don't build your own models from scratch. That doesn't mean data doesn't matter - quite the opposite. It means you need to be strategic about how you use data with existing models.

We discussed two primary ways entrepreneurs can leverage data with foundation models. Here’s a quick reminder:

Fine-tuning is like giving the model additional education in your specific domain. You provide examples that teach the model to better understand your industry language, respond in your brand voice, or perform specific tasks relevant to your business.

Retrieval-Augmented Generation (RAG) is like giving the model a custom reference library to consult. Instead of hoping the model already knows about your products, services, or domain, you explicitly provide this information at runtime.

Both approaches let you benefit from the capabilities of foundation models while adding your unique advantage. And both rely entirely on the quality of your data.

Why Data Quality Matters

Let’s talk about data.

Quality trumps quantity when it comes to AI data. It doesn't matter if you have millions of records if they're inconsistent, inaccurate, or irrelevant.

A constant refrain you must remember is Garbage In, Garbage Out (GIGO). If you flood your AI with crappy data it’s performance will actually get worse. MORE isn’t better.

So what makes data valuable for AI? Relevance is king - data directly related to the problems you're trying to solve is worth far more than massive quantities of tangential information. A thousand examples of exactly what you want to predict are worth more than a million examples of something vaguely related.

If you are building a bot to answer customer service questions give it lots of customer service interactions. Seems obvious but companies make this mistake all the time.

Cleanliness is critical too. Errors, inconsistencies, and noise in your data dramatically reduce its value. I've seen companies spend thousands on sophisticated AI infrastructure only to see it underperform because they skimped on data cleaning. You need to do a few runs to tidy everything up - thankfully AI can help here but don’t be lazy!

Representativeness matters as well. Your data should accurately reflect the real-world conditions where your AI will operate. If your customer service dataset only includes interactions with happy customers, your AI will struggle when it encounters its first angry client.

For entrepreneurs, focusing on these quality factors is far more important than obsessing over quantity. You don't need millions of examples - you need the right examples. Which is great news because as business owners and entrepreneurs we are often sitting on lots of data or (we’ll talk about this momentarily) in a great place to set up collection.

Practical Data Implementation for Entrepreneurs

Let's get practical about how to actually implement fine-tuning and RAG systems with your data.

Fine-tuning Implementation

Fine-tuning is becoming increasingly accessible. It sounds scary but here’s a high level run down.

First, prepare your data. For OpenAI's fine-tuning, you'll need to format your data as JSONL files with prompt-completion pairs. It’ll look something like this:

{"prompt": "Customer question: How do I reset my password?", "completion": "To reset your password, click on the 'Forgot Password' link on the login page and follow the instructions sent to your email."}
{"prompt": "Customer question: Where is my order?", "completion": "You can track your order by logging into your account and visiting the 'Order History' section. There you'll find real-time updates on your delivery."}

Think of this like model answers. Each prompt-completion pair teaches the model how you want it to respond to similar queries. You'll typically need several hundred to a few thousand high-quality examples for effective fine-tuning.

These can be farmed out on platforms like Mechanical Turk or done in-house. It depends on the specifics of the fine tuning!

Once your data is prepared, you can use platforms like OpenAI's fine-tuning API, Hugging Face, or Replicate to train your custom model. Costs vary but expect to pay hundreds to a few thousand dollars depending on the model size and amount of data. Very low costs compared to building a model from scratch!!

The practical benefit? A fine-tuned model can respond more consistently to queries in your domain, using your preferred tone and following your business policies - without needing to spell these out in each and every prompt.

RAG Implementation

RAG systems are even more accessible and often more immediately useful for entrepreneurs.

First, gather your knowledge base. This could be product documentation, FAQs, blog posts, internal wikis, or any text relevant to your domain. You'll need to extract the text from various formats (PDFs, websites, databases) using tools like PyPDF, BeautifulSoup, or dedicated services.

Be careful here as there is a tendency to just throw everything and the kitchen sink into our knowledge base. It’s tempting! But remember the GIGO rules above!

Next, chunk your documents into smaller, digestible pieces. Typically, these are paragraphs or sections of 200-1000 tokens. This chunking ensures the model receives relevant context without overwhelming it. You can use tools like LangChain and LlamaIndex for this. Don’t worry it’s not manual!

Then, create “embeddings” for each chunk. Embeddings are numerical representations of text that capture semantic meaning. Sounds hard and complex but again don’t worry tools do this for us. OpenAI's embedding API, HuggingFace models, or services like Pinecone can handle this process. These embeddings allow for semantic search - finding information based on meaning, not just keywords.

Store these embeddings in a vector database like Pinecone or Weaviate. These mean that when a user asks a question, you'll search this database for the most relevant chunks.

Finally, retrieve the most relevant chunks and send them to the model along with the user's query. The model then generates a response using both its general knowledge and your specific information.

Tools like LangChain and LlamaIndex have simplified this entire process with pre-built components for each step. So whilst it’s kinda neat to understand the steps above it’s honestly no longer needed - you choose your files, upload them and let the various tools do the heavy lifting.

The practical benefit? Your AI can reference exactly the information you want it to, staying current with your latest products or policies without retraining. It dramatically reduces hallucinations and keeps responses grounded in your actual business context.

Building Your Data Advantage

OK that’s all well and good. But how do we do this practically?

Start by identifying your data assets. What information do you have that competitors don't? Customer interactions, domain expertise, proprietary processes - these could all be valuable sources for fine-tuning and/or RAG systems.

Then, design for data collection. Build data capture into your products and processes from the ground up. You decide first what sort of data would be useful and then work out mechanisms to collect that information. Every customer interaction should be seen as a potential opportunity to gather valuable training or RAG data.

AI can help you here! Here’s a prompt to kick off this process:

You are an expert in identifying valuable proprietary data assets in businesses. Help me discover the unique data advantage in my company.

About my business:
- Industry: [your industry]
- Main products/services: [brief description]
- Customer interactions: [how you interact with customers]
- Existing data collection: [what data you already collect]

Based on this information:

1. Identify 5-7 unique data assets my business likely has that competitors may not have access to
2. For each data asset, explain:
   - Why it would be valuable for AI applications
   - Whether it would be better for fine-tuning or RAG systems
   - What specific business problems it could help solve
3. Suggest practical ways to better capture, organize, and utilize this data
4. Identify any potential privacy or ethical considerations

My goal is to leverage these data assets to create AI solutions that provide unique value to my customers.

What's Next?

In the final Part we'll conclude our week on AI fundamentals by exploring "The AI Stack"—the various layers of technology that make up modern AI systems.

We'll help you understand build vs. buy decisions, how different components fit together, and how to develop an AI strategy that's both ambitious and pragmatic.

We’ll be pulling everything that we’ve covered so far into a strategy you can deploy in your own business or with others as a consultant.

Keep Prompting,

Kyle

When you are ready

AI Entrepreneurship programmes to get you started in AI:

70+ AI Business Courses
✓ Instantly unlock 70+ AI Business courses ✓ Get FUTURE courses for Free ✓ Kyle’s personal Prompt Library ✓ AI Business Starter Pack Course ✓ AI Niche Navigator Course → Get Premium

AI Workshop Kit
Deliver AI Workshops and Presentations to Businesses with my Field Tested AI Workshop Kit → Learn More

AI Authority Accelerator
Do you want to become THE trusted AI Voice in your industry in 30-days? → Learn More

AI Automation Accelerator
Do you want to build your first AI Automation product in 30-days? → Enrol Now

Anything else? Hit reply to this email and let’s chat.

If you feel this — learning how to use AI in entrepreneurship and work — is not for you → Unsubscribe here.