Why Data is Important for LLM

Dev.to / 3/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

共有:

Key Points

The article argues that prompt quality hinges on providing clear context and appropriate input data, as shown by a scheduling example where adding context changed the result.
It identifies three data types—structured, unstructured, and semi-structured—and explains that their formats influence how data is stored, queried, and fed into LLMs.
Structured data has a fixed format and is easy to query, with examples such as financial reports, survey results, and class schedules.
Unstructured data lacks a predefined format, making it harder to query directly and requiring different strategies for model ingestion and interpretation.
As organizations scale their use of LLMs (e.g., Gemini, ChatGPT, Qwen, Kimi), the impact of data quality and data-type considerations becomes more significant for producing reliable outputs.

I had always thought that I could just feed any data into AI and expect a good output. One tiny example that I sometimes still do is less context when prompting. I remember asking:

"Create me a set of schedule to support my fundamental daily learning on Software and AI Engineer".

It then created me schedules. It technically worked..., but not quite! It gave me an 8-hour straight schedule with no breaks. What I actually wanted was:

Multiple learning sessions (morning, afternoon, evening)
Breaks in between
Software Engineer topics in the morning and afternoon
AI topics in the evening

As you can see, even though they have the same intent: create a set of schedules, the outcome is very different, just because of a missing context. This simple example already shows how critical input data is. And that’s just prompting. When we scale this up to real-world systems feeding data into LLMs like Gemini, ChatGPT, Qwen, or Kimi, the impact becomes much bigger.

Data Types

Speaking of data, I think we also need to understand what data that goes in and not just nod along, "Oh, let's feed some data" without really understanding what kind of data we're dealing with. There are three main types of data:

Structured Data
Unstructured Data
Semi-structured Data

Structured Data

Structured data has a fixed, predefined format.

Think of spreadsheets or relational databases—data neatly organized into rows and columns. It’s easy to query, validate, and process.

Examples:

Financial reports
Survey results
Class schedules

Unstructured Data

Unstructured data has no predefined format.

This is actually what we interact with most of the time in real life.

Examples:

Emails
Images
Videos
Chat messages

Unlike structured data, this type requires additional processing (like NLP or computer vision) before it becomes useful.

Semi-structured Data

Semi-structured data sits somewhere in between.

It doesn’t follow a strict table format, but it still contains some organization through metadata or tags.

A good example is a social media post:

The image itself → unstructured
Metadata (caption, hashtags, timestamp) → structured elements

How Data Shapes AI

So far I think we already understand that in the example of my own, data is crucial. In a machine learning domain, there is this well-known principle:

Garbage In, Garbage Out (GIGO)

Basically, if your input data is messy, incomplete, or misleading, your output will reflect that. Imagine if models are ingested with:

Noisy datasets
Biased sources
Incomplete information

I don't think we will achieve what we have today if they were trained on poor-quality data.

Real World Example

Currently, I’m building a project that uses OCR + LLM to extract and parse data from shopping receipts into a backend system, which is then visualized on a dashboard. I did a lot of trial and error, especially on receipts that are not in good conditions. Here's the first example:

Case 1: Blurry Receipt

In the above example, two critical things are blurry: date and item name. I used RapidOCR to scan the receipt and here's what I got:

Bangorejo Sol WARUNG lobaru, Kwar SAYUR Sukoharjo UPSP KIIA Gr rugu!
/02/ 10115 /2026 Kasir:KASIK Jam 10:57
PHIFIK 1PCKx NGKUNG 8. 000= BALADO 8. .000
KEMBALI.. JUMLAH OT A UANG .. 10 8 2. 000 .000 .000
1 Items Pembayai TUNAI
TERIMA KASIH
ATAS KUNJUNGAN ANDA
PEMESANAN xxxxxxxxxxxx
KRITIK DAN SARAN xxxxxxxxxxxx

As you can see, the date is fragmented 02 and 2026 and the item name is partially corrupted. Yes, the gap is my biggest obstacle in laying out what has been scanned, but apart from that, bad data produces bad output.

Then I passed this into an LLM (running via Ollama, model: GPT-OSS:120b-cloud):

response data: {'receipt_id': 'c5f962ce-1a48-432e-a4d2-a24ea048597f', 'merchant_name': {'value': 'WARUNG SAYUR KIIA'}, 'date': {'value': None}, 'time': {'value': '10:57'}, 'items': [{'name': {'value': 'NGKUNG BALADO'}, 'qty': {'value': 1}, 'price': {'value': 8000}, 'total_price': {'value': 8000}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': None}, 'discount_value': {'value': 0}, 'voucher_amount': {'value': 0}}], 'total_amount': {'value': 8000}}

The model cleaned up some parts, especially the merchant name, but:

The date is completely missing
Item names are still incomplete Here's another example:

Case 2: Dirty Receipt

Now what about a noisy, dirty receipt? Here's what I got:

ALFAMART KOMARASAN SUKOHARDO
KWARASAN SUKOHAROZF
CV ANXSRAH MARATA
JL. RAYA SOLO 4 BAKI NO. 24A ABRT OR
NPWP : 73  |  030.3-532060
Bon  |  701 13024083 Kasir POVRE
LERL AIR5L  |  1 16,090 16.0e
Disc  |  500
KP BRAMDING (5) 1 200 260
CNIONHRN (S) 1 200 200
Total Item! 16.009
Total Disc 589
Total Belanja 15, ,500
Tunai 15, 500
Kerbalian  |  D
PPN  |  DPP: 14. 324 PPM: 1.575
Tgl.  |  13- 02- 2026 16:30:18 V. 2025 11. 6

As you can see, the OCR got some information correct and others are quite broken. The LLM returned this output:

response data: {'receipt_id': 'ff78be41-a3d7-4f35-b9ca-e54ca7eb5b9a', 'merchant_name': {'value': 'ALFAMART'}, 'date': {'value': '2026-02-13'}, 'time': {'value': '16:30'}, 'items': [{'name': {'value': 'LERL AIR5L'}, 'qty': {'value': 1}, 'price': {'value': 16090}, 'total_price': {'value': 15590}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': 'nominal'}, 'discount_value': {'value': 500}, 'voucher_amount': {'value': 0}}, {'name': {'value': 'KP BRAMDING (5)'}, 'qty': {'value': 1}, 'price': {'value': 200}, 'total_price': {'value': 260}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': None}, 'discount_value': {'value': 0}, 'voucher_amount': {'value': 0}}, {'name': {'value': '(S)'}, 'qty': {'value': 1}, 'price': {'value': 200}, 'total_price': {'value': 200}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': None}, 'discount_value': {'value': 0}, 'voucher_amount': {'value': 0}}], 'total_amount': {'value': 16050}}

Interestingly:

The date and time were reconstructed correctly
But item names and pricing details are inconsistent

The body (line items) is where things degrade the most.

Key Insight

This pipeline highlights something important:

LLMs don’t “fix” bad data, but rather interpret it

If the OCR output is already corrupted/broken:

Missing tokens → missing fields
Wrong tokens → hallucinated or incorrect values

Conclusion

It is important to note that what we feed into LLMs, —prompting, preprocessing, or model input—data quality directly shapes the output.

Or, as the principle puts it:

Garbage In, Garbage Out

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/20DailyView insight →

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

Why Data is Important for LLM

Key Points

Data Types

Structured Data

Unstructured Data

Semi-structured Data

How Data Shapes AI

Real World Example

Case 1: Blurry Receipt

Case 2: Dirty Receipt

Key Insight

Conclusion

💡 Insights using this article

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer