AI Navigate

Why Data is Important for LLM

Dev.to / 3/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues that prompt quality hinges on providing clear context and appropriate input data, as shown by a scheduling example where adding context changed the result.
  • It identifies three data types—structured, unstructured, and semi-structured—and explains that their formats influence how data is stored, queried, and fed into LLMs.
  • Structured data has a fixed format and is easy to query, with examples such as financial reports, survey results, and class schedules.
  • Unstructured data lacks a predefined format, making it harder to query directly and requiring different strategies for model ingestion and interpretation.
  • As organizations scale their use of LLMs (e.g., Gemini, ChatGPT, Qwen, Kimi), the impact of data quality and data-type considerations becomes more significant for producing reliable outputs.

I had always thought that I could just feed any data into AI and expect a good output. One tiny example that I sometimes still do is less context when prompting. I remember asking:

"Create me a set of schedule to support my fundamental daily learning on Software and AI Engineer".

It then created me schedules. It technically worked..., but not quite! It gave me an 8-hour straight schedule with no breaks. What I actually wanted was:

  • Multiple learning sessions (morning, afternoon, evening)
  • Breaks in between
  • Software Engineer topics in the morning and afternoon
  • AI topics in the evening

As you can see, even though they have the same intent: create a set of schedules, the outcome is very different, just because of a missing context. This simple example already shows how critical input data is. And that’s just prompting. When we scale this up to real-world systems feeding data into LLMs like Gemini, ChatGPT, Qwen, or Kimi, the impact becomes much bigger.

Data Types

Speaking of data, I think we also need to understand what data that goes in and not just nod along, "Oh, let's feed some data" without really understanding what kind of data we're dealing with. There are three main types of data:

  • Structured Data
  • Unstructured Data
  • Semi-structured Data

Structured Data

Example of structured data. Image by Christine Sandu

Structured data has a fixed, predefined format.

Think of spreadsheets or relational databases—data neatly organized into rows and columns. It’s easy to query, validate, and process.

Examples:

  • Financial reports
  • Survey results
  • Class schedules

Unstructured Data

Example of unstructured data. Image by Saad Chaudry

Unstructured data has no predefined format.

This is actually what we interact with most of the time in real life.

Examples:

  • Emails
  • Images
  • Videos
  • Chat messages

Unlike structured data, this type requires additional processing (like NLP or computer vision) before it becomes useful.

Semi-structured Data

Example of semi-structured data. Image by Thought Catalog

Semi-structured data sits somewhere in between.

It doesn’t follow a strict table format, but it still contains some organization through metadata or tags.

A good example is a social media post:

  1. The image itself → unstructured
  2. Metadata (caption, hashtags, timestamp) → structured elements

How Data Shapes AI

So far I think we already understand that in the example of my own, data is crucial. In a machine learning domain, there is this well-known principle:

Garbage In, Garbage Out (GIGO)

Basically, if your input data is messy, incomplete, or misleading, your output will reflect that. Imagine if models are ingested with:

  • Noisy datasets
  • Biased sources
  • Incomplete information

I don't think we will achieve what we have today if they were trained on poor-quality data.

Real World Example

Currently, I’m building a project that uses OCR + LLM to extract and parse data from shopping receipts into a backend system, which is then visualized on a dashboard. I did a lot of trial and error, especially on receipts that are not in good conditions. Here's the first example:

Case 1: Blurry Receipt

Missing information on shopping receipt

In the above example, two critical things are blurry: date and item name. I used RapidOCR to scan the receipt and here's what I got:

Bangorejo Sol WARUNG lobaru, Kwar SAYUR Sukoharjo UPSP KIIA Gr rugu!
/02/ 10115 /2026 Kasir:KASIK Jam 10:57
PHIFIK 1PCKx NGKUNG 8. 000= BALADO 8. .000
KEMBALI.. JUMLAH OT A UANG .. 10 8 2. 000 .000 .000
1 Items Pembayai TUNAI
TERIMA KASIH
ATAS KUNJUNGAN ANDA
PEMESANAN xxxxxxxxxxxx
KRITIK DAN SARAN xxxxxxxxxxxx

As you can see, the date is fragmented 02 and 2026 and the item name is partially corrupted. Yes, the gap is my biggest obstacle in laying out what has been scanned, but apart from that, bad data produces bad output.

Then I passed this into an LLM (running via Ollama, model: GPT-OSS:120b-cloud):

response data: {'receipt_id': 'c5f962ce-1a48-432e-a4d2-a24ea048597f', 'merchant_name': {'value': 'WARUNG SAYUR KIIA'}, 'date': {'value': None}, 'time': {'value': '10:57'}, 'items': [{'name': {'value': 'NGKUNG BALADO'}, 'qty': {'value': 1}, 'price': {'value': 8000}, 'total_price': {'value': 8000}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': None}, 'discount_value': {'value': 0}, 'voucher_amount': {'value': 0}}], 'total_amount': {'value': 8000}}

The model cleaned up some parts, especially the merchant name, but:

  • The date is completely missing
  • Item names are still incomplete Here's another example:

Case 2: Dirty Receipt

Dirty receipt

Now what about a noisy, dirty receipt? Here's what I got:

ALFAMART KOMARASAN SUKOHARDO
KWARASAN SUKOHAROZF
CV ANXSRAH MARATA
JL. RAYA SOLO 4 BAKI NO. 24A ABRT OR
NPWP : 73  |  030.3-532060
Bon  |  701 13024083 Kasir POVRE
LERL AIR5L  |  1 16,090 16.0e
Disc  |  500
KP BRAMDING (5) 1 200 260
CNIONHRN (S) 1 200 200
Total Item! 16.009
Total Disc 589
Total Belanja 15, ,500
Tunai 15, 500
Kerbalian  |  D
PPN  |  DPP: 14. 324 PPM: 1.575
Tgl.  |  13- 02- 2026 16:30:18 V. 2025 11. 6

As you can see, the OCR got some information correct and others are quite broken. The LLM returned this output:

response data: {'receipt_id': 'ff78be41-a3d7-4f35-b9ca-e54ca7eb5b9a', 'merchant_name': {'value': 'ALFAMART'}, 'date': {'value': '2026-02-13'}, 'time': {'value': '16:30'}, 'items': [{'name': {'value': 'LERL AIR5L'}, 'qty': {'value': 1}, 'price': {'value': 16090}, 'total_price': {'value': 15590}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': 'nominal'}, 'discount_value': {'value': 500}, 'voucher_amount': {'value': 0}}, {'name': {'value': 'KP BRAMDING (5)'}, 'qty': {'value': 1}, 'price': {'value': 200}, 'total_price': {'value': 260}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': None}, 'discount_value': {'value': 0}, 'voucher_amount': {'value': 0}}, {'name': {'value': '(S)'}, 'qty': {'value': 1}, 'price': {'value': 200}, 'total_price': {'value': 200}, 'category': {'value': 'Food & Beverage'}, 'discount_type': {'value': None}, 'discount_value': {'value': 0}, 'voucher_amount': {'value': 0}}], 'total_amount': {'value': 16050}}

Interestingly:

  • The date and time were reconstructed correctly
  • But item names and pricing details are inconsistent

The body (line items) is where things degrade the most.

Key Insight

This pipeline highlights something important:

LLMs don’t “fix” bad data, but rather interpret it

If the OCR output is already corrupted/broken:

  • Missing tokens → missing fields
  • Wrong tokens → hallucinated or incorrect values

Conclusion

It is important to note that what we feed into LLMs, —prompting, preprocessing, or model input—data quality directly shapes the output.

Or, as the principle puts it:

Garbage In, Garbage Out