Semi-Structured Data Processing

What is semi-structured data?

Semi-structured data refers to data that does not follow the rigid schema of a traditional relational database, but still preserves some level of structure or organization. Unlike completely unstructured data, semi-structured data contains patterns that make it possible to parse and analyze it without enforcing a strict schema upfront.

Common examples of semi-structured data are JSON (JavaScript Object Notation) and XML (eXtensible Markup Language). Both are widely used to represent information in a flexible hierarchy.

Practical examples include JSON documents with nested lists, objects, and variable attributes, or XML files with nested tags that represent hierarchical information.

Examples of semi-structured data

When working with APIs, it is very common to receive nested data. In other words, information is organized as a hierarchy, like boxes inside other boxes. A practical JSON example is shown below:

{
  "user": {
    "id": "123456",
    "name": "Alice",
    "email": "alice@example.com",
    "birthday": "1990-05-15",
    "purchases": [
      {
        "productId": "789",
        "productName": "Interesting Book",
        "price": 29.99,
        "purchaseDate": "2023-02-23"
      },
      {
        "productId": "456",
        "productName": "Fun Mug",
        "price": 12.99,
        "purchaseDate": "2023-01-10"
      }
    ]
  }
}

An XML example is shown below:

<empresa>
    <nome>ABC Ltda.</nome>
    <departamentos>
        <departamento>
            <nome>Vendas</nome>
            <funcionarios>20</funcionarios>
        </departamento>
        <departamento>
            <nome>TI</nome>
            <funcionarios>15</funcionarios>
        </departamento>
    </departamentos>
</empresa>

Analyzing data in these formats can be challenging, which is why unnesting or flattening is a common step. In Dadosfera, you can do this with Python or SQL by using the Intelligence module or the Query module.

What is semi-structured data?​

Examples of semi-structured data​

What is semi-structured data?

Examples of semi-structured data