Entity resolution

1. Entity resolution

While structured outputs describe how the data is returned, the data itself is still vulnerable to inconsistencies.

2. LLMs are stateless

Every LLM call is stateless and made in isolation. This means that even with precise instructions, it is possible for the LLM to return two different descriptions of the same data in two different calls.

3. Entity suggestions

For a small dataset like ours, we can query our knowledge graph using Cypher to find an exhaustive list of characters

4. Entity suggestions

To ground the LLM's output in these characters, we can include this list of entities in a system prompt along with instructions to ignore any that aren't included in the list. For larger datasets, providing examples up front may not be possible. In these cases, we can use node properties and the relationships to identify duplicate nodes. Let's take an example from our Romeo and Juliet dataset.

5. Graph-based entity resolution

The LLM has hallucinated a character called Romeo Capulet.

6. Graph-based entity resolution

A quick Cypher statement reveals that both Romeo Capulet and Romeo Montague often speak to Juliet Capulet,

7. Graph-based entity resolution

and they both interacted positively with Balthasar,

8. Graph-based entity resolution

whose relationships reveal that he is employed by the Montague Family.

9. Graph-based entity resolution

Each of these facts can be assigned a score. If that score is high enough, this can be flagged to a human to check, a SIMILAR_TO relationship created between, or the characters could be merged into a single node with a Cypher statement.

10. Identifying similar people

This Cypher statement finds all characters that share common relationships to characters. To find potential duplicates to romeo-capulet, we can match all paths going from that node going to any other character, and beyond that character to another character. This query will return the name and id of the potential duplicates, and conditions to build up a set of rules for scoring candidates. Here we have a condition for whether they share a name, have relationships to the same family using the BELONGS_TO relationship, the number of nodes in common, and how many paths exist between the nodes. Each condition can be assigned an arbitrary score and action taken if certain criteria are met. Here, we just order by the number of nodes in common.

11. Using similarity relationships

We can merge the nodes together, or create a SIMILAR_TO relationship between them for traceability. We can then use the zero-star Cypher technique to start at a node and either use that node, or a node that meets the condition one relationship away, as the starting point for a pattern, in this case, finding all lines connected to Romeo Montague or Romeo Capulet.

12. Let's practice!

It's time to try it for yourself!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.