Entity resolution
1. Entity resolution
While structured outputs describe how the data is returned, the data itself is still vulnerable to inconsistencies.2. LLMs are stateless
Every LLM call is stateless and made in isolation. This means that even with precise instructions, it is possible for the LLM to return two different descriptions of the same data in two different calls.3. Entity suggestions
For a small dataset like ours, we can query our knowledge graph using Cypher to find an exhaustive list of characters4. Entity suggestions
To ground the LLM's output in these characters, we can include this list of entities in a system prompt along with instructions to ignore any that aren't included in the list. For larger datasets, providing examples up front may not be possible. In these cases, we can use node properties and the relationships to identify duplicate nodes. Let's take an example from our Romeo and Juliet dataset.5. Graph-based entity resolution
The LLM has hallucinated a character called Romeo Capulet.6. Graph-based entity resolution
A quick Cypher statement reveals that both Romeo Capulet and Romeo Montague often speak to Juliet Capulet,7. Graph-based entity resolution
and they both interacted positively with Balthasar,8. Graph-based entity resolution
whose relationships reveal that he is employed by the Montague Family.9. Graph-based entity resolution
Each of these facts can be assigned a score. If that score is high enough, this can be flagged to a human to check, a SIMILAR_TO relationship created between, or the characters could be merged into a single node with a Cypher statement.10. Identifying similar people
This Cypher statement finds all characters that share common relationships to characters. To find potential duplicates to romeo-capulet, we can match all paths going from that node going to any other character, and beyond that character to another character. This query will return the name and id of the potential duplicates, and conditions to build up a set of rules for scoring candidates. Here we have a condition for whether they share a name, have relationships to the same family using the BELONGS_TO relationship, the number of nodes in common, and how many paths exist between the nodes. Each condition can be assigned an arbitrary score and action taken if certain criteria are met. Here, we just order by the number of nodes in common.11. Using similarity relationships
We can merge the nodes together, or create a SIMILAR_TO relationship between them for traceability. We can then use the zero-star Cypher technique to start at a node and either use that node, or a node that meets the condition one relationship away, as the starting point for a pattern, in this case, finding all lines connected to Romeo Montague or Romeo Capulet.12. Let's practice!
It's time to try it for yourself!Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.