You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
235 lines
8.4 KiB
Plaintext
235 lines
8.4 KiB
Plaintext
.. Copyright (C) 2001-2019 NLTK Project
|
|
.. For license information, see LICENSE.TXT
|
|
|
|
=======
|
|
Chat-80
|
|
=======
|
|
|
|
Chat-80 was a natural language system which allowed the user to
|
|
interrogate a Prolog knowledge base in the domain of world
|
|
geography. It was developed in the early '80s by Warren and Pereira; see
|
|
`<http://acl.ldc.upenn.edu/J/J82/J82-3002.pdf>`_ for a description and
|
|
`<http://www.cis.upenn.edu/~pereira/oldies.html>`_ for the source
|
|
files.
|
|
|
|
The ``chat80`` module contains functions to extract data from the Chat-80
|
|
relation files ('the world database'), and convert then into a format
|
|
that can be incorporated in the FOL models of
|
|
``nltk.sem.evaluate``. The code assumes that the Prolog
|
|
input files are available in the NLTK corpora directory.
|
|
|
|
The Chat-80 World Database consists of the following files::
|
|
|
|
world0.pl
|
|
rivers.pl
|
|
cities.pl
|
|
countries.pl
|
|
contain.pl
|
|
borders.pl
|
|
|
|
This module uses a slightly modified version of ``world0.pl``, in which
|
|
a set of Prolog rules have been omitted. The modified file is named
|
|
``world1.pl``. Currently, the file ``rivers.pl`` is not read in, since
|
|
it uses a list rather than a string in the second field.
|
|
|
|
Reading Chat-80 Files
|
|
=====================
|
|
|
|
Chat-80 relations are like tables in a relational database. The
|
|
relation acts as the name of the table; the first argument acts as the
|
|
'primary key'; and subsequent arguments are further fields in the
|
|
table. In general, the name of the table provides a label for a unary
|
|
predicate whose extension is all the primary keys. For example,
|
|
relations in ``cities.pl`` are of the following form::
|
|
|
|
'city(athens,greece,1368).'
|
|
|
|
Here, ``'athens'`` is the key, and will be mapped to a member of the
|
|
unary predicate *city*.
|
|
|
|
By analogy with NLTK corpora, ``chat80`` defines a number of 'items'
|
|
which correspond to the relations.
|
|
|
|
>>> from nltk.sem import chat80
|
|
>>> print(chat80.items) # doctest: +ELLIPSIS
|
|
('borders', 'circle_of_lat', 'circle_of_long', 'city', ...)
|
|
|
|
The fields in the table are mapped to binary predicates. The first
|
|
argument of the predicate is the primary key, while the second
|
|
argument is the data in the relevant field. Thus, in the above
|
|
example, the third field is mapped to the binary predicate
|
|
*population_of*, whose extension is a set of pairs such as
|
|
``'(athens, 1368)'``.
|
|
|
|
An exception to this general framework is required by the relations in
|
|
the files ``borders.pl`` and ``contains.pl``. These contain facts of the
|
|
following form::
|
|
|
|
'borders(albania,greece).'
|
|
|
|
'contains0(africa,central_africa).'
|
|
|
|
We do not want to form a unary concept out the element in
|
|
the first field of these records, and we want the label of the binary
|
|
relation just to be ``'border'``/``'contain'`` respectively.
|
|
|
|
In order to drive the extraction process, we use 'relation metadata bundles'
|
|
which are Python dictionaries such as the following::
|
|
|
|
city = {'label': 'city',
|
|
'closures': [],
|
|
'schema': ['city', 'country', 'population'],
|
|
'filename': 'cities.pl'}
|
|
|
|
According to this, the file ``city['filename']`` contains a list of
|
|
relational tuples (or more accurately, the corresponding strings in
|
|
Prolog form) whose predicate symbol is ``city['label']`` and whose
|
|
relational schema is ``city['schema']``. The notion of a ``closure`` is
|
|
discussed in the next section.
|
|
|
|
Concepts
|
|
========
|
|
In order to encapsulate the results of the extraction, a class of
|
|
``Concept``\ s is introduced. A ``Concept`` object has a number of
|
|
attributes, in particular a ``prefLabel``, an arity and ``extension``.
|
|
|
|
>>> c1 = chat80.Concept('dog', arity=1, extension=set(['d1', 'd2']))
|
|
>>> print(c1)
|
|
Label = 'dog'
|
|
Arity = 1
|
|
Extension = ['d1', 'd2']
|
|
|
|
|
|
|
|
The ``extension`` attribute makes it easier to inspect the output of
|
|
the extraction.
|
|
|
|
>>> schema = ['city', 'country', 'population']
|
|
>>> concepts = chat80.clause2concepts('cities.pl', 'city', schema)
|
|
>>> concepts
|
|
[Concept('city'), Concept('country_of'), Concept('population_of')]
|
|
>>> for c in concepts: # doctest: +NORMALIZE_WHITESPACE
|
|
... print("%s:\n\t%s" % (c.prefLabel, c.extension[:4]))
|
|
city:
|
|
['athens', 'bangkok', 'barcelona', 'berlin']
|
|
country_of:
|
|
[('athens', 'greece'), ('bangkok', 'thailand'), ('barcelona', 'spain'), ('berlin', 'east_germany')]
|
|
population_of:
|
|
[('athens', '1368'), ('bangkok', '1178'), ('barcelona', '1280'), ('berlin', '3481')]
|
|
|
|
In addition, the ``extension`` can be further
|
|
processed: in the case of the ``'border'`` relation, we check that the
|
|
relation is **symmetric**, and in the case of the ``'contain'``
|
|
relation, we carry out the **transitive closure**. The closure
|
|
properties associated with a concept is indicated in the relation
|
|
metadata, as indicated earlier.
|
|
|
|
>>> borders = set([('a1', 'a2'), ('a2', 'a3')])
|
|
>>> c2 = chat80.Concept('borders', arity=2, extension=borders)
|
|
>>> print(c2)
|
|
Label = 'borders'
|
|
Arity = 2
|
|
Extension = [('a1', 'a2'), ('a2', 'a3')]
|
|
>>> c3 = chat80.Concept('borders', arity=2, closures=['symmetric'], extension=borders)
|
|
>>> c3.close()
|
|
>>> print(c3)
|
|
Label = 'borders'
|
|
Arity = 2
|
|
Extension = [('a1', 'a2'), ('a2', 'a1'), ('a2', 'a3'), ('a3', 'a2')]
|
|
|
|
The ``extension`` of a ``Concept`` object is then incorporated into a
|
|
``Valuation`` object.
|
|
|
|
Persistence
|
|
===========
|
|
The functions ``val_dump`` and ``val_load`` are provided to allow a
|
|
valuation to be stored in a persistent database and re-loaded, rather
|
|
than having to be re-computed each time.
|
|
|
|
Individuals and Lexical Items
|
|
=============================
|
|
As well as deriving relations from the Chat-80 data, we also create a
|
|
set of individual constants, one for each entity in the domain. The
|
|
individual constants are string-identical to the entities. For
|
|
example, given a data item such as ``'zloty'``, we add to the valuation
|
|
a pair ``('zloty', 'zloty')``. In order to parse English sentences that
|
|
refer to these entities, we also create a lexical item such as the
|
|
following for each individual constant::
|
|
|
|
PropN[num=sg, sem=<\P.(P zloty)>] -> 'Zloty'
|
|
|
|
The set of rules is written to the file ``chat_pnames.fcfg`` in the
|
|
current directory.
|
|
|
|
SQL Query
|
|
=========
|
|
|
|
The ``city`` relation is also available in RDB form and can be queried
|
|
using SQL statements.
|
|
|
|
>>> import nltk
|
|
>>> q = "SELECT City, Population FROM city_table WHERE Country = 'china' and Population > 1000"
|
|
>>> for answer in chat80.sql_query('corpora/city_database/city.db', q):
|
|
... print("%-10s %4s" % answer)
|
|
canton 1496
|
|
chungking 1100
|
|
mukden 1551
|
|
peking 2031
|
|
shanghai 5407
|
|
tientsin 1795
|
|
|
|
The (deliberately naive) grammar ``sql.fcfg`` translates from English
|
|
to SQL:
|
|
|
|
>>> nltk.data.show_cfg('grammars/book_grammars/sql0.fcfg')
|
|
% start S
|
|
S[SEM=(?np + WHERE + ?vp)] -> NP[SEM=?np] VP[SEM=?vp]
|
|
VP[SEM=(?v + ?pp)] -> IV[SEM=?v] PP[SEM=?pp]
|
|
VP[SEM=(?v + ?ap)] -> IV[SEM=?v] AP[SEM=?ap]
|
|
NP[SEM=(?det + ?n)] -> Det[SEM=?det] N[SEM=?n]
|
|
PP[SEM=(?p + ?np)] -> P[SEM=?p] NP[SEM=?np]
|
|
AP[SEM=?pp] -> A[SEM=?a] PP[SEM=?pp]
|
|
NP[SEM='Country="greece"'] -> 'Greece'
|
|
NP[SEM='Country="china"'] -> 'China'
|
|
Det[SEM='SELECT'] -> 'Which' | 'What'
|
|
N[SEM='City FROM city_table'] -> 'cities'
|
|
IV[SEM=''] -> 'are'
|
|
A[SEM=''] -> 'located'
|
|
P[SEM=''] -> 'in'
|
|
|
|
Given this grammar, we can express, and then execute, queries in English.
|
|
|
|
>>> cp = nltk.parse.load_parser('grammars/book_grammars/sql0.fcfg')
|
|
>>> query = 'What cities are in China'
|
|
>>> for tree in cp.parse(query.split()):
|
|
... answer = tree.label()['SEM']
|
|
... q = " ".join(answer)
|
|
... print(q)
|
|
...
|
|
SELECT City FROM city_table WHERE Country="china"
|
|
|
|
>>> rows = chat80.sql_query('corpora/city_database/city.db', q)
|
|
>>> for r in rows: print("%s" % r, end=' ')
|
|
canton chungking dairen harbin kowloon mukden peking shanghai sian tientsin
|
|
|
|
|
|
Using Valuations
|
|
-----------------
|
|
|
|
In order to convert such an extension into a valuation, we use the
|
|
``make_valuation()`` method; setting ``read=True`` creates and returns
|
|
a new ``Valuation`` object which contains the results.
|
|
|
|
>>> val = chat80.make_valuation(concepts, read=True)
|
|
>>> 'calcutta' in val['city']
|
|
True
|
|
>>> [town for (town, country) in val['country_of'] if country == 'india']
|
|
['bombay', 'calcutta', 'delhi', 'hyderabad', 'madras']
|
|
>>> dom = val.domain
|
|
>>> g = nltk.sem.Assignment(dom)
|
|
>>> m = nltk.sem.Model(dom, val)
|
|
>>> m.evaluate(r'population_of(jakarta, 533)', g)
|
|
True
|
|
|
|
|