You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.7 KiB

Making (mini) datasets using JSON files

Ways of thickening, layering, ..., text.


JSON?

JSON is an open standard file format that saves information in key, value paired objects.

It looks like this:

{
    "name": "XPUB1",
    "date": "26-10-2020",
    "number_of_students": 10
}

In a JSON file, you can store both strings and numbers, but also lists or another dictionary object.

Look for example how the number 10 is written without "s around it.

dictionaries & JSON files

Python uses a dictionary to store data in the same key, value paired way.

dict = {}

We will use a dictionary to store the data, and then save it as a JSON file.

In this way, we can add one (of multiple) layer(s) of value to a word ..., such as:

{
    "common": "English",
    "language": "communication system",
    "formal": 6,
    "semantic": ['language', 'formal', 'informat'],
    "semantic": { 
            "type": "word",
            "number of letters": 7
        }
}
In [1]:
# Adding a new key to the dictionary, assigning a string as value:
dataset['new'] = 'NEW WORD'
# or assigning a number as value:
dataset['new'] = 10 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-f8b69ae2bb04> in <module>
      1 # Adding a new key to the dictionary, assigning a string as value:
----> 2 dataset['new'] = 'NEW WORD'
      3 # or assigning a number as value:
      4 dataset['new'] = 10

NameError: name 'dataset' is not defined
In [ ]:
# Printing the tag of a word in the dictionary
print(dataset['are'])
In [ ]:
# Printing all the keys in the dataset
print(dataset.keys())

Making a sample dataset

In [3]:
# This is sample data, a list of words and POS tags:
dataset = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]
In [4]:
# Making a dataset with only verbs
dataset = {}

for word, tag in data:
    if 'VB' in tag:
        dataset[word] = tag

print(dataset)
{'are': 'VBP', 'extends': 'VBZ', 'be': 'VB', 'expressed': 'VBN'}

Saving as JSON file

In [5]:
import json
In [6]:
out = json.dumps(dataset, indent=4)
print(out)
{
    "are": "VBP",
    "extends": "VBZ",
    "be": "VB",
    "expressed": "VBN"
}
In [7]:
f = open('json-dataset.json', 'w')
f.write(out)
f.close()
In [ ]:
 
In [ ]: