XPUB

S13-Words-for-the-Future-notebooks

You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.7 KiB

Raw Blame History

Making (mini) datasets using JSON files¶

Ways of thickening, layering, ..., text.

JSON?¶

JSON is an open standard file format that saves information in key, value paired objects.

It looks like this:

{
    "name": "XPUB1",
    "date": "26-10-2020",
    "number_of_students": 10
}

In a JSON file, you can store both strings and numbers, but also lists or another dictionary object.

Look for example how the number 10 is written without "s around it.

dictionaries & JSON files¶

Python uses a dictionary to store data in the same key, value paired way.

dict = {}

We will use a dictionary to store the data, and then save it as a JSON file.

In this way, we can add one (of multiple) layer(s) of value to a word ..., such as:

{
    "common": "English",
    "language": "communication system",
    "formal": 6,
    "semantic": ['language', 'formal', 'informat'],
    "semantic": { 
            "type": "word",
            "number of letters": 7
        }
}

In [1]:

# Adding a new key to the dictionary, assigning a string as value:
dataset['new'] = 'NEW WORD'
# or assigning a number as value:
dataset['new'] = 10

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-f8b69ae2bb04> in <module>
      1 # Adding a new key to the dictionary, assigning a string as value:
----> 2 dataset['new'] = 'NEW WORD'
      3 # or assigning a number as value:
      4 dataset['new'] = 10

NameError: name 'dataset' is not defined

In [ ]:

# Printing the tag of a word in the dictionary
print(dataset['are'])

In [ ]:

# Printing all the keys in the dataset
print(dataset.keys())

Making a sample dataset¶

In [3]:

# This is sample data, a list of words and POS tags:
dataset = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]

In [4]:

# Making a dataset with only verbs
dataset = {}

for word, tag in data:
    if 'VB' in tag:
        dataset[word] = tag

print(dataset)

{'are': 'VBP', 'extends': 'VBZ', 'be': 'VB', 'expressed': 'VBN'}

Saving as JSON file¶

In [5]:

import json

In [6]:

out = json.dumps(dataset, indent=4)
print(out)

{
    "are": "VBP",
    "extends": "VBZ",
    "be": "VB",
    "expressed": "VBN"
}

In [7]:

f = open('json-dataset.json', 'w')
f.write(out)
f.close()

In [ ]:

6.7 KiB Raw Blame History

Making (mini) datasets using JSON files¶

JSON?¶

dictionaries & JSON files¶

Making a sample dataset¶

Saving as JSON file¶

6.7 KiB

Raw Blame History