You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.9 KiB
4.9 KiB
Making (mini) datasets using JSON files¶
Ways of thickening, layering, ..., text.
JSON?¶
JSON is an open standard file format that saves information in key, value paired objects.
It looks like this:
{
"name": "XPUB1",
"date": "26-10-2020",
"number_of_students": 10
}
In a JSON file, you can store both strings and numbers, but also lists or another dictionary object.
Look for example how the number 10
is written without "
s around it.
dictionaries & JSON files¶
Python uses a dictionary to store data in the same key, value paired way.
dict = {}
We will use a dictionary to store the data, and then save it as a JSON file.
In this way, we can add one (of multiple) layer(s) of value to a word ..., such as:
{
"common": "English",
"language": "communication system",
"formal": 6,
"semantic": ['language', 'formal', 'informat'],
"semantic": {
"type": "word",
"number of letters": 7
}
}
In [ ]:
# Adding a new key to the dictionary, assigning a string as value: dataset['new'] = 'NEW WORD' # or assigning a number as value: dataset['new'] = 10
In [ ]:
# Printing the tag of a word in the dictionary print(dataset['are'])
In [ ]:
# Printing all the keys in the dataset print(dataset.keys())
Making a sample dataset¶
In [ ]:
# This is sample data, a list of words and POS tags: data = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]
In [ ]:
# Making a dataset with only verbs dataset = {} for word, tag in data: if 'VB' in tag: dataset[word] = tag print(dataset)
Saving as JSON file¶
In [ ]:
import json
In [ ]:
out = json.dumps(dataset, indent=4) print(out)
In [ ]:
f = open('json-dataset.json', 'w') f.write(out) f.close()
In [ ]:
In [ ]: