diff --git a/json-making-datasets.ipynb b/json-making-datasets.ipynb new file mode 100644 index 0000000..501dd46 --- /dev/null +++ b/json-making-datasets.ipynb @@ -0,0 +1,213 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Making (mini) datasets using JSON files" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ways of thickening, layering, ..., text." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "----------------------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## JSON?\n", + "\n", + "JSON is an open standard file format that saves information in **key, value paired** objects. \n", + "\n", + "It looks like this:\n", + "\n", + "```\n", + "{\n", + " \"name\": \"XPUB1\",\n", + " \"date\": \"26-10-2020\",\n", + " \"number_of_students\": 10\n", + "}\n", + "```\n", + "\n", + "In a JSON file, you can store both **strings** and **numbers**, but also **lists** or another **dictionary** object.\n", + "\n", + "Look for example how the number `10` is written without `\"`s around it." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## dictionaries & JSON files\n", + "\n", + "Python uses a **dictionary** to store data in the same **key, value paired** way. \n", + "\n", + "`dict = {}`\n", + "\n", + "We will use a dictionary to store the data, and then save it as a JSON file.\n", + "\n", + "In this way, we can *add* one (of multiple) layer(s) of *value* to a word ..., such as:\n", + "\n", + "```\n", + "{\n", + " \"common\": \"English\",\n", + " \"language\": \"communication system\",\n", + " \"formal\": 6,\n", + " \"semantic\": ['language', 'formal', 'informat'],\n", + " \"semantic\": { \n", + " \"type\": \"word\",\n", + " \"number of letters\": 7\n", + " }\n", + "}\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Adding a new key to the dictionary, assigning a string as value:\n", + "dataset['new'] = 'NEW WORD'\n", + "# or assigning a number as value:\n", + "dataset['new'] = 10 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Printing the tag of a word in the dictionary\n", + "print(dataset['are'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Printing all the keys in the dataset\n", + "print(dataset.keys())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Making a sample dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# This is sample data, a list of words and POS tags:\n", + "data = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Making a dataset with only verbs\n", + "dataset = {}\n", + "\n", + "for word, tag in data:\n", + " if 'VB' in tag:\n", + " dataset[word] = tag\n", + "\n", + "print(dataset)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Saving as JSON file" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import json" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "out = json.dumps(dataset, indent=4)\n", + "print(out)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "f = open('json-dataset.json', 'w')\n", + "f.write(out)\n", + "f.close()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}