{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Making (mini) datasets using JSON files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ways of thickening, layering, ..., text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----------------------------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## JSON?\n", "\n", "JSON is an open standard file format that saves information in **key, value paired** objects. \n", "\n", "It looks like this:\n", "\n", "```\n", "{\n", " \"name\": \"XPUB1\",\n", " \"date\": \"26-10-2020\",\n", " \"number_of_students\": 10\n", "}\n", "```\n", "\n", "In a JSON file, you can store both **strings** and **numbers**, but also **lists** or another **dictionary** object.\n", "\n", "Look for example how the number `10` is written without `\"`s around it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## dictionaries & JSON files\n", "\n", "Python uses a **dictionary** to store data in the same **key, value paired** way. \n", "\n", "`dict = {}`\n", "\n", "We will use a dictionary to store the data, and then save it as a JSON file.\n", "\n", "In this way, we can *add* one (of multiple) layer(s) of *value* to a word ..., such as:\n", "\n", "```\n", "{\n", " \"common\": \"English\",\n", " \"language\": \"communication system\",\n", " \"formal\": 6,\n", " \"semantic\": ['language', 'formal', 'informat'],\n", " \"semantic\": { \n", " \"type\": \"word\",\n", " \"number of letters\": 7\n", " }\n", "}\n", "```\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'dataset' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;31m# Adding a new key to the dictionary, assigning a string as value:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'new'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'NEW WORD'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;31m# or assigning a number as value:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0mdataset\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'new'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m10\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mNameError\u001b[0m: name 'dataset' is not defined" ] } ], "source": [ "# Adding a new key to the dictionary, assigning a string as value:\n", "dataset['new'] = 'NEW WORD'\n", "# or assigning a number as value:\n", "dataset['new'] = 10 " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Printing the tag of a word in the dictionary\n", "print(dataset['are'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Printing all the keys in the dataset\n", "print(dataset.keys())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Making a sample dataset" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# This is sample data, a list of words and POS tags:\n", "dataset = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'are': 'VBP', 'extends': 'VBZ', 'be': 'VB', 'expressed': 'VBN'}\n" ] } ], "source": [ "# Making a dataset with only verbs\n", "dataset = {}\n", "\n", "for word, tag in data:\n", " if 'VB' in tag:\n", " dataset[word] = tag\n", "\n", "print(dataset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving as JSON file" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import json" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"are\": \"VBP\",\n", " \"extends\": \"VBZ\",\n", " \"be\": \"VB\",\n", " \"expressed\": \"VBN\"\n", "}\n" ] } ], "source": [ "out = json.dumps(dataset, indent=4)\n", "print(out)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "f = open('json-dataset.json', 'w')\n", "f.write(out)\n", "f.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }