adding a json making notebook

4 years ago · 52a5f396b5
parent 5c40276198
commit 52a5f396b5
1 changed files with 213 additions and 0 deletions
--- a/json-making-datasets.ipynb
+++ b/json-making-datasets.ipynb
@ -0,0 +1,213 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Making (mini) datasets using JSON files"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Ways of thickening, layering, ..., text."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "----------------------------------"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## JSON?\n",
+    "\n",
+    "JSON is an open standard file format that saves information in **key, value paired** objects. \n",
+    "\n",
+    "It looks like this:\n",
+    "\n",
+    "```\n",
+    "{\n",
+    "    \"name\": \"XPUB1\",\n",
+    "    \"date\": \"26-10-2020\",\n",
+    "    \"number_of_students\": 10\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "In a JSON file, you can store both **strings** and **numbers**, but also **lists** or another **dictionary** object.\n",
+    "\n",
+    "Look for example how the number `10` is written without `\"`s around it."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## dictionaries & JSON files\n",
+    "\n",
+    "Python uses a **dictionary** to store data in the same **key, value paired** way. \n",
+    "\n",
+    "`dict = {}`\n",
+    "\n",
+    "We will use a dictionary to store the data, and then save it as a JSON file.\n",
+    "\n",
+    "In this way, we can *add* one (of multiple) layer(s) of *value* to a word ..., such as:\n",
+    "\n",
+    "```\n",
+    "{\n",
+    "    \"common\": \"English\",\n",
+    "    \"language\": \"communication system\",\n",
+    "    \"formal\": 6,\n",
+    "    \"semantic\": ['language', 'formal', 'informat'],\n",
+    "    \"semantic\": { \n",
+    "            \"type\": \"word\",\n",
+    "            \"number of letters\": 7\n",
+    "        }\n",
+    "}\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Adding a new key to the dictionary, assigning a string as value:\n",
+    "dataset['new'] = 'NEW WORD'\n",
+    "# or assigning a number as value:\n",
+    "dataset['new'] = 10 "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Printing the tag of a word in the dictionary\n",
+    "print(dataset['are'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Printing all the keys in the dataset\n",
+    "print(dataset.keys())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Making a sample dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# This is sample data, a list of words and POS tags:\n",
+    "data = [('Common', 'JJ'), ('languages', 'NNS'), ('like', 'IN'), ('English', 'NNP'), ('are', 'VBP'), ('both', 'DT'), ('formal', 'JJ'), ('and', 'CC'), ('semantic', 'JJ'), (';', ':'), ('although', 'IN'), ('their', 'PRP$'), ('scope', 'NN'), ('extends', 'VBZ'), ('beyond', 'IN'), ('the', 'DT'), ('formal', 'JJ'), (',', ','), ('anything', 'NN'), ('that', 'WDT'), ('can', 'MD'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('a', 'DT'), ('computer', 'NN'), ('control', 'NN'), ('language', 'NN'), ('can', 'MD'), ('also', 'RB'), ('be', 'VB'), ('expressed', 'VBN'), ('in', 'IN'), ('common', 'JJ'), ('language', 'NN'), ('.', '.')]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Making a dataset with only verbs\n",
+    "dataset = {}\n",
+    "\n",
+    "for word, tag in data:\n",
+    "    if 'VB' in tag:\n",
+    "        dataset[word] = tag\n",
+    "\n",
+    "print(dataset)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Saving as JSON file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "out = json.dumps(dataset, indent=4)\n",
+    "print(out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "f = open('json-dataset.json', 'w')\n",
+    "f.write(out)\n",
+    "f.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}