The following is an output, from within a Jupyter Notebook.

It looks great on github, viewed as a markdown file. However, code blocks look terrible on the Jekyle blog site. The problems seem to have to do with the failure to process the ` ```{.python .input n=53}` code.

Looking at Jekyl’s document page, I need a different syntax to mark the code blocks.


Parsing JSON strings in Pandas

Gary Feng, Princeton, NJ May, 2016

Table of Contents

%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
[
 {
  "data": {
   "application/javascript": "$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')",
   "text/plain": "<IPython.core.display.Javascript object>"
  },
  "metadata": {},
  "output_type": "display_data"
 }
]

Introduction

In the 2015/16 NAEP process data scheme there is a compromise in the data structure, where we needed to keep complex (hierarchical) information about the state of objects in a single SQL string field. To the extend we can, we used JSON to encode the data. The question is how to efficiently decode the data for analysis.

Loading JSON to dict

Here we compare two ways to convert the JSON string to a dict.

  • JSON: http://stackoverflow.com/questions/19483351/converting-json-string-to- dictionary-not-list-python
  • AST: http://stackoverflow.com/questions/15197673/using-pythons-eval-vs-ast- literal-eval

Here is one example keystroke datum. The eNAEP system somehow doubled the double-quote in the JSON so we have to clear that first. We create 2 versions, one with the original “true” and the other replacing it with “True”, because the JSON method requires true while the AST method requires True.

kstr = '{\
    ""name"":""text.change"",\
    ""textDiff"":{\
        ""diffs"":[{""edit"":""INS"",""pos"":1,""len"":1,""text"":""m""}],\
        ""textContext"":""m^ "",\
        ""textLength"":3},\
        ""selection"":{\
            ""selectedText"":"""",\
            ""startPos"":1,\
            ""endPos"":1,\
            ""lenSelected"":0,\
            ""textSelected"":"""",\
            ""isCollapsed"":true\
        }\
    }'
# we create 2 versions, one with the original "true" and the other replacing it with "True"
s1 = kstr.replace("\"\"", "\"")
print s1
s2 = kstr.replace("\"\"", "\"").replace("true", "True")
print s2

{.json .output n=2}
[
 {
  "name": "stdout",
  "output_type": "stream",
  "text": "{    \"name\":\"text.change\",    \"textDiff\":{        \"diffs\":[{\"edit\":\"INS\",\"pos\":1,\"len\":1,\"text\":\"m\"}],        \"textContext\":\"m^ \",        \"textLength\":3},        \"selection\":{            \"selectedText\":\"\",            \"startPos\":1,            \"endPos\":1,            \"lenSelected\":0,            \"textSelected\":\"\",            \"isCollapsed\":true        }    }\n{    \"name\":\"text.change\",    \"textDiff\":{        \"diffs\":[{\"edit\":\"INS\",\"pos\":1,\"len\":1,\"text\":\"m\"}],        \"textContext\":\"m^ \",        \"textLength\":3},        \"selection\":{            \"selectedText\":\"\",            \"startPos\":1,            \"endPos\":1,            \"lenSelected\":0,            \"textSelected\":\"\",            \"isCollapsed\":True        }    }\n"
 }
]

JSON

{.python .input n=3} import json j = json.loads(s1) j["textDiff"]["textContext"]

{.json .output n=3} [ { "data": { "text/plain": "u'm^ '" }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ]

{.python .input n=3} %timeit json.loads(s1)

{.json .output n=3} [ { "name": "stdout", "output_type": "stream", "text": "100000 loops, best of 3: 11.8 \u00b5s per loop\n" } ]

AST .literal_eval()

{.python .input n=4} import ast a=ast.literal_eval(s2) a["textDiff"]["textContext"]

{.json .output n=4} [ { "data": { "text/plain": "'m^ '" }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ]

{.python .input n=5} %timeit ast.literal_eval(s2)

{.json .output n=5} [ { "name": "stdout", "output_type": "stream", "text": "10000 loops, best of 3: 60.8 \u00b5s per loop\n" } ]

Conclusion

AST is about 6x slower. JSON wins, and we don’t need to convert true to True.

Looking forward, here’s probably how we are going to use it in Pandas:

```{.python .input n=9} import pandas as pd

ps = pd.Series([json.loads(s1), ast.literal_eval(s2)]) print ps print “Accessing individual elements” print ps.loc[0][“textDiff”][“textContext”] print “\nGetting elements in as a Pandas Series” print ps.apply(lambda x: x[“textDiff”][“textContext”]) ```

{.json .output n=9} [ { "name": "stdout", "output_type": "stream", "text": "0 {u'selection': {u'isCollapsed': True, u'lenSel...\n1 {u'selection': {u'isCollapsed': True, u'lenSel...\ndtype: object\nAccessing individual elements\nm^ \n\nGetting elements in as a Pandas Series\n0 m^ \n1 m^ \ndtype: object\n" } ]

This also suggests that I should parse all ExtendedInfo entry into objects so that I can access later.

Retrieving info from a dict

Since we have nested data structure in JSON converted into a nested dict, accessing it becomes a bit more complex. Here we compare the speed different algorithms. It comes down to a trade off between speed and the expressiveness of the method of data access.

  • Native python: a["textDiff"]["textContext"]
  • A path-like expression using dpath: dpath.util.get(j, 'textDiff/textContext')
  • A XPath-like full-featured expression using jsonpath_rw: parse('textDiff.textContext').find(j)[0].value

```{.python .input n=47} # install libraries ! pip install jsonpath-rw ! pip install dpath from jsonpath_rw import jsonpath, parse # [match.value for match in parse(‘textDiff.diffs.[*].len’).find(j)] import dpath.util

for jsonpath_rw

def getVal(json, key): # make sure this is JSON # parse(‘textDiff.diffs.[*].len’).find(j)[0].value try: res = parse(key).find(json)[0].value except: res = None return res


```{.json .output n=47}
[
 {
  "name": "stdout",
  "output_type": "stream",
  "text": "Requirement already satisfied (use --upgrade to upgrade): jsonpath-rw in /Users/garyfeng/anaconda/lib/python2.7/site-packages\nRequirement already satisfied (use --upgrade to upgrade): ply in /Users/garyfeng/anaconda/lib/python2.7/site-packages (from jsonpath-rw)\nRequirement already satisfied (use --upgrade to upgrade): decorator in /Users/garyfeng/anaconda/lib/python2.7/site-packages (from jsonpath-rw)\nRequirement already satisfied (use --upgrade to upgrade): six in /Users/garyfeng/anaconda/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from jsonpath-rw)\nCleaning up...\nRequirement already satisfied (use --upgrade to upgrade): dpath in /Users/garyfeng/anaconda/lib/python2.7/site-packages\nCleaning up...\n"
 }
]

dict

First, after json parsing j is simple a dict. Let’s time the good-o python dict access. Result is typically in the 100 nano-seconds.

{.python .input n=48} %timeit j["textDiff"]

{.json .output n=48} [ { "name": "stdout", "output_type": "stream", "text": "The slowest run took 27.37 times longer than the fastest. This could mean that an intermediate result is being cached \n10000000 loops, best of 3: 113 ns per loop\n" } ]

dpath

Next, let’s use the dpath library. Typically in the 100 micro-seconds, or 1,000x slower.

{.python .input n=49} %timeit dpath.util.get(j, 'textDiff')

{.json .output n=49} [ { "name": "stdout", "output_type": "stream", "text": "The slowest run took 16.18 times longer than the fastest. This could mean that an intermediate result is being cached \n10000 loops, best of 3: 104 \u00b5s per loop\n" } ]

jsonpath_rw

Using the full-featured jsonpath_rw is in the 6 ms range, or 60x slower than dpath, or 60,000x slower than dict.

{.python .input n=50} %timeit parse('textDiff.textContext').find(j)[0].value

{.json .output n=50} [ { "name": "stdout", "output_type": "stream", "text": "The slowest run took 8.60 times longer than the fastest. This could mean that an intermediate result is being cached \n100 loops, best of 3: 6.51 ms per loop\n" } ]

Conclusion

Use the dict method as much as we can for performance, unless we really need the flexibility of dpath. jsonpath_rw is nice but probably not necessary.

Try notedown

Compile this iPynb into markdown, using notedown.

{.python .input n=54} ! pip install notedown ! notedown "Parsing JSON strings in Pandas.ipynb" --to markdown > "2016-06-17-notedown example.md"

{.json .output n=54} [ { "name": "stdout", "output_type": "stream", "text": "Downloading/unpacking notedown\n Downloading notedown-1.5.0.tar.gz\n Running setup.py (path:/private/var/folders/bg/4qvb4_0573v7j7j71bwk16dc0000gp/T/pip_build_garyfeng/notedown/setup.py) egg_info for package notedown\n \nRequirement already satisfied (use --upgrade to upgrade): nbformat in /Users/garyfeng/anaconda/lib/python2.7/site-packages (from notedown)\nRequirement already satisfied (use --upgrade to upgrade): nbconvert in /Users/garyfeng/anaconda/lib/python2.7/site-packages (from notedown)\nRequirement already satisfied (use --upgrade to upgrade): ipython in /Users/garyfeng/anaconda/lib/python2.7/site-packages (from notedown)\nDownloading/unpacking pandoc-attributes (from notedown)\n Downloading pandoc-attributes-0.1.7.tar.gz\n Running setup.py (path:/private/var/folders/bg/4qvb4_0573v7j7j71bwk16dc0000gp/T/pip_build_garyfeng/pandoc-attributes/setup.py) egg_info for package pandoc-attributes\n \nRequirement already satisfied (use --upgrade to upgrade): six in /Users/garyfeng/anaconda/lib/python2.7/site-packages/six-1.10.0-py2.7.egg (from notedown)\nDownloading/unpacking pandocfilters (from pandoc-attributes->notedown)\n Downloading pandocfilters-1.3.0.tar.gz\n Running setup.py (path:/private/var/folders/bg/4qvb4_0573v7j7j71bwk16dc0000gp/T/pip_build_garyfeng/pandocfilters/setup.py) egg_info for package pandocfilters\n \nInstalling collected packages: notedown, pandoc-attributes, pandocfilters\n Running setup.py install for notedown\n \n Installing notedown script to /Users/garyfeng/anaconda/bin\n Running setup.py install for pandoc-attributes\n \n Running setup.py install for pandocfilters\n \nSuccessfully installed notedown pandoc-attributes pandocfilters\nCleaning up...\n" } ]