Parsing JSON in 500 lines of Rust

115 points by KrishKrish 2 days ago

    match object(src) {
        Ok(res) => return Ok(res),
        Err(JSONParseError::NotFound) => {} // if not found, that ok
        Err(e) => return Err(e),
    }

You probably have realized that this is really tedious, and this is where macros would really shine:

    macro_rules! try_parse_as {
        ($f:expr) => (
            match $f(src) {
                Ok(res) => return Ok(res),
                Err(JSONParseError::NotFound) => {} // if not found, that ok
                Err(e) => return Err(e),
            }
        );
    }

    try_parse_as!(object);
    try_parse_as!(array);
    // ...

It is also possible to avoid macros by translating `Result<(&str, JSONValue), JSONParseError>` into `Result<Option<(&str, JSONValue)>, JSONParseError>` (where `Ok(None)` indicates `Err(JSONParseError::NotFound)`), which allows for shorthands like `if let Ok(res) = translate(object())? { ... }`.

Also, even though you chose to represent numbers as f64, a correct parsing algorithm is surprisingly tricky. Fortunately `f64::parse` accepts a strict superset of JSON number grammar, so you can instead count the number of characters making the number up and feed it into the Rust standard library.

madmoose a day ago

> Fortunately `f64::parse` accepts a strict superset of JSON number grammar
Just for the sake of completeness, and not to imply that you don't know this, but the JSON spec doesn't limit the size or precision of numbers, although it allows implementations set other limits.
I have encountered JSON documents that (annoyingly) required the use of a parser with bigint-support.
- messe a day ago
  
  > but the JSON spec doesn't limit the size or precision of numbers, although it allows implementations set other limits.
  This actually led to a data-loss causing bug in the AWS DynamoDB Console's editor a couple of years ago. IIRC, the temporary fix was to fail if the input-number couldn't be represented by a 64-bit float. Can't remember if it ever got a proper fix.
epicureanideal a day ago

Another article on the front page is discussing the capabilities of AI for coding.
I wonder, given a 500 line problem, can any of the current cutting edge AIs make the code obviously, dramatically better?
How far can they go?
- Philpax a day ago
  
  https://minimaxir.com/2025/01/write-better-code/
  - powerhugs a day ago
    
    So, no?

i_s 2 days ago

Here is one in 73 lines of F# by Jon Harrop:

https://gist.github.com/isaksky/6681cfad8ced1708a04b2eca92fc...

DeathArrow a day ago

I love how in F# you can express many concepts with simplicity. Thanks for the example.
williamcotton a day ago

Here's another approach at parsing JSON in F# that is part of a tutorial on building your own parser combinator:
https://fsharpforfunandprofit.com/posts/understanding-parser...
sandreas a day ago

Thanks, very cool
xgbi 2 days ago

Meta: github is now requiring a login to see gists?
- userbinator a day ago
  
  I don't even have an account and I could see it.
- masfuerte a day ago
  
  Most github pages used to be rendered on the server but they often require js for the actual content now.
- MJGrzymek a day ago
  
  I got log in page on first click but it went away after closing it and opening again.
- johnisgood a day ago
  
  No, I am not logged in and I can view it.

arp242 a day ago

I guess it's faster with sudo because the regular account has some resource limits, or something? Didn't mention which system is being used (Linux? macOS? Something else?) but I can't repro on my Linux system, where performance is identical:

  % cargo run --release
  [..]
  Parsing speed: 103761177.00 Bytes/s
  Parsing speed: 103.76 MB/s
  Parsing speed: 0.10 GB/s

  % doas cargo run --release
  [..]
  Parsing speed: 105401032.21 Bytes/s
  Parsing speed: 105.40 MB/s
  Parsing speed: 0.11 GB/

nhatcher a day ago

Wild guess, but different versions of Rust for the users?
- danbulant a day ago
  
  Too fast of a compile time, different versions would recompile the whole dep tree. And even just the end code would take more than .02s
nine_k a day ago

One run is not statistically significant %) Try running 10 times each, throw away the fastest and the slowest, indicate the average and min/max among the remaining runs.
Any random background processes might slow the system slightly during the initial run. A slightly better state of the page cache on the second run could play a role, too. Flushing the FS cache before each run might be a good idea; doing cat file.json > /dev/null could be an equally good idea.

vsgherzi 2 days ago

If OP is running this on MacOS it could be due to macOS' code signing that runs the first time you run an executable.

powerhugs a day ago

Or more likely, OP has multiple versions of rust installed.
I wouldn't think the code signing phase is run while the benchmark is running and thus affecting those numbers.
boguscoder 2 days ago

Yeah, Id repeat experiment in reversed order and probably result would flip too

InfinityByTen 2 days ago

I have to commend on the simplicity and clarity of thought in the write up. I could see what you were up to just by a skim through it (also thanks for the ready reference)!

I'm pretty sure you'd already know of this, but once you've written your own version, it might help to compare and take notes from a popular, well-established, benchmarked library: https://github.com/serde-rs/json.

abound a day ago

And on a related note, this project was for educational purposes, but if the author wants to do more parsing in Rust, there's the excellent `nom` crate [1], which provides a JSON parser as an example [2].
It uses a very similar paradigm to what the author used in the article, and provides a lot of helper utilities. I used it to parse a (very, very small) subset of Markdown recently [3] and enjoyed the experience.
[1] https://github.com/rust-bakery/nom
[2] https://github.com/rust-bakery/nom/blob/main/examples/json.r...
[3] https://git.sr.ht/~bsprague/logseq-to-linkwarden

Surac a day ago

I have no rust knowledge let me ask

- is this hard to do in 500 lines of rust? - is there a catch why a implementation in rust worth mentioning on HN?

don't get me wrong im realy intested in rust.

ModernMech a day ago

500 lines struck me as particularly long, but I would use a combinator library like nom, and it looks like they are trying to do it from scratch.
- nine_k a day ago
  
  How large is the combinator library? It should not be excessively long, for the simple thing it normally does, especially in Rust which is FP-friendly.
  - ModernMech a day ago
    
    I've seen a capable combinator lib written in about 500 lines of Rust. Probably could be shorter if minimal loc were a goal. The one I use is called "nom", it's very general and has a lot of built in functionality and options, so it's much larger than 500 lines to be sure.
diego_moita a day ago

> - is this hard to do in 500 lines of rust?
No, not that much. User posted mostly because he/she is learning the language.
> is there a catch why a implementation in rust worth mentioning on HN?
HN loves Rust and it is an fun opportunity to explore Rust semantics in use without going to the more complex part of the language. It belongs to the category of "cool first project", like a Sudoku solver, etc.
bmn__ a day ago

[dead]

Squonk42 a day ago

Parsing JSON is a Minefield: https://seriot.ch/projects/parsing_json.html

cduzz a day ago

Hilariously lots of "logging" (lucene based) tools explicitly require storing JSON docs and fall over if the schema's not quite right[1].
I regularly deal with situations where devs "log" by sending "json" to stdout of their container runtime, then expect the downstream infrastructure to magic structure into it perfectly. "You understand something has to parse that stream of bytes you're sending to find all the matching quotes and curly braces and stuff, right? What happens if one process in the container's emitting a huge log event and something else in the container decides to report that it's doing some memory stuff?" <blank stare> "I expect you'd log the error and open an incident?"
(the correct answer is to just collect the garbage strings in json (ha) and give them unparsed crap for them to deal with themselves, but then "we're devs deving; we don't want to waste energy on operations toil"
Later people ask "why's logging so expensive?"
Sigh.
[1] opensearch / elasticsearch, obvs
[2] https://12factor.net/logs

conradludgate a day ago

I need to work on my rust JSON parser some more[0]. I intended it handle deeply nested (malicious) objects gracefully and with little memory allocations. It's also pretty fast.

I had a goal of making it "async", so it could periodically yield for large input strings, but I'm not so sure it matters much.

Currently the API is pretty impractical to use, because of the arena structure that is set up. It could be improved

[0] https://github.com/conradludgate/sonny-jim

d1ss0nanz 2 days ago

https://www.perlmonks.org/?node_id=995856

olvrng a day ago

I have implemented 2 packages in Go for parsing json:

1. https://github.com/ezpkg/iter.json: last year, using go iterator, the core parsing code [1a] is around 200 lines

1a. https://github.com/ezpkg/ezpkg/blob/main/iter.json/parser.go

2. https://github.com/iOliverNguyen/ujson: 4 years ago, using callback, around 400 lines

sagacity a day ago

Regarding the 'sudo' issue: Doing a benchmark by just running an example executable is not really recommended because there's a ton of reasons why you might get differing performance.

It's probably better to set up an actual benchmark using a crate like Criterion instead [0].

[0] https://github.com/bheisler/criterion.rs

arp242 a day ago

It's fine for things like this where you want to get a rough performance indication to see on what order of magnitude things are at (~1MB/s vs. ~10 vs. ~50 vs. ~100). A few percent error margin is fine for that.
Tools like that are to eliminate noise and variation, which is an entirely different issue. According to the article, "sudo" is about 70% faster. That has nothing to do with the benchmarking method.
- sagacity a day ago
  
  It could eliminate issues where startup time and background scanning processes might interfere with initial throughput, though? Even things like CPU throttling could be affecting the test somehow. The main goal is to eliminate as many variables as you can.
- marginalia_nu a day ago
  
  It's fairly realistic for the first run to be from disk, and the second from cache in a scenario like this, and ~2x difference between the two isn't entirely unrealistic.

lelanthran 2 days ago

I once (maybe a long time ago?) made a parser for JSON by:

1. Reading the entire file into RAM.

2. Providing a `const char *get_value(const char *jstring, const char *path, ...)` function with a NULL-terminated parameter list that would return the position of the value of the key at the specified path.

3. Providing a `copy_value(const char *position)` function to copy the value at the specified position.

Slow? Yup!

But, it was easy and safe[1] and used absolutely minimal RAM![2]. The recursive nature of the JSON tree also allowed the caller to use a returned value from `get_value` as the `jstring` argument in further calls to `get_value`.

I might still have a fork of it lying around somewhere.

[1] "Safe" meaning "Caller had to check for NULL return values, and ensure that NULL terminated the parameter list".

[2] GCC with `-O2` and above does proper TCO, eliminating unbounded stack growth.

eska a day ago

That’s not a parser though. For example the caller would still need to convert escaped unicode characters, floats and bool. I guess it’s an incomplete tokenizer?
kukkamario a day ago

Better option would be to parse json into Bson and then use that as the in-memory format. It uses minimal memory and is actually also fast to access without parsing into some other data structure.
- nwpierce 21 hours ago
  
  That sounds a little like what I came up with a little while back. My lib can stream JSON to/from my own compact binary format that is easy to traverse/emit in C:
  https://github.com/nwpierce/jsb/
- lelanthran a day ago
  
  TBH I didn't know about Bson (Binary-json?). All I had was a tiny little device that rxed JSON and needed to retrieve values from the tree.
ivanjermakov a day ago

Only safe and easy if json file fits into the RAM.
- lelanthran a day ago
  
  > Only safe and easy if json file fits into the RAM.
  So?
  Many of the other JSON parsing tools build up a tree in RAM[1], so the statement "Only works if the final tree fits in RAM" is just as true, and since JSON is mostly text, the final tree is not that much smaller than the source JSON anyway.
  [1] Including the one this article is presenting, and all the other solutions presented in the comments.

_blk 2 days ago

What about empty string keys, duplicate keys and byte order marks?

\uFEFF{"": "value", "": null}

(Yup, there's a reason those are not recommended for use)

ayush1499 2 days ago

It'd be easy and cleaner to use miette for pretty error printing.

oguz-ismail a day ago

How many dependencies is that?

musicale 16 hours ago

I bet I could rewrite that in 5000 lines.

gdgghhhhh 2 days ago

I wonder how well it works with heavily nested files. AFAIU, this is a recursive parser.

thrance a day ago

Do not parse using strip_prefix and then to_string, you're allocating a new buffer after each token! You can use Cow<str> for the result of the "parse string value" function, as most JSON strings don't have escapes you can just return a reference to that slice of the buffer, only allocating when there's actually an escape.

In general when writing a parser you should strive to minimize allocations and backtracking.

rectalogic a day ago

I changed it to use Cow<str>, performance improved from 121.02 MB/s to 278.66 MB/s
https://github.com/rectalogic/jsonparser
- thrance 19 hours ago
  
  Good job!

unification_fan a day ago

Now try parsing Rust in 500 lines of JSON

mcherm a day ago

I suppose that is simply intended as a joke, but to me it falls flat since JSON is a data format not a programming language. JSON cannot parse anything.

pinoy420 a day ago

One line of Javascript and Python.

bravura a day ago

True but if can write an LLM in 100 lines of JAX, perhaps we should just train that LLM to parse JSON?