match object(src) {
Ok(res) => return Ok(res),
Err(JSONParseError::NotFound) => {} // if not found, that ok
Err(e) => return Err(e),
}
You probably have realized that this is really tedious, and this is where macros would really shine:
macro_rules! try_parse_as {
($f:expr) => (
match $f(src) {
Ok(res) => return Ok(res),
Err(JSONParseError::NotFound) => {} // if not found, that ok
Err(e) => return Err(e),
}
);
}
try_parse_as!(object);
try_parse_as!(array);
// ...
It is also possible to avoid macros by translating `Result<(&str, JSONValue), JSONParseError>` into `Result<Option<(&str, JSONValue)>, JSONParseError>` (where `Ok(None)` indicates `Err(JSONParseError::NotFound)`), which allows for shorthands like `if let Ok(res) = translate(object())? { ... }`.
Also, even though you chose to represent numbers as f64, a correct parsing algorithm is surprisingly tricky. Fortunately `f64::parse` accepts a strict superset of JSON number grammar, so you can instead count the number of characters making the number up and feed it into the Rust standard library.
> Fortunately `f64::parse` accepts a strict superset of JSON number grammar
Just for the sake of completeness, and not to imply that you don't know this, but the JSON spec doesn't limit the size or precision of numbers, although it allows implementations set other limits.
I have encountered JSON documents that (annoyingly) required the use of a parser with bigint-support.
> but the JSON spec doesn't limit the size or precision of numbers, although it allows implementations set other limits.
This actually led to a data-loss causing bug in the AWS DynamoDB Console's editor a couple of years ago. IIRC, the temporary fix was to fail if the input-number couldn't be represented by a 64-bit float. Can't remember if it ever got a proper fix.
I guess it's faster with sudo because the regular account has some resource limits, or something? Didn't mention which system is being used (Linux? macOS? Something else?) but I can't repro on my Linux system, where performance is identical:
One run is not statistically significant %) Try running 10 times each, throw away the fastest and the slowest, indicate the average and min/max among the remaining runs.
Any random background processes might slow the system slightly during the initial run. A slightly better state of the page cache on the second run could play a role, too. Flushing the FS cache before each run might be a good idea; doing cat file.json > /dev/null could be an equally good idea.
I have to commend on the simplicity and clarity of thought in the write up. I could see what you were up to just by a skim through it (also thanks for the ready reference)!
I'm pretty sure you'd already know of this, but once you've written your own version, it might help to compare and take notes from a popular, well-established, benchmarked library: https://github.com/serde-rs/json.
And on a related note, this project was for educational purposes, but if the author wants to do more parsing in Rust, there's the excellent `nom` crate [1], which provides a JSON parser as an example [2].
It uses a very similar paradigm to what the author used in the article, and provides a lot of helper utilities. I used it to parse a (very, very small) subset of Markdown recently [3] and enjoyed the experience.
How large is the combinator library? It should not be excessively long, for the simple thing it normally does, especially in Rust which is FP-friendly.
I've seen a capable combinator lib written in about 500 lines of Rust. Probably could be shorter if minimal loc were a goal. The one I use is called "nom", it's very general and has a lot of built in functionality and options, so it's much larger than 500 lines to be sure.
No, not that much. User posted mostly because he/she is learning the language.
> is there a catch why a implementation in rust worth mentioning on HN?
HN loves Rust and it is an fun opportunity to explore Rust semantics in use without going to the more complex part of the language. It belongs to the category of "cool first project", like a Sudoku solver, etc.
Hilariously lots of "logging" (lucene based) tools explicitly require storing JSON docs and fall over if the schema's not quite right[1].
I regularly deal with situations where devs "log" by sending "json" to stdout of their container runtime, then expect the downstream infrastructure to magic structure into it perfectly. "You understand something has to parse that stream of bytes you're sending to find all the matching quotes and curly braces and stuff, right? What happens if one process in the container's emitting a huge log event and something else in the container decides to report that it's doing some memory stuff?" <blank stare> "I expect you'd log the error and open an incident?"
(the correct answer is to just collect the garbage strings in json (ha) and give them unparsed crap for them to deal with themselves, but then "we're devs deving; we don't want to waste energy on operations toil"
I need to work on my rust JSON parser some more[0]. I intended it handle deeply nested (malicious) objects gracefully and with little memory allocations. It's also pretty fast.
I had a goal of making it "async", so it could periodically yield for large input strings, but I'm not so sure it matters much.
Currently the API is pretty impractical to use, because of the arena structure that is set up. It could be improved
Regarding the 'sudo' issue: Doing a benchmark by just running an example executable is not really recommended because there's a ton of reasons why you might get differing performance.
It's probably better to set up an actual benchmark using a crate like Criterion instead [0].
It's fine for things like this where you want to get a rough performance indication to see on what order of magnitude things are at (~1MB/s vs. ~10 vs. ~50 vs. ~100). A few percent error margin is fine for that.
Tools like that are to eliminate noise and variation, which is an entirely different issue. According to the article, "sudo" is about 70% faster. That has nothing to do with the benchmarking method.
It could eliminate issues where startup time and background scanning processes might interfere with initial throughput, though? Even things like CPU throttling could be affecting the test somehow. The main goal is to eliminate as many variables as you can.
It's fairly realistic for the first run to be from disk, and the second from cache in a scenario like this, and ~2x difference between the two isn't entirely unrealistic.
I once (maybe a long time ago?) made a parser for JSON by:
1. Reading the entire file into RAM.
2. Providing a `const char *get_value(const char *jstring, const char *path, ...)` function with a NULL-terminated parameter list that would return the position of the value of the key at the specified path.
3. Providing a `copy_value(const char *position)` function to copy the value at the specified position.
Slow? Yup!
But, it was easy and safe[1] and used absolutely minimal RAM![2]. The recursive nature of the JSON tree also allowed the caller to use a returned value from `get_value` as the `jstring` argument in further calls to `get_value`.
I might still have a fork of it lying around somewhere.
[1] "Safe" meaning "Caller had to check for NULL return values, and ensure that NULL terminated the parameter list".
[2] GCC with `-O2` and above does proper TCO, eliminating unbounded stack growth.
That’s not a parser though. For example the caller would still need to convert escaped unicode characters, floats and bool. I guess it’s an incomplete tokenizer?
Better option would be to parse json into Bson and then use that as the in-memory format. It uses minimal memory and is actually also fast to access without parsing into some other data structure.
That sounds a little like what I came up with a little while back. My lib can stream JSON to/from my own compact binary format that is easy to traverse/emit in C:
> Only safe and easy if json file fits into the RAM.
So?
Many of the other JSON parsing tools build up a tree in RAM[1], so the statement "Only works if the final tree fits in RAM" is just as true, and since JSON is mostly text, the final tree is not that much smaller than the source JSON anyway.
[1] Including the one this article is presenting, and all the other solutions presented in the comments.
Do not parse using strip_prefix and then to_string, you're allocating a new buffer after each token! You can use Cow<str> for the result of the "parse string value" function, as most JSON strings don't have escapes you can just return a reference to that slice of the buffer, only allocating when there's actually an escape.
In general when writing a parser you should strive to minimize allocations and backtracking.
I suppose that is simply intended as a joke, but to me it falls flat since JSON is a data format not a programming language. JSON cannot parse anything.
Also, even though you chose to represent numbers as f64, a correct parsing algorithm is surprisingly tricky. Fortunately `f64::parse` accepts a strict superset of JSON number grammar, so you can instead count the number of characters making the number up and feed it into the Rust standard library.
> Fortunately `f64::parse` accepts a strict superset of JSON number grammar
Just for the sake of completeness, and not to imply that you don't know this, but the JSON spec doesn't limit the size or precision of numbers, although it allows implementations set other limits.
I have encountered JSON documents that (annoyingly) required the use of a parser with bigint-support.
> but the JSON spec doesn't limit the size or precision of numbers, although it allows implementations set other limits.
This actually led to a data-loss causing bug in the AWS DynamoDB Console's editor a couple of years ago. IIRC, the temporary fix was to fail if the input-number couldn't be represented by a 64-bit float. Can't remember if it ever got a proper fix.
Another article on the front page is discussing the capabilities of AI for coding.
I wonder, given a 500 line problem, can any of the current cutting edge AIs make the code obviously, dramatically better?
How far can they go?
https://minimaxir.com/2025/01/write-better-code/
So, no?
Here is one in 73 lines of F# by Jon Harrop:
https://gist.github.com/isaksky/6681cfad8ced1708a04b2eca92fc...
I love how in F# you can express many concepts with simplicity. Thanks for the example.
Here's another approach at parsing JSON in F# that is part of a tutorial on building your own parser combinator:
https://fsharpforfunandprofit.com/posts/understanding-parser...
Thanks, very cool
Meta: github is now requiring a login to see gists?
I don't even have an account and I could see it.
Most github pages used to be rendered on the server but they often require js for the actual content now.
I got log in page on first click but it went away after closing it and opening again.
No, I am not logged in and I can view it.
I guess it's faster with sudo because the regular account has some resource limits, or something? Didn't mention which system is being used (Linux? macOS? Something else?) but I can't repro on my Linux system, where performance is identical:
Wild guess, but different versions of Rust for the users?
Too fast of a compile time, different versions would recompile the whole dep tree. And even just the end code would take more than .02s
One run is not statistically significant %) Try running 10 times each, throw away the fastest and the slowest, indicate the average and min/max among the remaining runs.
Any random background processes might slow the system slightly during the initial run. A slightly better state of the page cache on the second run could play a role, too. Flushing the FS cache before each run might be a good idea; doing cat file.json > /dev/null could be an equally good idea.
If OP is running this on MacOS it could be due to macOS' code signing that runs the first time you run an executable.
Or more likely, OP has multiple versions of rust installed.
I wouldn't think the code signing phase is run while the benchmark is running and thus affecting those numbers.
Yeah, Id repeat experiment in reversed order and probably result would flip too
I have to commend on the simplicity and clarity of thought in the write up. I could see what you were up to just by a skim through it (also thanks for the ready reference)!
I'm pretty sure you'd already know of this, but once you've written your own version, it might help to compare and take notes from a popular, well-established, benchmarked library: https://github.com/serde-rs/json.
And on a related note, this project was for educational purposes, but if the author wants to do more parsing in Rust, there's the excellent `nom` crate [1], which provides a JSON parser as an example [2].
It uses a very similar paradigm to what the author used in the article, and provides a lot of helper utilities. I used it to parse a (very, very small) subset of Markdown recently [3] and enjoyed the experience.
[1] https://github.com/rust-bakery/nom
[2] https://github.com/rust-bakery/nom/blob/main/examples/json.r...
[3] https://git.sr.ht/~bsprague/logseq-to-linkwarden
I have no rust knowledge let me ask
- is this hard to do in 500 lines of rust? - is there a catch why a implementation in rust worth mentioning on HN?
don't get me wrong im realy intested in rust.
500 lines struck me as particularly long, but I would use a combinator library like nom, and it looks like they are trying to do it from scratch.
How large is the combinator library? It should not be excessively long, for the simple thing it normally does, especially in Rust which is FP-friendly.
I've seen a capable combinator lib written in about 500 lines of Rust. Probably could be shorter if minimal loc were a goal. The one I use is called "nom", it's very general and has a lot of built in functionality and options, so it's much larger than 500 lines to be sure.
> - is this hard to do in 500 lines of rust?
No, not that much. User posted mostly because he/she is learning the language.
> is there a catch why a implementation in rust worth mentioning on HN?
HN loves Rust and it is an fun opportunity to explore Rust semantics in use without going to the more complex part of the language. It belongs to the category of "cool first project", like a Sudoku solver, etc.
[dead]
Parsing JSON is a Minefield: https://seriot.ch/projects/parsing_json.html
Hilariously lots of "logging" (lucene based) tools explicitly require storing JSON docs and fall over if the schema's not quite right[1].
I regularly deal with situations where devs "log" by sending "json" to stdout of their container runtime, then expect the downstream infrastructure to magic structure into it perfectly. "You understand something has to parse that stream of bytes you're sending to find all the matching quotes and curly braces and stuff, right? What happens if one process in the container's emitting a huge log event and something else in the container decides to report that it's doing some memory stuff?" <blank stare> "I expect you'd log the error and open an incident?"
(the correct answer is to just collect the garbage strings in json (ha) and give them unparsed crap for them to deal with themselves, but then "we're devs deving; we don't want to waste energy on operations toil"
Later people ask "why's logging so expensive?"
Sigh.
[1] opensearch / elasticsearch, obvs
[2] https://12factor.net/logs
I need to work on my rust JSON parser some more[0]. I intended it handle deeply nested (malicious) objects gracefully and with little memory allocations. It's also pretty fast.
I had a goal of making it "async", so it could periodically yield for large input strings, but I'm not so sure it matters much.
Currently the API is pretty impractical to use, because of the arena structure that is set up. It could be improved
[0] https://github.com/conradludgate/sonny-jim
https://www.perlmonks.org/?node_id=995856
I have implemented 2 packages in Go for parsing json:
1. https://github.com/ezpkg/iter.json: last year, using go iterator, the core parsing code [1a] is around 200 lines
1a. https://github.com/ezpkg/ezpkg/blob/main/iter.json/parser.go
2. https://github.com/iOliverNguyen/ujson: 4 years ago, using callback, around 400 lines
Regarding the 'sudo' issue: Doing a benchmark by just running an example executable is not really recommended because there's a ton of reasons why you might get differing performance.
It's probably better to set up an actual benchmark using a crate like Criterion instead [0].
[0] https://github.com/bheisler/criterion.rs
It's fine for things like this where you want to get a rough performance indication to see on what order of magnitude things are at (~1MB/s vs. ~10 vs. ~50 vs. ~100). A few percent error margin is fine for that.
Tools like that are to eliminate noise and variation, which is an entirely different issue. According to the article, "sudo" is about 70% faster. That has nothing to do with the benchmarking method.
It could eliminate issues where startup time and background scanning processes might interfere with initial throughput, though? Even things like CPU throttling could be affecting the test somehow. The main goal is to eliminate as many variables as you can.
It's fairly realistic for the first run to be from disk, and the second from cache in a scenario like this, and ~2x difference between the two isn't entirely unrealistic.
I once (maybe a long time ago?) made a parser for JSON by:
1. Reading the entire file into RAM.
2. Providing a `const char *get_value(const char *jstring, const char *path, ...)` function with a NULL-terminated parameter list that would return the position of the value of the key at the specified path.
3. Providing a `copy_value(const char *position)` function to copy the value at the specified position.
Slow? Yup!
But, it was easy and safe[1] and used absolutely minimal RAM![2]. The recursive nature of the JSON tree also allowed the caller to use a returned value from `get_value` as the `jstring` argument in further calls to `get_value`.
I might still have a fork of it lying around somewhere.
[1] "Safe" meaning "Caller had to check for NULL return values, and ensure that NULL terminated the parameter list".
[2] GCC with `-O2` and above does proper TCO, eliminating unbounded stack growth.
That’s not a parser though. For example the caller would still need to convert escaped unicode characters, floats and bool. I guess it’s an incomplete tokenizer?
Better option would be to parse json into Bson and then use that as the in-memory format. It uses minimal memory and is actually also fast to access without parsing into some other data structure.
That sounds a little like what I came up with a little while back. My lib can stream JSON to/from my own compact binary format that is easy to traverse/emit in C:
https://github.com/nwpierce/jsb/
TBH I didn't know about Bson (Binary-json?). All I had was a tiny little device that rxed JSON and needed to retrieve values from the tree.
Only safe and easy if json file fits into the RAM.
> Only safe and easy if json file fits into the RAM.
So?
Many of the other JSON parsing tools build up a tree in RAM[1], so the statement "Only works if the final tree fits in RAM" is just as true, and since JSON is mostly text, the final tree is not that much smaller than the source JSON anyway.
[1] Including the one this article is presenting, and all the other solutions presented in the comments.
What about empty string keys, duplicate keys and byte order marks?
\uFEFF{"": "value", "": null}
(Yup, there's a reason those are not recommended for use)
It'd be easy and cleaner to use miette for pretty error printing.
How many dependencies is that?
I bet I could rewrite that in 5000 lines.
I wonder how well it works with heavily nested files. AFAIU, this is a recursive parser.
Do not parse using strip_prefix and then to_string, you're allocating a new buffer after each token! You can use Cow<str> for the result of the "parse string value" function, as most JSON strings don't have escapes you can just return a reference to that slice of the buffer, only allocating when there's actually an escape.
In general when writing a parser you should strive to minimize allocations and backtracking.
I changed it to use Cow<str>, performance improved from 121.02 MB/s to 278.66 MB/s
https://github.com/rectalogic/jsonparser
Good job!
Now try parsing Rust in 500 lines of JSON
I suppose that is simply intended as a joke, but to me it falls flat since JSON is a data format not a programming language. JSON cannot parse anything.
One line of Javascript and Python.
True but if can write an LLM in 100 lines of JAX, perhaps we should just train that LLM to parse JSON?