Is Parsing JSON a Minefield?
There's a really great article titled Parsing JSON is a Minefield that provides some pretty comprehensive information about the ambiguities and contradictions in the JSON “spec”. As is so often the case with software, it turns out JSON has more complexities than you may expect (or at the very least more than Douglas Crockford would tell you about).
The JSON grammar was famously printed on a business card in order to demonstrate how simple it was. This got me to thinking about an interesting experiment: how correct would a JSON parser implemented purely based on that business card be? Luckily the author of the “Parsing JSON is a Minefield” article also created a comprehensive suite of tests that cover all the edge cases discussed in the article. So of course I decided to write and test a JSON parser.
Baseline Assumptions
The JSON business card is expectedly pretty vague, so I had to make several assumptions about how things should work. I did my best to make decisions based on what a typical software developer would expect, but they are still assumptions based on my own biases. Just keep that in mind going forward.
Before I could do anything I needed to define how the parser would work. Unlike serde which deserializes specific JSON into structs, I want to parse JSON into a more generic enum that can represent any JSON file.
pub enum JSONValue {
Null,
Boolean(bool),
Number(f64),
String(String),
Array(Vec<JSONValue>),
Object(HashMap<String, JSONValue>),
}
So already there were a few assumptions being made. “Number” was always represented as a 64-bit floating point number (the same way JavaScript does it), String was a Rust string (which is UTF-8 and satisfies the “any-Unicode-character” requirement), and Objects are HashMaps (meaning duplicate keys can't exist).
Next I needed to define the behavior of the parser. I decided it should take a Rust String
as inpujt, which means I'm assuming all JSON is encoded as UTF-8 (or can be easily translated into UTF-8). Combined with a custom JSONParseError
type this resulted in the following function signature for the parser:
pub fn parse(input: String) -> Result<JSONValue, JSONParseError> {
...
}
Note: For brevity I've omitted several boilerplate trait implementations like PartialEq
, Display
, Debug
, and From
.
Now that data types and a function signature existed, I could take the business card and translate it into a series of test that I felt covered the described behavior. I put these tests in a separate file called business_card.rs
.
use crate::*;
#[test]
fn parse_null() {
assert_eq!(parse(String::from("null")), Ok(JSONValue::Null));
}
#[test]
fn parse_true() {
assert_eq!(parse(String::from("true")), Ok(JSONValue::Boolean(true)));
}
#[test]
fn parse_false() {
assert_eq!(parse(String::from("false")), Ok(JSONValue::Boolean(false)));
}
#[test]
fn parse_number() {
assert_eq!(parse(String::from("1")), Ok(JSONValue::Number(1.0)));
assert_eq!(parse(String::from("123")), Ok(JSONValue::Number(123.0)));
assert_eq!(parse(String::from("123.45")), Ok(JSONValue::Number(123.45)));
assert_eq!(parse(String::from("-1")), Ok(JSONValue::Number(-1.0)));
assert_eq!(parse(String::from("-123")), Ok(JSONValue::Number(-123.0)));
assert_eq!(
parse(String::from("-123.45")),
Ok(JSONValue::Number(-123.45))
);
assert_eq!(parse(String::from("123e8")), Ok(JSONValue::Number(123e8)));
assert_eq!(parse(String::from("123e+8")), Ok(JSONValue::Number(123e8)));
assert_eq!(parse(String::from("123e-8")), Ok(JSONValue::Number(123e-8)));
assert_eq!(parse(String::from("123E8")), Ok(JSONValue::Number(123e8)));
assert_eq!(parse(String::from("123E+8")), Ok(JSONValue::Number(123e8)));
assert_eq!(parse(String::from("123E-8")), Ok(JSONValue::Number(123e-8)));
assert_eq!(
parse(String::from("-123.45e-8")),
Ok(JSONValue::Number(-123.45e-8))
);
}
#[test]
fn parse_string() {
assert_eq!(
parse(String::from("\"\"")),
Ok(JSONValue::String(String::new()))
);
assert_eq!(
parse(String::from("\"a\"")),
Ok(JSONValue::String(String::from("a")))
);
assert_eq!(
parse(String::from("\"Hello World!\"")),
Ok(JSONValue::String(String::from("Hello World!")))
);
assert_eq!(
parse(String::from("\"\\\"\"")),
Ok(JSONValue::String(String::from("\"")))
);
assert_eq!(
parse(String::from("\"\\\\\"")),
Ok(JSONValue::String(String::from("\\")))
);
assert_eq!(
parse(String::from("\"\\/\"")),
Ok(JSONValue::String(String::from("/")))
);
assert_eq!(
parse(String::from("\"\\b\"")),
Ok(JSONValue::String(String::from("\x08")))
);
assert_eq!(
parse(String::from("\"\\f\"")),
Ok(JSONValue::String(String::from("\x0C")))
);
assert_eq!(
parse(String::from("\"\\n\"")),
Ok(JSONValue::String(String::from("\n")))
);
assert_eq!(
parse(String::from("\"\\r\"")),
Ok(JSONValue::String(String::from("\r")))
);
assert_eq!(
parse(String::from("\"\\t\"")),
Ok(JSONValue::String(String::from("\t")))
);
assert_eq!(
parse(String::from("\"\\u2603\"")),
Ok(JSONValue::String(String::from("\u{2603}")))
);
assert_eq!(
parse(String::from("\"\\uD83D\\uDE00\"")),
Ok(JSONValue::String(String::from("\u{1F600}")))
);
}
#[test]
fn parse_array() {
assert_eq!(parse(String::from("[]")), Ok(JSONValue::Array(vec![])));
assert_eq!(
parse(String::from("[1,2,3]")),
Ok(JSONValue::Array(vec![
JSONValue::Number(1.0),
JSONValue::Number(2.0),
JSONValue::Number(3.0)
]))
);
}
#[test]
fn parse_object() {
assert_eq!(
parse(String::from("{}")),
Ok(JSONValue::Object(HashMap::new()))
);
let mut o1: HashMap<String, JSONValue> = HashMap::new();
o1.insert(
String::from("name"),
JSONValue::String(String::from("John Doe")),
);
assert_eq!(
parse(String::from("{\"name\":\"John Doe\"}")),
Ok(JSONValue::Object(o1))
);
let mut o2: HashMap<String, JSONValue> = HashMap::new();
o2.insert(
String::from("name"),
JSONValue::String(String::from("John Doe")),
);
o2.insert(String::from("age"), JSONValue::Number(42.0));
assert_eq!(
parse(String::from("{\"name\":\"John Doe\",\"age\":42}")),
Ok(JSONValue::Object(o2))
);
}
Most of these tests are pretty straightforward. The biggest assumptions were what each of the escape characters meant. Among those the major assumption was that \u four-hex-digits
referred to UTF-16 encoded values and thus need to account for surrogate pairs.
Building the Parser
With all of that out of the way I was free to start hacking away at a hand crafted recursive descent parser. This parser has two primary goals:
1) Be correct (this includes not crashing) 2) Be easy to understand
Performance, extensibility, security, and real world usability are non-goals.
Here's the initial parser I came up with that conforms to the business card behavior:
fn expect(cursor: &mut Peekable<Chars>, c: char) -> Result<(), JSONParseError> {
if cursor.next() != Some(c) {
return Err(JSONParseError::new(&format!("Expected character: {}", c)));
}
Ok(())
}
fn parse_token(
cursor: &mut Peekable<Chars>,
token: Chars,
value: JSONValue,
) -> Result<JSONValue, JSONParseError> {
for c in token {
expect(cursor, c)?;
}
Ok(value)
}
fn parse_null(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
parse_token(cursor, "null".chars(), JSONValue::Null)
}
fn parse_true(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
parse_token(cursor, "true".chars(), JSONValue::Boolean(true))
}
fn parse_false(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
parse_token(cursor, "false".chars(), JSONValue::Boolean(false))
}
fn read_integer(cursor: &mut Peekable<Chars>, number: &mut String) -> Result<(), JSONParseError> {
let first = cursor.peek();
if first < Some(&'0') || first > Some(&'9') {
return Err(JSONParseError::new("Unexpected character"));
}
while let Some(c) = cursor.peek() {
if c >= &'0' && c <= &'9' {
number.push(*c);
cursor.next();
} else {
break;
}
}
Ok(())
}
fn parse_number_raw(cursor: &mut Peekable<Chars>) -> Result<f64, JSONParseError> {
let mut number = String::new();
if cursor.peek() == Some(&'-') {
number.push('-');
cursor.next();
}
if cursor.peek() == Some(&'0') {
number.push('0');
cursor.next();
} else {
read_integer(cursor, &mut number)?;
}
if cursor.peek() == Some(&'.') {
number.push('.');
cursor.next();
read_integer(cursor, &mut number)?;
}
if cursor.peek() == Some(&'e') || cursor.peek() == Some(&'E') {
number.push('e');
cursor.next();
if cursor.peek() == Some(&'-') {
number.push('-');
cursor.next();
}
if cursor.peek() == Some(&'+') {
cursor.next();
}
read_integer(cursor, &mut number)?;
}
Ok(number.parse::<f64>()?)
}
fn parse_number(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
Ok(JSONValue::Number(parse_number_raw(cursor)?))
}
fn parse_unicode_raw(
cursor: &mut Peekable<Chars>,
high: Option<u32>,
) -> Result<char, JSONParseError> {
expect(cursor, 'u')?;
let mut u = String::new();
for _ in 0..4 {
if let Some(c) = cursor.next() {
if c.is_ascii_hexdigit() {
u.push(c);
} else {
return Err(JSONParseError::new("Unexpected character"));
}
} else {
return Err(JSONParseError::new("Unexpected EOF"));
}
}
let numeric = u32::from_str_radix(&u, 16)?;
if let Some(h) = high {
if numeric >= 0xDC00 && numeric <= 0xDFFF {
let mut code = 0x10000;
code += (h & 0x03FF) << 10;
code += numeric & 0x03FF;
char::from_u32(code).ok_or(JSONParseError::new("Invalid unicode character"))
} else {
Err(JSONParseError::new("Invalid surrogate pair"))
}
} else {
if numeric >= 0xD800 && numeric <= 0xDBFF {
expect(cursor, '\\')?;
parse_unicode_raw(cursor, Some(numeric))
} else {
char::from_u32(numeric).ok_or(JSONParseError::new("Invalid unicode character"))
}
}
}
fn parse_string_raw(cursor: &mut Peekable<Chars>) -> Result<String, JSONParseError> {
expect(cursor, '"')?;
let mut s = String::new();
let mut done = false;
while let Some(c) = cursor.next() {
if c == '"' {
done = true;
break;
} else if c == '\\' {
if let Some('u') = cursor.peek() {
s.push(parse_unicode_raw(cursor, None)?)
} else {
match cursor.next() {
Some('"') => s.push('"'),
Some('\\') => s.push('\\'),
Some('/') => s.push('/'),
Some('b') => s.push('\x08'),
Some('f') => s.push('\x0C'),
Some('n') => s.push('\n'),
Some('r') => s.push('\r'),
Some('t') => s.push('\t'),
None => return Err(JSONParseError::new("Unexpected EOF")),
_ => return Err(JSONParseError::new("Unexpected character")),
}
}
} else {
match c {
'\u{0000}'..='\x1F' | '\u{007F}' | '\u{0080}'..='\u{009F}' => {
return Err(JSONParseError::new("Unexpected control character"))
}
_ => s.push(c),
}
}
}
if done {
return Ok(s);
}
return Err(JSONParseError::new("Unexpected EOF"));
}
fn parse_string(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
Ok(JSONValue::String(parse_string_raw(cursor)?))
}
fn parse_array(cursor: &mut Peekable<Chars>, limit: usize) -> Result<JSONValue, JSONParseError> {
expect(cursor, '[')?;
let mut array: Vec<JSONValue> = vec![];
if cursor.peek() == Some(&']') {
cursor.next();
return Ok(JSONValue::Array(array));
}
loop {
array.push(parse_value(cursor, limit)?);
if cursor.peek() == Some(&']') {
cursor.next();
break;
}
expect(cursor, ',')?;
}
Ok(JSONValue::Array(array))
}
fn parse_object(cursor: &mut Peekable<Chars>, limit: usize) -> Result<JSONValue, JSONParseError> {
expect(cursor, '{')?;
let mut object = HashMap::new();
if cursor.peek() == Some(&'}') {
cursor.next();
return Ok(JSONValue::Object(object));
}
loop {
let key = parse_string_raw(cursor)?;
expect(cursor, ':')?;
object.insert(key, parse_value(cursor, limit)?);
if cursor.peek() == Some(&'}') {
cursor.next();
break;
}
expect(cursor, ',')?;
}
Ok(JSONValue::Object(object))
}
fn parse_value(cursor: &mut Peekable<Chars>, limit: usize) -> Result<JSONValue, JSONParseError> {
if limit == 0 {
return Err(JSONParseError::new("Exceeded nesting limit"));
}
let value = match cursor.peek() {
Some(c) => match c {
'"' => parse_string(cursor),
'0'..='9' | '-' => parse_number(cursor),
'[' => parse_array(cursor, limit - 1),
'{' => parse_object(cursor, limit - 1),
't' => parse_true(cursor),
'f' => parse_false(cursor),
'n' => parse_null(cursor),
_ => return Err(JSONParseError::new("Unexpected character")),
},
None => return Err(JSONParseError::new("Unexpected EOF")),
};
value
}
pub fn parse(input: String) -> Result<JSONValue, JSONParseError> {
let mut cursor = input.chars().peekable();
let value = parse_value(&mut cursor, 1000)?;
if let Some(_) = cursor.peek() {
return Err(JSONParseError::new("Expected EOF"));
}
Ok(value)
}
Here are the new assumptions that were introduced during the implementation: – “Control characters” means “Unicode control characters”, which includes C0 controls ( 0x00–0x1F), C1 controls (0x80-0x9F), and delete (0x7F) – There should be some maximum recursion depth after which an error is returned. This isn't required to conform with the business card, but that paper tape isn't actually infinite so I'm counting this as a necessary implementation detail.
The Minefield
So now it's time to see how my “naive” parser holds up against more rigorous test cases. In order to do this I added all the tests from the JSON Parsing Test Suite I mentioned before to my project and wrote the following build script to translate those into Rust tests:
use std::fs;
use std::io;
fn make_name(name: &String) -> String {
name.replace("/", "_slash_")
.replace("-", "_dash_")
.replace("+", "_plus_")
.replace(".", "_dot_")
.replace("#", "_hash_")
.replace("__", "_")
.replace(".json", "")
.replace("minefield_slash_", "")
.to_lowercase()
}
fn generate_accept_test(path: &String) -> String {
format!(
r#"
#[test]
fn {}() -> std::io::Result<()> {{
if let Err(e) = parse(std::fs::read_to_string("{}")?) {{
assert!(false, "Parsing failed: {{:?}}", e);
}}
Ok(())
}}"#,
make_name(path),
path
)
}
fn generate_reject_test(path: &String) -> String {
format!(
r#"
#[test]
fn {}() {{
if let Ok(data) = std::fs::read_to_string("{}") {{
if let Ok(_) = parse(data) {{
assert!(false, "Parsing should fail")
}}
}}
}}"#,
make_name(path),
path
)
}
fn generate_undefined_test(path: &String) -> String {
format!(
r#"
#[test]
fn {}() {{
if let Ok(data) = std::fs::read_to_string("{}") {{
let _ = parse(data);
}}
}}"#,
make_name(path),
path
)
}
fn generate_test(path: &String) -> Option<String> {
if path.contains("/y_") {
return Some(generate_accept_test(path));
} else if path.contains("/n_") {
return Some(generate_reject_test(path));
} else if path.contains("/i_") {
return Some(generate_undefined_test(path));
}
None
}
fn generate_module(tests: Vec<String>) -> String {
format!(
r#"#[cfg(test)]
use crate::parse;
{}"#,
tests.join("\n")
)
}
fn main() -> io::Result<()> {
let mut tests = vec![];
for entry in fs::read_dir("minefield")? {
let entry = entry?;
let path = entry.path().display().to_string();
if path.ends_with(".json") {
if let Some(test) = generate_test(&path) {
tests.push(test);
}
}
}
fs::write("src/tests/minefield.rs", generate_module(tests))?;
Ok(())
}
These test suites are broken into three categories:
– y_
tests which should be accepted
– n_
tests which should be rejected
– i_
tests which are undefined
I decided that these should be handled as follows:
– y_
tests should return a JSONValue
– n_
tests should return a JSONParseError
– i_
tests should just return without crashing
So how did I fare?
The Results
The results at this point kind of surprised me. Out of 325 tests, only 18 failed. The thing is though, there's a bug I know about already. I knew from the start that some tests would fail because the business card grammar doesn't mention anything about white space.
After updating the parser to allow whitespace there were only two failing tests. And both of these failed for the same reason: despite being a control character delete
(0x7F) should be allowed in strings.
After updating the parser to accept delete
in strings 100% of the tests passed. A parser I wrote in an afternoon based on a business card and naive assumptions passed for all the defined behavior in the JSON Parsing Test Suite.
Conclusion
At the end of the day, parsing JSON in the real world is still a minefield. While I may have been able to easily craft a parser that passes tests for all of the defined behavior, it's those undefined tests that are the real problem. The cases where there isn't a correct answer will still provide interoperability issues when passing data between different libraries, especially when it's between different languages. Moreover, even for the defined y_
tests, there is no assertion for what a value parses into, just that it's accepted. As an example issue: my implementation can't safely handle integers greater than 2^53 because it treats all numbers as floats, but an implementation that supports parsing to an actual integer type could.
Despite the real world challenges though, I think JSON parsing might be a navigable minefield. In an afternoon I was able to throw together a parser that's as correct as any mainstream JSON parser in use today (perhaps more correct than some), and I don't think that's something that would be possible for many data interchange formats.