Toast in the Machine

A Blog about Programming

There's a really great article titled Parsing JSON is a Minefield that provides some pretty comprehensive information about the ambiguities and contradictions in the JSON “spec”. As is so often the case with software, it turns out JSON has more complexities than you may expect (or at the very least more than Douglas Crockford would tell you about).

The JSON grammar was famously printed on a business card in order to demonstrate how simple it was. This got me to thinking about an interesting experiment: how correct would a JSON parser implemented purely based on that business card be? Luckily the author of the “Parsing JSON is a Minefield” article also created a comprehensive suite of tests that cover all the edge cases discussed in the article. So of course I decided to write and test a JSON parser.

Baseline Assumptions

The JSON business card is expectedly pretty vague, so I had to make several assumptions about how things should work. I did my best to make decisions based on what a typical software developer would expect, but they are still assumptions based on my own biases. Just keep that in mind going forward.

Before I could do anything I needed to define how the parser would work. Unlike serde which deserializes specific JSON into structs, I want to parse JSON into a more generic enum that can represent any JSON file.

pub enum JSONValue {
    Null,
    Boolean(bool),
    Number(f64),
    String(String),
    Array(Vec<JSONValue>),
    Object(HashMap<String, JSONValue>),
}

So already there were a few assumptions being made. “Number” was always represented as a 64-bit floating point number (the same way JavaScript does it), String was a Rust string (which is UTF-8 and satisfies the “any-Unicode-character” requirement), and Objects are HashMaps (meaning duplicate keys can't exist).

Next I needed to define the behavior of the parser. I decided it should take a Rust String as inpujt, which means I'm assuming all JSON is encoded as UTF-8 (or can be easily translated into UTF-8). Combined with a custom JSONParseError type this resulted in the following function signature for the parser:

pub fn parse(input: String) -> Result<JSONValue, JSONParseError> {
    ...
}

Note: For brevity I've omitted several boilerplate trait implementations like PartialEq, Display, Debug, and From.

Now that data types and a function signature existed, I could take the business card and translate it into a series of test that I felt covered the described behavior. I put these tests in a separate file called business_card.rs.

use crate::*;

#[test]
fn parse_null() {
    assert_eq!(parse(String::from("null")), Ok(JSONValue::Null));
}

#[test]
fn parse_true() {
    assert_eq!(parse(String::from("true")), Ok(JSONValue::Boolean(true)));
}

#[test]
fn parse_false() {
    assert_eq!(parse(String::from("false")), Ok(JSONValue::Boolean(false)));
}

#[test]
fn parse_number() {
    assert_eq!(parse(String::from("1")), Ok(JSONValue::Number(1.0)));

    assert_eq!(parse(String::from("123")), Ok(JSONValue::Number(123.0)));

    assert_eq!(parse(String::from("123.45")), Ok(JSONValue::Number(123.45)));

    assert_eq!(parse(String::from("-1")), Ok(JSONValue::Number(-1.0)));

    assert_eq!(parse(String::from("-123")), Ok(JSONValue::Number(-123.0)));

    assert_eq!(
        parse(String::from("-123.45")),
        Ok(JSONValue::Number(-123.45))
    );

    assert_eq!(parse(String::from("123e8")), Ok(JSONValue::Number(123e8)));

    assert_eq!(parse(String::from("123e+8")), Ok(JSONValue::Number(123e8)));

    assert_eq!(parse(String::from("123e-8")), Ok(JSONValue::Number(123e-8)));

    assert_eq!(parse(String::from("123E8")), Ok(JSONValue::Number(123e8)));

    assert_eq!(parse(String::from("123E+8")), Ok(JSONValue::Number(123e8)));

    assert_eq!(parse(String::from("123E-8")), Ok(JSONValue::Number(123e-8)));

    assert_eq!(
        parse(String::from("-123.45e-8")),
        Ok(JSONValue::Number(-123.45e-8))
    );
}

#[test]
fn parse_string() {
    assert_eq!(
        parse(String::from("\"\"")),
        Ok(JSONValue::String(String::new()))
    );

    assert_eq!(
        parse(String::from("\"a\"")),
        Ok(JSONValue::String(String::from("a")))
    );

    assert_eq!(
        parse(String::from("\"Hello World!\"")),
        Ok(JSONValue::String(String::from("Hello World!")))
    );

    assert_eq!(
        parse(String::from("\"\\\"\"")),
        Ok(JSONValue::String(String::from("\"")))
    );

    assert_eq!(
        parse(String::from("\"\\\\\"")),
        Ok(JSONValue::String(String::from("\\")))
    );

    assert_eq!(
        parse(String::from("\"\\/\"")),
        Ok(JSONValue::String(String::from("/")))
    );

    assert_eq!(
        parse(String::from("\"\\b\"")),
        Ok(JSONValue::String(String::from("\x08")))
    );

    assert_eq!(
        parse(String::from("\"\\f\"")),
        Ok(JSONValue::String(String::from("\x0C")))
    );

    assert_eq!(
        parse(String::from("\"\\n\"")),
        Ok(JSONValue::String(String::from("\n")))
    );

    assert_eq!(
        parse(String::from("\"\\r\"")),
        Ok(JSONValue::String(String::from("\r")))
    );

    assert_eq!(
        parse(String::from("\"\\t\"")),
        Ok(JSONValue::String(String::from("\t")))
    );

    assert_eq!(
        parse(String::from("\"\\u2603\"")),
        Ok(JSONValue::String(String::from("\u{2603}")))
    );

    assert_eq!(
        parse(String::from("\"\\uD83D\\uDE00\"")),
        Ok(JSONValue::String(String::from("\u{1F600}")))
    );
}

#[test]
fn parse_array() {
    assert_eq!(parse(String::from("[]")), Ok(JSONValue::Array(vec![])));

    assert_eq!(
        parse(String::from("[1,2,3]")),
        Ok(JSONValue::Array(vec![
            JSONValue::Number(1.0),
            JSONValue::Number(2.0),
            JSONValue::Number(3.0)
        ]))
    );
}

#[test]
fn parse_object() {
    assert_eq!(
        parse(String::from("{}")),
        Ok(JSONValue::Object(HashMap::new()))
    );

    let mut o1: HashMap<String, JSONValue> = HashMap::new();
    o1.insert(
        String::from("name"),
        JSONValue::String(String::from("John Doe")),
    );

    assert_eq!(
        parse(String::from("{\"name\":\"John Doe\"}")),
        Ok(JSONValue::Object(o1))
    );

    let mut o2: HashMap<String, JSONValue> = HashMap::new();
    o2.insert(
        String::from("name"),
        JSONValue::String(String::from("John Doe")),
    );
    o2.insert(String::from("age"), JSONValue::Number(42.0));

    assert_eq!(
        parse(String::from("{\"name\":\"John Doe\",\"age\":42}")),
        Ok(JSONValue::Object(o2))
    );
}

Most of these tests are pretty straightforward. The biggest assumptions were what each of the escape characters meant. Among those the major assumption was that \u four-hex-digits referred to UTF-16 encoded values and thus need to account for surrogate pairs.

Building the Parser

With all of that out of the way I was free to start hacking away at a hand crafted recursive descent parser. This parser has two primary goals:

1) Be correct (this includes not crashing) 2) Be easy to understand

Performance, extensibility, security, and real world usability are non-goals.

Here's the initial parser I came up with that conforms to the business card behavior:

fn expect(cursor: &mut Peekable<Chars>, c: char) -> Result<(), JSONParseError> {
    if cursor.next() != Some(c) {
        return Err(JSONParseError::new(&format!("Expected character: {}", c)));
    }

    Ok(())
}

fn parse_token(
    cursor: &mut Peekable<Chars>,
    token: Chars,
    value: JSONValue,
) -> Result<JSONValue, JSONParseError> {
    for c in token {
        expect(cursor, c)?;
    }

    Ok(value)
}

fn parse_null(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
    parse_token(cursor, "null".chars(), JSONValue::Null)
}

fn parse_true(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
    parse_token(cursor, "true".chars(), JSONValue::Boolean(true))
}

fn parse_false(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
    parse_token(cursor, "false".chars(), JSONValue::Boolean(false))
}

fn read_integer(cursor: &mut Peekable<Chars>, number: &mut String) -> Result<(), JSONParseError> {
    let first = cursor.peek();

    if first < Some(&'0') || first > Some(&'9') {
        return Err(JSONParseError::new("Unexpected character"));
    }

    while let Some(c) = cursor.peek() {
        if c >= &'0' && c <= &'9' {
            number.push(*c);
            cursor.next();
        } else {
            break;
        }
    }

    Ok(())
}

fn parse_number_raw(cursor: &mut Peekable<Chars>) -> Result<f64, JSONParseError> {
    let mut number = String::new();

    if cursor.peek() == Some(&'-') {
        number.push('-');
        cursor.next();
    }

    if cursor.peek() == Some(&'0') {
        number.push('0');
        cursor.next();
    } else {
        read_integer(cursor, &mut number)?;
    }

    if cursor.peek() == Some(&'.') {
        number.push('.');
        cursor.next();
        read_integer(cursor, &mut number)?;
    }

    if cursor.peek() == Some(&'e') || cursor.peek() == Some(&'E') {
        number.push('e');
        cursor.next();

        if cursor.peek() == Some(&'-') {
            number.push('-');
            cursor.next();
        }

        if cursor.peek() == Some(&'+') {
            cursor.next();
        }

        read_integer(cursor, &mut number)?;
    }

    Ok(number.parse::<f64>()?)
}

fn parse_number(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
    Ok(JSONValue::Number(parse_number_raw(cursor)?))
}

fn parse_unicode_raw(
    cursor: &mut Peekable<Chars>,
    high: Option<u32>,
) -> Result<char, JSONParseError> {
    expect(cursor, 'u')?;

    let mut u = String::new();

    for _ in 0..4 {
        if let Some(c) = cursor.next() {
            if c.is_ascii_hexdigit() {
                u.push(c);
            } else {
                return Err(JSONParseError::new("Unexpected character"));
            }
        } else {
            return Err(JSONParseError::new("Unexpected EOF"));
        }
    }

    let numeric = u32::from_str_radix(&u, 16)?;

    if let Some(h) = high {
        if numeric >= 0xDC00 && numeric <= 0xDFFF {
            let mut code = 0x10000;
            code += (h & 0x03FF) << 10;
            code += numeric & 0x03FF;
            char::from_u32(code).ok_or(JSONParseError::new("Invalid unicode character"))
        } else {
            Err(JSONParseError::new("Invalid surrogate pair"))
        }
    } else {
        if numeric >= 0xD800 && numeric <= 0xDBFF {
            expect(cursor, '\\')?;
            parse_unicode_raw(cursor, Some(numeric))
        } else {
            char::from_u32(numeric).ok_or(JSONParseError::new("Invalid unicode character"))
        }
    }
}

fn parse_string_raw(cursor: &mut Peekable<Chars>) -> Result<String, JSONParseError> {
    expect(cursor, '"')?;

    let mut s = String::new();
    let mut done = false;

    while let Some(c) = cursor.next() {
        if c == '"' {
            done = true;
            break;
        } else if c == '\\' {
            if let Some('u') = cursor.peek() {
                s.push(parse_unicode_raw(cursor, None)?)
            } else {
                match cursor.next() {
                    Some('"') => s.push('"'),
                    Some('\\') => s.push('\\'),
                    Some('/') => s.push('/'),
                    Some('b') => s.push('\x08'),
                    Some('f') => s.push('\x0C'),
                    Some('n') => s.push('\n'),
                    Some('r') => s.push('\r'),
                    Some('t') => s.push('\t'),
                    None => return Err(JSONParseError::new("Unexpected EOF")),
                    _ => return Err(JSONParseError::new("Unexpected character")),
                }
            }
        } else {
            match c {
                '\u{0000}'..='\x1F' | '\u{007F}' | '\u{0080}'..='\u{009F}' => {
                    return Err(JSONParseError::new("Unexpected control character"))
                }
                _ => s.push(c),
            }
        }
    }

    if done {
        return Ok(s);
    }

    return Err(JSONParseError::new("Unexpected EOF"));
}

fn parse_string(cursor: &mut Peekable<Chars>) -> Result<JSONValue, JSONParseError> {
    Ok(JSONValue::String(parse_string_raw(cursor)?))
}

fn parse_array(cursor: &mut Peekable<Chars>, limit: usize) -> Result<JSONValue, JSONParseError> {
    expect(cursor, '[')?;

    let mut array: Vec<JSONValue> = vec![];

    if cursor.peek() == Some(&']') {
        cursor.next();
        return Ok(JSONValue::Array(array));
    }

    loop {
        array.push(parse_value(cursor, limit)?);

        if cursor.peek() == Some(&']') {
            cursor.next();
            break;
        }

        expect(cursor, ',')?;
    }

    Ok(JSONValue::Array(array))
}

fn parse_object(cursor: &mut Peekable<Chars>, limit: usize) -> Result<JSONValue, JSONParseError> {
    expect(cursor, '{')?;

    let mut object = HashMap::new();

    if cursor.peek() == Some(&'}') {
        cursor.next();
        return Ok(JSONValue::Object(object));
    }

    loop {
        let key = parse_string_raw(cursor)?;

        expect(cursor, ':')?;
        object.insert(key, parse_value(cursor, limit)?);

        if cursor.peek() == Some(&'}') {
            cursor.next();
            break;
        }

        expect(cursor, ',')?;
    }

    Ok(JSONValue::Object(object))
}

fn parse_value(cursor: &mut Peekable<Chars>, limit: usize) -> Result<JSONValue, JSONParseError> {
    if limit == 0 {
        return Err(JSONParseError::new("Exceeded nesting limit"));
    }

    let value = match cursor.peek() {
        Some(c) => match c {
            '"' => parse_string(cursor),
            '0'..='9' | '-' => parse_number(cursor),
            '[' => parse_array(cursor, limit - 1),
            '{' => parse_object(cursor, limit - 1),
            't' => parse_true(cursor),
            'f' => parse_false(cursor),
            'n' => parse_null(cursor),
            _ => return Err(JSONParseError::new("Unexpected character")),
        },
        None => return Err(JSONParseError::new("Unexpected EOF")),
    };

    value
}

pub fn parse(input: String) -> Result<JSONValue, JSONParseError> {
    let mut cursor = input.chars().peekable();

    let value = parse_value(&mut cursor, 1000)?;

    if let Some(_) = cursor.peek() {
        return Err(JSONParseError::new("Expected EOF"));
    }

    Ok(value)
}

Here are the new assumptions that were introduced during the implementation: – “Control characters” means “Unicode control characters”, which includes C0 controls ( 0x00–0x1F), C1 controls (0x80-0x9F), and delete (0x7F) – There should be some maximum recursion depth after which an error is returned. This isn't required to conform with the business card, but that paper tape isn't actually infinite so I'm counting this as a necessary implementation detail.

The Minefield

So now it's time to see how my “naive” parser holds up against more rigorous test cases. In order to do this I added all the tests from the JSON Parsing Test Suite I mentioned before to my project and wrote the following build script to translate those into Rust tests:

use std::fs;
use std::io;

fn make_name(name: &String) -> String {
    name.replace("/", "_slash_")
        .replace("-", "_dash_")
        .replace("+", "_plus_")
        .replace(".", "_dot_")
        .replace("#", "_hash_")
        .replace("__", "_")
        .replace(".json", "")
        .replace("minefield_slash_", "")
        .to_lowercase()
}

fn generate_accept_test(path: &String) -> String {
    format!(
        r#"
#[test]
fn {}() -> std::io::Result<()> {{
    if let Err(e) = parse(std::fs::read_to_string("{}")?) {{
        assert!(false, "Parsing failed: {{:?}}", e);
    }}

    Ok(())
}}"#,
        make_name(path),
        path
    )
}

fn generate_reject_test(path: &String) -> String {
    format!(
        r#"
#[test]
fn {}() {{
    if let Ok(data) = std::fs::read_to_string("{}") {{
        if let Ok(_) = parse(data) {{
            assert!(false, "Parsing should fail")
        }}
    }}
}}"#,
        make_name(path),
        path
    )
}

fn generate_undefined_test(path: &String) -> String {
    format!(
        r#"
#[test]
fn {}() {{
    if let Ok(data) = std::fs::read_to_string("{}") {{
        let _ = parse(data);
    }}
}}"#,
        make_name(path),
        path
    )
}

fn generate_test(path: &String) -> Option<String> {
    if path.contains("/y_") {
        return Some(generate_accept_test(path));
    } else if path.contains("/n_") {
        return Some(generate_reject_test(path));
    } else if path.contains("/i_") {
        return Some(generate_undefined_test(path));
    }

    None
}

fn generate_module(tests: Vec<String>) -> String {
    format!(
        r#"#[cfg(test)]
use crate::parse;
{}"#,
        tests.join("\n")
    )
}

fn main() -> io::Result<()> {
    let mut tests = vec![];

    for entry in fs::read_dir("minefield")? {
        let entry = entry?;
        let path = entry.path().display().to_string();

        if path.ends_with(".json") {
            if let Some(test) = generate_test(&path) {
                tests.push(test);
            }
        }
    }

    fs::write("src/tests/minefield.rs", generate_module(tests))?;

    Ok(())
}

These test suites are broken into three categories: – y_ tests which should be accepted – n_ tests which should be rejected – i_ tests which are undefined

I decided that these should be handled as follows: – y_ tests should return a JSONValuen_ tests should return a JSONParseErrori_ tests should just return without crashing

So how did I fare?

The Results

The results at this point kind of surprised me. Out of 325 tests, only 18 failed. The thing is though, there's a bug I know about already. I knew from the start that some tests would fail because the business card grammar doesn't mention anything about white space.

After updating the parser to allow whitespace there were only two failing tests. And both of these failed for the same reason: despite being a control character delete (0x7F) should be allowed in strings.

After updating the parser to accept delete in strings 100% of the tests passed. A parser I wrote in an afternoon based on a business card and naive assumptions passed for all the defined behavior in the JSON Parsing Test Suite.

Conclusion

At the end of the day, parsing JSON in the real world is still a minefield. While I may have been able to easily craft a parser that passes tests for all of the defined behavior, it's those undefined tests that are the real problem. The cases where there isn't a correct answer will still provide interoperability issues when passing data between different libraries, especially when it's between different languages. Moreover, even for the defined y_ tests, there is no assertion for what a value parses into, just that it's accepted. As an example issue: my implementation can't safely handle integers greater than 2^53 because it treats all numbers as floats, but an implementation that supports parsing to an actual integer type could.

Despite the real world challenges though, I think JSON parsing might be a navigable minefield. In an afternoon I was able to throw together a parser that's as correct as any mainstream JSON parser in use today (perhaps more correct than some), and I don't think that's something that would be possible for many data interchange formats.

I recently saw yet another exchange about developers being expected to provide support for open source software. The debate on this topic isn't anything new. It's gone on long before I got into open source and will probably be going on long after I'm dead. I've heard it all before, but this time one of the takes made me stop and ask a question.

The basic idea was that it's irresponsible to your users to release software if you can't support it. It sounds pretty straightforward at first, but there's some very important assumptions you have to make in order to resolve that sentence. Specifically, what do you mean by “release” and “support”. I don't think any two people reading that sentence will insert the same definition for either of those terms and that's a recipe for misunderstanding. In reality, both release and support are sliding scales.

First let's consider what we mean by support. Is support responding to an email from a user with a question? Is support acting as a maintainer for the project? Is support providing ongoing bug fixes with an SLA? Is support adding new features that get requested? Is support having an on call rotation? These are all valid definitions of support, but it's pretty clear that some of these would be an absurd standard to hold every open source developer to. But where do we draw the line? I think that varies from project to project and largely depends on what release means for that project.

Just for example, let's take my text editor. I've released this code by putting it on GitHub, but the intention isn't for people to actually use it as their text editor. The code is there for other developers to use as a reference for how to build terminal applications from scratch. The only documentation is a single README, and that begins with a note about how it exists for educational purposes. Now consider Helix. Helix has a logo, marketing page, nicely packaged binaries for multiple operating systems, and extensive end user documentation. Helix has a very different definition of release, and I think as a result the expectation of support is much higher than for my text editor.

Beyond that there are also external factors that can influence what support should be offered. What if the user would rather the software exist without support than not exist at all? What if your software results in a critical CVE? The context of a specific situation can't be ignored.

Ultimately, the point I'm trying to make is that the level of support a developer owes their users is a sliding scale that's influenced by a variety of factors. Reasonable people can disagree about where exactly we draw the specific lines, but treating it as a yes or no binary is always going to be wrong.

Lately I've been working on a text editor as a learning exercise in how to build terminal apps from scratch. In doing so I had to draw a line somewhere as to what “from scratch” actually meant. After all, I don't intend to invent the universe.

My initial plan was to restrict myself to Rust's standard library, but it quickly became clear that wasn't going to be sufficient. Specifically, this project requires the ability to change the terminal mode and handle signals, both of which require platform specific code to do manually. I think it's fair to say writing separate blocks of inline assembly for every architecture/operating system pair I want to support is totally out of scope for this project.

Ultimately I decided that the libc crate was a reasonable foundation to build off. libc is the standard library for the C programming language, but it's also part of the POSIX specification. This means it's guaranteed to be available on pretty much any OS except for Windows, and is also going to be the lowest level of abstraction possible that doesn't require writing platform specific code.

Another good choice would be the nix crate, which provides the same functionality as the libc crate, but without requiring the user to write any unsafe code. Ultimately I decided that dealing with the unsafe code myself was a reasonable part of the “from scratch” experience. Moreover, the libc crate is part of the rust-lang project, so it doesn't introduce a dependency on a third party.

Unfortunately the POSIX specification has some wiggle room, so portions of libc still differ between operating systems. Up until today I had only tried compiling the editor on Linux, and I was curious if it would build under FreeBSD. Unfortunately I was greeted with errors.

   Compiling takkun v0.1.0 (/usr/home/cameron/workspace/takkun)
error[E0308]: mismatched types
  --> src/terminal.rs:27:11
   |
27 |     c_cc: [0; 32],
   |           ^^^^^^^ expected an array with a fixed size of 20 elements, found one with 32 elements

error[E0560]: struct `termios` has no field named `c_line`
  --> src/terminal.rs:30:5
   |
30 |     c_line: 0,
   |     ^^^^^^ `termios` does not have this field
   |
   = note: available fields are: `c_iflag`, `c_oflag`, `c_cflag`, `c_lflag`, `c_cc` ... and 2 others

Some errors have detailed explanations: E0308, E0560.
For more information about an error, try `rustc --explain E0308`.
error: could not compile `takkun` due to 2 previous errors

Right away it's clear that the termios struct isn't the same of FreeBSD as it is under Linux. Specifically, the c_cc field seems to be a different size and the c_line field only exists in Linux. So I'm going to need a way to conditionally initialize the termios struct depending on what OS I'm on.

Luckily Rust has a way to do this that even works when initializing static items. The cfg attribute can be used to conditionally include code, allowing me to do the following:

#[cfg(any(target_os = "linux"))]
static mut TERMIOS: libc::termios = libc::termios {
    c_iflag: 0,
    c_oflag: 0,
    c_cflag: 0,
    c_lflag: 0,
    c_cc: [0; 32],
    c_ispeed: 0,
    c_ospeed: 0,
    c_line: 0,
};

#[cfg(any(target_os = "freebsd"))]
static mut TERMIOS: libc::termios = libc::termios {
    c_iflag: 0,
    c_oflag: 0,
    c_cflag: 0,
    c_lflag: 0,
    c_cc: [0; 20],
    c_ispeed: 0,
    c_ospeed: 0,
};

And that was all it took. If I had also needed to write some OS specific logic, Rust provides the cfg! macro that can be used in boolean expressions. It's not a perfect out-of-the-box solution, but it's trivial compared to triggering system calls in pure Rust.

I recently followed through the Build Your Own Text Editor tutorial that walks through all of antirez's kilo editor as a way to learn more about how terminals really work under the hood. In the process I ended up having so much fun that I decided to re-implement everything in Rust and expand what I had into a more capable editor.

There are obvious features that I think anyone would want to add like an undo stack and support for multiple files, but I wanted to start with improving some of the ways the editor interacted with the terminal. This post documents some improvements I've made so far, but I'm still learning about this stuff. Expect some stream of consciousness, unanswered questions, and potentially incorrect (or more likely incomplete) information.

Alternate Screen Buffer

One of the first things that I found was that kilo leaves traces of itself behind in the terminal's scrollback. More complete editors like Vim, Emacs, and Nano all return the terminal to the state it was in before they were launched.

The answer is a feature called alternate screen buffer. This StackOverflow answer describes how to switch into and out of the alternate screen buffer in a backwards compatible way by writing the escape sequences "\033[?1049h\033[2J\033[H" and "\033[2J\033[H\033[?1049l" respectively.

That same answer also provides a function for writing to stdin that still confuses me. It's unclear to me from the answer why I would want to write to stdin instead of stdout. Writing these strings to stdout seems to work fine, but I need to investigate that further.

Resizing the Terminal

The next thing I wanted to fix was handling the terminal being resized. There is a naive solution which is to resize the terminal between every key press. Kilo's input loop runs every tenth of a second, so this works relatively well, but it seems inefficient and could be more responsive.

Unfortunately the only way to detect a terminal resize is handling the SIGWINCH signal. Signal handlers can be dangerous and are generally discouraged if not strictly necessary. They're essentially another thread that isn't allowed to do any locking, so ultimately the only way safely communicate back to the main application thread is writing to a named pipe. The main thread can then read data from that pipe and trigger a resize.

If this were the only signal I needed to handle, I would probably forgo signal handling altogether and opt for the naive solution. Unfortunately there is one place I actually need one.

ctrl+z Support

In a typical terminal application, hitting ctrl+z will cause the process to be sent to the background. This utilizes a feature the shell provides called job control and is typically handled automatically. Unfortunately there are two problems that prevent this from working as expected.

The first problem is raw mode's handling of key presses. In raw mode, ctrl+z is sent along to the application instead of being handled by the terminal. Normally the terminal sends the SIGTSTP signal to the application when ctrl+z is pressed. I couldn't find any information on how to make a process background itself, so I guessed and tried making it send SIGTSTP to itself using libc's kill function.

unsafe {
    libc::kill(std::process::id() as i32, libc::SIGTSTP);
}

Thankfully that worked. The process was sent to the background with raw mode and the alternate screen buffer being exited as well.

The second problem is returning to the foreground. Running fg does resume the process, but it doesn't handle entering raw mode and the alternate screen buffer. To make matters worse, there are now multiple race conditions to contend with.

You may assume you can just make the next line of code after sending SIGTSTP enter raw mode and the alternate screen buffer, but that doesn't work. Signal handling isn't synchronous so that code can run before the process is backgrounded. The only way to ensure that code runs after returning to the foreground is to handle the SIGCONT signal.

Using the same method as the resize handler, the signal handler can write data to a named pipe then have the main thread read that to trigger entering raw mode and the alternate screen buffer. There is another race condition though. The application thread can write to stdout before this handler is finished thus printing data outside the alternate screen buffer and not in raw mode. This means some sort of locking needs to be added to prevent writing in the time between SIGTSTP being sent and raw mode and the alternate screen buffer being re-entered.

Making it Feel More Responsive

The main application thread is now reading data from two signal handlers to handle resizing and returning to the foreground. With the input loop structured like Kilo reading that data can only happen between key presses. In the worst case, that can mean waiting the full 1/10th of a second timeout before a signal takes effect. That's not a terrible delay but it is enough to be perceptible. I decided I wanted better.

I ended up adding two threads to the application. The first thread reads from stdin then sends structured events back to the main thread via a Rust channel. The second thread reads the named pipe the signal handlers are using and sends structured events back to the main thread through the same channel. Instead of taking turns reading stdin and the signal handler pipe, the main thread now just reads the single event stream and decides what to write to stdout based on that.

Input Timeout and the Escape Key

At this point my editor is starting to feel much more polished. Resizing and ctrl+z work, and when it exits nothing is left behind in the terminal's scrollback.

With the new threaded architecture I decided to revisit the way stdin was being read. I thought the timeout might not be needed anymore. Reading stdin could now block forever without affecting the main thread, but this proved to be a mistake.

The 1/10th second timeout was useful for allowing the editor to periodically handle other tasks while waiting on the user to press a key. It served a second function however: detecting when the escape key was pressed.

It turns out that the only way to differentiate the user pressing the escape key from the terminal sending an escape sequence is to wait a little while to see if an escape sequence follows the escape character. If that 1/10th of a second passes with no further data we assume it was the user pressing escape.

But what about network delay? What if an escape sequence gets spread apart in time? What if user input gets bunched up? What if the user just types something in really fast? How can we tell the difference? The answer is you can't.

You can test it if you want. Open a text file with some content in Nano and as quickly as you can type <esc>[F. That's the escape sequence for End. If you do it fast enough the cursor will move to the end of the line. Nano thinks you're the terminal sending an escape sequence.

So ultimately I'm stuck with the timeout. It's the only way to distinguish the user from the machine. I'm never going to be happy with the ambiguity involved with handling the escape key, but that's just how it is when dealing with legacy behavior from the 1960s.

Previous: Debugging Functions

Now that we have some debugging functions in place, we can start work on actually loading data off a disk. Currently our “disk” is only a single block though, so we need to update our build to generate a real disk image. We'll do this using qemu-image to generate a blank disk image, then we'll use dd to copy our bootloader onto the first block of that image.

build/disk.img: build/boot.img
	qemu-img create build/disk.img 1474560B
	dd if=build/boot.img of=build/disk.img bs=512 count=1 conv=notrunc

1,475,560 bytes is the size of a 1.44MB floppy disk. This is a useful format to work with because it's ubiquitous (at least for any machine we're targeting) and it has a disk geometry we can predict when using qemu. Now we just need to update our run command to point at the full disk image and also to tell qemu this is a floppy image and not a hard drive.

.PHONY:
run: build/disk.img
	qemu-system-x86_64 -drive format=raw,if=floppy,file=build/disk.img

Logical Block Addressing (LBA)

Disks are broken up into groups called blocks, which are traditionally 512 bytes. Modern disks tend to use larger 4096 byte blocks, but DOS had long been retired by the time those existed.

Using logical block addressing, you can access the disk blocks indexed from zero. With a total size of 1474560 bytes, our floppy image is made up of 2880 blocks, with block 0 being the first block (which is the boot sector) and block 2879 being the last.

Unfortunately logical block addressing didn't exist until the mid 90s, so most machines that would have run DOS as a primary operating system don't support it. Instead, disks were addressed using cylinder-head-sector addressing, a scheme which requires knowing the physical geometry of the disk.

Cylinder-head-sector (CHS) addressing

Hard drives are made up of one or more metal discs called platters read by some number of magnetic heads (usually two per platter). The drives are sliced up into concentric rings called cylinders, and pie slices called sectors. The intersection of a cylinder and a head is a track, and the intersection of a track and a sector is a specific block. Using this, you can access a block by selecting a specific cylinder, head, and sector. If that sounds confusing don't worry, it's not necessary to fully understand.

In order to maintain our sanity we really need to think of the disk in terms of LBA. Luckily it's easy to convert from LBA to CHS (and vice versa), so we can build functions to read and write blocks based on their LBA address.

The formulas for converting an LBA address into a CHS tuple are as follows:

cylinder = LBA / (heads per cylinder * sectors per track) head = (LBA / sectors per track) % heads per cylinder sector = (LBA % sectors per track) + 1

As far as math is concerned this is just basic arithmetic, but there is some information we're missing. We need to know the number of heads per cylinder, the number of sectors per track, and if we want to ensure we're not reading past the end of the drive the number of cylinders (or total number of blocks, which we would need the number of cylinders to compute).

Determining Disk Geometry

Disk geometry can be determined by calling int 0x13 function code 0x08. This function will populate various registers with the disk info of a disk specified by the value in dl. We'll need to do some shuffling to get the data we want isolated, so we'll build a function for that.

; allocate some memory to store disk info
I_CYLINDERS: dw 0
I_HEADS: dw 0
I_SECTORS: dw 0


; Loads disk info at the I_CYLINDERS, I_HEADS, and I_SECTORS
; labels.
; dl: drive index
load_disk_info:
    pusha

    mov ah, 0x08
    int 0x13

    ; print a message and halt if there was an error
    jc disk_error

    ; isolate bits [5:0] of cx
    ; this is the number of sectors
    mov ax, cx
    and ax, 0x3f
    mov [I_SECTORS], ax

    ; isolate bits [15:8] of dx
    ; this is the number of heads - 1
    mov al, dh
    mov ah, 0
    inc ax
    mov [I_HEADS], ax

    ; isolate bits [7:6][15-8] of cx
    ; this is the number of cylinders - 1
    mov al, ch
    mov ah, cl
    shr ah, 6
    inc ax
    mov [I_CYLINDERS], ax

    popa
    ret


disk_error:
    mov bx, DISK_ERROR_MESSAGE
    call print_string
    hlt
    jmp $-1


DISK_ERROR_MESSAGE: db "Error Reading Disk!$"

To verify that the disk info is being loaded correctly, let's write a function to print that info in a human readable format.

; Prints disk info
; dl: drive index
print_disk_info:
    pusha

    call load_disk_info

    mov bx, CYLINDERS
    call print
    mov ax, [I_CYLINDERS]
    call print_hex
    mov bx, LINE_BREAK
    call print

    mov bx, HEADS
    call print
    mov ax, [I_HEADS]
    call print_hex
    mov bx, LINE_BREAK
    call print

    mov bx, SECTORS
    call print
    mov ax, [I_SECTORS]
    call print_hex
    mov bx, LINE_BREAK
    call print

    popa
    ret

CYLINDERS: db "Cylinders: $"
HEADS: db "Heads: $"
SECTORS: db "Sectors: $"

With all that in place, we can call print_disk_info with dl set to 0 to print the info for the first floppy disk.

; print disk info for first floppy disk
mov dl, 0x00
call print_disk_info

If everything is working correctly, you should see the following output:

Cylinders: 0050
Heads: 0002
Sectors: 0012

Converting from hex that's 80 cylinders, 2 heads, and 18 sectors, which is the geometry of a 1.44MB floppy disk. In the next post we'll use this information about the disk geometry to read an arbitrary block from the disk.

Previous: A Simple Toolchain

One final task I want to get out of the way before attempting to load anything from disk is creating a couple debugging functions. print_hex will print the contents of a register in hexadecimal format, and print_mem will print a specified amount of data from memory in hexadecimal format.

The print_hex function is relatively simple. It isolates 4 bits of the ax register at a time by bitshifting the value until the bits we want are in the lowest 4 positions, then masking that value with 0x000F. It then uses the resulting value as an index for the string 0123456789ABCDEF.

; Prints a register value in hexadecimal
; ax: the value to print
print_hex:
    pusha

    ; Overwrite the first character of HEX_VALUE with bits 15-12
    mov bx, ax
    shr bx, 12
    mov bx, [bx + HEX_TABLE]
    mov [HEX_VALUE + 0], bl

    ; Overwrite the second character of HEX_VALUE with bits 11-8
    mov bx, ax
    shr bx, 8
    and bx, 0x000f
    mov bx, [bx + HEX_TABLE]
    mov [HEX_VALUE + 1], bl

    ; Overwrite the third character of HEX_VALUE with bits 7-4
    mov bx, ax
    shr bx, 4
    and bx, 0x000f
    mov bx, [bx + HEX_TABLE]
    mov [HEX_VALUE + 2], bl

    ; Overwrite the fourth character of HEX_VALUE with bits 3-0
    mov bx, ax
    and bx, 0x000f
    mov bx, [bx + HEX_TABLE]
    mov [HEX_VALUE + 3], bl

    ; Print the now populated HEX_VALUE
    mov bx, HEX_VALUE
    call print

    popa
    ret

    HEX_VALUE: db "****$"
    HEX_TABLE: db "0123456789ABCDEF"

The print_mem function is mostly straightforward now that we have a print_hex function. It just runs a loop loading memory into ax then calling print_hex. The only caveat is it loads the values in big endian orientation so you can read bytes in order from left to right. If we were to load an entire word at once by doing mov ax, [si] it would load as little endian and you'd see the bytes in the order 0, 1, 3, 2, 5, 4...

; Prints words of memory in hexadecimal
; bx: address of memory to print
; cx: number of words to print
print_mem:
    pusha

    ; SI will point to the current byte of memory
    mov si, bx

    print_mem_loop:
        ; Print one word of memory
        ; This loads bytes in big endian order
        mov ah, [si]
        mov al, [si + 1]
        call print_hex

        mov bx, SPACE
        call print

        ; Advance the memory pointer by 1 word
        ; Decrement the loop counter by 1 and exit if it's 0
        add si, 2
        sub cx, 1
        cmp cx, 0
        jne print_mem_loop

    mov bx, LINE_BREAK
    call print

    popa
    ret

SPACE: db " $"
LINE_BREAK: db 0x0a, 0x0d, "$"

Inspecting the Bootloader

Now that we have these debugging functions, we can have the bootloader print itself as a sanity check. This should output the same data as hexdump -C build/boot.img.

; print bootloader contents
mov bx, 0x7c00
mov cx, 256
call print_mem

Next: Getting Disk Info

Previous: Initial Bootloader

Now that we have a machine that can boot and execute some code, there is enough of a foundation to do a “Hello World” tutorial. There isn't too much we can do inside the 512 bytes of the boot sector, but it's enough space to play around with some fundamentals before moving on to loading an actual kernel.

So the boot loader from the previous post contained this block of code to print a dollar sign to the screen:

mov al, "$"
mov ah, 0x0e
int 0x10

You can probably infer what most of this does. int 0x10 triggers an interrupt handler that does something with the contents of ax. We printed a dollar sign so al is probably where we put a character we want to print, and the 0x0e in ah must refer to some sort of “print character” function. But how does that actually work? What if we want to print a string and not a single character? What if we want to use colors? Can we change the font? What if we want to deal with graphics instead of text?

INT 10H

INT 10H (or int 0x10 as I prefer writing it in code), is the interrupt vector for the BIOS video services. It can do several useful things, but what we care about right now is controlling text output.

When triggering INT 10H, ah is used as the function code. The function code we used (0x0e) is for teletype output, which behaves more or less like a TTY you're probably used to from modern operating systems. In text mode, it prints a character at the current cursor position then advances the cursor. It also line wraps if necessary, scrolls the screen if necessary, and responds to various control codes.

So that's enough information to output a character, and based on the described behavior you can probably figure out how to print full strings using a simple loop. This information also presents some unanswered questions though. Are there other types out output besides teletype? What modes besides text are there? What other function codes can INT 10H handle?

Video Modes

Let's start from the top. The very first function code INT 10H provides (function code 0) allows you to set the video mode. The value of al determines which of the many available modes the video card will use, but all of these fall into one of two categories: text and graphics.

In text mode, the video card interprets video memory as ASCII characters and renders those to the screen.

In graphics mode, the video card interprets video memory as pixel data. If you want to print text in this you'll have to write your own font rendering code, which is far beyond the scope of this post.

I'm going to choose to use text mode 3, which is a 80x25 character text mode with 16 colors. On any modern computer or emulator this is likely the mode the computer booted into, but we'll set it to be safe. As an added bonus, setting the video mode will also clear the screen.

mov ah, 0
mov al, 0x03
int 0x10

NOTE: BIOS also supports multiple “pages” of text that you can toggle between, but for the time being we'll just do everything on page 0.

Cursor Position and Shape

Before we begin outputting text, it's important to understand how to manipulate the cursor on screen. Some output functions don't advance the cursor, so manually advancing the cursor is necessary.

Function 0x01

This sets the cursor shape. The cursor is always a full character width but you can specify the starting and ending rows to control the height and vertical position of of the box that's drawn. In text mode 3, characters are 16 pixels tall. The default underline cursor starts at line 13 and ends at line 14 (I assume that last pixel is left blank for line spacing). If you wanted the cursor to be a box instead, you could fill from 0 to 14 instead.

; Make cursor a square
mov ah, 0x01
mov ch, 0
mov cl, 0x0e
int 0x10

Function 0x02

This sets the cursor position using row and column indexing with 0, 0 being the top left of the screen, and 24, 79 being the bottom right corner.

; Put cursor in bottom right corner
mov ah, 0x02
mov bh, 0
mov dh, 24
mov dl, 79
int 0x10

Text Output Functions

INT 10H provides four functions for text output.

  • Function 0x09 writes a character and attribute at the cursor position
  • Function 0x0a writes a character at the cursor position
  • Function 0x0e is the already discussed teletype output
  • Function 0x13 is string output

Note: The first three should work on any IBM compatible PC, but the string output function wont work on some older machines. I haven’t been able to find a definitive answer as to where the exact cutoff is, but any machine with a VGA card is almost certainly new enough.

Function 0x09

This writes a character with a given attribute at the current cursor position. It does not advance the cursor or respond to control codes. It can, however, repeat a character more than one time.

The attribute refers to the color of the text, which in 16 color mode can be anything between 0x00 and 0x0f.

  • 0x00 – Black
  • 0x01 – Blue
  • 0x02 – Green
  • 0x03 – Cyan
  • 0x04 – Red
  • 0x05 – Magenta
  • 0x06 – Brown
  • 0x07 – White
  • 0x08 – Gray
  • 0x09 – Light Blue
  • 0x0a – Light Green
  • 0x0b – Light Cyan
  • 0x0c – Light Red
  • 0x0d – Light Magenta
  • 0x0e – Yellow
  • 0x0f – Bright White
; Set the page to use (used by function 0x02 and 0x09)
mov bh, 0

; Set the color to red (used by function 0x09)
mov bl, 0x04

; Move cursor to 10, 0
mov ah, 0x02
mov dh, 10
mov dl, 0
int 0x10

; Output "r" at the cursor position
mov ah, 0x09
mov al, 'r'
mov cx, 1
int 0x10

; Move cursor to 10, 1
mov ah, 0x02
mov dh, 10
mov dl, 1
int 0x10

; Output "e" at the cursor position
mov ah, 0x09
mov al, 'e'
mov cx, 1
int 0x10

; Move cursor to 10, 2
mov ah, 0x02
mov dh, 10
mov dl, 2
int 0x10

; Output "e" at the cursor position
mov ah, 0x09
mov al, 'd'
mov cx, 1
int 0x10

; Move cursor to 10, 3
mov ah, 0x02
mov dh, 10
mov dl, 3
int 0x10

Function 0x0a

This is the same as 0x09, but you don't provide an attribute value. The text will be printed using whatever the last attribute value at that position was.

Function 0x0e

The behavior of the teletype output function was mostly discussed above, but it should be added that this function will also retain the previous attribute value like 0x09.

Other Useful Functions

We've discussed all the text output functions BIOS provides, but there are several other useful functions you can use in conjunction with those.

Function 0x05

This function selects which display page to use.

; Set the currently displayed page to page 0
mov ah, 0x05
mov al, 0
int 0x10

Function 0x06

This function scrolls an area of the active page up one row. Text that goes outside that area isn't retained, so scrolling up one followed by scrolling down one would result in the top line of text in that area being blank.

; Clear the screen (assuming an 80x25 character video mode)
mov ah, 0x06

; Scroll 25 rows and set the attribute to 0x07 (white)
mov al, 25
mov bh, 0x07

; Top left corner of scroll area
mov ch, 0
mov cl, 0

; Bottom right corner of scroll area
mov dh, 24
mov dh, 79

int 0x10

Function 0x07

This scrolls the active page down. It behaves the same as 0x06 aside from the direction.

Printing Strings

Now that we've gone over all the various text related functions of INT 10H, we can put all that together to actually print strings. What's that you ask? Isn't that just using function 0x013?

It could be depending on what you were building, but this series is specifically about learning how DOS works, and function 0x013 isn't actually all that useful for displaying DOS's string format.

DOS’s print string function operates on dollar terminated strings. Rather than taking the length of the string as a parameter, it just keeps printing characters until it comes across a dollar sign. We could loop over our string to find its length, but if we’re looping anyway it’s easier to just print as we iterate.

As best I can tell from experimentation, the DOS print string function interprets characters the same as the BIOS teletype output, with one exception. DOS does not display the character 0x27 (escape), and also does not display the next character following 0x27. I don't fully understand why, but I'll update this section once I do.

For now, I'm going to implement the print string function by using the BIOS teletype output function and call that close enough.

; Prints a $ terminated string
; bx: address of the string to print
print:
    pusha
  
    ; SI will point to the current character
    mov si, bx

    ; Set the arguments for BIOS teletype output
    mov ah, 0x0e
    mov bx, 0

    print_loop:
        ; load the next character from memory
        mov al, [si]

        ; check if we're at the end of the string
        cmp al, '$'
        je print_exit

        ; print the character
        int 0x10

        ; move on to the next character
        add si, 1
        jmp print_loop

    print_exit:
        popa
        ret

What if I don't want to go through the BIOS?

In some cases, you may find it more effective to directly write text to graphics memory instead of using the various character and string printing functions provided by the BIOS. In text modes, text is stored directly in graphics memory and the graphics adapter handles converting that text data into pixels. All that's necessary to modify this data directly is knowing where it's located in memory and how it's structured, but that will be discussed in a future post about VGA.

Conclusion

Now that we have a print function we can update our bootloader to print an actual “hello world” message.

; boot.asm
[bits 16]
[org 0x7c00]


main:
    ; disable interrupts
    cli

    ; make sure the CPU is in a sane state
    jmp 0x0000:clear_segment_registers
    clear_segment_registers:
        xor ax, ax
        mov ds, ax
        mov es, ax
        mov ss, ax
        mov sp, main
        cld

    ; re-enable interrupts
    sti

    ; print a message
    mov bx, msg
    call print

    ; "It's now safe to turn off your computer."
    hlt
    jmp $-1


; Prints a $ terminated string
; bx: address of the string to print
print:
    pusha
    
    ; SI will point to the current character
    mov si, bx

    ; Set the arguments for BIOS teletype output
    mov ah, 0x0e
    mov bx, 0

    print_loop:
        ; load the next character from memory
        mov al, [si]

        ; check if we're at the end of the string
        cmp al, '$'
        je print_exit

        ; print the character
        int 0x10

        ; move on to the next character
        add si, 1
        jmp print_loop

    print_exit:
        popa
        ret


msg:
    db "Hello World!"
    crlf db 0x0d, 0x0a
    endstr db '$'


; padding
times 510-($-$$) db 0

; literally magic
dw 0xaa55

Next: A Simple Toolchain

Previous: Introduction

We can't run anything if we don't have a way to boot the machine, so step one is writing a bootloader. We don't have a kernel to load yet so for now we'll focus on just getting the machine to boot and print a character.

Essential Tools

Before we begin you'll need a few tools to start developing. At minimum you'll need an assembler and an x86 emulator. I'm developing on Linux so I'm going to be using NASM and QEMU respectively. If you're running macOS you should be able to install them using Homebrew, and if you're using Windows I recommend setting up the Windows Subsystem for Linux.

Let's Boot a Computer

To get started, here's the most minimal bootloader I can write that's guaranteed to run on just about any x86 based machine. Don't worry if you don't understand it just yet, I'm going to break it all down later.

; boot.asm
[bits 16]
[org 0x7c00]


main:
    ; disable interrupts
    cli

    ; make sure the CPU is in a sane state
    jmp 0x0000:clear_segment_registers
    clear_segment_registers:
        xor ax, ax
        mov ds, ax
        mov es, ax
        mov ss, ax
        mov sp, main
        cld

    ; re-enable interrupts
    sti

    ; print a dollar sign
    mov al, "$"
    mov ah, 0x0e
    int 0x10

    ; "It's now safe to turn off your computer."
    hlt
    jmp $-1

; padding
times 510-($-$$) db 0

; literally magic
dw 0xaa55

You can build this by doing nasm -fbin boot.asm -o boot.img, and then run it by doing qemu-system-x86_64 -drive format=raw,file=boot.img. If everything worked correctly, the computer should have booted, printed a single $ to the screen, then halted.

The Boot Process

When you power on a computer, the first thing that happens is some firmware that lives in a ROM chip on the motherboard is loaded into memory and executed. In the case of IBM compatible PCs, that firmware is called BIOS.

Note: BIOS is not the generic term for firmware. It is a specific firmware that provides a specific set of functions for the operating system to make use of. So whenever I refer to BIOS I'm talking about that specific interface.

The way BIOS boots a computer is to read the first 512 bytes from some boot disk and load it into memory at the location 0x7c00 (that's 31744 in decimal). Astute observers will note that's exactly 1KB below 32KB. It then jumps to 0x7c00 and starts executing.

Note: the x86's memory is byte addressable counting from 0. As an example, that means the address 31 points to the 32nd byte of memory.

So what we've done is generate a 512 byte blob of machine code and told QEMU that blob is a hard drive. QEMU then takes that blob and reads the first 512 bytes of it (which is all of it) into memory and executes it.

What Does the Code Do?

Directives

There are two directives to the assembler at the beginning of the file.

[bits 16]
[org 0x7c00]

The first directive, bits 16, tells the assembler to generate 16-bit code.

The second directive, org 0x7c00 tells the assembler where in memory the binary will eventually be loaded. This is necessary for the assembler to calculate the correct memory addresses of labels. As an example from this code, mov sp, main needs to resolve to mov sp, 0x7c00 and not mov sp, 0x0000.

Resetting the Segment Registers

The first actual instructions occur after the main label. The lines beginning with cli and ending with sti are responsible for resetting the segment registers.

So far when I've talked about memory, it's been using a single 16-bit number (like 0x7c00). However, using a single 16-bit number you can only count as high as 65,535, which is far less memory than an x86 can access. The way x86 CPUs actually access memory is by using a pair of 16-bit numbers called the segment and offset. Memory is divided into 64KB sized segments that are offset by 16 bytes. So segment 0 spans from 0 to 65,535, segment 2 spans from 16 to 65,551, and so on. The offset is the address within a segment. This means that there are multiple segment and offset combinations that can point to any given memory address.

To summarize, you can calculate a memory address as segment * 16 + offset. Absolute memory addresses are often written as segment:offset. Following that convention, the boot sector is loaded at 0x0000:0x7c00.

So how do we actually use segment registers? The x86 has four segment registers: cs (code segment), ds (data segment), ss (stack segment), and es (extra segment). Whenever you issue an instruction that accesses memory, the value corresponding register is used. For example:

  • jmp 0x0001 jumps to cs:0x0001
  • mov ax, [0x0001] loads the value at ds:0001 into ax
  • push ax saves ax to ss:sp and decrements sp (the stack grows towards 0).
  • es can be used to specify other segments, such as doing mov ax, [es:0x001] if you need to load memory not in the current segment.

It's possible for some (buggy) BIOS to have the segment registers setup incorrectly on boot. They should all be 0x0000, but we reset them just to be safe. This process can seem a bit convoluted at first so I'll explain it in detail.

This section begins with cli to disable interrupts and ends with sti to re-enable them. This is a precaution that I'm not sure is actually necessary, but I've included it just to be safe.

Next is the instruction jmp 0x0000:clear_segment_registers. It may seem a bit odd to jump to the next instruction, but this is how you change the code segment. This is because you need to update the code segment and instruction pointer (that is to say the code offset) at the same time to prevent jumping to an unintended memory location.

Next we set the ds, es, and ss registers. These can be directly set, but not to an immediate value. So mov ds, 0 is not a valid instruction. instead we need to set some other register (in this case ax) to 0, then move that value using mov ds, ax.

Finally we set the stack pointer (sp) to main so the stack grows away from our code.

Printing “$”

The next 3 lines are the actual code for the bootloader, which prints “$”.

This works by triggering the interrupt 10h to invoke a routine provided by the BIOS. The value in ah indicates which function to call, which in this case corresponds to “teletype output”. The value that gets outputted is the character in al, which we set to be $.

Halting the CPU

The next two lines halt the CPU. hlt executes the halt instruction which causes the CPU to hault and wait for an interrupt. Because interrupts periodically fire execution will eventually continue and we need to jump back to the halt instruction. $ is the current location so jmp $-1 jumps back 1 instruction.

Padding the Boot Sector and Magic Number

times 510-($-$$) db 0 pads the file with zeroes until it's 510 bytes long. Finally, dw 0xaa55 writes the two byte magic number that the BIOS requires a boot sector to end with, thus completing the 512 byte sector.

Next: Hello World

Welcome to DOS from Scratch, a series of blog posts following my attempts to create a DOS compatible operating system from scratch.

But Why?

Why would I want to spend time writing code to imitate a 40 year old operating system? Nostalgia, mostly. My first experience using a computer was on a 286 running I don't even know what version of DOS. It was fascinating to me, but back then I was a young child who had no idea how computers worked. Now that I'm an adult with a firm grasp on systems programming, I want to revisit that period of history and learn what was really going on under the hood. I also want to document the process for anyone else who may have a similar nostalgic curiosity.

Project Philosophy

Because the purpose of this project is to learn how everything in the operating system works, I intend to write all of the code from scratch. I'll be referencing existing code to learn how things work, and occasionally running existing applications to test compatibility, but the final product will be 100% written by me. While there will be a bootstrapping phase using an existing OS and tools, I do intend for the operating system to eventually be entirely self hosting.

Getting Started

Now that all the background is out of the way, let's dive in. I'll be updating this page with a table of contents as new posts are published, so you should bookmark this page if you're interested in following along.

You can also check out the code on Github.

Table of Contents