Pao Ramen
-
Why Engineers Hate Their Managers (And What to Do About It)
Jun 29 · terriblesoftware.org ⎯ Discover why engineers hate managers, the common management anti-patterns that destroy trust, and practical solutions from someone who’s been on both sides.
-
toward a new aesthetic
Jun 29 · newsletter.sachajudd.com ⎯ Lately I’ve been thinking about what being online looks like. The actual look of it. The aesthetic. As I continue to think and write and talk about escaping…
-
Hayao Miyazaki: 9 Creative Lessons
Jun 29 · www.readtrung.com ⎯ Notes on the creative process for the legendary Japanese filmmaker and founder of Studio Ghibli.
-
Introduction to deep research in the OpenAI API | OpenAI Cookbook
Jun 28 · cookbook.openai.com ⎯ The Deep Research API enables you to automate complex research workflows that require reasoning, planning, and synthesis across real-worl…
-
GitHub - pacocoursey/xslt: Native web build system (XML+XSLT)
Jun 28 · github.com ⎯ Comments
-
You don’t understand "ordinary people." -- Richard P. Feynman
Jun 27 · genius.cat-v.org ⎯
-
What Problems to Solve - By Richard Feynman
Jun 27 · genius.cat-v.org ⎯
-
The ultimate Astro + Google Analytics guide
Jun 26 · daniel.es ⎯ Learn how to integrate Google Analytics (GA4) with Astro in multiple ways. Using Partytown, View Transitions, CookieConsent V3 and more.
-
Web Numbers
Jun 26 · ar.al ⎯ Domains? Where we’re going, we don’t need domains! Until the end of this year, the only way to have a secure web site is to have it accessible via a domain name. That, however, is changing. And the design of the Small Web will be changing along with it. IP addresses are about to make a comeback on the web in a big (ok, small) way, thanks to upcoming support for security certificates for IP addresses by Let’s Encrypt.
-
What is the Small Web?
Jun 26 · ar.al ⎯ Updated June 19th, 2023 Sorry, your browser doesn’t support embedded videos. But that doesn’t mean you can’t watch it! You can download Small Is Beautiful #23 directly, and watch it with your favourite video player. Small Is Beautiful (Oct, 2022): What is the Small Web and why do we need it? Today, I want to introduce you to a concept – and a vision for the future of our species in the digital and networked age – that I’ve spoken about for a while but never specifically written about:
-
Take 3 minutes to delete these words and improve your writing forever
Jun 25 · newsletter.weskao.com ⎯ Before you hit “send,” delete these words from your writing. Your message will be stronger for it.
-
Quality is a trap
Jun 25 · ericwbailey.website ⎯ It is a proxy phrase, often born of an inability or unwillingness to articulate other concerns.
-
The six crises every organization faces [corrected image]
Jun 25 · theoverlap.substack.com ⎯ Each phase of growth solves one problem—and creates the next. Here's the cycle organizations repeat.
-
The Bitter Lesson is coming for Tokenization
Jun 25 · lucalp.dev ⎯ Highlights the desire to replace tokenization with a general method that better leverages compute and data. We’ll see tokenization’s fragility and review the Byte Latent Transformer arch.
-
Agentic Search for Dummies — Benjamin Anderson
Jun 24 · benanderson.work ⎯ A simple, effective baseline for building AI search agents.
-
Monotone Functions and Cache Lines
Jun 24 · buttondown.com ⎯ If you've taken a calculus class your professor might have explained to you the idea of a monotone function, or (in one direction) a nondecreasing function. A monotone function looks like this: A monotone function f between two ordered sets, is one which satisfies: x <= y implies f(x) <= f(y) Note that this definition does not imply that f is increasing: it's okay if x < y and f(x) = f(y) (for example, a constant function is monotone), it's just that it's nondecreasing. In a calculus class, you probably only learned about this concept in the context of functions from real numbers to real numbers, but it's a coherent idea, and a useful abstraction, for any two ordered sets. For example, in set-to-set functions, it shows up a lot in the database literature via ideas like the CALM Theorem. The CALM theorem is interested in monotonicity from the perspective of programs: if getting more inputs can only result in new outputs (the classic example is "if you add more edges to a graph, you can only get more not cycles, not fewer"). One way you might think about what a monotone function f gives you is that knowing the values of f(x) and f(y) is a sort of hint that suggests what the ordering of x and y are. Looking at the the contrapositive of our definition tells us what we can learn from a comparison of f(x) and f(y): Original statement: x <= y implies f(x) <= f(y) Contrapositives: f(x) > f(y) implies x > y and f(x) < f(y) implies x < y Implied: we don't learn anything if f(x) = f(y). Looking at our graph from before, we see this is true, if we have two points on the plateau (roughly around [4, 5]), they can agree along the y-axis, but be in either order along the x-axis. The comparison of x and y is in some sense "pickier" than the comparison of f(x) and f(y): there are some things look the same under f(x) and f(y), but can be distinguished in their original forms of x and y. What this suggests is that if we have a monotone function where comparisons on its codomain are easier or faster to compute than those on its domain, we can use those instead and only fall back to the original comparison when the values are the same. Our implementation looks like this: def less_than(x, y): if f(x) < f(y): return True if f(x) > f(y): return False return x < y That's all abstract. Let's get more concrete. A string, in at least some programming languages, is a struct containing a pointer, a capacity, and a length. If we have an array of strings, what we really have is an array of pointers and capacities and lengths, where those pointers point somewhere into heap memory: [ s1, s2, s3, s4 ] | v [ (ptr1, cap1, len1), --> somewhere in the heap (ptr2, cap2, len2), --> somewhere in the heap (ptr3, cap3, len3), --> somewhere in the heap (ptr4, cap4, len4), --> somewhere in the heap ] What this means if we want to sort this list is that we're not operating on a big slab of continuous memory like it might look like: every time we compare two strings, we have to follow each of their pointers and leap off somewhere into heap memory to compare them. This is not the biggest deal, but if the page we try to read is not in cache, then we're likely to hit a page fault and have to load it in, which is slow. You can extend this situation to rows in a database which might not even reside in memory and need to be fetched from disk, or even network: comparing two of them might need to read data from some far off place. There is a well-known trick to improve this situation, which I know most commonly as an abbreviated key: store a small prefix (8 bytes is a good number) of the string next to its pointer: [ (abbrev1, ptr1, cap1, len1), (abbrev2, ptr2, cap2, len2), (abbrev3, ptr3, cap3, len3), (abbrev4, ptr4, cap4, len4), ] Now we've duplicated some of the data: some of the string is actually stored directly in our big slab of memory. The heart of the trick is that the operation of "taking a fixed-size prefix" is monotone! Consider if we take 3-byte prefixes of strings, and we have the following dataset: [ "goofball", "goofy", "donald", ] If we store our prefixes inline, they look like this: [ ("goo", "goofball"), ("goo", "goofy"), ("don, "donald"), ] Now, comparing "goofball" and "goofy," we have to go to the actual string to distinguish, since they agree on their prefix. But because of monotonicity, and the fact the "don" < "goo", we know already that "donald" must come before "goofy" without ever having to do any pointer-chasing. We can observe this: use rand::seq::SliceRandom; use rand::thread_rng; use std::fs::File; use std::io::{BufRead, BufReader}; use std::time::Duration; fn main() { println!("String: {:?}", run::<String>()); println!("AbbreviatedString: {:?}", run::<AbbreviatedString>()); } fn run<T: ComparableString>() -> Duration { let file = File::open("/usr/share/dict/words").unwrap(); let reader = BufReader::new(file); let mut strings: Vec<T> = Vec::new(); for line in reader.lines() { let word = line.unwrap(); strings.push(word.into()); } strings.shuffle(&mut thread_rng()); let start = std::time::Instant::now(); strings.sort_unstable(); start.elapsed() } trait ComparableString: Ord + From<String> {} impl ComparableString for String {} #[derive(Debug, PartialEq, Eq, Ord)] struct AbbreviatedString { abbreviation: [u8; 8], full_string: String, } impl From<String> for AbbreviatedString { fn from(full_string: String) -> Self { let mut abbreviation = [0; 8]; let bytes = full_string.as_bytes(); let len = bytes.len().min(8); abbreviation[..len].copy_from_slice(&bytes[..len]); AbbreviatedString { abbreviation, full_string, } } } impl PartialOrd for AbbreviatedString { // I believe this is also the derive impl. fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some( self.abbreviation .cmp(&other.abbreviation) // Fall back to the full comparison. .then_with(|| self.full_string.cmp(&other.full_string)), ) } } impl ComparableString for AbbreviatedString {} impl From<AbbreviatedString> for String { fn from(abbreviated: AbbreviatedString) -> Self { abbreviated.full_string } } String: 21.027458ms AbbreviatedString: 8.386ms This is of course unfair and a best-case situation for this technique, as all the words are different and almost all of them are different within the first 8 characters, so we almost never have to read the actual string. If we had to do that more often (say, if all the strings had the same first 8 characters), this inlining would contribute almost pure overhead. We can confirm this by making it so: strings.push(format!("abcdefgh{}", word).into()); Which degrades our performance, as we'd expect: String: 21.33675ms AbbreviatedString: 24.115ms Links See Brandur Leach's excellent post on SortSupport in Postgres and the follow-up of implementing it for INET types. A related topic is German Strings, which stores small strings inline in the data used by the pointer.
-
Reinforcement learning, explained with a minimum of math and jargon
Jun 24 · www.understandingai.org ⎯ To create reliable agents, AI companies had to go beyond predicting the next token.
-
Estimating the number of your RSS subscribers
Jun 24 · darekkay.com ⎯ Analyze server access logs to view the subscriber count from popular RSS readers.
-
Reproducing U-Net
Jun 23 · doubledissent.bearblog.dev ⎯ While trying to get some deep learning research reps in, I thought of doing some archaeology by trying to reproduce some older deep learning papers. Then,…
-
Loss
Jun 23 · en.m.wikipedia.org ⎯ Look up loss or lose in Wiktionary, the free dictionary.