Pao Ramen
-
Web Numbers
Domains? Where we’re going, we don’t need domains! Until the end of this year, the only way to have a secure web site is to have it accessible via a domain name. That, however, is changing. And the design of the Small Web will be changing along with it. IP addresses are about to make a comeback on the web in a big (ok, small) way, thanks to upcoming support for security certificates for IP addresses by Let’s Encrypt.
Jun 26 ⎯ ar.al
-
What is the Small Web?
Updated June 19th, 2023 Sorry, your browser doesn’t support embedded videos. But that doesn’t mean you can’t watch it! You can download Small Is Beautiful #23 directly, and watch it with your favourite video player. Small Is Beautiful (Oct, 2022): What is the Small Web and why do we need it? Today, I want to introduce you to a concept – and a vision for the future of our species in the digital and networked age – that I’ve spoken about for a while but never specifically written about:
Jun 26 ⎯ ar.al
-
Take 3 minutes to delete these words and improve your writing forever
Before you hit “send,” delete these words from your writing. Your message will be stronger for it.
Jun 25 ⎯ newsletter.weskao.com
-
Quality is a trap
It is a proxy phrase, often born of an inability or unwillingness to articulate other concerns.
Jun 25 ⎯ ericwbailey.website
-
The six crises every organization faces [corrected image]
Each phase of growth solves one problem—and creates the next. Here's the cycle organizations repeat.
Jun 25 ⎯ theoverlap.substack.com
-
The Bitter Lesson is coming for Tokenization
Highlights the desire to replace tokenization with a general method that better leverages compute and data. We’ll see tokenization’s fragility and review the Byte Latent Transformer arch.
Jun 25 ⎯ lucalp.dev
-
Agentic Search for Dummies — Benjamin Anderson
A simple, effective baseline for building AI search agents.
Jun 24 ⎯ benanderson.work
-
Monotone Functions and Cache Lines
If you've taken a calculus class your professor might have explained to you the idea of a monotone function, or (in one direction) a nondecreasing function. A monotone function looks like this: A monotone function f between two ordered sets, is one which satisfies: x <= y implies f(x) <= f(y) Note that this definition does not imply that f is increasing: it's okay if x < y and f(x) = f(y) (for example, a constant function is monotone), it's just that it's nondecreasing. In a calculus class, you probably only learned about this concept in the context of functions from real numbers to real numbers, but it's a coherent idea, and a useful abstraction, for any two ordered sets. For example, in set-to-set functions, it shows up a lot in the database literature via ideas like the CALM Theorem. The CALM theorem is interested in monotonicity from the perspective of programs: if getting more inputs can only result in new outputs (the classic example is "if you add more edges to a graph, you can only get more not cycles, not fewer"). One way you might think about what a monotone function f gives you is that knowing the values of f(x) and f(y) is a sort of hint that suggests what the ordering of x and y are. Looking at the the contrapositive of our definition tells us what we can learn from a comparison of f(x) and f(y): Original statement: x <= y implies f(x) <= f(y) Contrapositives: f(x) > f(y) implies x > y and f(x) < f(y) implies x < y Implied: we don't learn anything if f(x) = f(y). Looking at our graph from before, we see this is true, if we have two points on the plateau (roughly around [4, 5]), they can agree along the y-axis, but be in either order along the x-axis. The comparison of x and y is in some sense "pickier" than the comparison of f(x) and f(y): there are some things look the same under f(x) and f(y), but can be distinguished in their original forms of x and y. What this suggests is that if we have a monotone function where comparisons on its codomain are easier or faster to compute than those on its domain, we can use those instead and only fall back to the original comparison when the values are the same. Our implementation looks like this: def less_than(x, y): if f(x) < f(y): return True if f(x) > f(y): return False return x < y That's all abstract. Let's get more concrete. A string, in at least some programming languages, is a struct containing a pointer, a capacity, and a length. If we have an array of strings, what we really have is an array of pointers and capacities and lengths, where those pointers point somewhere into heap memory: [ s1, s2, s3, s4 ] | v [ (ptr1, cap1, len1), --> somewhere in the heap (ptr2, cap2, len2), --> somewhere in the heap (ptr3, cap3, len3), --> somewhere in the heap (ptr4, cap4, len4), --> somewhere in the heap ] What this means if we want to sort this list is that we're not operating on a big slab of continuous memory like it might look like: every time we compare two strings, we have to follow each of their pointers and leap off somewhere into heap memory to compare them. This is not the biggest deal, but if the page we try to read is not in cache, then we're likely to hit a page fault and have to load it in, which is slow. You can extend this situation to rows in a database which might not even reside in memory and need to be fetched from disk, or even network: comparing two of them might need to read data from some far off place. There is a well-known trick to improve this situation, which I know most commonly as an abbreviated key: store a small prefix (8 bytes is a good number) of the string next to its pointer: [ (abbrev1, ptr1, cap1, len1), (abbrev2, ptr2, cap2, len2), (abbrev3, ptr3, cap3, len3), (abbrev4, ptr4, cap4, len4), ] Now we've duplicated some of the data: some of the string is actually stored directly in our big slab of memory. The heart of the trick is that the operation of "taking a fixed-size prefix" is monotone! Consider if we take 3-byte prefixes of strings, and we have the following dataset: [ "goofball", "goofy", "donald", ] If we store our prefixes inline, they look like this: [ ("goo", "goofball"), ("goo", "goofy"), ("don, "donald"), ] Now, comparing "goofball" and "goofy," we have to go to the actual string to distinguish, since they agree on their prefix. But because of monotonicity, and the fact the "don" < "goo", we know already that "donald" must come before "goofy" without ever having to do any pointer-chasing. We can observe this: use rand::seq::SliceRandom; use rand::thread_rng; use std::fs::File; use std::io::{BufRead, BufReader}; use std::time::Duration; fn main() { println!("String: {:?}", run::<String>()); println!("AbbreviatedString: {:?}", run::<AbbreviatedString>()); } fn run<T: ComparableString>() -> Duration { let file = File::open("/usr/share/dict/words").unwrap(); let reader = BufReader::new(file); let mut strings: Vec<T> = Vec::new(); for line in reader.lines() { let word = line.unwrap(); strings.push(word.into()); } strings.shuffle(&mut thread_rng()); let start = std::time::Instant::now(); strings.sort_unstable(); start.elapsed() } trait ComparableString: Ord + From<String> {} impl ComparableString for String {} #[derive(Debug, PartialEq, Eq, Ord)] struct AbbreviatedString { abbreviation: [u8; 8], full_string: String, } impl From<String> for AbbreviatedString { fn from(full_string: String) -> Self { let mut abbreviation = [0; 8]; let bytes = full_string.as_bytes(); let len = bytes.len().min(8); abbreviation[..len].copy_from_slice(&bytes[..len]); AbbreviatedString { abbreviation, full_string, } } } impl PartialOrd for AbbreviatedString { // I believe this is also the derive impl. fn partial_cmp(&self, other: &Self) -> Option<std::cmp::Ordering> { Some( self.abbreviation .cmp(&other.abbreviation) // Fall back to the full comparison. .then_with(|| self.full_string.cmp(&other.full_string)), ) } } impl ComparableString for AbbreviatedString {} impl From<AbbreviatedString> for String { fn from(abbreviated: AbbreviatedString) -> Self { abbreviated.full_string } } String: 21.027458ms AbbreviatedString: 8.386ms This is of course unfair and a best-case situation for this technique, as all the words are different and almost all of them are different within the first 8 characters, so we almost never have to read the actual string. If we had to do that more often (say, if all the strings had the same first 8 characters), this inlining would contribute almost pure overhead. We can confirm this by making it so: strings.push(format!("abcdefgh{}", word).into()); Which degrades our performance, as we'd expect: String: 21.33675ms AbbreviatedString: 24.115ms Links See Brandur Leach's excellent post on SortSupport in Postgres and the follow-up of implementing it for INET types. A related topic is German Strings, which stores small strings inline in the data used by the pointer.
Jun 24 ⎯ buttondown.com
-
Reinforcement learning, explained with a minimum of math and jargon
To create reliable agents, AI companies had to go beyond predicting the next token.
Jun 24 ⎯ www.understandingai.org
-
Estimating the number of your RSS subscribers
Analyze server access logs to view the subscriber count from popular RSS readers.
Jun 24 ⎯ darekkay.com
-
Reproducing U-Net
While trying to get some deep learning research reps in, I thought of doing some archaeology by trying to reproduce some older deep learning papers. Then,…
Jun 23 ⎯ doubledissent.bearblog.dev
-
Loss
Look up loss or lose in Wiktionary, the free dictionary.
Jun 23 ⎯ en.m.wikipedia.org
-
The secret of good metaphors
How do metaphors and analogies work, and what makes a good metaphor?
Jun 23 ⎯ www.doc.cc
-
Canine - An open source alternative to Heroku
Canine is an open source deployment platform that makes it easy to deploy and manage your applications.
Jun 23 ⎯ canine.sh
-
How Databases Store Your Tables on Disk
Ever wondered how your database stores and retrieves data efficiently from disk? In this article, we take a deep dive into the internal structure of databases, covering what pages are, how heap files store data, and how indexes (including clustered indexes) make data access faster and smarter. By the end, you’ll have a solid understanding of how tables are actually stored behind the scenes and how databases optimize performance using these low-level concepts
Jun 23 ⎯ www.deepintodev.com
-
Cluely System prompt
Cluely System prompt. GitHub Gist: instantly share code, notes, and snippets.
Jun 23 ⎯ gist.github.com
-
GitHub - TheGP/untidetect-tools: List of anti-detect and humanizing tools and browsers, including captcha solvers and sms-activation.
List of anti-detect and humanizing tools and browsers, including captcha solvers and sms-activation. - TheGP/untidetect-tools
Jun 22 ⎯ github.com
-
<syntax-highlight> element - Web Component
A web component that uses the CSS Custom Highlight API.
Jun 21 ⎯ andreruffert.github.io
-
Wilted lands and wounded worlds: visualizing environmental costs of war in Hayao Miyazaki’s Nausicaä of the Valley of the Wind
Audrey Aguirre Upland High School, Upland, CA, USA. Email: audrey.a.aguirre (at) gmail (dot) com Download PDF Past studies on Nausicaä of the Valley of the Wind (風の谷のナウシカ; Topcraft, 1984) have prim…
Jun 21 ⎯ jgeekstudies.org
-
Cracovians: The Twisted Twins of Matrices
Linear algebra is typically explained using matrices. But matrix theory is just one possible perspective. Below, I describe an alternative approach to linear algebra. Tadeusz Banachiewicz (1882–195…
Jun 21 ⎯ marcinciura.wordpress.com