phpFashion

Rubrika Random

Var, Let, Const: Stop Complicating Your Life in JavaScript

JavaScript offers three ways to declare variables: var, let, and const. Many programmers aren't entirely clear on when to use which one, and most tutorials and linters force you to use them incorrectly. Let's see how to write cleaner and more understandable code without unnecessary rules that don't actually help us.

Let's Start with the Most Dangerous Part

JavaScript has one treacherous quirk: by simply omitting a variable declaration, you can unknowingly use a global variable. All it takes is forgetting var, let, or const:

function calculatePrice(amount) {
    price = amount * 100;    // Omission! Missing 'let'
    return price;            // We're using a global variable 'price'
}

function processOrder() {
    price = 0;               // We're using the same global variable!
    // ... some code calling calculatePrice()
    return price;            // We're returning a completely different value than expected
}

This is every developer's nightmare – the code appears to work correctly until something mysteriously starts failing elsewhere in the application. Debugging such errors can take hours because a global variable can be overwritten anywhere in the application.

That's why it's absolutely crucial to always declare variables using let or const.

Forget About var

The var keyword has been in JavaScript since its inception in 1995 and carries some problematic properties that were considered features at the time of the language's creation but proved to be a source of many bugs over time. After twenty years of language development, JavaScript's authors decided to address these problems – not by fixing var (to maintain backward compatibility) but by introducing the new let keyword in ES2015.

You can find plenty of articles on the internet dissecting the problems with var in the finest detail. But you know what? There's no need to get bogged down in the details. Let's just treat var as a relic of the past and focus on modern JavaScript.

When to Use let

let is the modern way to declare variables in JavaScript.

The nice thing is that the variable only exists within the code block (between curly braces) where it was defined. This makes the code more predictable and safer.

if (someCondition) {
    let temp = calculateSomething();
    // temp is only available here
}
// temp no longer exists here

In loops, the declaration is technically placed before the curly braces, but don't let that confuse you – the variable only exists within the loop:

for (let counter = 0; counter < 10; counter++) {
    // The counter variable only exists in the loop
}
// counter is no longer accessible here

When to Use const

const is used to declare constants. These are typically important values at the module or application level that should never change:

const PI = 3.14159;
const API_URL = 'https://api.example.com';
const MAX_RETRY_ATTEMPTS = 3;

However, it's important to understand one key detail: const only prevents assigning a new value to the variable – it doesn't control what happens with the value itself. This distinction is particularly evident with objects and arrays (an array is also an object) – const doesn't make them immutable objects, i.e., it doesn't prevent changes inside the object:

const CONFIG = {
    url: 'https://api.example.com',
    timeout: 5000
};

CONFIG.url = 'https://api2.example.com';  // This works!
CONFIG = { url: 'https://api2.example.com' };  // This throws TypeError!

If you need a truly immutable object, you need to freeze it first.

The let vs const Dilemma

Now we come to a more interesting question. While the situation with var vs let is clear, the use of const is the subject of many community discussions. Most tutorials, style guides, and linters promote the rule “use const wherever you can.” So we commonly see const used in function or method bodies.

Let's explain why this popular “best practice” is actually an anti-pattern that makes code less readable and unnecessarily restrictive.

The approach “if a variable's value isn't reassigned in the code, it should be declared as const” seems logical at first glance. Why else would const even exist? The more “constants,” the safer and more predictable the code, right? And faster too, because the compiler can better optimize it.

However, this entire approach fundamentally misunderstands the purpose of constants. It's primarily about communicating intent – are we truly trying to signal to other developers that this variable should never be reassigned, or do we just happen not to reassign it in our current implementation?

// Real constants - values that are constant by their nature
const PI = 3.14159;
const DAYS_IN_WEEK = 7;
const API_ENDPOINT = 'https://api.example.com';

// vs.

function processOrder(items) {
    // These AREN'T constants, we just happen to not reassign them
    const total = items.reduce((sum, item) => sum + item.price, 0);
    const tax = total * 0.21;
    const shipping = calculateShipping(total);
    return { total, tax, shipping };
}

In the first case, we have values that are constants by their nature – they express immutable properties of our system or important configuration data. When we see PI or API_ENDPOINT somewhere in the code, we immediately understand why these values are constants.

In the second case, we're using const just because we happen to not reassign the values right now. But that's not their essential characteristic – these are regular variables that we might want to change in the next version of the function. And when we want to do that, const will unnecessarily prevent us.

In the days when JavaScript was one big global code, it made sense to try to secure variables against reassignment. But today we write code in modules and classes. Today it's common and correct that the scope is a small function, and within its scope, it makes no sense to worry about the difference between let and const.

Because it creates completely unnecessary mental overhead:

  1. The programmer has to think while writing: “Will I change this value? No? Then I must use const…”
  2. It distracts readers! When they see const in the code, they wonder: “Why is this a constant? Is this some important value? Does it have any significance?”
  3. In a month we need to change the value and have to deal with: “Can I change const to let? Is someone relying on this?”

Simply use let and you don't have to deal with these questions at all.

It's even worse when this decision is made automatically by a linter. That is, when the linter “fixes” variables to const because it only sees one assignment. The code reader then unnecessarily wonders: “Why must these variables be constants here? Is it somehow important?” And yet it's not important – it's just a coincidence. Don't use the prefer-const rule in ESLint!

By the way, the optimization argument is a myth. Modern JavaScript engines (like V8) can easily detect whether a variable is reassigned or not, regardless of whether it was declared using let or const. So using const provides no performance benefit.

Implicit Constants

In JavaScript, there are several constructs that implicitly create constants without us having to use the const keyword:

// imported modules
import { React } from 'react';
React = something; // TypeError: Assignment to constant variable

// functions
function add(a, b) { return a + b; }
add = something; // TypeError: Assignment to constant variable

// classes
class User {}
User = something; // TypeError: Assignment to constant variable

This makes sense – these constructs define the basic building blocks of our code, and overwriting them could cause chaos in the application. That's why JavaScript automatically protects them against reassignment, just as if they were declared using const.

Constants in Classes

Classes were added to JavaScript relatively recently (in ES2015), and their functionality is still gradually maturing. For example, private members marked with # didn't arrive until 2022. JavaScript is still waiting for class constant support. For now, you can use static, but it's far from the same thing – it marks a value shared between all class instances, not an immutable one.

Conclusion

  1. Don't use var – it's outdated
  2. Use const for real constants at the module level
  3. In functions and methods, use let – it's more readable and clearer
  4. Don't let the linter automatically change let to const – it's not about the number of assignments, but about intent

How to Deal with the Chaos of Empty Strings and NULL Values in MySQL?

You know the situation – you create a query WHERE street = '', but the system doesn't return all the records you'd expect. Or your LEFT JOIN doesn't work as it should. The reason is a common problem in databases: inconsistent use of empty strings and NULL values. Let's see how to solve this chaos once and for all.

When to Use NULL and When to Use an Empty String?

In theory, the difference is clear: NULL means “value is not set”, while an empty string means “value is set and is empty”. Let's look at a real example from an e-commerce site, where we have an orders table. Each order has a required delivery address and an optional billing address for cases where the customer wants to bill to a different location (typical checkbox “Bill to a different address”):

CREATE TABLE orders (
    id INT PRIMARY KEY,
    delivery_street VARCHAR(255) NOT NULL,
    delivery_city VARCHAR(255) NOT NULL,
    billing_street VARCHAR(255) NULL,
    billing_city VARCHAR(255) NULL
);

The billing_city and billing_street fields are nullable because the billing address is optional. But there's a difference between them. While a street can be legitimately empty (villages without street names) or unset (delivery address is used), the city must always be filled in if a billing address is used. So either billing_city contains a city name, or it's NULL – in which case the delivery address is used.

The Reality of Large Databases

In practice, both approaches often end up being mixed in the database. There can be several reasons:

  • Changes in application logic over time (e.g., switching from one ORM to another)
  • Different teams or programmers using different conventions
  • Buggy data migrations when merging databases
  • Legacy code that behaves differently than new code
  • Application bugs that occasionally let through an empty string instead of NULL or vice versa

This leads to situations where we have a mix of values in the database and need to write complex conditions:

SELECT * FROM tbl
WHERE foo = '' OR foo IS NULL;

Even worse is that NULL behaves unintuitive when comparing:

SELECT * FROM tbl WHERE foo = ''; -- doesn't include NULL
SELECT * FROM tbl WHERE foo <> ''; -- also doesn't include NULL

-- we must use
SELECT * FROM tbl WHERE foo IS NULL;
SELECT * FROM tbl WHERE foo <=> NULL;

This inconsistency in comparison operators' behavior is another reason why it's better to use only one way of representing empty values in the database.

Why Avoid the Dual Approach

A similar situation exists in JavaScript, where we have null and undefined. After years of experience, many JavaScript developers concluded that distinguishing between these two states brings more problems than benefits and decided to use only the system-native undefined.

In the database world, the situation is similar. Instead of constantly dealing with whether something is an empty string or NULL, it's often simpler to choose one approach and stick to it. For example, Oracle database essentially equates empty strings and NULL values, thus elegantly avoiding this problem. It's one of the places where Oracle deviates from the SQL standard, but it simplifies working with empty/NULL values.

How can we achieve something similar in MySQL?

What Do We Actually Want to Enforce?

  1. For required fields (NOT NULL), we want to enforce that they always contain meaningful values. That means preventing empty strings (or strings containing only spaces)
  2. For optional fields (NULL), we want to prevent storing empty strings. When a field is optional, NULL should be the only representation of an “unfilled value”. Mixing both approaches in one column leads to problems with querying and JOIN operations, as we showed above.

Solution in MySQL

Historically in MySQL, it made sense to use exclusively empty strings ('') instead of NULL values. It was the only approach that could be enforced using the NOT NULL constraint. If we wanted an automatically consistent database, this was the only way.

However, there's one important case where this approach fails – when we need a unique index on the column. MySQL considers multiple empty strings as the same value, while multiple NULL values are considered different.

However, since MySQL version 8.0.16, we can use CHECK constraints and have more control over what values we allow. We can, for example, enforce that a column will either be NULL or contain a non-empty string:

CREATE TABLE users (
    id INT PRIMARY KEY,

    -- Required field - must contain some non-empty text
    email VARCHAR(255) NOT NULL UNIQUE
        CONSTRAINT email_not_empty      -- rule name
        CHECK (email != ''),

    -- Optional field - either NULL or non-empty text
    nickname VARCHAR(255)
        CONSTRAINT nickname_not_empty
        CHECK (nickname IS NULL OR nickname != '')
);

When creating a CHECK constraint, it's important to give it a meaningful name using the CONSTRAINT keyword. This way, we get a meaningful error message Check constraint ‘nickname_not_empty’ is violated instead of a generic constraint violation notice. This significantly helps with debugging and application maintenance.

The problem isn't just empty strings, but also strings containing only spaces. We can improve the CHECK constraint solution using the TRIM function:

CREATE TABLE users (
    id INT PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE
        CONSTRAINT email_not_empty
        CHECK (TRIM(email) != ''),
   ...
);

Now these validation bypass attempts won't work either:

INSERT INTO users (email) VALUES ('   ');  -- all spaces

Practical Solution in Nette Framework

A consistent approach to empty values needs to be handled at the application level too. If you're using Nette Framework, you can use an elegant solution using the setNullable() method:

$form = new Form;
$form->addText('billing_street')
    ->setNullable(); // empty input transforms to NULL

Recommendations for Practice

  1. At the start of the project, decide on one approach:
    • Either use only NULL for missing values
    • Or use only empty strings for empty/missing values
  2. Document this decision in the project documentation
  3. Use CHECK constraints to enforce consistency
  4. For existing projects:
    • Conduct an audit of the current state
    • Prepare a migration script to unify the approach
    • Don't forget to adjust application logic

With this approach, you'll avoid many problems with comparing, indexing, and JOIN operations that arise from mixing NULL and empty strings. Your database will be more consistent and queries simpler.


Renaming ENUM Values Without Data Loss: A Safe Guide

Renaming values in a MySQL ENUM column can be tricky. Many developers attempt a direct change, which often results in data loss or errors. We'll show you the correct and safe way to do it.

Imagine a typical scenario: You have an orders table in your database with a status column of type ENUM. It contains the values waiting_payment, processing, shipped, and cancelled. The requirement is to rename waiting_payment to unpaid and shipped to completed. How can this be done without risk?

What Doesn't Work

First, let's look at what does not work. Many developers try this straightforward approach:

-- THIS DOES NOT WORK!
ALTER TABLE orders
MODIFY COLUMN status ENUM(
    'unpaid',      -- previously 'waiting_payment'
    'processing',  -- unchanged
    'completed',   -- previously 'shipped'
    'cancelled'    -- unchanged
);

This approach is a recipe for disaster. MySQL will attempt to map existing values to the new ENUM, and since the original values are no longer in the definition, it will either replace them with an empty string or return the error Data truncated for column 'status' at row X. In a production database, this would mean losing important data.

Backup First!

Before making any structural changes to your database, it is absolutely crucial to create a data backup. Use MySQL-dump or another trusted tool.

The Correct Approach

The correct approach consists of three steps:

  1. First, extend the ENUM with new values.
  2. Update the data.
  3. Finally, remove the old values.

Let's go through it step by step:

1. The first step is to add the new values to the ENUM while keeping the original ones:

ALTER TABLE orders
MODIFY COLUMN status ENUM(
    'waiting_payment',  -- original value
    'processing',       -- unchanged
    'shipped',         -- original value
    'cancelled',       -- unchanged
    'unpaid',          -- new value (replaces waiting_payment)
    'completed'        -- new value (replaces shipped)
);

2. Now we can safely update the existing data:

UPDATE orders SET status = 'unpaid' WHERE status = 'waiting_payment';
UPDATE orders SET status = 'completed' WHERE status = 'shipped';

3. Finally, once all data has been converted to the new values, we can remove the old ones:

ALTER TABLE orders
MODIFY COLUMN status ENUM(
    'unpaid',
    'processing',
    'completed',
    'cancelled'
);

Why Does This Work?

This works because of how MySQL handles ENUM values. When performing an ALTER TABLE modification on an ENUM column, MySQL tries to map existing values based on their textual representation. If the original value does not exist in the new ENUM definition, MySQL will either throw an error (if STRICT_ALL_TABLES is enabled in sql_mode) or replace it with an empty string.

That's why it's crucial to have both old and new values present in the ENUM simultaneously during the transition phase. In our case, this ensures that every record in the database retains its exact textual equivalent. Only after executing the UPDATE queries—when we are sure that all data is using the new values—can we safely remove the old ones.


Two Words That Kill Open Source

Do you know what you should NEVER, and I mean NEVER, say to open-source project authors? “I don't have time.” These two words can destroy a developer’s motivation faster than an iPhone battery drains while scrolling TikTok.

  • “I don't have time to write a fix.”
  • “I don't have time to create a bug report.”
  • “This should be in the documentation, but I don’t have time to write it.”

Really? REALLY?!

Imagine you're at a party, and someone says to you: “Hey, you with the beer! Make me a sandwich. I don’t have time to make it myself, I’m too busy eating chips.” How would you feel? Like a vending machine with a face? That’s exactly how I feel when I read words like that. My motivation to help vanishes instantly, and I feel the urge to do anything else — even absolutely nothing.

You see, we open source developers are a peculiar breed. We spend hours of our free time creating software that we then make available to everyone. For free. Voluntarily. It’s like Santa Claus handing out gifts every day of the year, not just on Christmas. We enjoy it. But that doesn’t give anyone the right to boss us around like we’re some kind of digital slaves. So, when someone comes with a request for a new feature but “doesn’t have time” to contribute, it immediately raises the question, “Why should I have the time then?” It’s like asking Michelangelo to paint your living room because you “don’t have time” to do it yourself — as if he has nothing better to do.

Over the years, I’ve accumulated dozens of issues across various projects where I’ve asked, “Could you prepare a pull request?” and the reply was, “I could, but I don’t have time this week.” If that poor soul hadn’t written that sentence, I probably would’ve solved the issue long ago. But by saying that, they basically told me they don’t value my time. So, did they fix it themselves a week later? Not at all… 99% of the things people promised to do were never delivered, which is why 99% of those issues remain unresolved. They hang there like digital monuments to human laziness.

So, dear users, before you write “I don’t have time,” think again. What you’re really saying is, “Hey, you! Your free time is worthless. Drop everything you’re doing and deal with MY problem!” Instead, try this:

  • Find the time. Trust me, it’s there. It might be hiding between episodes of your favorite show or in the time you spend scrolling through social media.
  • Offer a solution. You don’t need to submit a full patch. Just show that you’ve given it some real thought.
  • Motivate open source maintainers to take up your issue. For example, by showing how the change will be useful not just for you, but for the whole of humanity and the surrounding universe.

Next time you find a bug, request a new feature, or notice something missing from the documentation, try to help out the community in some way. Because in the open-source world, we’re all in the same boat. And to keep it moving forward, we all need to row. So don’t just sit there complaining that you “don’t have time” to paddle — grab an oar and do your part. Saying “I don’t have time” is the fastest way to kill the motivation of those who are giving you free software. Try to carve out those few minutes or hours. Your karma will thank you.


Why GPT is the SQL of our century?

And conversely, was SQL the GPT of the seventies?

SQL, which emerged in the 1970s, represented a revolutionary breakthrough in human-computer interaction. Its design aimed to make queries as readable and writable as possible, resembling plain English. For instance, a query to fetch names and salaries of employees in SQL might look like this: SELECT name, salary FROM employee – simple and comprehensible, right? This made databases accessible to a broader audience, not just computer nerds.

Although this intention was commendable, it soon became clear that writing SQL queries still required experts. Moreover, because a universal application interface never emerged, using natural language for modern programming became more of a burden. Programmers today communicate with databases by writing complex SQL command generators, which databases then decode.

Enter GPT. These advanced language models bring a similar revolution in the era of artificial intelligence as SQL did for databases in its time. They enable interaction with databases on an entirely new level, 50 years advanced. Although there is currently no database directly controlled by a language model, we can use GPT to generate SQL. And it works great.

How to do it? Just copy the structure of your database, i.e., a dump with CREATE TABLE commands, into ChatGPT and include the name of the database server you are using. You can then ask questions directly in Czech.

If you want to make using this technique even easier, use the prepared SQL Wizard, where you just need to copy the database structure and you can start asking questions right away without further explanation.

Let's see a practical example. We will use two database tables linked through a foreign key. The users table will contain information about users, and the orders table will store information about orders made by users. Here are the definitions of these tables:

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100)
);

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    user_id INT,
    created_at DATE,
    amount DECIMAL(10, 2),
    FOREIGN KEY (user_id) REFERENCES users(id)
);

Copy these definitions into ChatGPT and you can immediately ask naturally and in Czech, for example, about all orders from user Jan Novák. Specifically, I would ask: “Write SQL that lists all orders from user Jan Novák.” And GPT might respond like this:

SELECT orders.order_id, orders.created_at, orders.amount
FROM orders
JOIN users ON orders.user_id = users.id
WHERE users.name = 'Jan Novák';

Many programmers can, of course, write such an SQL query faster than logging into ChatGPT 🙂 But let's try a slightly more complex example: “List users who have only one order.”

SELECT users.id, users.name, users.email
FROM users
JOIN orders ON users.id = orders.user_id
GROUP BY users.id, users.name, users.email
HAVING COUNT(orders.order_id) = 1;

Or “List users who have no orders.”

SELECT users.*
FROM users
LEFT JOIN orders ON users.id = orders.user_id
WHERE orders.order_id IS NULL;

Here the fundamental difference in approach is evident. While with GPT, I specify what data I want to get, in SQL, I have to describe how to get it. And that requires knowledge and experience.

Thanks to GPT, you can focus on what you want to get, instead of how to get it, which is especially valuable for those who are not SQL experts.

Finally, let's demonstrate how to use GPT to create database triggers. For example, you can write: “Add a column order_count to the users table that will track the number of orders. Initialize it to the value corresponding to the current number of orders for each user. And create a trigger that will update the value when an order is created or deleted.”

GPT will generate queries for adding and initializing the order_count column:

ALTER TABLE users ADD COLUMN order_count INT DEFAULT 0;
UPDATE users SET order_count = (SELECT COUNT(*) FROM orders WHERE orders.user_id = users.id);

and the required triggers to keep its value up to date:

CREATE TRIGGER update_order_count_after_insert
AFTER INSERT ON orders
FOR EACH ROW
BEGIN
    UPDATE users
    SET order_count = order_count + 1
    WHERE id = NEW.user_id;
END;

CREATE TRIGGER update_order_count_after_delete
AFTER DELETE ON orders
FOR EACH ROW
BEGIN
    UPDATE users
    SET order_count = order_count - 1
    WHERE id = OLD.user_id;
END;

GPT offers a way to work effectively and intuitively with databases, even for those who are not SQL experts. It's a revolutionary tool that truly makes advanced database operations accessible to the general public. However, it is still crucial to carefully check each output to ensure data correctness and security.


Can Regular Expressions Be Used to Parse HTML?

Let's once and for all crack this eternal question that divides the programming community. I decided to dive into the dark waters of regular expressions to bring an answer (spoiler: yes, it's possible).

So, what exactly does an HTML document contain? It's a mix of text, entities, tags, comments, and the special doctype tag. Let's first explore each ingredient separately.

Entities

The foundation of an HTML page is text, which consists of ordinary characters and special sequences called HTML entities. These can be either named, like &nbsp; for a non-breaking space, or numerical, either in decimal &#160; or hexadecimal &#xA0; format. A regular expression capturing an HTML entity would look like this:

(?<entity>
	&
	(
		[a-z][a-z0-9]+     # named entity
		|
		\#\d+              # decimal number
		|
		\#x[0-9a-f]+       # hexadecimal number
	)
	;
)

All regular expressions are written in extended mode, ignore case, and a dot represents any character. That is, the modifier six.

Tags

These iconic elements make HTML what it is. A tag starts with <, followed by the tag name, possibly a set of attributes, and closes with > or />. Attributes can optionally have a value, which can be enclosed in double, single, or no quotes. A regular expression capturing an attribute would look like this:

(?<attribute>
	\s+                         # at least one white space before the attribute
	<a href="#fns" class="footnote">[\s&quot;&#039;&lt;&gt;=`/]</a>+               # attribute name
	(
		\s* = \s*               # equals sign before the value
		(
			"                   # value enclosed in double quotes
				(
					<a href="#fn" class="footnote">[&quot;]</a>        # any character except double quote
					|
					(?&entity)  # or HTML entity
				)*
			"
			|
			'                   # value enclosed in single quotes
				(
					<a href="#fn" class="footnote">[&#039;]</a>        # any character except single quote
					|
					(?&entity)  # or HTML entity
				)*
			'
			|
			<a href="#fns" class="footnote">[\s&quot;&#039;&lt;&gt;=`]</a>+         # value without quotes
		)
	)?                           # value is optional
)

Notice that I am referring to the previously defined entity group.

Elements

An element can represent either a standalone tag (so-called void element) or paired tags. There is a fixed list of void element names by which they are recognized. A regular expression for capturing them would look like this:

(?<void_element>
	<                  # start of the tag
	(                  # element name
		img|hr|br|input|meta|area|embed|keygen|source|base|col
		|link|param|basefont|frame|isindex|wbr|command|track
	)
	(?&attribute)*     # optional attributes
	\s*
	/?                 # optional /
	>                  # end of the tag
)

Other tags are thus paired and captured by this regular expression (I use a reference to the content group, which we will define later):

(?<element>
	<                  # starting tag
	(?<element_name>
		[a-z]<a href="#fns" class="footnote">[\s/&gt;]</a>*  # element name
	)
	(?&attribute)*     # optional attributes
	\s*
	>                  # end of the starting tag
	(?&content)*
	</                 # ending tag
	(?P=element_name)  # repeat element name
	\s*
	>                  # end of the ending tag
)

A special case is elements like <script>, whose content must be processed differently from other elements:

(?<special_element>
	<                  # starting tag
	(?<special_element_name>
		script|style|textarea|title  # element name
	)
	(?&attribute)*     # optional attributes
	\s*
	>                  # end of the starting tag
	(?>                # atomic group
		.*?            # smallest possible number of any characters
		</             # ending tag
		(?P=special_element_name)
	)
	\s*
	>                  # end of the ending tag
)

The lazy quantifier .*? ensures that the expression stops at the first ending sequence, and the atomic group ensures that this stop is definitive.

Comments

A typical HTML comment starts with the sequence <!-- and ends with -->. A regular expression for HTML comments might look like this:

(?<comment>
	<!--
	(?>           # atomic group
		.*?       # smallest possible number of any characters
		-->
	)
)

The lazy quantifier .*? again ensures that the expression stops at the first ending sequence, and the atomic group ensures that this stop is definitive.

Doctype

This is a historical relic that exists today only to switch the browser to so-called standard mode. It usually looks like <!doctype html>, but can contain other characters as well. Here is the regular expression that captures it:

(?<doctype>
	<!doctype
	\s
	<a href="#fn" class="footnote">[&gt;]</a>*         # any character except '>'
	>
)

Putting It All Together

With the regular expressions ready for each part of HTML, it's time to create an expression for the entire HTML 5 document:

\s*
(?&doctype)?              # optional doctype
(?<content>
	(?&void_element)      # void element
	|
	(?&special_element)   # special element
	|
	(?&element)           # paired element
	|
	(?&comment)           # comment
	|
	(?&entity)            # entity
	|
	<a href="#fn" class="footnote">[&lt;]</a>                  # character
)*

We can combine all the parts into one complex regular expression. This is it, a superhero among regular expressions with the ability to parse HTML 5.

Final Notes

Even though we have shown that HTML 5 can be parsed using regular expressions, the provided example is not useful for processing an HTML document. It will fail on invalid documents. It will be slow. And so on. In practice, regular expressions like the following are more commonly used (for finding URLs of images):

<img.+?src=["'](.+?)["'].*?>

But this is a very unreliable solution that can lead to errors. This regexp incorrectly matches custom tags such as <imgs-tag src="image.jpg">, custom attributes like <img data-src="custom info">, or fails when the attribute contains a quote <img src="mcdonald's.jpg">. Therefore, it is recommended to use specialized libraries. In the world of PHP, we're unlucky because the DOM extension supports only the ancient, decaying HTML 4. Fortunately, PHP 8.4 promises an HTML 5 parser.


When Copilot Loses Direction: A Celebration of Shoddy Workmanship

A video from Microsoft, intended to be a dazzling demonstration of Copilot's capabilities, is instead a tragically comedic presentation of the decline in programming craftsmanship.

I'm referring to this video. It's supposed to showcase the abilities of GitHub Copilot, including how to use it to write a regular expression for searching <img> tags with the hero-image class. However, the original code being modified is as holey as Swiss cheese, something I would be embarrassed to use. Copilot gets carried away and instead of correcting, continues in the same vein.

The result is a regular expression that unintentionally matches other classes, tags, attributes, and so on. Worse still, it fails if the src attribute is listed before class.

I write about this because this demonstration of shoddy work, especially considering the official nature of the video, is startling. How is it possible that none of the presenters or their colleagues noticed this? Or did they notice and decide it didn't matter? That would be even more disheartening. Teaching programming requires precision and thoroughness, without which incorrect practices can easily be propagated. The video was meant to celebrate the art of programming, but I see in it a bleak example of how the level of programming craftsmanship is falling into the abyss of carelessness.

Just to give a bit of a positive spin: the video does a good job of showing how Copilot and GPT work, so you should definitely give it a look 🙂


Tabs Instead of Spaces as a Courtesy

You've probably encountered the “tabs vs. spaces” debate for indentation before. This argument has been around for ages, and both sides present their reasons:

Tabs:

  • Indenting is their purpose
  • Smaller files, as indentation takes up one character
  • You can set your own indentation width (more on this later)

Spaces:

  • Code will look the same everywhere, and consistency is key
  • Avoid potential issues in environments sensitive to whitespace

But what if it's about more than personal preference? ChaseMoskal recently posted a thought-provoking entry on Reddit titled Nobody talks about the real reason to use tabs instead of spaces that might open your eyes.

The Main Reason to Use Tabs

Chase describes his experience with implementing spaces at his workplace and the negative impacts it had on colleagues with visual impairments.

One of them was accustomed to using a tab width of 1 to avoid large indentations when using large fonts. Another uses a tab width of 8 because it suits him best on an ultra-wide monitor. For both, however, code with spaces poses a serious problem, requiring them to convert spaces to tabs before reading and back to spaces before committing.

For blind programmers who use Braille displays, each space represents one Braille cell. Therefore, if the default indentation is 4 spaces, a third-level indentation wastes 12 precious Braille cells even before the start of the code. On a 40-cell display, which is most commonly used with laptops, this is more than a quarter of the available cells, wasted without conveying any information.

Adjusting the width of indentation may seem trivial to us, but for some programmers, it is absolutely essential. And that’s something we simply cannot ignore.

By using tabs in our projects, we give them the opportunity for this adjustment.

Accessibility First, Then Personal Preference

Sure, not everyone can be persuaded to choose one side over the other when it comes to preferences. Everyone has their own. And we should appreciate the option to choose.

However, we must ensure that we consider everyone. We should respect differences and use accessible means. Like the tab character, for instance.

I think Chase put it perfectly when he mentioned in his post that “…there is no counterargument that comes close to outweighing the accessibility needs of our colleagues.”

Accessible First

Just as the “mobile first” methodology has become popular in web design, where we ensure that everyone, regardless of device, has a great user experience with your product – we should strive for an “accessible first” environment by ensuring that everyone has the same opportunity to work with code, whether in employment or on an open-source project.

If tabs become the default choice for indentation, we remove one barrier. Collaboration will then be pleasant for everyone, regardless of their abilities. If everyone has the same opportunities, we can fully utilize our collective potential ❤️


This article is based on Default to tabs instead of spaces for an ‘accessible first’ environment. I read a similarly convincing post in 2008 and changed from spaces to tabs in all my projects that very day. It left a trace in Git, but the article itself has disappeared into the annals of history.

3 years ago in section Random


What are SameSite Cookies and Why Do We Need Them?

SameSite cookies provide a mechanism to recognize what led to the loading of a page. Whether it was through clicking a link on another website, submitting a form, loading inside an iframe, using JavaScript, etc.

Identifying how a page was loaded is crucial for security. The serious vulnerability known as Cross-Site Request Forgery (CSRF) has been with us for over twenty years, and SameSite cookies offer a systematic way to address it.

A CSRF attack involves an attacker luring a victim to a webpage that inconspicuously makes a request to a web application where the victim is logged in, and the application believes the request was made voluntarily by the victim. Thus, under the identity of the victim, some action is performed without the victim knowing. This could involve changing or deleting data, sending a message, etc. To prevent such attacks, applications need to distinguish whether the request came from a legitimate source, e.g., by submitting a form on the application itself, or from elsewhere. SameSite cookies can do this.

How does it work? Let’s say I have a website running on a domain, and I create three different cookies with attributes SameSite=Lax, SameSite=Strict, and SameSite=None. Name and value do not matter. The browser will store them.

  1. When I open any URL on my website by typing directly into the address bar or clicking on a bookmark, the browser sends all three cookies.
  2. When I access any URL on my website from a page from the same website, the browser sends all three cookies.
  3. When I access any URL on my website from a page from a different website, the browser sends only the cookies with None and in certain cases Lax, see table:
Code on another website   Sent cookies
Link <a href="…"> None + Lax
Form GET <form method="GET" action="…"> None + Lax
Form POST <form method="POST" action="…"> None
iframe <iframe src="…"> None
AJAX $.get('…'), fetch('…') None
Image <img src="…"> None
Prefetch <link rel="prefetch" href="…"> None
  None

SameSite cookies can distinguish only a few cases, but these are crucial for protecting against CSRF.

If, for example, there is a form or a link for deleting an item on my website's admin page and it was sent/clicked, the absence of a cookie created with the Strict attribute means it did not happen on my website but rather the request came from elsewhere, indicating a CSRF attack.

Create a cookie to detect a CSRF attack as a so-called session cookie without the Expires attribute, its validity is essentially infinite.

Domain vs Site

“On my website” is not the same as “on my domain,” it's not about the domain, but about the website (hence the name SameSite). Although the site often corresponds to the domain, for services like github.io, it corresponds to the subdomain. A request from doc.nette.org to files.nette.org is same-site, while a request from nette.github.io to tracy.github.io is already cross-site. Here it is nicely explained.

<iframe>

From the previous lines, it is clear that if a page from my website is loaded inside an <iframe> on another website, the browser does not send Strict or Lax cookies. But there's another important thing: if such a loaded page creates Strict or Lax cookies, the browser ignores them.

This creates a possibility to defend against fraudulent acquisition of cookies or Cookie Stuffing, where until now, systemic defense was also lacking. The trick is that the fraudster collects a commission for affiliate marketing, although the user was not brought to the merchant's website by a user-clicked link. Instead, an invisible <iframe> with the same link is inserted into the page, marking all visitors.

Cookies without the SameSite Attribute

Cookies without the SameSite attribute were always sent during both same-site and cross-site requests. Just like SameSite=None. However, in the near future, browsers will start treating the SameSite=Lax flag as the default, so cookies without an attribute will be considered Lax. This is quite an unusually large BC break in browser behavior. If you want the cookie to continue to behave the same and be transmitted during any cross-site request, you need to set it to SameSite=None. (Unless you develop embedded widgets, etc., you probably won't want this often.) Unfortunately, for last year's browsers, the None value is unexpected. Safari 12 interprets it as Strict, thus creating a tricky problem on older iOS and macOS.

And note: None works only when set with the Secure attribute.

What to Do in Case of an Attack?

Run away! The basic rule of self-defense, both in real life and on the web. A huge mistake made by many frameworks is that upon detecting a CSRF attack, they display the form again and write something like “The CSRF token is invalid. Please try to submit the form again”. By resubmitting the form, the attack is completed. Such protection lacks sense when you actually invite the user to bypass it.

Until recently, Chrome did that during a cross-site request—it displayed the page again after a refresh, but this time sent the cookies with the Strict attribute. So, the refresh eliminated the CSRF protection based on SameSite cookies. Fortunately, it no longer does this today, but it's possible that other or older browsers still do. A user can also “refresh” the page by clicking on the address bar + enter, which is considered a direct URL entry (point 1), and all cookies are sent.

Thus, the best response to detecting CSRF is to redirect with a 302 HTTP code elsewhere, perhaps to the homepage. This rids you of dangerous POST data, and the problematic URL isn't saved to history.

Incompatibilities

SameSite hasn't worked nearly as well as it should have for a long time, mainly due to browser bugs and deficiencies in the specification, which, for example, didn't address redirections or refreshes. SameSite cookies weren't transferred during saving or printing a page, but were transferred after a refresh when they shouldn't have been, etc. Fortunately, the situation is better today. I believe that the only serious shortcomings in current browser versions persist, as mentioned above for Safari.

Addendum: Besides SameSite, the origin of a request can very recently be distinguished also by the Origin header, which is more privacy-respecting and more accurate than the Referer header.

5 years ago in section Random


How to Properly Set Up CSP and `script-src`

Content Security Policy (CSP) is an additional security feature that tells the browser what external sources a page can load and how it can be displayed. It protects against the injection of malicious code and attacks such as XSS. It is sent as a header composed of a series of directives. However, implementing it is not trivial.

Typically, we want to use JavaScript libraries located outside our server, such as Google Analytics, advertising systems, captchas, etc. Unfortunately, the first version of CSP fails here. It requires a precise analysis of the content loaded and the setting of the correct rules. This means creating a whitelist, a list of all the domains, which is not easy since some scripts dynamically pull other scripts from different domains or are redirected to other domains, etc. Even if you take the effort and manually create the list, you never know what might change in the future, so you must constantly monitor if the list is still up-to-date and correct it. Analysis by Google showed that even this meticulous tuning ultimately results in allowing such broad access that the whole purpose of CSP falls apart, just sending much larger headers with each request.

CSP level 2 approaches the problem differently using a nonce, but only the third version of the solution completed the process. Unfortunately, as of 2019, it does not have sufficient browser support.

Regarding how to assemble the script-src and style-src directives to work correctly even in older browsers and to minimize the effort, I have written a detailed article in the Nette partner section. Essentially, the resulting form might look like this:

script-src 'nonce-XXXXX' 'strict-dynamic' * 'unsafe-inline'
style-src 'nonce-XXXXX' * 'unsafe-inline'

Example of Use in PHP

We generate a nonce and send the header:

$nonce = base64_encode(random_bytes(16));

header("Content-Security-Policy: script-src 'nonce-$nonce' 'strict-dynamic' * 'unsafe-inline'");

And we insert the nonce into the HTML code:

<script nonce="<?=$nonce?>" src="..."></script>

Example of Use in Nette

Since Nette has built-in support for CSP and nonce since version 2.4, simply specify in the configuration file:

http:
	csp:
		script-src: [nonce, strict-dynamic, *, unsafe-inline]
		style-src: [nonce, *, unsafe-inline]

And then use in templates:

<script n:nonce src="..."></script>
<style n:nonce>...</style>

Monitoring

Before you set new rules for CSP, try them out first using the Content-Security-Policy-Report-Only header. This header works in all browsers that support CSP. If a rule is violated, the browser does not block the script but instead sends a notification to the URL specified in the report-uri directive. To receive and analyze these notifications, you might use a service like Report URI.

http:
	cspReportOnly:
		script-src: [nonce, strict-dynamic, *, unsafe-inline]
		report-uri: https://xxx.report-uri.com/r/d/csp/reportOnly

You can use both headers simultaneously, with Content-Security-Policy having verified and active rules and Content-Security-Policy-Report-Only to test their modifications. Of course, you can also monitor failures in the strict rules.

5 years ago in section Random


phpFashion © 2004, 2025 David Grudl | o blogu

Ukázky zdrojových kódů smíte používat s uvedením autora a URL tohoto webu bez dalších omezení.