mysql: bump sqlalchemy.url MariaDB/MySQL charset to to 'utf8mb4' to get full UTF-8 support
The change in 210e76d69b62 only changed character_set_database, as shown by output after:
--- a/kallithea/model/base.py +++ b/kallithea/model/base.py @@ -46,3 +46,8 @@ def init_model(engine): engine_str = obfuscate_url_pw(str(engine.url)) log.info("initializing db for %s", engine_str) meta.Base.metadata.bind = engine + + meta.Session.configure(bind=engine) + for a, b in meta.Session().execute('''show variables''').fetchall(): + if 'character_set_' in a: + print(a, repr(b))
Before, with charset=utf8, the utf8mb3 charset was used all the way through the stack:
[kallithea.model.base] initializing db for mysql://kallithea-test:XXXXX@localhost/kallithea-test?charset=utf8 character_set_client 'utf8' character_set_connection 'utf8' character_set_database 'utf8mb4' character_set_filesystem 'binary' character_set_results 'utf8' character_set_server 'latin1' character_set_system 'utf8'
With explicit charset=utf8mb4:
[kallithea.model.base] initializing db for mysql://kallithea-test:XXXXX@localhost/kallithea-test?charset=utf8mb4 character_set_client 'utf8mb4' character_set_connection 'utf8mb4' character_set_database 'utf8mb4' character_set_filesystem 'binary' character_set_results 'utf8mb4' character_set_server 'latin1' character_set_system 'utf8'
tests: introduce REUSE_TEST_DB as an alternative to TEST_DB
With TEST_DB, the specified database would be dropped (if existing) and created from scratch. That would fail if the user only had been granted access to that database without general database creation permissions.
With REUSE_TEST_DB, the database must exist, and the tables in the existing database will be deleted before creating the new ones.
This is for example useful for testing on PostgreSQL without full database privileges. (With MariaDB/MySQL, it is possible to grant create permissions to a database before it exists, so it is easier to use TEST_DB.)
Support use of an existing database so the Kallithea database user doesn't have to be granted permissions to create databases.
The existing database must of course have been created "correctly", for example using the right charset and collation on MariaDB/MySQL. Creating the database manually also provides a way to avoid the hardcoded defaults.
db: refactor to clarify that we always invoke SA create_all with checkfirst=False
create_all(checkfirst=True) would skip creating tables that already exist, potentially leaving them with wrong schema.
We are strict and want a successful create_all to leave all the tables in a fully known state. We thus want it to fail if any tables already are present. Existing tables should not silently be accepted.
In MySQL, the character sets for server, database, tables, and connection are set independently. Ideally, they should all use UTF-8, but systems tend to use latin1 as default encoding, for example:
To make things work consistently anyway, we have so far specified the utf8mb4 charset explicitly when creating tables, but there is no corresponding simple option for specifying the collation for tables. We need a better solution.
and there is thus no longer any need for specifying the charset when creating tables.
To be reasonably resilient across all systems without relying on system defaults, we will now start specifying the charset and collation when creating the database, but drop the specification of charset when creating tables.
For investigation of these issues, consider the output from: show variables like '%char%'; show variables like '%collation%'; show create database `KALLITHEA_DB_NAME`; SELECT * FROM information_schema.SCHEMATA WHERE schema_name = "KALLITHEA_DB_NAME"; SELECT * FROM information_schema.TABLES T, information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = "KALLITHEA_DB_NAME";
mysql: bump charset to to 'utf8mb4' to get full UTF-8 support
We used to use 'utf8', but in MySQL, this is just an alias for 'utf8mb3' which isn't full unicode. 'utf8mb4' has less surprises (especially if used with the 'utf8mb4_unicode_ci' collation).
MySQL character sets for server, database, tables, and connection are set independently. Until now, we have specified 'utf8' when creating tables to overrule the database charset and in the default MySQL connection URL.
There is no good reason it only should be possible to comment on content lines. Other lines might not have an obvious locator, but we can live with that as long as each comment only apply in one place.
With this, we actually want the comment bubble on all lines with bubble markup, so we can loosen the css selector.
diff: drop per file ignore-whitespace and context - it didn't work and had conceptual issue (Issue #344)
Diffs are currently generated at the low level as one big diff between two vcs resisions, provided global values for diff context size and flag for ignoring whitespace. All files use the same flags. There is no way to actually compute the full diff using these use per file flags, and no simple and efficient way to add it.
The best option is thus to drop the failed attempt at making it per file, and just rely on the simple global flags in the URL.
The links for changing whitespace and context is sometimes shown for the whole "page", and sometimes next to the diff for one file. For now, keep showing the link in these places, but make sure it navigates back to the FID of the section where the link was clicked.
The implementation is completely rewritten and moved to a more appropriate location in helpers.
With a more clean implementation, we also consistently use the simple getters to extract values from the URL.
diff: fix ignorews/context link to use the right target as anchor
The value in url_fid might not be a valid anchor.
For changesets, url_fid would be like 'C--9c390eb52cd6' even though the actual target included the changeset hash and were like 'C-1536d03b4869-9c390eb52cd6'.
For pullrequests and compare, it wouldn't link to anything at all, even though there was a target like 'C--56535da5df40'.
Instead, pass id_fid as anchor value as a separate argument. That one is a valid anchor.
jQuery has weak "friendly" semantics for data attributes and do for example interpret the data attribute name 'id_for' as 'idFor'. We could thus not retrieve it as 'id_for'.
Instead, avoid the special semantics of '_' by just naming the data attribute 'for'.
db: better support for databases with "odd" characters in the name, such as "-"
Add missing quoting, using the quoting flavour preferred by the DB.
Tested with
echo "CREATE USER 'kallithea-test'@'localhost' IDENTIFIED BY 'password'"|sudo -u mysql mysql echo "GRANT ALL PRIVILEGES ON \`kallithea-test\`.* TO 'kallithea-test'@'localhost'" | sudo -u mysql mysql TEST_DB='mysql://kallithea-test:password@localhost/kallithea-test?charset=utf8mb4' py.test
Column widths did in some cases break with 32757d5e9d0b. The markup was too magic to find and fix the exact root cause, but we fix it by refactoring to a slightly different approach:
Use fixed table-layout to be able to control columns width from the first row, without too much auto sizing.
Instead of using padding, put the line number in a centered block-inline with right-align and min-width. The numbers will thus generally be right aligned, but will expand to use less margin for big line numbers.
lib: use sha1 instead of md5 in a couple of places
md5 is dead and should be avoided. In the places changed here, we want to keep using hashes without trivial collisions, but do not expect strong crypto security. sha1 seems like a trivial step up from md5 and without obvious alternatives. It is more expensive than md5, but we can live with that in these places.
The remaining few uses of md5() cannot be changed without breaking backwards compatibility or external API.
lib/diffs: mark trailing tabs similar to trailing spaces
So far, spaces have not been marked up, but trailing spaces are followed by '<i></i>' (newline symbol). Tabs have been marked up as '<u>\t</u>' (underlined with icon), but trailing tabs didn't get the same "trailing whitespace" markup as trailing spaces got.
Fix this unfairness by handling trailing tabs explicitly as '<u>\t</u><i></i>' so they get both kinds of markup.
notifications: explicitly add author of pull request in invitation mail
When an author of a pull request does not get feedback from any of the reviewers, he is left wondering whether they got the invitation mail at all. By explicitly sending the invitation also to the author, he can at least know if e-mail works at the server side and how reviewers see it. The mail can also be forwarded as a 'reminder'.
logging: try to avoid using ANSI color codes when not logging to a terminal
Color on the console is nice, but it is annoying to get these codes when for example redirecting to a file.
The formatter doesn't know exactly where the message will be used - it is fully configurable. In the default configuration, the messages are passed to a StreamHandler to sys.stderr .
As an approximate optmization, hardcode the color formatters to only emit color formatting if logging to something that isatty.
permissions: drop hg.create.write_on_repogroup "Repository creation with group write access" setting
Simplify permissions system and get rid of some confusing tech debt.
Before, the global 'write_on_repogroup' setting controlled what write permission on a repo group meant.
With this change, users can create repositories in a repo group if and only if they have write access. Write access to a repo group will now mean the permission to create repositories in it.
Write access to repo groups must be granted explicitly. There should not be any other reason to grant write access than to allow users to create repos. There is thus no upgrade concerns for this change.
An admin that doesn't want users to create repos in a repogroup should just not give them write access.
These global settings might still exist in the database, but is ignored and no longer used and do no harm.
inifile: make it possible for expand() to comment out settings without assigning new value
For completeness, when we already can create and update values, also make it possible to delete. This fits nicely into the implementation. Potentiallly "nice to have" but not used yet.
The ini file is a text file and it only really makes sense to set string values. Intuitively, it makes sense that setting something to None means that the old value should be commented out but no new value should be set. If we want to set a value of "None", then specify "None" - not None.
lib: drop own asbool implementation and consistently use tg.support.converters as utils2.asbool
str2bool never reported error on odd input such as '' or '-1', but the tg asbool behaviour of raising ValueError("String is not true/false: %r" % obj) in that case seems fine.
uwsgi: slim down and tweak the default '[uwsgi]' configuration section
The goal is to have a basic working setup that show how and why to use uWSGI. System administrators should check uWSGI documentation for further information and general advice about operations.
uwsgi: drop unnecessary dependency of http module - just use http-socket directly
The http plugin has more advanced http functionality like https and load balancing, duplicating what in many setups is handled by separate front-end servers. In this simple template, just use the basic http-socket.
i18n: fix dead code in Accept-Language workaround from 7c7d6b5c07c7
AppConfig is just providing defaults, and the .ini file has not been read yet. There is thus no point in checking if i18n.lang has been set, and the code would thus *always* set i18n.lang=en ... but that was fine, as it just is a default.
email templates: fix missing translation of titles and buttons
The buttons and titles of email templates were not correctly translated. The corresponding strings were not part of the i18n files because they were not recognized by the extraction logic.
The benefit of this functionality is questionable. Especially in bigger setups with multiple front-end instances all serving the same multitude of repositories, making the hit rate very low. And the overhead of storing cache invalidation data *in* the database is non-trivial.
We preserve a small cache in Repository SA records, but should probably just in general know what we are doing and not ask for the same information multiple times in each request.
setup: install pip in virtualenv to make sure we have the latest version
Older versions are good enough for bootstrapping (and might also be good for everything else) but gives:
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command.
The user might still get this warning initially, but then it goes away ...
config: set base_path config in set_app_settings using Ui.get_repos_location() instead of in app_cfg using make_ui()
Only hit the database once (when starting the application) to get this.
It would perhaps be more elegant to set it directly, for example in kallithea.base_path ... but it seems like a setting that really belongs in the .ini file ...
auth: for default permissions, use existing explicit query result values instead of following dot references in ORM result objects
There has been reports of spurious crashes on resolving references like .repository from Permissions:
File ".../kallithea/lib/auth.py", line 678, in __wrapper if self.check_permissions(user): File ".../kallithea/lib/auth.py", line 718, in check_permissions return user.has_repository_permission_level(repo_name, self.required_perm) File ".../kallithea/lib/auth.py", line 450, in has_repository_permission_level actual_perm = self.permissions['repositories'].get(repo_name) File ".../kallithea/lib/vcs/utils/lazy.py", line 41, in __get__ value = self._func(obj) File ".../kallithea/lib/auth.py", line 442, in permissions return self.__get_perms(user=self, cache=False) File ".../kallithea/lib/auth.py", line 498, in __get_perms return compute(user_id, user_is_admin) File ".../kallithea/lib/auth.py", line 190, in _cached_perms_data r_k = perm.UserRepoToPerm.repository.repo_name File ".../sqlalchemy/orm/attributes.py", line 285, in __get__ return self.impl.get(instance_state(instance), dict_) File ".../sqlalchemy/orm/attributes.py", line 721, in get value = self.callable_(state, passive) File ".../sqlalchemy/orm/strategies.py", line 710, in _load_for_state % (orm_util.state_str(state), self.key)
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <UserRepoToPerm at ...> is not bound to a Session; lazy load operation of attribute 'repository' cannot proceed (Background on this error at: http://sqlalche.me/e/bhk3)
Permissions are cached between requests: SA result records are stored in in beaker.cache.sql_cache_short and resued in following requests after the initial session as been removed. References in Permission objects would usually give lazy lookup ... but not outside the original session, where we would get an error like this.
Permissions are indeed implemented/used incorrectly. That might explain a part of the problem. Even if not fully explaining or fixing this problem, it is still worth fixing:
Permissions are fetched from the database using Session().query with multiple class/table names (joined together in way that happens to match the references specified in the table definitions) - including Repository. The results are thus "structs" with selected objects. If repositories always were retrieved using this selected repository, everything would be fine. In some places, this was what we did.
But in some places, the code happened to do what was more intuitive: just use .repository and rely on "lazy" resolving. SA was not aware that this one already was present in the result struct, and would try to fetch it again. Best case, that could be inefficient. Worst case, it would fail as we see here.
Fix this by only querying from one table but use the "joinedload" option to also fetch other referenced tables in the same select. (This might inefficiently return the main record multiple times ... but that was already the case with the previous approach.)
This change is thus doing multiple things with circular dependencies that can't be split up in minor parts without taking detours:
The existing repository join like: .join((Repository, UserGroupRepoToPerm.repository_id == Repository.repo_id)) is thus replaced by: .options(joinedload(UserGroupRepoToPerm.repository))
Since we only are doing Session.query() on one table, the results will be of that type instead of "structs" with multiple objects. If only querying for UserRepoToPerm this means: - perm.UserRepoToPerm.repository becomes perm.repository - perm.Permission.permission_name looked at the explicitly queried Permission in the result struct - instead it should look in the the dereferenced repository as perm.permission.permission_name